Package hm :: Package app :: Package views :: Module rssparser
[hide private]
[frames] | no frames]

Module rssparser

source code

Ultra-liberal RSS parser
#!/usr/bin/python
Visit http://diveintomark.org/projects/rss_parser/ for the latest version

Handles RSS 0.9x and RSS 1.0 feeds

RSS 0.9x elements:
- title, link, description, webMaster, managingEditor, language
  copyright, lastBuildDate, pubDate

RSS 1.0 elements:
- dc:rights, dc:language, dc:creator, dc:date, dc:subject,
  content:encoded

Things it handles that choke other RSS parsers:
- bastard combinations of RSS 0.9x and RSS 1.0 (most Movable Type feeds)
- illegal XML characters (most Radio feeds)
- naked and/or invalid HTML in description (The Register)
- content:encoded in item element (Aaron Swartz)
- guid in item element (Scripting News)
- fullitem in item element (Jon Udell)
- non-standard namespaces (BitWorking)

Requires Python 2.2 or later


Version:

2.3.1

Author:

Mark Pilgrim (mark@diveintomark.org)

Copyright:

Copyright 2003, Mark Pilgrim

License:

GPL

Classes [hide private]
  RSSParser
Functions [hide private]
 
decodeEntities(data) source code
 
open_resource(source, etag=None, modified=None, agent=None, referrer=None)
URI, filename, or string --> stream
source code
 
get_etag(resource)
Get the ETag associated with a response returned from a call to open_resource().
source code
 
get_modified(resource)
Get the Last-Modified timestamp for a response returned from a call to open_resource().
source code
 
format_http_date(date)
Formats a tuple of 9 integers into an RFC 1123-compliant timestamp as required in RFC 2616.
source code
 
rfc1123_match(...)
match(string[, pos[, endpos]]) --> match object or None.
source code
 
rfc850_match(...)
match(string[, pos[, endpos]]) --> match object or None.
source code
 
asctime_match(...)
match(string[, pos[, endpos]]) --> match object or None.
source code
 
parse_http_date(date)
Parses any of the three HTTP date formats into a tuple of 9 integers as returned by time.gmtime().
source code
 
parse(uri, etag=None, modified=None, agent=None, referrer=None) source code
Variables [hide private]
  __contributors__ = ['Jason Diamond (jason@injektilo.org)']
  __history__ = '\n1.0 - 9/27/2002 - MAP - fixed namespace proce...
  USER_AGENT = 'UltraLiberalRSSParser/2.3.1 +http://diveintomark...
  short_weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'S...
  long_weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday',...
  months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Au...
  TEST_SUITE = ('http://www.pocketsoap.com/rssTests/rss1.0withMo...
Function Details [hide private]

open_resource(source, etag=None, modified=None, agent=None, referrer=None)

source code 

URI, filename, or string --> stream

This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it.

If the etag argument is supplied, it will be used as the value of an If-None-Match request header.

If the modified argument is supplied, it must be a tuple of 9 integers as returned by gmtime() in the standard Python time module. This MUST be in GMT (Greenwich Mean Time). The formatted date/time will be used as the value of an If-Modified-Since request header.

If the agent argument is supplied, it will be used as the value of a User-Agent request header.

If the referrer argument is supplied, it will be used as the value of a Referer[sic] request header.

The optional arguments are only used if the source argument is an HTTP URL and the urllib2 module is importable (i.e., you must be using Python version 2.0 or higher).

get_etag(resource)

source code 

Get the ETag associated with a response returned from a call to open_resource().

If the resource was not returned from an HTTP server or the server did not specify an ETag for the resource, this will return None.

get_modified(resource)

source code 

Get the Last-Modified timestamp for a response returned from a call to open_resource().

If the resource was not returned from an HTTP server or the server did not specify a Last-Modified timestamp, this function will return None. Otherwise, it returns a tuple of 9 integers as returned by gmtime() in the standard Python time module().

format_http_date(date)

source code 

Formats a tuple of 9 integers into an RFC 1123-compliant timestamp as required in RFC 2616. We don't use time.strftime() since the %a and %b directives can be affected by the current locale (HTTP dates have to be in English). The date MUST be in GMT (Greenwich Mean Time).

rfc1123_match(...)

source code 

match(string[, pos[, endpos]]) --> match object or None. Matches zero or more characters at the beginning of the string

rfc850_match(...)

source code 

match(string[, pos[, endpos]]) --> match object or None. Matches zero or more characters at the beginning of the string

asctime_match(...)

source code 

match(string[, pos[, endpos]]) --> match object or None. Matches zero or more characters at the beginning of the string

parse_http_date(date)

source code 

Parses any of the three HTTP date formats into a tuple of 9 integers as returned by time.gmtime(). This should not use time.strptime() since that function is not available on all platforms and could also be affected by the current locale.


Variables Details [hide private]

__history__

Value:
'''
1.0 - 9/27/2002 - MAP - fixed namespace processing on prefixed RSS 2.0\
 elements,
  added Simon Fell\'s test suite
1.1 - 9/29/2002 - MAP - fixed infinite loop on incomplete CDATA sectio\
ns
2.0 - 10/19/2002
  JD - use inchannel to watch out for image and textinput elements whi\
...

USER_AGENT

Value:
'UltraLiberalRSSParser/2.3.1 +http://diveintomark.org/projects/rss_par\
ser/'

short_weekdays

Value:
['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

long_weekdays

Value:
['Monday',
 'Tuesday',
 'Wednesday',
 'Thursday',
 'Friday',
 'Saturday',
 'Sunday']

months

Value:
['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
...

TEST_SUITE

Value:
('http://www.pocketsoap.com/rssTests/rss1.0withModules.xml',
 'http://www.pocketsoap.com/rssTests/rss1.0withModulesNoDefNS.xml',
 'http://www.pocketsoap.com/rssTests/rss1.0withModulesNoDefNSLocalName\
Clash.xml',
 'http://www.pocketsoap.com/rssTests/rss2.0noNSwithModules.xml',
 'http://www.pocketsoap.com/rssTests/rss2.0noNSwithModulesLocalNameCla\
sh.xml',
 'http://www.pocketsoap.com/rssTests/rss2.0NSwithModules.xml',
...