Using Regular Expressions

Using regular expressions to parse feeds may seem a little brutish, but it does have two advantages. First, it totally negates the issues regarding the differences between standards. Second, it is a much easier installation: it requires no XML parsing modules or any dependencies thereof.

Regular expressions, however, aren’t pretty. Consider Example 8-7, which is a section from Rael Dornfest’s lightweight RSS aggregator, Blagg.

Example 8-7. A section of code from Blagg
# Feed's title and link
my($f_title, $f_link) = ($rss =~ m#<title>(.*?)</title>.*?<link>(.*?)</link>#ms);

   
# RSS items' title, link, and description
   
while ( $rss =~ m{<item(?!s).*?>.*?(?:<title>(.*?)</title>.*?)?(?:<link>(.*?)</link>.

*?)?(?:<description>(.*?)</description>.*?)?</item>}mgis ) {
     my($i_title, $i_link, $i_desc, $i_fn) = ($1||'', $2||'', $3||'', undef);
   
     # Unescape &amp; &lt; &gt; to produce useful HTML
     my %unescape = ('&lt;'=>'<', '&gt;'=>'>', '&amp;'=>'&', '&quot;'=>'"');

     my $unescape_re = join '|' => keys %unescape;
     $i_title && $i_title =~ s/($unescape_re)/$unescape{$1}/g;
     $i_desc && $i_desc =~ s/($unescape_re)/$unescape{$1}/g;
   
     # If no title, use the first 50 non-markup characters of the description
     unless ($i_title) {
          $i_title = $i_desc;
          $i_title =~ s/<.*?>//msg;
          $i_title = substr($i_title, 0, 50);
          }
          next unless $i_title;

While this looks pretty nasty, it is actually an efficient way of stripping the data out of the RSS file, even if it is potentially much harder to extend. If ...

Get Developing Feeds with RSS and Atom now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.