Converting HTML to ASCII

Problem

You want to convert an HTML file into formatted plain ASCII.

Solution

If you have an external formatter like lynx, call an external program:

$ascii = `lynx -dump $filename`;

If you want to do it within your program and don’t care about the things that the HTML::TreeBuilder formatter doesn’t yet handle (tables and frames):

use HTML::FormatText;
use HTML::Parse;

$html = parse_htmlfile($filename);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
$ascii = $formatter->format($html);

Discussion

These examples both assume you have the HTML text in a file. If your HTML is in a variable, you need to write it to a file for lynx to read. If you are using HTML::FormatText, use the HTML::TreeBuilder module:

use HTML::TreeBuilder;
use HTML::FormatText;

$html = HTML::TreeBuilder->new();
$html->parse($document);

$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);

$ascii = $formatter->format($html);

If you use Netscape, its ``Save as'' option with the type set to “Text” does the best job with tables.

See Also

The documentation for the CPAN modules HTML::Parse, HTML::TreeBuilder, and HTML::FormatText; your system’s lynx (1) manpage; Section 20.6

Get Perl Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.