Converting HTML to ASCII
Problem
You want to convert an HTML file into formatted plain ASCII.
Solution
If you have an external formatter like lynx, call an external program:
$ascii = `lynx -dump $filename`;
If you want to do it within your program and don’t care about the things that the HTML::TreeBuilder formatter doesn’t yet handle (tables and frames):
use HTML::FormatText; use HTML::Parse; $html = parse_htmlfile($filename); $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50); $ascii = $formatter->format($html);
Discussion
These examples both assume you have the HTML text in a file. If your
HTML is in a variable, you need to write it to a file for
lynx
to read. If you are using
HTML::FormatText, use the HTML::TreeBuilder module:
use HTML::TreeBuilder; use HTML::FormatText; $html = HTML::TreeBuilder->new(); $html->parse($document); $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50); $ascii = $formatter->format($html);
If you use Netscape, its ``Save as'' option with the type set to “Text” does the best job with tables.
See Also
The documentation for the CPAN modules HTML::Parse,
HTML::TreeBuilder, and HTML::FormatText; your system’s
lynx
(1) manpage; Section 20.6
Get Perl Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.