How to do it...

  1. You can start with downloading the dataset using the following command:
curl -L -O http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
  1. Now you want to decompress the ZIP file:
bunzip2 enwiki-latest-pages-articles-multistream.xml.bz2

This should create an uncompressed file which is named enwiki-latest-pages-articles-multistream.xml and is about 56 GB.

  1. Let us take a look at the Wikipedia XML file:
head -n50 enwiki-latest-pages-articles-multistream.xml<mediawiki xmlns=http://www.mediawiki.org/xml/export-0.10/ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" ...

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.