- You can start with downloading the dataset using the following command:
curl -L -O http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
- Now you want to decompress the ZIP file:
bunzip2 enwiki-latest-pages-articles-multistream.xml.bz2
This should create an uncompressed file which is named enwiki-latest-pages-articles-multistream.xml and is about 56 GB.
- Let us take a look at the Wikipedia XML file:
head -n50 enwiki-latest-pages-articles-multistream.xml<mediawiki xmlns=http://www.mediawiki.org/xml/export-0.10/ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" ...