Downloading a complete dump of Wikipedia for a real-life Spark ML project

In this recipe, we will be downloading and exploring a dump of Wikipedia so we can have a real-life example. The dataset that we will be downloading in this recipe is a dump of Wikipedia articles. You will either need the command-line tool curl, or a browser to retrieve a compressed file, which is about 13.6 GB at this time. Due to the size, we recommend the curl command-line tool.

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.