O'Reilly logo

Building Machine Learning Systems with Python - Second Edition by Luis Pedro Coelho, Willi Richert

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Fetching the data

Luckily for us, the team behind StackOverflow provides most of the data behind the StackExchange universe to which StackOverflow belongs under a cc-wiki license. At the time of writing this book, the latest data dump can be found at https://archive.org/details/stackexchange. It contains data dumps of all Q&A sites of the StackExchange family. For StackOverflow, you will find multiple files, of which we only need the stackoverflow.com-Posts.7z file, which is 5.2 GB.

After downloading and extracting it, we have around 26 GB of data in the format of XML, containing all questions and answers as individual row tags within the root tag posts:

<?xml version="1.0" encoding="utf-8"?>
<posts>
...
 <row Id="4572748" PostTypeId="2" ParentId="4568987" ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required