Practical Data Cleaning with Python
Tips and tools for data janitors
It's a commonly cited statistic that data scientists spend roughly 80% of their time processing, wrangling, and munging their data and only 20% actually analyzing it. Speeding up the time you spend cleaning your data even a small amount can lead to valuable gains down the line.
Join expert Katharine Jarmul for a hands-on, in-depth exploration of practical data cleaning with Python, as she highlights the tools that can help speed up the data wrangling process and automate (or at least allow for general scripting) of some of the repetitive processes. You’ll get an overview of best libraries and tools to use when handling messy data and learn how to apply software development practices to data wrangling problems by writing data unit tests, which allow you to catch problems before they have created innacurate data for your entire company. Along the way, you’ll explore a few case studies to see the application of these techniques on real-world data problems.
What you'll learn-and how you can apply it
By the end of this live, online course, you’ll understand:
- How to determine what messy data problems lie in your datasets
- How to approach writing data unit tests
- The best tools and approaches to use when automating your data cleaning
And you’ll be able to:
- Utilize Python libraries for data cleaning
- Determine what processes lend themselves to automation
- Write data unit tests to validate your workflows
This training course is for you because...
You are a data scientist or data engineer with at least one year of experience who needs to speed up, automate, and validate your reports, notebooks, and pipelines in Python.
- An intermediate grasp of Python
- Experience working on data analysis tools in Python
To test whether you will be able to run the jupyter notebooks in your upcoming training, please:
Navigate here: https://attendee-testing.oreilly-jupyterhub.com (This is the link to the test site)
- Sign in with your Safari credentials
- Click "start my server"
Click on "notebook .ipynb"
Run each of the code cells: click the cell then either press Shift+Return or click the triangle in the top menu
There may be a few second delay, but you should eventually see the graphs. If you do not, this probably means that your firewall is blocking JupyterHub's websockets. Please turn off your company VPN or speak with your system administrator to allow.
Materials and downloads needed:
- All of the coding exercises in the course will be hosted on JupyterHub, and we'll send the URL out at the start of class. Purely browser-based, no installations required.
However, if you would like to download the files and work on them locally, please follow these steps:
- A machine with a package manager (pip or Anaconda are preferred) installed
- Code, datasets, and package requirements downloaded prior to the training (Link to repository will be provided prior to training—Katharine will be using Python 3, but examples will attempt to be multilingual.)
- If you would like to follow along, please fork or clone the repository: https://github.com/kjam/data-cleaning-101/ and follow the README to get all necessary libraries installed. Active participation by following along with the code locally is highly encouraged!
About your instructor
Katharine Jarmul is a co-founder of KI Protect, a data security company based in Berlin, Germany. Katharine Jarmul is a data analyst based in Berlin, Germany. She has worked with Python wrangling data since 2008 for both small and large companies. Automated data workflows, Natural Language Processing and data tests are her passions. She is co-author of Data Wrangling with Python and has authored several O'Reilly video courses focused on data analysis with Python
The timeframes are only estimates and may vary according to how the class is progressing
Segment 1: Data Cleaning: A Look at Real World Data Problems (10 min)
- Instructors will show examples or real world surveys of data cleanliness and begin conversation about what data cleaning problems we all face as data scientists
- Participants will answer a poll about data problems faced in their work and share on the chat other issues they have seen in data
Break (5 min)
Segment 2: Data Cleanup Tools in Python (80 min, with 10m break in middle)
- Instructors will introduce python data tools for deduplication, parsing, string matching, determining dates and measurement parsing, dealing with difficult formats (like mp3s, or pdfs, or websites), managing nulls in Pandas.
- Participants will follow along in Jupyter notebooks and respond to prompted chat and answer questions
Break (5 min)
Segment 3: Automation and data cleanup (35 min)
- Instructors will introduce ways to automate data cleanup including using data pipelines, workflows and DAGs (with Dask).
- Participants will follow along with notebooks and respond with chat
Break (5 min)
Segment 4 Case Study: Web scraped data (30 min)
- Instructors will present a real-world case study of messy data and a walk through how to first begin cleaning it, looking particularly at how to document and automate the cleanup
- Participants will also begin exploring the data. Part of the optional “homework” will be to continue this analysis
Segment 5 Day one wrapup (20 min)
- Instructors will present future research and new pursuits in data automation and cleaning
- Participants will have a chance for Q&A and survey for day one
Segment 6 Statistical Models and Data Inconsistency (30 min)
- Instructors will demonstrate building statistical models which show data inconsistency and elaborate on how this relates to data testing and validation
- Participants will follow along, and complete a quiz to begin data validation topic.
Segment 7 Writing Data Unit Tests (80 min, with 10 min break in middle)
- Instructors will demonstrate how to write data unit tests that help determine data validation and programming errors with a focus on creating automatable, reusable and simple tests. Libraries used will include engarde, mypy, faker, hypothesis and tdda.
- Participants will follow along with code from the repository and chat regarding questions and problem-solving applied immediately by practicing alongside the instructor
Break (5 min)
Segment 8 Implementing Data Unit Tests in your system (20 min)
- Instructors will cover problems of integrating data unit tests into particular frameworks and cover best practices for fitting tests to your system
- Participants will answer a few questions about the problems they have faced with implementing data unit tests and validation into their setup, and continue asking and chatting on the shared chat
Segment 9 Case Study: Data Validation (30 min)
- Instructors will cover a case study featuring data validation problems and apply the examples learned in Segment 7 to a new problemset
- Participants will follow along by writing code. They will also have some extra assignments for this to further their learning after the course is finished.
Break (5 min)
Segment 10 Day two wrapup (20 min)
- Instructors will present future research and new pursuits in data unit tests
- Participants will have a chance for Q&A and survey for day two