Chapter 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation

The contributor for this chapter is Claudia Perlich. Claudia has been the Chief Scientist at Media 6 Degrees (M6D) for the past few years. Before that she was in the data analytics group at the IBM center that developed Watson, the computer that won Jeopardy! (although she didn’t work on that project). Claudia holds a master’s in computer science, and got her PhD in information systems at NYU. She now teaches a class to business students on data science, where she addresses how to assess data science work and how to manage data scientists.

Claudia is also a famously successful data mining competition winner. She won the KDD Cup in 2003, 2007, 2008, and 2009, the ILP Challenge in 2005, the INFORMS Challenge in 2008, and the Kaggle HIV competition in 2010.

More recently she’s turned toward being a data mining competition organizer, first for the INFORMS Challenge in 2009, and then for the Heritage Health Prize in 2011. Claudia claims to be retired from competition. Fortunately for the class, she provided some great insights into what can be learned from data competitions. From the many competitions she’s done, she’s learned quite a bit in particular about data leakage, and how to evaluate the models she comes up with for the competitions.

Claudia’s Data Scientist Profile

Claudia started by asking what people’s reference point might be to evaluate where they stand with their own data science profile ...

Get Doing Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.