Introduction

I.1. Natural Language Processing and manual annotation: Dr Jekyll and Mr Hy|ide?

I.1.1. Where linguistics hides

Natural Language Processing (NLP) has witnessed two major evolutions in the past 25 years: first, the extraordinary success of machine learning, which is now, for better or for worse (for an enlightening analysis of the phenomenon see [CHU 11]), overwhelmingly dominant in the field, and second, the multiplication of evaluation campaigns or shared tasks. Both involve manually annotated corpora, for the training and evaluation of the systems (see Figure I.1).

These corpora progressively became the hidden pillars of our domain, providing food for our hungry machine learning algorithms and reference for evaluation. Annotation is now the place where linguistics hides in NLP.

However, manual annotation has largely been ignored for quite a while, and it took some time even for annotation guidelines to be recognized as essential [NÉD 06]. When the performance of the systems began to stall, manual annotation finally started to generate some interest in the community, as a potential leverage for improving the obtained results [HOV 10, PUS 12].

This is all the more important, as it was proven that systems trained on badly annotated corpora underperform. In particular, they tend to reproduce annotation errors when these errors follow a regular pattern and do not correspond to simple noise [REI 08]. Furthermore, the quality of manual annotation is crucial when it is ...

Get Collaborative Annotation for Reliable Natural Language Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.