Natural Language Annotation for Machine Learning

Chapter 4. Building Your Model and Specification

Now that you’ve defined your goal and collected a relevant dataset, you need to create the model for your task. But what do we mean by “model”? Basically, the model is the practical representation of your goal: a description of your task that defines the classifications and terms that are relevant to your project. You can also think of it as the aspects of your task that you want to capture within your dataset. These classifications can be represented by metadata, labels that are applied to the text of your corpus, and/or relationships between labels or metadata. In this chapter, we will address the following questions:

The model is captured by a specification, or spec. But what does a spec look like?
You have the goals for your annotation project. Where do you start? How do you turn a goal into a model?
What form should your model take? Are there standardized ways to structure the phenomena?
How do you take someone else’s standard and use it to create a specification?
What do you do if there are no existing specifications, definitions, or standards for the kinds of phenomena you are trying to identify and model?
How do you determine when a feature in your description is an element in the spec versus an attribute on an element?

The spec is the concrete representation of your model. So, whereas the model is an abstract idea of what information you want your annotation to capture, and the interpretation of that information, the spec turns those abstract ideas into tags and attributes that will be applied to your corpus.

Some Example Models and Specs

Recall from Chapter 1 that the first part in the MATTER cycle involves creating a model for the task at hand. We introduced a model as a triple, M = <T,R,I>, consisting of a vocabulary of terms, T, the relations between these terms, R, and their interpretation, I. However, this is a pretty high-level description of what a model is. So, before we discuss more theoretical aspects of models, let’s look at some examples of annotation tasks and see what the models for those look like.

For the most part, we’ll be using XML DTD (Document Type Definition) representations. XML is becoming the standard for representing annotation data, and DTDs are the simplest way to show an overview of the type of information that will be marked up in a document. The next few sections will go through what the DTDs for different models will look like, so you can see how the different elements of an annotation task can be translated into XML-compliant forms.

A DTD is a set of declarations containing the basic building blocks that allow an XML document to be validated. DTDs have been covered in depth in other books (O’Reilly’s Learning XML and XML in a Nutshell) and websites (W3schools.com), so we’ll give a short overview here.

Essentially, the DTD defines what the structure of an XML document will be by defining what tags will be used inside the document and what attributes those tags will have. By having a DTD, the XML in a file can be validated to ensure that the formatting is correct.

So what do we mean by tags and attributes? Let’s take a really basic example: web pages and HTML. If you’ve ever made a website and edited some code by hand, you’re familiar with elements such as  and  . These are tags that tell a program reading the HTML that the text in between  and  should be bold, and that   indicates a newline should be included when the text is displayed. Annotation tasks use similar formatting, but they define their own tags based on what information is considered important for the goal being pursued. So an annotation task that is based on marking the parts of speech in a text might have tags such as <noun>, <verb>, <adj>, and so on. In a DTD, these tags would be defined like this:

<!ELEMENT noun ( #PCDATA ) >
<!ELEMENT verb ( #PCDATA ) >
<!ELEMENT adj ( #PCDATA ) >

The string !ELEMENT indicates that the information contained between the < and > is about an element (also known as a “tag”), and the word following it is the name of that tag (noun, verb, adj). The ( #PCDATA ) indicates that the information between the <noun> and </noun> tags will be parsable character data (other flags instead of #PCDATA can be used to provide other information about a tag, but for this book, we’re not going to worry about them).

By declaring the three tags in a DTD, we can have a valid XML document that has nouns, verbs, and adjectives all marked up. However, annotation tasks often require more information about a piece of text than just its type. This is where attributes come in. For example, knowing that a word is a verb is useful, but it’s even more useful to know the tense of the verb—past, present, or future. This can be done by adding an attribute to a tag, which looks like this:

<!ELEMENT verb ( #PCDATA ) >
<!ATTLIST verb tense ( past | present | future | none ) #IMPLIED >

The !ATTLIST line declares that an attribute called tense is being added to the verb element, and that it has four possible values: past, present, future, and none. The #IMPLIED shows that the information in the attribute isn’t required for the XML to be valid (again, don’t worry too much about this for now). Now you can have a verb tag that looks like this:

<verb tense="present">

You can also create attributes that allow annotators to put in their own information, by declaring the attribute’s type to be CDATA instead of a list of options, like this:

<!ELEMENT verb ( #PCDATA ) >
<!ATTLIST verb tense CDATA #IMPLIED >

One last type of element that is commonly used in annotation is a linking element, or a link tag. These tags are used to show relationships between other parts of the data that have been marked up with tags. For instance, if the part-of-speech (POS) task also wanted to show the relationship between a verb and the noun that performed the action described by the verb, the annotation model might include a link tag called performs, like so:

<!ELEMENT performs EMPTY >
<!ATTLIST performs fromID IDREF >
<!ATTLIST performs toID IDREF >

The EMPTY in this element tag indicates that the tag will not be applied to any of the text itself, but rather is being used to provide other information about the text. Normally in HTML an empty tag would be something like the   tag, or another tag that stands on its own. In annotation tasks, an empty tag is used to create paths between other, contentful tags.

In a model, it is almost always important to keep track of the order (or arity) of the elements involved in the linking relationship. We do this here by using two elements that have the type IDREF, meaning they will refer to other annotated extents or elements in the text by identifiable elements.

We’ll talk more about the IDs and the relationship between DTDs and annotated data in Chapter 5, but for now, this should give you enough information to understand the examples provided in this chapter.

Note

There are other formats that can be used to specify specs for a model. XML schema are sometimes used to create a more complex representation of the tags being used, as is the Backus–Naur Form. However, these formats are more complex than DTDs, and aren’t generally necessary to use unless you are using a particular piece of annotation software, or want to have a more restrictive spec. For the sake of simplicity, we will use only DTD examples in this book.

Film Genre Classification

A common task in Natural Language Processing (NLP) and machine learning is classifying documents into categories; for example, using film reviews or summaries to determine the genre of the film being described. If you have a goal of being able to use machine learning to identify the genre of a movie from the movie summary or review, then a corresponding model could be that you want to label the summary with all the genres that the movie applies to, in order to feed those labels into a classifier and train it to identify relevant parts of the document. To turn that model into a spec, you need to think about what that sort of label would look like, presumably in a DTD format.

The easiest way to create a spec for a classification task is to simply create a tag that captures the information you need for your goal and model. In this case, you could create a tag called genre that has an attribute called label, where label holds the values that can be assigned to the movie summary. The simplest incarnation of this spec would be this:

<!ELEMENT genre ( #PCDATA ) >
<!ATTLIST genre label CDATA #IMPLIED >

This DTD has the required tag and attribute, and allows for any information to be added to the label attribute. Functionally for annotation purposes, this means the annotator would be responsible for filling in the genres that she thinks apply to the text. Of course, a large number of genre terms have been used, and not everyone will agree on what a “standard” list of genres should be—for example, are “fantasy” and “sci-fi” different genres, or should they be grouped into the same category? Are “mystery” films different from “noir”? Because the list of genres will vary from person to person, it might be better if your DTD specified a list of genres that annotators could choose from, like this:

<!ELEMENT genre ( #PCDATA ) > 
<!ATTLIST genre label ( Action | Adventure | Animation | Biography | Comedy |
  Crime | Documentary | Drama | Family | Fantasy | Film-Noir | Game-Show | 
  History | Horror | Music | Musical | Mystery | News | Reality-TV | Romance | 
  Sci-Fi | Sport | Talk-Show | Thriller | War | Western ) >

The list in the label attribute is taken from IMDb’s list of genres. Naturally, since other genre lists exist (e.g., Netflix also has a list of genres), you would want to choose the one that best matches your task, or create your own list. As you go through the process of annotation and the rest of the MATTER cycle, you’ll find places where your model/spec needs to be revised in order to get the results you want. This is perfectly normal, even for tasks that seem as straightforward as putting genre labels on movie summaries—annotator opinions can vary, even when the task is as clearly defined as you can make it. And computer algorithms don’t really think and interpret the way that people do, so even when you get past the annotation phase, you may still find places where, in order to maximize the correctness of the algorithm, you would have to change your model.

For example, looking at the genre list from IMDb we see that “romance” and “comedy” are two separate genres, and so the summary of a romantic comedy would have to have two labels: romance and comedy. But if, in a significant portion of reviews, those two tags appear together, an algorithm may learn to always associate the two, even when the summary being classified is really a romantic drama or musical comedy. So, you might find it necessary to create a rom-com label to keep your classifier from creating false associations.

In the other direction, there are many historical action movies that take place over very different periods in history, and a machine learning (ML) algorithm may have trouble finding enough common ground between a summary of 300, Braveheart, and Pearl Harbor to create an accurate association with the history genre. In that case, you might find it necessary to add different levels of historical genres, ones that reflect different periods in history, to train a classifier in the most accurate way possible.

Note

If you’re unclear on how the different components of the ML algorithm can be affected by the spec, or why you might need to adapt a model to get better results, don’t worry! For now, just focus on turning your goal into a set of tags, and the rest will come later. But if you really want to know how this works, Chapter 7 has an overview of all the different ways that ML algorithms “learn,” and what it means to train each one.

Adding Named Entities

Of course, reworking the list of genres isn’t the only way to change a model to better fit a task. Another way is to add tags and attributes that will more closely reflect the information that’s relevant to your goal. In the case of the movie summaries, it might be useful to keep track of some of the Named Entities (NEs) that appear in the summaries that may give insight into the genre of the film. An NE is an entity (an object in the world) that has a name which uniquely identifies it by name, nickname, abbreviation, and so on. “O’Reilly,” “Brandeis University,” “Mount Hood,” “IBM,” and “Vice President” are all examples of NEs. In the movie genre task, it might be helpful to keep track of NEs such as film titles, directors, writers, actors, and characters that are mentioned in the summaries.

You can see from the list in the preceding paragraph that there are many different NEs in the model that we would like to capture. Because the model is abstract, the practical application of these NEs to a spec or DTD has to be decided upon. There are often many ways in which a model can be represented in a DTD, due to the categorical nature of annotation tasks and of XML itself. In this case there are two primary ways in which the spec could be created. We could have a single tag called named_entity with an attribute that would have each of the items from the previous list, like this:

<!ELEMENT named_entity ( #PCDATA ) >
<!ATTLIST named_entity role (film_title | director | 
  writer | actor | character ) >

Or each role could be given its own tag, like this:

<!ELEMENT film_title ( #PCDATA ) >
<!ELEMENT director ( #PCDATA ) >
<!ELEMENT writer ( #PCDATA ) >
<!ELEMENT actor ( #PCDATA ) >
<!ELEMENT character ( #PCDATA ) >

While these two specs seem to be very different, in many ways they are interchangeable. It would not be difficult to take an XML file with the first DTD and change it to one that is compliant with the second. Often the choices that you’ll make about how your spec will represent your model will be influenced by other factors, such as what format is easier for your annotators, or what works better with the annotation software you are using. We’ll talk more about the considerations that go into which formats to use in Chapter 5 and Chapter 6.

By giving ML algorithms more information about the words in the document that are being classified, such as by annotating the NEs, it’s possible to create more accurate representations of what’s going on in the text, and to help the classifier pick out markers that might make the classifications better.

Semantic Roles

Another layer of information that might be useful in examining movie summaries is to annotate the relationships between the NEs that are marked up in the text. These relationships are called semantic roles, and they are used to explicitly show the connections between the elements in a sentence. In this case, it could be helpful to annotate the relationships between actors and characters, and the staff of the movie and which movie they worked on. Consider the following example summary/review:

In Love, Actually, writer/director Richard Curtis weaves a convoluted tale about characters and their relationships. Of particular note is Liam Neeson (Schindler’s List, Star Wars) as Daniel, a man struggling to deal with the death of his wife and the relationship with his young stepson, Sam (Thomas Sangster). Emma Thompson (Sense and Sensibility, Henry V) shines as a middle-aged housewife whose marriage with her husband (played by Alan Rickman) is under siege by a beautiful secretary. While this movie does have its purely comedic moments (primarily presented by Bill Nighy as out-of-date rock star Billy Mack), this movie avoids the more in-your-face comedy that Curtis has presented before as a writer for Blackadder and Mr. Bean, presenting instead a remarkable, gently humorous insight into what love, actually, is.

Using one of the NE DTDs from the preceding section would lead to a number of annotated extents, but due to the density, an algorithm may have difficulty determining who goes with what. By adding semantic role labels such as acts_in, acts_as, directs, writes, and character_in, the relationships between all the NEs will become much clearer.

As with the DTD for the NEs, we are faced with a choice between using a single tag with multiple attribute options:

<!ELEMENT sem_role ( EMPTY ) >
<!ATTLIST sem_role from IDREF >
<!ATTLIST sem_role to IDREF >
<!ATTLIST sem_role label (acts_in | 
  acts_as | directs | writes | character_in ) >

or a tag for each semantic role we wish to capture:

<!ELEMENT acts_in ( EMPTY ) >
<!ATTLIST acts_in from IDREF >
<!ATTLIST acts_in to IDREF >

<!ELEMENT acts_as ( EMPTY ) >
<!ATTLIST acts_as from IDREF >
<!ATTLIST acts_as to IDREF >

<!ELEMENT directs ( EMPTY ) >
<!ATTLIST directs from IDREF >
<!ATTLIST directs to IDREF >

<!ELEMENT writes ( EMPTY ) >
<!ATTLIST writes from IDREF >
<!ATTLIST writes to IDREF >

<!ELEMENT character_in ( EMPTY ) >
<!ATTLIST character_in from IDREF >
<!ATTLIST character_in to IDREF >

You’ll notice that this time, the DTD specifies that each of these elements is EMPTY, meaning that no character data is associated directly with the tag. Remember that linking tags in annotation are usually defined by EMPTY tags specifically because links between elements do not generally have text associated with them in particular, but rather clarify a relationship between two or more other extents. We’ll discuss the application of linking and other types of tags in Chapter 5.

It may be the case that your annotation task requires more than one model to fully capture the data you need. This happens most frequently when a task requires information from two or more very different levels of linguistics, or if information from two different domains needs to be captured. For example, an annotation over a corpus that’s made up of documents that require training to understand, such as clinical notes, scientific papers, or legal documents, may require that annotators have training in those fields, and that the annotation task be tailored to the domain.

In general, employing different annotation models in the same task simply means that more than one MATTER cycle is being worked through at the same time, and that the different models will likely be focused on different aspects of the corpus or language being explored. In these cases, it is important that all the models be coordinated, however, and that changes made to one model during the MATTER cycle don’t cause conflict with the others.

If your corpus is made up of domain-specific documents (such as the clinical notes that we mentioned before), and your annotation task requires that your annotators be able to interpret these documents (e.g., if you are trying to determine which patients have a particular disease), then one of your models may need to be a light annotation task (Stubbs 2012).

A light annotation task is essentially a way to formulate an annotation model that allows a domain expert (such as a doctor) to provide her insight into a text without being required to link her knowledge to one of the layers of linguistic understanding. Such an annotation task might be as simple as having the domain expert indicate whether a file has a particular property (such as whether or not a patient is at risk for diabetes), or it may involve annotating the parts of the text associated with a disease state. However, the domain expert won’t be asked to mark POS tags or map every noun in the text to a semantic interpretation: those aspects of the text would be handled in a different model altogether, and merged at the end.

There is a slightly different philosophy behind the creation of light annotation tasks than that of more “traditional” annotations: light annotations focus on encoding an answer to a particular question about a text, rather than creating a complete record of a particular linguistic phenomenon, with the purpose of later merging all the different models into a single annotation. However, aside from the difference in goal, light annotation tasks still follow the MATTER and MAMA cycles. Because of this, we aren’t going to use them as examples in this book, and instead will stick to more traditional linguistic annotations.

If you are interested in performing an annotation task that requires domain-specific knowledge, and therefore would benefit from using a light annotation task, a methodology for creating a light annotation and incorporating it into the MATTER cycle is developed and presented in Stubbs 2012.

Adopting (or Not Adopting) Existing Models

Now that you have an idea of how specs can represent a model, let’s look a little more closely at some of the details we just presented. You might recall from Chapter 1 that when we discussed semantic roles we presented a very different list from acts_in, acts_as, directs, writes, and character_in. Here’s what the list looked like:

Agent: The event participant that is doing or causing the event to occur
Theme/figure: The event participant who undergoes a change in position or state
Experiencer: The event participant who experiences or perceives something
Source: The location or place from which the motion begins; the person from whom the theme is given
Goal: The location or place to which the motion is directed or terminates
Recipient: The person who comes into possession of the theme
Patient: The event participant who is affected by the event
Instrument: The event participant used by the agent to do or cause the event
Location/ground: The location or place associated with the event itself

Similarly, we also presented an ontology that defined the categories Organization, Person, Place, and Time. This set of labels can be viewed as a simple model of NE types that are commonly used in other annotation tasks.

So, if these models exist, why didn’t we just use them for our film genre annotation task? Why did we create our own sets of labels for our spec? Just as when defining the goal of your annotation you need to think about the trade-off between informativity and correctness, when creating the model and spec for your annotation task, you need to consider the trade-off between generality and specificity.

Creating Your Own Model and Specification: Generality Versus Specificity

The ontology consisting of Organization, Person, Place, and Time is clearly a very general model for entities in a text, but for the film genre annotation task, it is much too general to be useful for the kinds of distinctions we want to be able to make. Of the NE labels that we identified earlier, four of them (“director,” “writer,” “actor,” and “character”) would fall under the label “Person,” and “film title” doesn’t clearly fit under any of them. Using these labels would lead to unhelpful annotations in two respects: first, the labels used would be so generic as to be useless for the task (labeling everyone as “Person” won’t help distinguish one movie review from another); and second, it would be difficult to explain to the annotators that, while you’ve given them a set of labels, you don’t want every instance of those types of entities labeled, but rather only those that are relevant to the film (so, for example, a mention of another reviewer would not be labeled as a “Person”). Clearly, overly general tags in a spec can lead to confusion during annotation.

On the other hand, we could have made the tags in the spec even more specific, such as actor_star, actor_minor_character, character_main, character_minor, writer_film, writer_book, writer_book_and_film, and so on. But what would be gained from such a complicated spec? While it’s possible to think of an annotation task where it might be necessary to label all that information (perhaps one that was looking at how these different people are described in movie reviews), remember that the task we defined was, first, simply labeling the genres of films as they are described in summaries and reviews, and then expanding it to include some other information that might be relevant to making that determination. Using overly specific tags in this case would decrease how useful the annotations would be, and also increase the amount of work done by the annotators for no obvious benefit. Figure 4-1 shows the different levels of the hierarchy we are discussing. The top two levels are too vague, while the bottom is too specific to be useful. The third level is just right for this task.

We face the same dichotomy when examining the list of semantic roles. The list given in linguistic textbooks is a very general list of roles that can be applied to the nouns in a sentence, but any annotation task trying to use them for film-related roles would have to have a way to limit which nouns were assigned roles by the annotator, and most of the roles related to the NEs we’re interested in would simply be “agent”—a label that is neither helpful nor interesting for this task. So, in order to create a task that was in the right place regarding generality and specificity, we developed our own list of roles that were particular to this task.

Figure 4-1. A hierarchy of named entities

Note

We haven’t really gotten into the details of NE and semantic role annotation using existing models, but these are not trivial annotation tasks. If you’re interested in learning more about annotation efforts that use these models, check out FrameNet for semantic roles, and the Message Understanding Conferences (MUCs) for examples of NE and coreference annotation.

Overall, there are a few things that you want to make sure your model and specification have in order to proceed with your task. They should:

Contain a representation of all the tags and links relevant to completing your goal.
Be relevant to the implementation stated in your goal (if your purpose is to classify documents by genre, spending a lot of time annotating temporal information is probably not going to be of immediate help).
Be grounded in existing research as much as possible. Even if there’s no existing annotation spec that meets your goal completely, you can still take advantage of research that’s been done on related topics, which will make your own research much easier.

Specifically to the last point on the list, even though the specs we’ve described for the film genre annotation task use sets of tags that we created for this purpose, it’s difficult to say that they weren’t based on an existing model to some extent. Obviously some knowledge about NEs and semantic roles helped to inform how we described the annotation task, and helped us to decide whether annotating those parts of the document would be useful. But you don’t need to be a linguist to know that nouns can be assigned to different groups, and that the relationships between different nouns and verbs can be important to keep track of. Ultimately, while it’s entirely possible that your annotation task is completely innovative and new, it’s still worth taking a look at some related research and resources and seeing if any of them are helpful for getting your model and spec put together.

The best way to find out if a spec exists for your task is to do a search for existing annotated datasets. If you aren’t sure where to start, or Google results seem overwhelming, check Appendix A for the list of corpora and their annotations.

Using Existing Models and Specifications

While the examples we discussed thus far had fairly clear-cut reasons for us to create our own tags for the spec, there are some advantages to basing your annotation task on existing models. Interoperability is a big concern in the computer world, and it’s actually a pretty big concern in linguistics as well—if you have an annotation that you want to share with other people, there are a few things that make it easier to share, such as using existing annotation standards (e.g., standardized formats for your annotation files), using software to create the annotation that other people can also use, making your annotation guidelines available to other people, and using models or specifications that have already been vetted in similar tasks. We’ll talk more about standards and formats later in this chapter and in the next one; for now, we’ll focus just on models and specs.

Using models or specs that other people have used can benefit your project in a few ways. First of all, if you use the specification from an annotation project that’s already been done, you have the advantage of using a system that’s already been vetted, and one that may also come with an annotated corpus, which you can use to train your own algorithms or use to augment your own dataset (assuming that the usage restrictions on the corpus allow for that, of course).

In Background Research, we mentioned some places to start looking for information that would be useful with defining your goal, so presumably you’ve already done some research into the topics you’re interested in (if you haven’t, now is a good time to go back and do so). Even if there’s no existing spec for your topic, you might find a descriptive model similar to the one we provided for semantic roles.

Note

Not all annotation and linguistic models live in semantic textbooks! The list of film genres that we used was taken from IMDb.com, and there are many other places where you can get insight into how to frame your model and specification. A recent paper on annotating bias used the Wikipedia standards for editing pages as the standard for developing a spec and annotation guidelines for an annotation project (Herzig et al. 2011). Having a solid linguistic basis for your task can certainly help, but don’t limit yourself to only linguistic resources!

If you are lucky enough to find both a model and a specification that are suitable for your task, you still might need to make some changes for them to fit your goal. For example, if you are doing temporal annotation, you can start with the TimeML specification, but you may find that the TIMEX3 tag is simply too much information for your purposes, or too overwhelming for your annotators. The TIMEX3 DTD description is as follows:

<!ELEMENT TIMEX3 ( #PCDATA ) >
<!ATTLIST TIMEX3 start #IMPLIED >
<!ATTLIST TIMEX3 tid ID #REQUIRED >
<!ATTLIST TIMEX3 type ( DATE | DURATION | SET | TIME ) #REQUIRED >
<!ATTLIST TIMEX3 value NMTOKEN #REQUIRED >
<!ATTLIST TIMEX3 anchorTimeID IDREF #IMPLIED >
<!ATTLIST TIMEX3 beginPoint IDREF #IMPLIED >
<!ATTLIST TIMEX3 endPoint IDREF #IMPLIED >
<!ATTLIST TIMEX3 freq NMTOKEN #IMPLIED >
<!ATTLIST TIMEX3 functionInDocument ( CREATION_TIME | EXPIRATION_TIME | 
  MODIFICATION_TIME | PUBLICATION_TIME | RELEASE_TIME | RECEPTION_TIME | 
  NONE ) #IMPLIED >
<!ATTLIST TIMEX3 mod ( BEFORE | AFTER | ON_OR_BEFORE | ON_OR_AFTER | LESS_THAN |  
  MORE_THAN | EQUAL_OR_LESS | EQUAL_OR_MORE | START | MID | END | 
  APPROX )  #IMPLIED >
<!ATTLIST TIMEX3 quant CDATA #IMPLIED >
<!ATTLIST TIMEX3 temporalFunction ( false | true ) #IMPLIED >
<!ATTLIST TIMEX3 valueFromFunction IDREF #IMPLIED >
<!ATTLIST TIMEX3 comment CDATA #IMPLIED >

A lot of information is encoded in a TIMEX3 tag. While the information is there for a reason—years of debate and modification took place to create this description of a temporal reference—there are certainly annotation tasks where this level of detail will be unhelpful, or even detrimental. If this is the case, other temporal annotation tasks have been done over the years that have specs that you may find more suitable for your goal and model.

Using Models Without Specifications

It’s entirely possible—even likely—that your annotation task may be based on a linguistic (or psychological or sociological) phenomenon that has been clearly explained in the literature, but has not yet been turned into a specification. In that case, you will have to decide the form the specification will take, in much the same way that we discussed in the first section of this chapter. Depending on how fleshed out the model is, you may have to make decisions about what parts of the model become tags, what become attributes, and what become links. In some ways this can be harder than simply creating your own model and spec, because you will be somewhat constrained by someone else’s description of the phenomenon. However, having a specification that is grounded in an established theory will make your own work easier to explain and distribute, so there are advantages to this approach as well.

Many (if not all) of the annotation specifications that are currently in wide use are based on theories of language that were created prior to the annotation task being created. For example, the TLINK tag in ISO-TimeML is based largely on James Allen’s work in temporal reasoning (Allen 1984; Pustejovsky et al. 2003), and ISO-Space has been influenced by the qualitative spatial reasoning work of Randell et al. (1992) and others. Similarly, syntactic bracketing and POS labeling work, as well as existing semantic role labeling, are all based on models developed over years of linguistic research and then applied through the creation of syntactic specifications.

Different Kinds of Standards

Previously we mentioned that one of the aspects of having an interoperable annotation project is using a standardized format for your annotation files, as well as using existing models and specs. However, file specifications are not the only kind of standards that exist in annotation: there are also annotation specifications that have been accepted by the community as go-to (or de facto) standards for certain tasks. While there are no mandated (a.k.a. de jure) standards in the annotation community, there are varying levels and types of de facto standards that we will discuss here.

ISO Standards

The International Organization for Standardization (ISO) is the body responsible for creating standards that are used around the world for ensuring compatibility of systems between businesses and government, and across borders. ISO is the organization that helps determine what the consensus will be for many different aspects of daily life, such as the size of DVDs, representation of dates and times, and so on. There are even ISO standards for representing linguistic annotations in general and for certain types of specifications, in particular ISO-TimeML and ISO-Space. Of course, you aren’t required to use ISO standards (there’s no Annotation Committee that enforces use of these standards), but they do represent a good starting point for most annotation tasks, particularly those standards related to representation.

Note

ISO standards are created with the intent of interoperability, which sets them apart from other de facto standards, as those often become the go-to representation simply because they were there first, or were used by a large community at the outset and gradually became ingrained in the literature. While this doesn’t mean that non-ISO standards are inherently problematic, it does mean that they may not have been created with interoperability in mind.

Annotation format standards

Linguistic annotation projects are being done all over the world for many different, but often complementary, reasons. Because of this, in the past few years ISO has been developing the Linguistic Annotation Framework (LAF), a model for annotation projects that is abstract enough to apply to any level of linguistic annotation.

How can a model be flexible enough to encompass all of the different types of annotation tasks? LAF takes a two-pronged approach to standardization. First, it focuses on the structure of the data, rather than the content. Specifically, the LAF standard allows for annotations to be represented in any format that the task organizers like, so long as it can be transmuted into LAF’s XML-based “dump format,” which acts as an interface for all manner of annotations. The dump format has the following qualities (Ide and Romary 2006):

The annotation is kept separate from the text it is based on, and annotations are associated with character or element offsets derived from the text.
Each level of annotation is stored in a separate document.
Annotations that represent hierarchical information (e.g., syntax trees) must be either represented with embedding in the XML dump format, or use a flat structure that symbolically represents relationships.
When different annotations are merged, the dump format must be able to integrate overlapping annotations in a way that is compatible with XML.

The first bullet point—keeping annotation separate from the text—now usually takes the form of stand-off annotation (as opposed to inline annotation, where the tags and text are intermingled). We’ll go through all the forms that annotation can take and the pros and cons in Chapter 5.

The other side of the approach that LAF takes toward standardization is encouraging researchers to use established labels for linguistic annotation. This means that instead of just creating your own set of POS or NE tags, you can go to the Data Category Registry (DCR) for definitions of existing tags, and use those to model your own annotation task. Alternatively, you can name your tag whatever you want, but when transmuting to the dump format, you would provide information about what tags in the DCR your own tags are equivalent to. This will help other people merge existing annotations, because it will be known whether two annotations are equivalent despite naming differences. The DCR is currently under development (it’s not an easy task to create a repository of all annotation tags and levels, and so progress has been made very carefully). You can see the information as it currently exists at www.isocat.org.

LAF didn’t emerge as an ISO standard from out of nowhere. Here’s a quick rundown of where the standards composing the LAF model originated:

1987: The Text Encoding Initiative (TEI) is founded “to develop guidelines for encoding machine-readable texts in the humanities and social sciences.” The TEI is still an active organization today. See http://www.tei-c.org.
1990: The TEI releases its first set of Guidelines for the Encoding and Interchange of Machine Readable Texts. It recommends that encoding be done using SGML (Standard Generalized Markup Language), the precursor to XML and HTML.
1993: The Expert Advisory Group on Language Engineering Standards (EAGLES) is formed to provide standards for large-scale language resources (ex, corpora), as well as standards for manipulating and evaluating those resources. See http://www.ilc.cnr.it/EAGLES/home.html.
1998: The Corpus Encoding Standard (CES), also based on SGML, is released. The CES is a corpus-specific application of the standards laid out in the TEI’s Guidelines and was developed by the EAGLES group. See http://www.cs.vassar.edu/CES/.
2000: The Corpus Encoding Standard for XML (XCES) is released, again under the EAGLES group. See http://www.xces.org/.
2002: The TEI releases version P4 of its Guidelines, the first version to implement XML. See http://www.tei-c.org/Guidelines/P4/.
2004: The first document describing the Linguistic Annotation Framework is released (Ide and Romary 2004).
2007: The most recent version (P5) of the TEI Guidelines is released. See http://www.tei-c.org/Guidelines/P5/.
2012: LAF and the TEI Guidelines are still being updated and improved to reflect progress made in corpus and computational linguistics.

Annotation specification standards

In addition to helping create standards for annotation formats, ISO is working on developing standards for specific annotation tasks. We mentioned ISO-TimeML already, which is the standard for representing temporal information in a document. There is also ISO-Space, the standard for representing locations, spatial configurations, and movement in natural language. The area of ISO that is charged with looking at annotation standards for all areas of natural language is called TC 37/SC 4. Other projects involve the development of standards for how to encode syntactic categories and morphological information in different languages, semantic role labeling, dialogue act labeling, discourse relation annotation, and many others. For more information, you can visit the ISO web page or check out Appendix A of this book.

Community-Driven Standards

In addition to the committee-based standards provided by ISO, a number of de facto standards have been developed in the annotation community simply through wide use. These standards are created when an annotated resource is formed and made available for general use. Because corpora and related resources can be very time-consuming to create, once a corpus is made available it will usually quickly become part of the literature. By extension, whatever annotation scheme was used for that corpus will also tend to become a standard.

If there is a spec that is relevant to your project, taking advantage of community-driven standards can provide some very useful benefit. Any existing corpora that are related to your effort will be relevant, since they are developed using the spec you want to adopt. Additionally, because resources such as these are often in wide use, searching the literature for mentions of the corpus will often lead you to papers that are relevant to your own research goals, and will help you identify any problems that might be associated with the dataset or specification. Finally, datasets that have been around long enough often have tools and interfaces built around them that will make the datasets easier for you to use.

Warning

Community-driven standards don’t necessarily follow LAF guidelines, or make use of other ISO standards. This doesn’t mean they should be disregarded, but if interoperability is important to you, you may have to do a little extra work to make your corpus fit the LAF guidelines.

We have a list of existing corpora in Appendix C to help you get started in finding resources that are related to your own annotation task. While the list is as complete as we could make it, it is not exhaustive, and you should still check online for resources that would be useful to you. The list that we have was compiled from the LRE Map, a database of NLP resources maintained by the European Language Resources Association (ELRA).

Other Standards Affecting Annotation

While the ISO and community-driven standards are generally the only standards directly related to annotation and NLP, there are many standards in day-to-day life that can affect your annotation project. For example, the format that you choose to store your data in (Unicode, UTF-8, UTF-16, ASCII, etc.) will affect how easily other people will be able to use your texts on their own computers. This becomes especially tricky if you are annotating in a language other than English, where the alphabet uses different sets of characters. Even languages with characters that overlap with English (French, Spanish, Italian, etc.) can be problematic when accented vowels are used. We recommend using UTF-8 for encoding most languages, as it is an encoding that captures most characters that you will encounter, and it is available for nearly all computing platforms.

Other standards that can affect a project are those that vary by region, such as the representation of dates and times. If you have a project in which it is relevant to know when the document was created, or how to interpret the dates in the text, it’s often necessary to know where the document originated. In the United States, dates are often represented as MM-DD-YYYY, whereas in other countries dates are written in the format DD-MM-YYYY. So if you see the date 01-03-1999 in a text, knowing where it’s from might help you determine whether the date is January 3 or March 1. Adding to the confusion, most computers will store dates as YYYY-MM-DD so that the dates can be easily sorted.

Similarly, naming conventions can also cause confusion. When annotating NEs, if you’re making a distinction between given names and family names, again the origin of the text can be a factor in how the names should be annotated. This can be especially confusing, because while it might be a convention in a country for people to be referred to by their family name first (as in Hungary, South Korea, or Japan), if the text you are annotating has been translated, the names may have been (or may not have been) swapped by the translator to follow the convention of the language being translated to.

None of the issues we’ve mentioned should be deal breakers for your project, but they are definitely things to be aware of. Depending on your task, you may also run into regional variations in language or pronunciation, which can be factors that you should take into account when creating your corpus. Additionally, you may need to modify your model or specification to allow for annotating different formats of things such as dates and names if you find that your corpus has more diversity in it than you initially thought.

Summary

In this chapter we defined what models and specifications are, and looked at some of the factors that should be taken into account when creating a model and spec for your own annotation task. Specifically, we discussed the following:

The model of your annotation project is the abstract representation of your goal, and the specification is the concrete representation of it.
XML DTDs are a handy way to represent a specification; they can be applied directly to an annotation task.
Most models and specifications can be represented using three types of tags: document-level labels, extent tags, and link tags.
When creating your specification, you will need to consider the trade-off between generality and specificity. Going too far in either direction can make your task confusing and unmanageable.
Searching existing datasets, annotation guidelines, and related publications and conferences is a good way to find existing models and specifications for your task.
Even if no existing task is a perfect fit for your goal, modifying an existing specification can be a good way to keep your project grounded in linguistic theories.
Interoperability and standardization are concerns if you want to be able to share your projects with other people. In particular, text encoding and annotation format can have a big impact on how easily other people can use your corpus and annotations.
Both ISO standards and community-driven standards are useful bases for creating your model and specification.
Regional differences in standards of writing, text representation, and other natural language conventions can have an effect on your task, and may need to be represented in your specification.

Get Natural Language Annotation for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Natural Language Annotation for Machine Learning by James Pustejovsky, Amber Stubbs

Chapter 4. Building Your Model and Specification

Some Example Models and Specs

Note

Film Genre Classification

Note

Adding Named Entities

Semantic Roles

Adopting (or Not Adopting) Existing Models

Creating Your Own Model and Specification: Generality Versus Specificity

Note

Using Existing Models and Specifications

Note

Using Models Without Specifications

Different Kinds of Standards

ISO Standards

Note

Annotation format standards

Annotation specification standards

Community-Driven Standards

Warning

Other Standards Affecting Annotation

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly