Appendix A. List of Available Corpora and Specifications

This appendix was compiled primarily from the LRE Resource Map. Many thanks to Nicoletta Calzolari and Riccardo del Gratta for their help in creating this appendix, and for allowing us to reprint this information here.

Please note that this appendix does not represent a complete list of all the existing software for the various tasks listed here. It is intended to provide a general overview of the different corpora and specifications available, to give you an idea of what resources you can use in your own annotation and machine learning (ML) tasks. For the most up-to-date list of resources, check the LRE Resource Map, or just do a web search to see what else is available.

Corpora

A Reference Dependency Bank for Analyzing Complex Predicates

Modality: Written

Languages: Hindi/Urdu

Annotation: Semantic dependencies

URL: http://ling.uni-konstanz.de/pages/home/pargram_urdu/main/Resources.html

A Treebank for Finnish (FinnTreeBank)

Modality: Written

Language: Finnish

Annotation: Treebank

Size: 17,000 model sentences

URL: http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/index.shtml

ALLEGRA (ALigned press reLEases of the GRisons Administration)

Modality: Written

Languages: German, Romansh, Italian

URL: http://www.latl.unige.ch/allegra/

AnCora

Modality: Written

Language: Catalan

Annotations: Lemma and part of speech, syntactic constituents and functions, argument structure and thematic roles, semantic classes of the verb, Named Entities, coreference ...

Get Natural Language Annotation for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.