Another option to load the documents is through cc.mallet.pipe.iterator.CsvIterator.CsvIterator(Reader, Pattern, int, int, int), which assumes all of the documents are in a single file and returns one instance per line extracted by a regular expression. The class is initialized by the following components:
- Reader: This is the object that specifies how to read from a file
- Pattern: This is a regular expression, extracting three groups: data, target label, and document name
- int, int, int: These are the indexes of data, target, and name groups as they appear in a regular expression
Consider a text document in the following format, specifying the document name, category, and content:
AP881218 local-news A 16-year-old student ...