Importing from file

Another option to load the documents is through cc.mallet.pipe.iterator.CsvIterator.CsvIterator(Reader, Pattern, int, int, int), which assumes all of the documents are in a single file and returns one instance per line extracted by a regular expression. The class is initialized by the following components:

  • Reader: This is the object that specifies how to read from a file
  • Pattern: This is a regular expression, extracting three groups: data, target label, and document name
  • int, int, int: These are the indexes of data, target, and name groups as they appear in a regular expression

Consider a text document in the following format, specifying the document name, category, and content:

AP881218 local-news A 16-year-old student ...

Get Machine Learning in Java - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.