Information extraction

In this phase, we are interested in extracting the necessary details from the input data. In the previous phase, we already identified the necessary pieces that are of interest to us. Here is where we can adopt the following techniques for information extraction:

  • Identify and locate where the text is present
    • Analyze and come up with the best method of information extraction:
    • Tokenize and extract information
    • Go to offset and extract information
    • Regular expression-based information extraction
    • Complex algorithm-based information extraction

Depending on the complexity of the data, we might have to adopt one or more of the aforementioned techniques to extract the information from the target data.

Get Modern Big Data Processing with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.