Just like most real-life situations, this project consists of both structured and unstructured data. Unstructured and semi-structured data are media articles, press releases, trade literature, blog posts, tweets, and so on. Unstructured files can arrive to a researcher in the form of text, PDF, Word, HTML, and many other formats. Structured data is usually delimiter-separated (most often comma-separated, such as CSV, or tab-separated, such as TSV) text files with or without a header. These structured files can be used by Cascading as they are, but unstructured data needs preprocessing.
The steps that we used to pre-pre-process our unstructured data are: