O'Reilly logo

Learning Cascading by Victoria Loewengart, Michael Covert

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Project scope – understanding requirements

Just like most real-life situations, this project consists of both structured and unstructured data. Unstructured and semi-structured data are media articles, press releases, trade literature, blog posts, tweets, and so on. Unstructured files can arrive to a researcher in the form of text, PDF, Word, HTML, and many other formats. Structured data is usually delimiter-separated (most often comma-separated, such as CSV, or tab-separated, such as TSV) text files with or without a header. These structured files can be used by Cascading as they are, but unstructured data needs preprocessing.

The steps that we used to pre-pre-process our unstructured data are:

  1. First convert unstructured files of different formats ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required