Understanding datasets

This section is divided into two parts. In the first part, we need to discuss the challenges we have faced in order to generate the dataset. In the later section, we will be discussing the attributes of the dataset.

Challenges in obtaining the dataset

As we all know, the health domain is a highly regulated domain when it comes to obtaining the dataset. These are some of the challenges I want to highlight:

  • For summarization, ideally, we need to have a corpus that contains original text as well as a summary of that text. This is called parallel corpus. Unfortunately, there is no good, free parallel corpus available for medical document summarization. We need to obtain this kind of parallel dataset for the English language.
  • There ...

Get Machine Learning Solutions now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.