How it works...

We have 768 observations for the dataset. Each line/record is comprised of 10 features and a label value that can used for a supervised learning model (that is, logistic regression). The label/class is either a 1, meaning tested positive for diabetes, and 0 if the test came back negative.


  • Number of times pregnant
  • Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • Diastolic blood pressure (mm Hg)
  • Triceps skin fold thickness (mm)
  • 2-hour serum insulin (mu U/ml)
  • Body mass index (weight in kg/(height in m)^2)
  • Diabetes pedigree function
  • Age (years)
  • Class variable (0 or 1)
    Label/Class:               1 - tested positive               0 - tested negative

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.