Machine learning for Go programmers

We know that Go is powering amazing infrastructure projects like Kubernetes, etcd, and Docker; and we love Go's simplicity, ease of deployment, and tooling. But what if we want to infuse a little more intelligence in our Go application? Specifically, what about all of these cool machine learning algorithms? Can we integrate those in our Go applications natively, and if so, how would we do that?

In fact, you can quickly and effectively implement machine learning natively in your Go application; and we will explore some of the relevant tooling in this Oriole. This will allow you to level up your algorithmic decision making. However, before jumping into implementation, let's define the machine learning process that we follow and look at the particular model that we will be using.

The machine learning process

"Machine learning" might sound pretty fancy, but in essence, machine learning boils down to the training and utilization of algorithms that are able to "learn" how to answer questions or make decisions. For example, a machine learning algorithm might make a decision about whether certain network traffic is fraudulent or not, or it might predict a company's sales figures.

But how do these algorithms "learn" how to answer questions? Well, we need to "train" them. Machine learning algorithms are like other functions or algorithms in that we input data and they produce output. In the training phase of machine learning we input a "training" dataset to the machine learning model; and the output of this process is a trained (or fit) model. Then, in a prediction, inference, or utilization phase, we can input new data into our trained model and the model will output its predictions or inferences.

For example, let's say that we want to create a machine learning model to predict the progression of diabetes based on various attributes of a patient. In a training phase, we could expose our machine learning model to "labeled" data in the form of various measured patient attributes paired with actual measurements of disease progression (this is called "supervised" learning). Our model would then "learn" how to predict unknown disease progressions based on newly input patient attributes.

Our first machine learning model

In this Oriole, we will be utilizing one of these machine learning algorithms called linear regression. Linear regression is widely used to model continuous variables (e.g., sales, or disease progression); and is used as the basis for many other models. It also produces models that are immediately interpretable. Thus, linear regression can provide an excellent starting point when introducing predictive capabilities in a organization.

Furthermore, exploring linear regression will give us an opportunity to implement a machine learning workflow in Go, without any unnecessary complications from more sophisticated models. This workflow, discussed further below, includes profiling variables in a data set, splitting a data set into training and test sets, and evaluating the performance of a model.

Linear regression itself uses the formula for a line y = m*x + b as the basis for its decision making. When we "train" a linear regression model, we determine the m and the b to be used in the formula for predicting y. In our use case, y will be an indicator of diabetes disease progression, so our training will produce a parameterized formula for a line that predicts disease progression based on the input of one or more features (or independent variables).

Go tooling

In this particular exercise, we will utilize the following Go packages/tooling to explore/manipulate our data set and build our linear regression model. A more complete listing of data science related tooling for Go can be found here.

import (
    "github.com/sajari/regression"
    "github.com/kniren/gota/dataframe"
    "github.com/gonum/plot"
    "github.com/gonum/plot/plotter"
    "github.com/gonum/plot/vg"
    "github.com/gonum/floats"
)

We are also going to import the following from stdlib:

import (
    "math"
    "fmt"
    "io/ioutil"
    "log"
    "os"
)

Import the data set

Our example data set is related to diabetes disease progression. The data set can be retrieved from here and is commonly used for testing regression models. The data set includes "Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements, for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline."

We are going to use the gota package (i.e., dataframe for Go) to import the CSV, infer the types of the columns, and print out useful information about the data set, including number of rows/columns, names of the columns, and some example values.

// Open the CSV dataset file.
f, err := os.Open("diabetes.csv")
if err != nil {
    fmt.Println(err.Error())
}

// Create a dataframe from the CSV string.
// The types of the columns will be inferred.
dataDF := dataframe.ReadCSV(f)

// Close the file.
f.Close()
fmt.Printf("Num of Rows: %d\nNum of Rows: %d\nColumn names: %v\n\n", dataDF.Nrow(), dataDF.Ncol(), dataDF.Names())

In the following models, we are going to utilize two of the ten features (specifically bmi and ltg) to model the quantitative measure of disease progression (which we will call "y"). Let's take a look at these columns:

fmt.Println(dataDF.Select([]string{"bmi", "ltg", "y"}).Subset([]int{0, 1, 2, 3, 4}))

Profile the data

Before building any machine learning model, it is essential that we gain some intuition about how the variables are distributed. In fact, linear regression assumes that we are using normally distributed features. Here, we are going to explore the distributions of our features visually via histograms. We will generate a histogram for each feature, along with "y," normalize all of the histograms, and save each to an image file.

// Create a histogram for each of the features.
for _, colName := range dataDF.Names() {

    // Extract the columns as a slice of floats.
    floatCol := dataDF.Col(colName).Float()

    // Create a plotter.Values value and fill it with the
    // values from the respective column of the dataframe.
    plotVals := make(plotter.Values, len(floatCol))
    summaryVals := make([]float64, len(floatCol))
    for i, floatVal := range floatCol {
        plotVals[i] = floatVal
        summaryVals[i] = floatVal
    }

    // Make a plot and set its title.
    p, err := plot.New()
    if err != nil {
        log.Fatal(err)
    }
    p.Title.Text = fmt.Sprintf("Histogram of a %s", colName)

    // Create a histogram of our values drawn
    // from the standard normal.
    h, err := plotter.NewHist(plotVals, 16)
    if err != nil {
        log.Fatal(err)
    }

    // Normalize the histogram.
    h.Normalize(1)

    // Add the histogram to the plot.
    p.Add(h)

    // Save the plot to a PNG file.
    if err := p.Save(4*vg.Inch, 4*vg.Inch, colName+"_hist.png"); err != nil {
        log.Fatal(err)
    }
}

Note, the features bmi and ltg appear to be somewhat normally distributed; at least they are not extremely skewed. y appears to follow a non-normal distribution. Strictly speaking, this is a violation of some of the assumptions we are making when we use linear regression. This should be noted and documented as you build your model.

In response to the information we gain about y above, we could do one of two things: (1) transform y to log(y) or to some power of y and then try to model that by bmi and ltg, (2) ignore the non-normality of y; try to train and model, and see how it goes. For now, let's try the latter. It could very well be that this distribution of y does not prevent us from training a reasonable model and transforming y has the downside of making our model less interpretable.

In addition to gaining some insight about how the variables are distributed, we should look at the correlation between our chosen features and the response, "y." This should confirm that the variables are linearly correlated with the response. It will also give us some intuition about what coefficients we might expect in our models (e.g., whether the coefficients will be positive or negative).

// Extract the target column.
yVals := dataDF.Col("y").Float()

// Create a scatter plot for each of the features in the dataset.
for _, colName := range dataDF.Names() {

    // Extract the columns as a slice of floats.
    floatCol := dataDF.Col(colName).Float()

    // pts will hold the values for plotting
    pts := make(plotter.XYs, len(floatCol))

    // Fill pts with data.
    for i, floatVal := range floatCol {
        pts[i].X = floatVal
        pts[i].Y = yVals[i]
    }

    // Create the plot.
    p, err := plot.New()
    if err != nil {
        log.Fatal(err)
    }
    p.X.Label.Text = colName
    p.Y.Label.Text = "y"
    p.Add(plotter.NewGrid())

    s, err := plotter.NewScatter(pts)
    if err != nil {
        log.Fatal(err)
    }
    s.GlyphStyle.Radius = vg.Points(3)

    // Save the plot to a PNG file.
    p.Add(s)
    if err := p.Save(4*vg.Inch, 4*vg.Inch, colName+"_scatter.png"); err != nil {
        log.Fatal(err)
    }
}

It does indeed appear that bmi and ltg are close to linearly correlated with y. As both bmi and ltg increase, y increases (or both exhibit an "up and to the right" behavior). This indicates that y is "positively correlated" with both bmi and ltg, and we would expect that the coefficient of each in a linear regression model would be positive. After training a model, we should use this information as a safety check to see if the coefficients behave as expected.

Split the data into training and test data sets

To avoid overfitting our model to our data, we will split the data up into training and test data sets. We will train our model (i.e., fit our linear regression line) using the training data; and we will evaluate the performance of our model using the test data. Note, there are many other methods for preventing overfitting (e.g., cross validation); and for more complicated models we might want to consider regularization. However, we will utilize training; and test data sets should be appropriate here and will demonstrate some of the data set splitting functionality that can be the basis of other methods like cross-validation.

In this case, we will utilize 3/4 of the data for training and 1/4 of the data for tests:

// Calculate the number of elements in each set.
trainingNum := (3 * dataDF.Nrow()) / 4
testNum := dataDF.Nrow() / 4

if trainingNum+testNum < dataDF.Nrow() {
    trainingNum++
}

We then create two slices of integers for each of training and test. Integers corresponding to the index values of our training and test points in the dataframe are enumerated. This will allow us to subset out the training and test portions of the dataframe.

// Create the subset indices.
trainingIdx := make([]int, trainingNum)
testIdx := make([]int, testNum)

// Enumerate the training indices.
for i := 0; i < trainingNum; i++ {
    trainingIdx[i] = i
}

// Enumerate the test indices.
for i := 0; i < testNum; i++ {
    testIdx[i] = trainingNum + i
}

Then, subsetting is as simple as providing these indices to the Subset method on the dataframe value:

// Create the subset dataframes.
trainingDF := dataDF.Subset(trainingIdx)
testDF := dataDF.Subset(testIdx)

Train and evaluate a linear regression model

We are going to train two linear regression models in this Oriole. First, we are going to try and model y using bmi only. Then we are going to build a multiple regression model by adding in ltg. We will evaluate each model to see if adding in ltg makes a significant impact on our model performance.

Train a single linear regression model

The bmi and y values in our training dataframe are extracted into simple slices of floats. This is required as we need to supply the raw float values to the regression model for training. gota provides a convenient Float method that allows us to extract these values.

// Extract the bmi as a slice of floats.
bmiTraining := trainingDF.Col("bmi").Float()

// Extract the y values as a slice of floats.
yTraining := trainingDF.Col("y").Float()

We then create a regression.Regression value using the github.com/sajari/regression package. This value will represent our model, along with all the training information. Those familiar with scikit-learn and other packages, will be familiar with this flow: create a model value, add the training data, train the model, and use the trained model to make predictions.

var r regression.Regression
r.SetObserved("diabetes progression")
r.SetVar(0, "bmi")

The training values have to be added to the regression value. To do this, we loop over the values adding them as DataPoint's in the regression value.

// Loop of records, adding the training data to the regression value.
for i, bmiVal := range bmiTraining {
    r.Train(regression.DataPoint(yTraining[i], []float64{bmiVal}))
}

To train the model we use the Run() method on the regression value. This returns an error value, which is nil if the training was successful. Note that here we are just outputting that error value to the notebook. In a real world scenario one would want to do a check on this error (i.e., if err != nil {...}) and handle it appropriately.

// Train/fit the regression model.
r.Run()

Ok great! We have a trained model. Now let's output the formula of the model to standard out to examine the coefficient and intercept. Remember that we expect the coefficient on bmi to be positive.

// Output the trained model parameters.
fmt.Printf("\nRegression Formula:\n%v\n\n", r.Formula)

Evaluate the model

The test values are now extracted as floats from the full dataframe for testing. Here we will utilize the "mean absolute error" (MAE) as our evaluation metric. This is a simple, but intuitive metric, and it has the advantage of being directly comparable to the scale of our y values.

// Extract the bmi as a slice of floats.
bmiTest := testDF.Col("bmi").Float()

// Extract the y values as a slice of floats.
yTest := testDF.Col("y").Float()

To calculate the MAE, we define a float64 value.

// We are going to evaluate our model using the mean absolute error.
var mAE float64

Then we loop over the bmi values accumulating the MAE. For each of the test bmi values, Predict is called on the trained regression model to make a prediction for y based on the particular value of bmi. This Predict function utilizes the above printed formula under the hood to make the predictions. Thus, after training a model, one could abandon the third-party regression package and implement a simple function to make the prediction based on the formula. This would be preferable in many cases to avoid third-party dependencies, assuming that you don't need to update your model online.

// Loop over the test data predicting y and evaluating the prediction
// with the mean absolute error.
for i, bmiVal := range bmiTest {

    // Predict y with our trained model.
    yPredicted, err := r.Predict([]float64{bmiVal})

    // Add the to the mean absolute error.
    mAE += math.Abs(yTest[i]-yPredicted) / float64(testDF.Nrow())
}
// Output the MAE to standard out.
fmt.Printf("MAE = %0.2f\n\n", mAE)

Ok, so our model was evaluated to perform with a MAE of 50.78. One can compare this to the range of y values in the histogram above to get a sense of the scale of possible errors.

Add another predictor to the model (multiple regression)

Let's not stop there. We have ten other features that might be relevant to this prediction, so let's try to add one of those into the model (making it a "multiple" regression model). We will add ltg into the model and then evaluate the multiple regression model to see if we did any better than our bmi-only model above.

// Extract the bmi as a slice of floats.
ltgTraining := trainingDF.Col("ltg").Float()

Here we will label both bmi and ltg as the predictors in the model using SetVar. The 0 and 1 in the SetVar methods indicate that bmi is the first predictor and ltg is the second, although the ordering is more for the data scientist to keep track of things than for the model itself. The response, y, can also be labeled. Here we label it as "diabetes progression."

var mr regression.Regression
mr.SetObserved("diabetes progression")
mr.SetVar(0, "bmi")
mr.SetVar(1, "ltg")

Again, we add in the training values to the regression value. This time we add in both the bmi and ltg values. It is important here that we maintain the ordering specified above to prevent confusion.

// Loop of records, adding the training data to the regression value.
for i, bmiVal := range bmiTraining {
    mr.Train(regression.DataPoint(yTraining[i], []float64{bmiVal, ltgTraining[i]}))
}

Then, despite being slightly more complicated, our regression model is trained with the same Run function and we can print out the resulting formula; remembering to check our intuition with the sign of the coefficients.

// Train/fit the regression model.
mr.Run()
// Output the trained model parameters.
fmt.Printf("\nRegression Formula:\n%v\n\n", mr.Formula)

Evaluate the new model

We will evaluate the multiple regression model in much the same way as the single regression model above. However, first we need to extract the ltg test values.

// Extract the bmi as a slice of floats.
ltgTest := testDF.Col("ltg").Float()

We also need to re-initialize our MAE value to zero, before accumulating the new error values from the mulitple regression model.

// Re-initialize the MAE value.
mAE = 0

// Loop over the test data predicting y and evaluating the prediction
// with the mean absolute error.
for i, bmiVal := range bmiTest {

    // Predict y with our trained model.
    yPredicted, err := mr.Predict([]float64{bmiVal, ltgTest[i]})

    // Add the to the mean absolute error.
    mAE += math.Abs(yTest[i]-yPredicted) / float64(testDF.Nrow())
}
// Output the MAE to standard out.
fmt.Printf("MAE = %0.2f\n\n", mAE)

Excellent! The extra complication we added to the model paid off with ~11% decrease in the MAE.

Conclusion

We have successfully used Go to train a model predicting diabetes disease progression. This included performing all the common steps utilized by data scientists to train models (train/test split, profiling the data, etc.). We first trained a linear regression model with a single predictor and then increased the complexity of our model by adding in a second predictor.

To follow up on this exercise, we recommend exploring the other features in the data set. Can you build a model with some other combination of the features that performs better than the second model we trained above? Discuss your efforts with other data science Gophers in the #data-science channel in Gophers Slack, and be sure to check out all the other great open source tooling being developed for Go) including gophernotes, the Jupyter notebook kernel used to develop this Oriole.

Also note, data scientists will likely recognize that adding more features to such a model could introduce issues with evaluation metrics and overfitting. Hopefully we will explore topics like "regularization" in future Orioles; however, this exercise at least provides a starting point for building machine learning applications with Go.