O'Reilly logo
live online training icon Live Online training

Enhanced security with machine learning

Harness machine learning to hunt potential threats quickly and efficiently

Charles Givre

Many machine learning resources focus heavily on the math behind the models and use datasets that are irrelevent to those working in security. Join expert Charles Givre for a hands-on introduction to machine learning created specifically for security professionals, who are being challenged by ever-increasing amounts of data and increasingly sophisticated attackers. Charles presents the theory you need along with datasets that are directly applicable to the problems you face in your role in security. Through a combination of lecture and exercises, you'll gain a practical understanding of the entire data science process, from efficiently gathering data from diverse sources to exploring and visualizing data and machine learning to scaling your solution to extremely large datasets.

What you'll learn-and how you can apply it

By the end of this live, online course, you’ll understand:

  • Machine learning key concepts and how to apply them to security problems
  • How to build a machine learning model and evaluate its performance

And you’ll be able to:

  • Quickly and efficiently gather and prepare data for analysis
  • Explore data using basic statistical techniques
  • Create and evaluate basic machine learning models using security data

This training course is for you because...

  • You're a security professional with some scripting skills who wants to learn how to integrate machine learning into your workflow to identify potential threats.
  • You're a network analyst with some data analysis experience who wants to use machine learning techniques to better secure your network.

Prerequisites

  • A working knowledge of Python
  • A basic understanding of statistics and security and networking concepts

System Test:

To test whether you will be able to run the jupyter notebooks in your upcoming training, please:

Navigate here: https://notebook.oreilly-jupyterhub.com (This is the link to the test site)

  • Sign in with your Safari credentials
  • Click "start my server"
  • Click on "notebook .ipynb"

  • Run each of the code cells: click the cell then either press Shift+Return or click the triangle in the top menu

  • There may be a few second delay, but you should eventually see the graphs. If you do not, this probably means that your firewall is blocking JupyterHub's websockets. Please turn off your company VPN or speak with your system administrator to allow.

Required materials and setup:

  • A machine (8+ GB of RAM and 20~30 GB of available hard disk space) with the Griffon
    Virtual Machine for Data Science installed (link:https://github.com/gtkcyber/griffon-vm)
  • Materials downloaded from the course GitHub repository

Recommended preparation:

About your instructor

  • Charles Givre has always been interested solving problems in unique ways. He's made a career of it as a data scientist at Booz Allen Hamilton, where he works as a technical leader on large government projects. Charles enjoys sharing his passion for data science with others and has developed comprehensive data science training programs at his firm. Previously, he worked as a counterterrorism analyst at the Central Intelligence Agency. Charles is a sought-after speaker and has delivered training and talks at international conferences, including Black Hat, Strata + Hadoop World, and Open Data Science Conference (ODSC), among others. He has contributed to the Apache Drill codebase and is the coauthor of the first O’Reilly book about Drill; he has also delivered numerous workshops on the topic. Charles holds an MA in Middle Eastern studies from Brandeis University and both a BS in computer science and a bachelor of music from the University of Arizona. He also holds a number of professional certifications, including CISSP and Security+. Charles blogs at Thedataist.com. In his nonexistent spare time, he enjoys spending time with his family and restoring classic cars.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Day 1: Introduction and exploratory data analysis

Introduction to machine learning (30 minutes)
- Lecture: An overview of machine learning; solving security problems with machine
learning; where does machine learning fit in your security ecosystem?

The machine learning process (30 minutes)
- Lecture: Prework (thinking through the question, figuring out the data); exploring
your data; gathering and cleaning the data; engineering and selecting features;
selecting, building, and training your model; evaluating your model’s performance;
putting your model in production

Break (10 minutes)

The machine learning ecosystem (15 minutes)
- Lecture: Scripting languages and modules; visualization tools; big data tools

Hands-on exercise (30 minutes)

Break (10 minutes)

Data ingestion and exploration (60 minutes)
- Lecture: An overview of pandas, Series, and DataFrame; summarizing data with pandas;
visualizing data with pandas, seaborn, and Yellowbrick

Break (10 minutes)

Hands-on exercise (30 minutes)
- Prepare your data

Day 2: Classification

Feature engineering and selection (45 minutes)
- Lecture: Selecting, preparing, and visualizing features

Classification models (45 minutes)
- Lecture: Logistic regression; k-NN classifier; decision trees; random forest

Break (10 minutes)

Hands-on exercise (30 minutes)
- Build a classifier to classify malicious URLs

Evaluating performance (30 minutes)
- Lecture: Accuracy, precision, and recall; visualizing confusion matrices and
classification reports

Break (10 minutes)

Fine-tuning your model (30 minutes)
- Lecture: Grid search for hyper parameter tuning; model selection

Hands-on exercise (30 minutes)
- Tune your model

Break (10 minutes)

Advanced topics (30 minutes)
- Lecture: Neural nets and their applications to security; deep learning case studies;
hunting with data science

Day 3: Clustering and unsupervised learning

Measuring distances (30 minutes)
- Lecture: Cosine distances; Euclidian distances; other distance functions

Clustering models (30 minutes)
- Lecture: K-means; DBSCAN

Break (10 minutes)

Hands-on exercise (30 minutes)
- Detect anomalies using clustering techniques

Evaluating performance (30 minutes)
- Lecture: Performance metrics for clustering...a little harder; using Yellowbrick to
visualize model performance

Break (10 minutes)

Hands-on exercise (45 minutes)
- Improve your model

Taking a model to production (30 minutes)
- Lecture: Pipelines in scikit-learn; pickling your model

Break (10 minutes)

Hands-on exercise (45 minutes)
- Final project

Wrap-up and Q&A (30 minutes)
- Lecture: Where to go from here; scaling your models