O'Reilly logo
live online training icon Live Online training

Data Science for Security Professionals

enter image description here

A hands-on overview of data science and machine learning

Charles Givre

As a security professional, you’re assailed by ever-increasing amounts of data and increasingly sophisticated attacks. Help is here in the form of Charles Givre’s hands-on introduction to data science for security professionals. Using your existing scripting skills, you’ll apply data science techniques to analyze your data more efficiently and machine learning techniques to keep your data and systems more secure.

Through a combination of lecture and exercises, you’ll gain a practical understanding of the entire data science process, including gathering data from diverse sources, exploring and visualizing data, machine learning, and ultimately, scaling your solution to extremely large datasets. By the end of this course, you’ll be confident in your ability to apply this knowledge to extract even more value from your data while shoring up your defenses.

What you'll learn-and how you can apply it

By the end of this course, you will understand:

  • The concepts behind machine learning and how to apply it to security problems.
  • The process of transforming raw data into actionable information

And you'll be able to:

  • Quickly and efficiently gather and prepare data for analysis
  • Explore data using basic statistical techniques
  • Create and evaluate basic machine learning models using security data

This training course is for you because...

  • You are a security professional with some scripting skills and you want to apply data science techniques to your work to analyze data more efficiently
  • You are a network analyst with some scripting skills and you want to use machine learning techniques to better secure your network

Prerequisites

  • Have beginner-to-intermediate experience with the Python programming language
  • Be familiar with security and networking concepts

To test whether you will be able to run the jupyter notebooks in your upcoming training, please:

Navigate here: https://attendee-testing-2.oreilly-jupyterhub.com (This is the link to the test site)

  • Sign in with your Safari credentials
  • Click "start my server"
  • Click on "notebook .ipynb"

  • Run each of the code cells: click the cell then either press Shift+Return or click the triangle in the top menu

  • There may be a few second delay, but you should eventually see the graphs. If you do not, this probably means that your firewall is blocking JupyterHub's websockets. Please turn off your company VPN or speak with your system administrator to allow.

SETUP INSTRUCTIONS:

Access to a virtual machine with all data sources and all tools pre-configured ---> HERE. Students should have access to a computer with at least 8GB of RAM and 20~30GB of hard disk space. Setup Instructions

Recommended Preparation:

Introduction to Python

Intermediate Python

About your instructor

  • Mr. Charles Givre has always been interested solving problems in unique ways, and has worked to make a career of it as a data scientist at Booz Allen Hamilton. At Booz Allen, Mr. Givre worked as a technical leader on various large government projects. Mr. Givre enjoys sharing his passion for data science with others and has worked to develop comprehensive data science training programs at his firm. Prior to joining Booz Allen, Mr. Givre worked as a counterterrorism analyst at the Central Intelligence Agency for nearly five years.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Day 1—Get Data

In this section, you’ll learn how to quickly and efficiently ingest a variety of data types and prepare them for analysis. You’ll also learn the concepts behind vectorized computing.

Introduction: Data Preparation with Pandas (15 Min)

  • What is Pandas and why use it?
  • The Series, DataFrame, and Panel objects
  • The Pandas ecosystem: Scikit-learn, Seaborn, Bokeh

Vectorized Computing in One Dimension: The Series Object (90 Min)

  • Creating a series
  • Describing data
  • Filtering data
  • Other operations on data
  • Activity: Worksheet

Vectorized Computing in Two Dimensions: The DataFrame (90 min)

  • Creating a DataFrame
  • Reading logfiles, APIs and other sources
  • Manipulating data in data frames
  • Applying functions to data frames
  • Aggregating data in data frames
  • Activity: DataFrame Worksheet

Homework: You’ll receive a series of data sources to prepare for analysis

Day 2—Explore Your Data

On day two, you’ll learn the concepts and techniques behind exploratory data analysis as well as practical data visualization techniques.

Statistical Summaries (90 Min)

  • 5-Number summaries
  • Normalizing data
  • Understanding Distributions
  • Correlations
  • Confidence Intervals and P-Values
  • Exercise: Complete EDA Worksheet

Concepts of Data Visualization (30 Min)

  • Creating effective visualizations
  • Choosing the correct visualization
  • Using visualization to explore data

Practical Data Visualization (90 Min)

  • Using Matplotlib to create basic charts
  • Overview of advanced charts with Seaborn
  • Creating dashboards with Superset

Homework: Complete visualization worksheet

Day 3—Learn From It

Day three will introduce the machine learning process. We will cover model selection, feature engineering, and model evaluation.

Machine Learning Concepts (60 Min)

  • Machine learning process
  • Machine learning problem types
  • Supervised vs Unsupervised machine learning

Unsupervised Machine Learning in Practice (60 Min)

  • Distance measures
  • Nearest Neighbors
  • K-Means
  • Exercise: K-Means Worksheet

Supervised Machine Learning: Classification (60 Min)

  • Feature engineering
  • Modeling with Decision Trees and Support Vector Machines
  • Model evaluation
  • Case Study: Classifier to identify SQL Injection

Final Project: DGA Classifier