Data Science for Security Professionals
A hands-on overview of data science and machine learning
As a security professional, you’re assailed by ever-increasing amounts of data and increasingly sophisticated attacks. Help is here in the form of Charles Givre’s hands-on introduction to data science for security professionals. Using your existing scripting skills, you’ll apply data science techniques to analyze your data more efficiently and machine learning techniques to keep your data and systems more secure.
Through a combination of lecture and exercises, you’ll gain a practical understanding of the entire data science process, including gathering data from diverse sources, exploring and visualizing data, machine learning, and ultimately, scaling your solution to extremely large datasets. By the end of this course, you’ll be confident in your ability to apply this knowledge to extract even more value from your data while shoring up your defenses.
What you'll learn-and how you can apply it
By the end of this course, you will understand:
- The concepts behind machine learning and how to apply it to security problems.
- The process of transforming raw data into actionable information
And you'll be able to:
- Quickly and efficiently gather and prepare data for analysis
- Explore data using basic statistical techniques
- Create and evaluate basic machine learning models using security data
This training course is for you because...
- You are a security professional with some scripting skills and you want to apply data science techniques to your work to analyze data more efficiently
- You are a network analyst with some scripting skills and you want to use machine learning techniques to better secure your network
- Have beginner-to-intermediate experience with the Python programming language
- Be familiar with security and networking concepts
To test whether you will be able to run the jupyter notebooks in your upcoming training, please:
Navigate here: https://attendee-testing-2.oreilly-jupyterhub.com (This is the link to the test site)
- Sign in with your Safari credentials
- Click "start my server"
Click on "notebook .ipynb"
Run each of the code cells: click the cell then either press Shift+Return or click the triangle in the top menu
There may be a few second delay, but you should eventually see the graphs. If you do not, this probably means that your firewall is blocking JupyterHub's websockets. Please turn off your company VPN or speak with your system administrator to allow.
Access to a virtual machine with all data sources and all tools pre-configured ---> HERE. Students should have access to a computer with at least 8GB of RAM and 20~30GB of hard disk space. Setup Instructions
About your instructor
Mr. Charles Givre has always been interested solving problems in unique ways, and has worked to make a career of it as a data scientist at Booz Allen Hamilton. At Booz Allen, Mr. Givre worked as a technical leader on various large government projects. Mr. Givre enjoys sharing his passion for data science with others and has worked to develop comprehensive data science training programs at his firm. Prior to joining Booz Allen, Mr. Givre worked as a counterterrorism analyst at the Central Intelligence Agency for nearly five years.
The timeframes are only estimates and may vary according to how the class is progressing
Day 1—Get Data
In this section, you’ll learn how to quickly and efficiently ingest a variety of data types and prepare them for analysis. You’ll also learn the concepts behind vectorized computing.
Introduction: Data Preparation with Pandas (15 Min)
- What is Pandas and why use it?
- The Series, DataFrame, and Panel objects
- The Pandas ecosystem: Scikit-learn, Seaborn, Bokeh
Vectorized Computing in One Dimension: The Series Object (90 Min)
- Creating a series
- Describing data
- Filtering data
- Other operations on data
- Activity: Worksheet
Vectorized Computing in Two Dimensions: The DataFrame (90 min)
- Creating a DataFrame
- Reading logfiles, APIs and other sources
- Manipulating data in data frames
- Applying functions to data frames
- Aggregating data in data frames
- Activity: DataFrame Worksheet
Homework: You’ll receive a series of data sources to prepare for analysis
Day 2—Explore Your Data
On day two, you’ll learn the concepts and techniques behind exploratory data analysis as well as practical data visualization techniques.
Statistical Summaries (90 Min)
- 5-Number summaries
- Normalizing data
- Understanding Distributions
- Confidence Intervals and P-Values
- Exercise: Complete EDA Worksheet
Concepts of Data Visualization (30 Min)
- Creating effective visualizations
- Choosing the correct visualization
- Using visualization to explore data
Practical Data Visualization (90 Min)
- Using Matplotlib to create basic charts
- Overview of advanced charts with Seaborn
- Creating dashboards with Superset
Homework: Complete visualization worksheet
Day 3—Learn From It
Day three will introduce the machine learning process. We will cover model selection, feature engineering, and model evaluation.
Machine Learning Concepts (60 Min)
- Machine learning process
- Machine learning problem types
- Supervised vs Unsupervised machine learning
Unsupervised Machine Learning in Practice (60 Min)
- Distance measures
- Nearest Neighbors
- Exercise: K-Means Worksheet
Supervised Machine Learning: Classification (60 Min)
- Feature engineering
- Modeling with Decision Trees and Support Vector Machines
- Model evaluation
- Case Study: Classifier to identify SQL Injection
Final Project: DGA Classifier