O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Path: Scaling Python for Big Data

Video Description

If you have some Python experience, and you want to take it to the next level, this practical, hands-on Learning Path will be a helpful resource. Video tutorials in this Learning Path will show you how to use Python for distributed task processing, and perform large-scale data processing in Spark using the PySpark API.

Table of Contents

  1. Building Data Pipelines with Python
    1. Welcome To The Course 00:02:53
    2. About The Author 00:01:55
    3. Introduction To Automation 00:02:48
    4. Adventures With Servers 00:06:37
    5. Being A Good Systems Caretaker 00:06:03
    6. What Is A Queue? 00:02:32
    7. What Is A Consumer? What Is A Producer? 00:02:00
    8. Why Celery? 00:01:49
    9. Celery Architecture & Set Up 00:05:25
    10. Writing Your First Tasks 00:07:49
    11. Deploying Your Tasks 00:06:08
    12. Scaling Your Workers 00:08:52
    13. Monitoring With Flower 00:05:05
    14. Advanced Celery Features 00:06:00
    15. Why Dask? 00:03:01
    16. First Steps With Dask 00:10:08
    17. Dask Bags 00:10:18
    18. Dask Distributed 00:09:58
    19. What Are Data Pipelines? What Is Dag? 00:02:37
    20. Luigi And Airflow: A Comparison 00:05:50
    21. First Steps With Luigi 00:07:12
    22. More Complex Luigi Tasks 00:09:17
    23. Introduction To Hadoop 00:08:21
    24. First Steps With Airflow 00:08:07
    25. Custom Tasks With Airflow 00:09:16
    26. Advanced Airflow: Subdags And Branches 00:11:17
    27. Using Luigi With Hadoop 00:10:15
    28. Apache Spark 00:08:28
    29. Apache Spark Streaming 00:06:32
    30. Django Channels 00:09:39
    31. And Many More 00:05:59
    32. Introduction To Testing With Python 00:07:24
    33. Property-Based Testing With Hypothesis 00:06:09
    34. What's Next? 00:03:57
  2. Introduction to PySpark
    1. Introduction And Course Overview 00:02:01
    2. About The Author 00:01:02
    3. Installing Python 00:04:38
    4. Installing iPython And Using Notebooks 00:06:28
    5. Download And Setup 00:03:24
    6. Running The Spark Shell 00:05:35
    7. Running The Spark Shell With iPython 00:06:38
    8. What Is A Resilient Distributed Dataset - RDD? 00:04:54
    9. Reading A Text File 00:03:34
    10. Actions 00:02:13
    11. Transformations 00:02:30
    12. Persisting Data 00:04:11
    13. Map 00:03:04
    14. Filter 00:03:56
    15. Flatmap 00:03:16
    16. MapPartitions 00:04:07
    17. MapPartitionsWithIndex 00:01:51
    18. Sample 00:02:36
    19. Union 00:01:11
    20. Intersection 00:01:28
    21. Distinct 00:02:02
    22. Cartesian 00:03:17
    23. Pipe 00:03:40
    24. Coalesce 00:02:12
    25. Repartition 00:02:29
    26. RepartitionAndSortWithinPartitions 00:03:58
    27. Reduce 00:04:19
    28. Collect 00:01:56
    29. Count 00:03:05
    30. First 00:01:20
    31. Take 00:01:05
    32. TakeSample 00:03:03
    33. TakeOrdered 00:02:10
    34. SaveAsTextFile 00:04:09
    35. CountByKey 00:02:40
    36. ForEach 00:03:11
    37. GroupByKey 00:02:31
    38. ReduceByKey 00:03:30
    39. AggregateByKey 00:03:44
    40. SortByKey 00:02:47
    41. Join 00:04:16
    42. CoGroup 00:02:09
    43. WholeTextFile 00:03:15
    44. Pickle Files 00:03:59
    45. HadoopInputFormat 00:05:35
    46. HadoopOutputFormat 00:05:31
    47. Broadcast Variables 00:04:17
    48. Accumulators 00:05:08
    49. Using A Custom Accumulator 00:04:52
    50. Partitioning 00:07:56
    51. Spark Standalone Cluster 00:04:26
    52. Mesos 00:03:38
    53. Yarn 00:02:28
    54. Client Versus Cluster Mode 00:02:41
    55. Spark Streaming 00:04:21
    56. Dataframes And SQL 00:03:28
    57. MLlib 00:04:29
    58. Resources And Where To Go From Here 00:01:02
    59. Wrap Up 00:01:28