The filter() API walks through the parallelized distributed collection (that is, RDDs) and applies the selection criteria supplied to filter() as a lambda in order to include or exclude the element from the resulting RDD. The combination uses map(), which transforms each element and filter(), which selects a subset is a powerful combination in Spark ML programming.
We will see later with the DataFrame API how a similar Filter() API can be used to achieve the same effect using a higher-level framework used in R and Python (pandas).