Let's now compare the performance of the two approaches:
def test_pandas_pdf(): return (big_df .withColumn('probability', pandas_pdf(big_df.val)) .agg(f.count(f.col('probability'))) .show() )%timeit -n 1 test_pandas_pdf()# row-by-row version with Python-JVM conversion@f.udf('double')def pdf(v): return float(stats.norm.pdf(v))def test_pdf(): return (big_df .withColumn('probability', pdf(big_df.val)) .agg(f.count(f.col('probability'))) .show() )%timeit -n 1 test_pdf()
The test_pandas_pdf() method simply uses the pandas_pdf(...) method to retrieve the PDF from the normal distribution, performs the .count(...) operation, and prints out the results using the .show(...) method. The test_pdf() method does the same but uses the