Making pd.DataFrame.apply Faster with Bodo

July 12, 2021

Niyousha Mohammadshafie

‍

The apply function in Pandas is a game-changer tool as it eases the execution of user-defined functions on all the rows. But the issue with the apply function is that it can be unbearably slow when working with big data. In today’s deadline-driven world, the efficiency of the code is a prerequisite for the success of any data science project. To boost the efficiency of our code, it is important to use the fastest alternative solutions, such as Bodo. Essentially, Bodo offers our code unmatched performance and unlimited scalability. Unlike spark or other similar solutions, Bodo does not need us to make any major changes to our code or learn a new skill. Bodo makes up for Pandas’ lack of speed while staying equally powerful and user-friendly as Pandas.

In this short blog post, we want to compare the performance of the apply function executed on a Pandas DataFrame in Pandas and Bodo. For this comparison, we use used New York city parking violation dataset, sourced from the New York City website. Basically, New York City parking violation datasets have been a notorious example of big data that data scientists struggle with. We aim to see what hour in a day is most likely for a parking violation to happen. To achieve this goal, we need to create a new column in our data frame, which can be done using pandas.DataFrame.apply with a lambda function. The New York City website publishes NYC parking violation datasets every year. Here, we only focus on the year 2017 dataset as an example. Since this is a relatively large dataset and not all the columns are useful, we only load two columns: violation time and plate ID.

Let’s check the data type of our columns and the numbers of data points we have:

To extract the hour from the violation time column, we need to run a user-defined function (UDF) using pandas.DataFrame.apply. Running a UDF in Pandas takes a long time as apply generally does not take advantage of Pandas’ vectorization. Let’s take a look at the execution time of the below cell:

Reading Time: 451.5508031845093
0 1.0
1 16.0    
2 14.0    
3 11.0    
4 17.0    
Name: hour, dtype: float64

As demonstrated in the previous cell, running the UDF with one core in Pandas took 451 seconds (7.5 minutes). Unlike Pandas, Bodo can directly compile the apply code to keep the vectorization while implementing scientifically correct parallelization. This means that with minor changes in the code, we can significantly speed up our UDF. Now, let’s check out the performance of our code with Bodo on 4 cores:

[stdout:0]
Reading Time: 13.471014976501465
0 1
1 16
2 14
3 11
4 17
Name: hour, dtype: Int64

Wow, that was so fast! Running the same code in Bodo took only 13 seconds. That a 30X improvement over Pandas. Now that we have created the ‘hour’ column; we can sum up the total number of violation tickets per hour and visualize the results using Matplotlib and Seaborn libraries.

We show that the total number of parking tickets per hour reaches its maximum around noon and is at its minimum around 4 am.

Conclusion

Without a doubt, the Pandas data frames have become a staple for data scientists, but even simple Pandas functions struggle to keep up with the exponential growth of data. As demonstrated with a simple example in this post, data exploratory analysis with large datasets can be painfully slow. Bodo is a modern solution to significantly speed up the data wrangling process without the need to make major modifications to your Pandas source code and bring scalability to your data science workflow. I hope this blog post is useful to you and encourage you to check out Bodo’s website at https://bodo.ai.

‍