For a long time, the python multiprocessing library has been a solution for many data scientists and engineers to get faster results when processing time is a pain point. I want to show you a much faster solution: Bodo.
Bodo is the parallel computing platform for extreme performance data analytics. Bodo includes a new type of compiler for python that automatically parallelizes your python code and generates machine code using Message Passing Interface (MPI). MPI is a protocol standard in parallel computing for efficiently passing messages between processes, designed for high performance and scalability.
Developers are moving to Bodo because you can get nearly the same performance as parallelized code written in C++ MPI, but without the pain of code conversion and building complex pipelines that must be carefully maintained.
This blog shows how the performance of the multiprocessing library in python compares to Bodo. Let me give you a hint: Bodo is unbelievably fast! But that's not all.
You can only run python multiprocessing on one machine, whereas with Bodo, you can parallelize your code on a cluster of machines. In fact, you can scale Bodo nearly linearly to thousands of machines! This helps when you need huge memory and have no solution other than converting your code to PySpark and running it on a Spark cluster.
Thanks to Bodo, you no longer have to put in extra effort to scale your code beyond your laptop. It's super easy to install Bodo, just like any library you install using pip. Let's get started.
You need to have python (and pip) installed. If not, I hope this blog is a nice read for you anyway!
Performance on big data: A billion rows
You may wonder if the reported performance gains were only applicable to a small dataframe. How does Bodo perform on larger datasets? Well, let's run on 100x larger data and see. In fact, you'll see that Bodo really shines when data is massive! Here are the results for running this example on a 100x larger dataset: a 1Bx2 dataframe with a size of 6GB saved as parquet. You can do so by running gen_data.py with per=1000 and rep=1_000_000.
Multiprocessing starts showing unexpected behavior as the number of cores grows: it takes longer and longer to complete the job. Regular pandas groupby takes 67 seconds on this relatively larger data set (run python pandas_groupby.py on the new dataset to test), but Bodo can reduce it to less than 14 sec.
I wanted to experiment with the compilation time, so I left cache=False in bodo.jit inside bodo_group.py. The table below shows that the compilation time is around 1 sec. This short time is not noticeable in large computations.
To sum up, we compared pandas, Bodo, and python multiprocessing libraries side by side. Bodo returns the fastest runtime, and python multiprocessing performance degrades as the data and the number of cores grows. For this benchmark and size of data, Bodo is almost 5x faster than pandas and 25x faster than multiprocessing. This difference increases as the size grows from 6GB to 60GB and so on to the point that you cannot scale pandas, i.e., running it on a cluster, as it takes hours, while Bodo takes minutes. Take a look at Bodo benchmarks on a larger dataset in Bodo blog posts. Bodo turned out to be 5x faster than pandas and ~25x faster than the best performance of multiprocessing.
From these results, it seems that it does not make sense to use multiprocessing as it is slower than pandas. Programmers usually develop their python application in a development environment with small data. One of the common mistakes they then make is that they run the same application at scale on larger servers with multiprocessing, assuming it is the way to scale. What they fail to notice is that multiprocessing is not always a good fit. If the number of groups in the data is small, then multiprocessing might be a viable option. For example, if you only have 4 groups in your data, then running with 4 cores using multiprocessing may give you a boost. In this example, you can try generating new data by setting pers=4 and reps = 250_000_000 in gen_data.py, and watch the run time drops from 733 seconds to 390 seconds using multiprocessing with 8 cores. With multiprocessing you need to manually partition your data with groups and then map them. Thus, when your list of dataframe is closer to the number of cores, each core can process one partitioned dataframe. How often does this happen in real life? Rarely! But with Bodo, you do not have this problem, and can safely assume that the code you developed in your development environment will scale very well linearly at large scale!