How to Scale Your Python Application: Multiprocessing Library vs. Bodo

How to Scale Your Python Application: Multiprocessing Library vs. Bodo

Date
February 23, 2022
Author
Ali Reza Farhidzadeh

For a long time, the python multiprocessing library has been a solution for many data scientists and engineers to get faster results when processing time is a pain point. I want to show you a much faster solution: Bodo.

Bodo is the parallel computing platform for extreme performance data analytics. Bodo includes a new type of compiler for python that automatically parallelizes your python code and generates machine code using Message Passing Interface (MPI). MPI is a protocol standard in parallel computing for efficiently passing messages between processes, designed for high performance and scalability.

Developers are moving to Bodo because you can get nearly the same performance as parallelized code written in C++ MPI, but without the pain of code conversion and building complex pipelines that must be carefully maintained.

This blog shows how the performance of the multiprocessing library in python compares to Bodo. Let me give you a hint: Bodo is unbelievably fast! But that's not all.

You can only run python multiprocessing on one machine, whereas with Bodo, you can parallelize your code on a cluster of machines. In fact, you can scale Bodo nearly linearly to thousands of machines! This helps when you need huge memory and have no solution other than converting your code to PySpark and running it on a Spark cluster.

Thanks to Bodo, you no longer have to put in extra effort to scale your code beyond your laptop. It's super easy to install Bodo, just like any library you install using pip. Let's get started.

Prerequisites:

You need to have python (and pip) installed. If not, I hope this blog is a nice read for you anyway!

Steps

  1. We need some data, so let's generate it. gen_data.py, as shown below, generates a (10M x 2) dataframe. Run it using python.
    python gen_data.py
    It generates a parquet file which is stored in the same directory.
  2. Now run pandas groupby on this small (10M x 2) dataframe.
    python pandas_groupby.py
    Output:
    Total time: 0.29
  3. Open multiprocess_ex.py. Change the ncores variable to the number of CPU cores you want to use. For example, ncores = 2. Then run multiprocess_ex.py:
    python multiprocess_ex.py
    Output:
    Total time: 2.54
    I ran this code repeatedly with 1, 2, 4, and 8 cores. I found that it had the shortest runtime with 4 cores. Now you may wonder why multiprocessing ran even slower than regular pandas groupby. Two reasons:
    1. Multiprocessing has some overhead for multithreading
    2. Multiprocessing performance degrades as the number of groups in the groupby increases.
    You can try this by generating new data in which you can reduce the period from 10_000 to 100 and increase repetition to 100_000 (to keep 1M rows) and try it again. Again, you will see the run time reduces! Now let's see how Bodo performs on groupby. Before running Bodo, let's install it. You can use either conda or pip. Here is an example with pip:
    pip install bodo
  4. Now, we can run bodo_groupby.py using the command below. You may have noticed it's the same code as pandas_group.py but with import bodo and @bodo.jit decorators. This code applies the same logic on the input dataframe but with the Bodo engine. You may also have noticed that with Bodo, we do not need to partition the dataframe manually and use a map function. There is also a cache=True in bodo.jit argument. It caches the binary to save the compilation time the second time we run the code. Let's run with 1 core first.
    mpiexec -n 1 python bodo_groupby.py
    Output:
     Compute time: 0.25
     Compilation time: 0.85
     Total time second call: 1.1
    Let's rerun it:
    mpiexec -n 1 python bodo_groupby.py
    Output:
    Compute time: 0.31
     Compilation time: 0.04
     Total time second call: 0.35
    You can see the compilation time dropped to 0.04 sec vs. 0.85 in the previous run. That's the power of caching the compiled code.
  5. Let's run bodo_groupby.py with two cores to see the difference.
    mpiexec -n 2 python bodo_groupby.py
    In this code, we split the compute and compilation times and reported the total time to compare with the multiprocessing example. Output:
    Compute time: 0.17
    Compilation time: 0.03
    Total time second call: 0.20
    Notice the compute time is almost cut by half! That's expected because we doubled the number of cores. The image below illustrates multiprocessing's performance versus Bodo, and the table below shows the details.

    Alt

    Number of Cores multiprocessing bodo_total bodo_compute bodo_compile
    1 3.61 0.35 0.31 0.04
    2 2.54 0.2 0.17 0.03
    4 2.02 0.15 0.12 0.03

    Source: Bodo.ai

    You can see that Bodo's runtime is 10x faster than the multiprocessing library. Now let's take a look at Bodo, pandas, and multiprocessing performance all together side by side. The best results for multiprocessing are achieved with 4 cores. It looks like, for a small dataframe, the overhead cost of running with having more cores like 8 increases the run time for multiprocessing.

    Alt

Performance on big data: A billion rows

You may wonder if the reported performance gains were only applicable to a small dataframe. How does Bodo perform on larger datasets? Well, let's run on 100x larger data and see. In fact, you'll see that Bodo really shines when data is massive! Here are the results for running this example on a 100x larger dataset: a 1Bx2 dataframe with a size of 6GB saved as parquet. You can do so by running gen_data.py with per=1000 and rep=1_000_000.

Multiprocessing starts showing unexpected behavior as the number of cores grows: it takes longer and longer to complete the job. Regular pandas groupby takes 67 seconds on this relatively larger data set (run python pandas_groupby.py on the new dataset to test), but Bodo can reduce it to less than 14 sec.

Alt

I wanted to experiment with the compilation time, so I left cache=False in bodo.jit inside bodo_group.py. The table below shows that the compilation time is around 1 sec. This short time is not noticeable in large computations.

Number of Cores multiprocessing bodo_total bodo_compute bodo_compile
1 345.49 40.68 39.62 1.06
2 379 28.46 27.53 0.93
4 461.6 17.27 16.0 1.27
8 733.28 13.92 12.53 1.39

Source: Bodo.ai

To sum up, we compared pandas, Bodo, and python multiprocessing libraries side by side. Bodo returns the fastest runtime, and python multiprocessing performance degrades as the data and the number of cores grows. For this benchmark and size of data, Bodo is almost 5x faster than pandas and 25x faster than multiprocessing. This difference increases as the size grows from 6GB to 60GB and so on to the point that you cannot scale pandas, i.e., running it on a cluster, as it takes hours, while Bodo takes minutes. Take a look at Bodo benchmarks on a larger dataset in Bodo blog posts. Bodo turned out to be 5x faster than pandas and ~25x faster than the best performance of multiprocessing.

Alt

From these results, it seems that it does not make sense to use multiprocessing as it is slower than pandas. Programmers usually develop their python application in a development environment with small data. One of the common mistakes they then make is that they run the same application at scale on larger servers with multiprocessing, assuming it is the way to scale. What they fail to notice is that multiprocessing is not always a good fit. If the number of groups in the data is small, then multiprocessing might be a viable option. For example, if you only have 4 groups in your data, then running with 4 cores using multiprocessing may give you a boost. In this example, you can try generating new data by setting pers=4 and reps = 250_000_000 in gen_data.py, and watch the run time drops from 733 seconds to 390 seconds using multiprocessing with 8 cores. With multiprocessing you need to manually partition your data with groups and then map them. Thus, when your list of dataframe is closer to the number of cores, each core can process one partitioned dataframe. How often does this happen in real life? Rarely! But with Bodo, you do not have this problem, and can safely assume that the code you developed in your development environment will scale very well linearly at large scale!

By using this website, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.