Bodo vs. Numba: Choosing the Right Tool for High-Performance Python

March 5, 2025

Todd A. Anderson and Ehsan Totoni

When it comes to efficient number crunching in Python, two powerful tools are Bodo and Numba. Both aim to speed up Python code using compilation techniques. While there is some overlap in the domains they support, there are notable performance differences, and each has unique areas of support that the other does not. Understanding the distinctions between Bodo and Numba will help developers optimize their code effectively.

In this blog post, the authors of Bodo and of Numba’s auto-parallelization feature will walk you through the differences between these two technologies, explore scenarios where each shines, and provide guidance on choosing the right tool for your performance needs.

TL;DR - Numba accelerates NumPy code on a single machine. Bodo supports NumPy plus other common data science packages like Pandas and Scikit-learn and excels on larger problems by using clusters.

‍

Understanding Numba

Numba is an open-source Just-In-Time (JIT) compiler targeted at computationally intensive Python/NumPy code, ranging from data science to scientific computing. Numba supports a subset of Python and NumPy that it translates into fast machine code using the LLVM compiler infrastructure. By decorating Python functions with @numba.njit, programmers instruct Numba to compile those functions to native code the first time they are called for each unique combination of argument data types. This allows Numba to specialize the code and generate native code (as if rewritten in C), significantly accelerating computationally-heavy tasks.

However, by default, Numba is still single-threaded. To overcome this limitation and attempt to use all of the cores available on modern CPUs, Numba provides an auto-parallelization feature (enabled by adding parallel=True to the njit decorator) that recognizes many NumPy code patterns as being parallelizable. Those patterns along with programmer-annotated parallelizable numba.prange loops, are then executed in parallel on all cores through Numba. It is important to note that there can be multiple parallelized regions in a single Numba-compiled function and each such parallel region executes in a fork-join style of parallelism.

Strengths of Numba

Ease of Use: With minimal code changes, you can achieve significant speedups.
Integration with NumPy: Numba excels at accelerating NumPy-based computations.
Automatic Parallelization: Simple flag enables Numba to execute parts of your code on all available CPU cores.
GPU Acceleration: Supports execution on CUDA-capable GPUs for even greater performance gains on some workloads.

Limitations of Numba

Limited Python Support: Not all of Python is supported by Numba, e.g., async, set/dict/generator comprehensions, del, introspection. Variables must be type-stable within a function.
Limited Support for Other Packages: Only supported package is NumPy and even then only a subset of NumPy API (both in terms of functions are parameters to functions).
Compilation Time: Compilation overhead can exceed native code performance gains for small data sizes.
More Difficult Debugging: Normal Python debugging tools may not work as expected and error messages not familiar to Python programmers unfamiliar with compilers.

Downloading Numba

To give Numba a try, you can install via conda or pip. The code is available at Numba’s GitHub repo and you can connect with other Numba users in their Discourse.

‍

Understanding Bodo

Bodo is an open-source, high-performance compute engine for Python data processing, predominantly targeting data science and data engineering workloads. Using an innovative auto-parallelizing and auto-distributing just-in-time (JIT) compiler, Bodo simplifies scaling Python workloads from laptops to clusters without major code changes. Under the hood, Bodo’s parallelization and distribution execution model relies on MPI-based high-performance computing (HPC) technology‒making it both easier to use and often orders of magnitude faster than tools like Spark or Dask. For its compilation infrastructure, Bodo extends Numba and adds support for subsets of common data science packages such as Pandas and Scikit-learn. In addition, Bodo supports scalable I/O for a variety of common data formats such as CSV, JSON, Parquet, and Iceberg.

To use Bodo, programmers decorate Python functions with @bodo.jit, which like Numba will instruct Bodo to compile those functions to native code the first time they are called. However, Bodo goes a step further and specializes not only on the types but on the value of certain arguments such as file names, allowing even more specialized and performant code.

By default, Bodo uses MPI to execute on all the cores of the current machine but can be configured to run on clusters of arbitrary size. Bodo does not require users to write any distribution code but having a solid understanding of Bodo’s automatic data distribution can help users make the most of its capabilities.

Bodo Data Distribution

Bodo reuses a part of Numba’s auto-parallelization feature that recognizes which operations can be executed in parallel but Bodo builds on top of this to decide how each array should be split between MPI ranks. By default, arrays and tables in Bodo are distributed in chunks along the first dimension and operations that produce such arrays are parallelized. However, certain operations, like reductions or fancy indexing, cause arrays to be replicated across all ranks.

For typical data processing programs, the compiler follows the map-reduce and relational table parallel patterns and decides on which arrays and tables to parallelize automatically and accurately. However, if the program is not fully recognizable by the compiler, replicated arrays are created which means that the computation itself is duplicated across all ranks and thus does not yield a performance gain. Any other array used in these duplicated regions have to be replicated themselves which can cause a cascade of arrays to be replicated if the programmer is not careful. Bodo provides a tool that allows you to see how Bodo has decided to distribute your arrays for this purpose. However, some sequential regions between inherently parallel parts of your program are typically unavoidable and here Bodo has the advantage that MPI’s SPMD style of parallelism has lower overhead compared to Numba’s fork-join style.

Bodo Data Handling

Bodo supports reading and writing a variety of common data science data formats (e.g., Iceberg, Snowflake, Parquet, CSV, JSON, NumPy, and HDF5) within Bodo compiled functions. This allows Bodo to parallelize and distribute the loading and storing of data which can be critical for data-heavy programs.

Strengths of Bodo

Scalability: Designed to work efficiently from laptops to multi-node clusters, handling terabytes of data.
Parallelism Without Complexity: Automatically parallelizes code without explicit parallel programming.
Distribution Without Complexity: Automatically distributes arrays and dataframes without explicit distributed programming.
Parallel Data Loading: Reduces data loading time by parallelizing the loading across the cluster.
Integration with Pandas and NumPy: Supports standard libraries, making it suitable for data workflows.
Minimal Code Changes: Requires little to no modification of existing code for parallel execution.

Limitations of Bodo

Limited Python Support: Same Python limitations as Numba.‍
Subset of APIs of Supported Packages: More Python packages supported compared to Numba but package support is still limited and only a subset of APIs in those packages is supported.‍
Programmer Awareness: Programmers need to write code with Bodo’s distribution behavior in mind.

Downloading Bodo

If you're curious how Bodo performs for your specific use case, install it via pip or conda, check out the GitHub repo for more examples, and connect with other users in the Community Slack to share insights and get support.

‍

When To Use Bodo vs Numba

From these descriptions, we can summarize when it may be appropriate to use Bodo or Numba as seen in the table below. In short, Bodo is often preferred when using Pandas or Scikit-learn APIs, or when having very large data sets or computational needs that require the combined resources of a cluster. Conversely, Numba may be a better fit for smaller datasets, algorithms that don’t parallelize well under Bodo, or when GPU acceleration is required.

If you’re on a single node using only NumPy, both Bodo and Numba can work, and the final decision will hinge on other programmatic or performance requirements.

Programming in Bodo vs Numba

Let’s take a look at some code examples. Consider the following example of k-means written in NumPy. This example starts with loading the data points from a file and, to make the comparison fair between Bodo and Numba, also loads a file containing the same set of initial randomized centroids. This example also highlights Bodo’s main data distribution requirement that all “large” data variables (in this case data) should be accessed using parallelizable operations along the first dimension (in this case num_points) to be distributable. This algorithm is also parallelizable via Numba allowing us a direct performance comparison.

import numpy as np
from bodo import jit, prange
from math import sqrt

@jit
def kmeans_bodo(num_points,
                num_features,
                k,
                num_iterations):
    datafile = np.fromfile("kmeans_data", 
                           dtype=np.float64)
    data = (datafile.reshape((num_points,
                              num_features)))
    centroidsfile = np.fromfile("orig_centroids",
                                dtype=np.float64)
    centroids = (centroidsfile.reshape((k,
                              num_features)))

    for it in range(num_iterations):
        dist = np.array(
           [[sqrt(np.sum(
              (data[i,:]-centroids[j,:])**2))
             for j in range(k)] 
            for i in range(num_points)])
        labels = np.array(
           [dist[i,:].argmin()
            for i in range(num_points)])

        centroids = np.array(
           [[np.sum(
              data[labels==i, j])/np.sum(labels==i)            
             for j in range(num_features)]
            for i in range(k)])

    centroids.tofile("centroids.out")
    labels.tofile("labels.out")

‍

Performance Comparison

The following figure shows the runtime of Bodo and Numba for the above code and the given parameters running on a 16 core, Intel 14900K with 96GB of RAM. The equivalent Numba code loads and saves the inputs and outputs outside of the Numba function since I/O is not supported by Numba, but in both cases the loading and saving time is negligible.

Bodo is about 40% faster than Numba in this case. In general, Bodo’s process-level parallelism with MPI can have several advantages compared to Numba’s threading. First, MPI processes have their own isolated memory regions and communicate using messages, which avoids “false sharing” of cache lines, lock contention and other resource contention issues of threading. Second, the message passing approach requires Bodo to make precise parallelism and data communication decisions, avoiding expensive implicit data communication through shared memory’s cache coherence in some cases. Third, Bodo launches the MPI processes only once, but Numba may have repeated fork-join overheads. In short, MPI process-style parallelism can have significant benefits over threading even on a single node.

‍

Conclusion

Both Numba and Bodo offer powerful capabilities for accelerating Python code, but they serve different purposes and excel in different use cases. For strictly NumPy code on a single machine, Numba may often be easier programmatically and in many cases on par with Bodo in terms of performance. On the other hand, Bodo is designed for large-scale data processing across clusters and as such supports additional Python packages like Pandas and Scikit-learn, which are frequently used in data engineering workflows. However, even for single node strictly NumPy code, we have shown that there are cases where Bodo has a performance advantage.

By carefully assessing your project's requirements, data size, computational needs, and execution environment, you can choose the tool that will deliver the best performance for your specific use case. Remember that the key to optimization is not just about choosing the right tool but also about understanding the nature of your problem and how best to leverage the tools at your disposal.