When it comes to efficient number crunching in Python, two powerful tools are Bodo and Numba. Both aim to speed up Python code using compilation techniques. While there is some overlap in the domains they support, there are notable performance differences, and each has unique areas of support that the other does not. Understanding the distinctions between Bodo and Numba will help developers optimize their code effectively.
In this blog post, the authors of Bodo and of Numba’s auto-parallelization feature will walk you through the differences between these two technologies, explore scenarios where each shines, and provide guidance on choosing the right tool for your performance needs.
TL;DR - Numba accelerates NumPy code on a single machine. Bodo supports NumPy plus other common data science packages like Pandas and Scikit-learn and excels on larger problems by using clusters.
Numba is an open-source Just-In-Time (JIT) compiler targeted at computationally intensive Python/NumPy code, ranging from data science to scientific computing. Numba supports a subset of Python and NumPy that it translates into fast machine code using the LLVM compiler infrastructure. By decorating Python functions with @numba.njit, programmers instruct Numba to compile those functions to native code the first time they are called for each unique combination of argument data types. This allows Numba to specialize the code and generate native code (as if rewritten in C), significantly accelerating computationally-heavy tasks.
However, by default, Numba is still single-threaded. To overcome this limitation and attempt to use all of the cores available on modern CPUs, Numba provides an auto-parallelization feature (enabled by adding parallel=True to the njit decorator) that recognizes many NumPy code patterns as being parallelizable. Those patterns along with programmer-annotated parallelizable numba.prange loops, are then executed in parallel on all cores through Numba. It is important to note that there can be multiple parallelized regions in a single Numba-compiled function and each such parallel region executes in a fork-join style of parallelism.
To give Numba a try, you can install via conda or pip. The code is available at Numba’s GitHub repo and you can connect with other Numba users in their Discourse.
Bodo is an open-source, high-performance compute engine for Python data processing, predominantly targeting data science and data engineering workloads. Using an innovative auto-parallelizing and auto-distributing just-in-time (JIT) compiler, Bodo simplifies scaling Python workloads from laptops to clusters without major code changes. Under the hood, Bodo’s parallelization and distribution execution model relies on MPI-based high-performance computing (HPC) technology‒making it both easier to use and often orders of magnitude faster than tools like Spark or Dask. For its compilation infrastructure, Bodo extends Numba and adds support for subsets of common data science packages such as Pandas and Scikit-learn. In addition, Bodo supports scalable I/O for a variety of common data formats such as CSV, JSON, Parquet, and Iceberg.
To use Bodo, programmers decorate Python functions with @bodo.jit, which like Numba will instruct Bodo to compile those functions to native code the first time they are called. However, Bodo goes a step further and specializes not only on the types but on the value of certain arguments such as file names, allowing even more specialized and performant code.
By default, Bodo uses MPI to execute on all the cores of the current machine but can be configured to run on clusters of arbitrary size. Bodo does not require users to write any distribution code but having a solid understanding of Bodo’s automatic data distribution can help users make the most of its capabilities.
Bodo reuses a part of Numba’s auto-parallelization feature that recognizes which operations can be executed in parallel but Bodo builds on top of this to decide how each array should be split between MPI ranks. By default, arrays and tables in Bodo are distributed in chunks along the first dimension and operations that produce such arrays are parallelized. However, certain operations, like reductions or fancy indexing, cause arrays to be replicated across all ranks.
For typical data processing programs, the compiler follows the map-reduce and relational table parallel patterns and decides on which arrays and tables to parallelize automatically and accurately. However, if the program is not fully recognizable by the compiler, replicated arrays are created which means that the computation itself is duplicated across all ranks and thus does not yield a performance gain. Any other array used in these duplicated regions have to be replicated themselves which can cause a cascade of arrays to be replicated if the programmer is not careful. Bodo provides a tool that allows you to see how Bodo has decided to distribute your arrays for this purpose. However, some sequential regions between inherently parallel parts of your program are typically unavoidable and here Bodo has the advantage that MPI’s SPMD style of parallelism has lower overhead compared to Numba’s fork-join style.
Bodo supports reading and writing a variety of common data science data formats (e.g., Iceberg, Snowflake, Parquet, CSV, JSON, NumPy, and HDF5) within Bodo compiled functions. This allows Bodo to parallelize and distribute the loading and storing of data which can be critical for data-heavy programs.
If you're curious how Bodo performs for your specific use case, install it via pip or conda, check out the GitHub repo for more examples, and connect with other users in the Community Slack to share insights and get support.
From these descriptions, we can summarize when it may be appropriate to use Bodo or Numba as seen in the table below. In short, Bodo is often preferred when using Pandas or Scikit-learn APIs, or when having very large data sets or computational needs that require the combined resources of a cluster. Conversely, Numba may be a better fit for smaller datasets, algorithms that don’t parallelize well under Bodo, or when GPU acceleration is required.
If you’re on a single node using only NumPy, both Bodo and Numba can work, and the final decision will hinge on other programmatic or performance requirements.
Let’s take a look at some code examples. Consider the following example of k-means written in NumPy. This example starts with loading the data points from a file and, to make the comparison fair between Bodo and Numba, also loads a file containing the same set of initial randomized centroids. This example also highlights Bodo’s main data distribution requirement that all “large” data variables (in this case data) should be accessed using parallelizable operations along the first dimension (in this case num_points) to be distributable. This algorithm is also parallelizable via Numba allowing us a direct performance comparison.
import numpy as np
from bodo import jit, prange
from math import sqrt
@jit
def kmeans_bodo(num_points,
num_features,
k,
num_iterations):
datafile = np.fromfile("kmeans_data",
dtype=np.float64)
data = (datafile.reshape((num_points,
num_features)))
centroidsfile = np.fromfile("orig_centroids",
dtype=np.float64)
centroids = (centroidsfile.reshape((k,
num_features)))
for it in range(num_iterations):
dist = np.array(
[[sqrt(np.sum(
(data[i,:]-centroids[j,:])**2))
for j in range(k)]
for i in range(num_points)])
labels = np.array(
[dist[i,:].argmin()
for i in range(num_points)])
centroids = np.array(
[[np.sum(
data[labels==i, j])/np.sum(labels==i)
for j in range(num_features)]
for i in range(k)])
centroids.tofile("centroids.out")
labels.tofile("labels.out")
The following figure shows the runtime of Bodo and Numba for the above code and the given parameters running on a 16 core, Intel 14900K with 96GB of RAM. The equivalent Numba code loads and saves the inputs and outputs outside of the Numba function since I/O is not supported by Numba, but in both cases the loading and saving time is negligible.
Bodo is about 40% faster than Numba in this case. In general, Bodo’s process-level parallelism with MPI can have several advantages compared to Numba’s threading. First, MPI processes have their own isolated memory regions and communicate using messages, which avoids “false sharing” of cache lines, lock contention and other resource contention issues of threading. Second, the message passing approach requires Bodo to make precise parallelism and data communication decisions, avoiding expensive implicit data communication through shared memory’s cache coherence in some cases. Third, Bodo launches the MPI processes only once, but Numba may have repeated fork-join overheads. In short, MPI process-style parallelism can have significant benefits over threading even on a single node.
Both Numba and Bodo offer powerful capabilities for accelerating Python code, but they serve different purposes and excel in different use cases. For strictly NumPy code on a single machine, Numba may often be easier programmatically and in many cases on par with Bodo in terms of performance. On the other hand, Bodo is designed for large-scale data processing across clusters and as such supports additional Python packages like Pandas and Scikit-learn, which are frequently used in data engineering workflows. However, even for single node strictly NumPy code, we have shown that there are cases where Bodo has a performance advantage.
By carefully assessing your project's requirements, data size, computational needs, and execution environment, you can choose the tool that will deliver the best performance for your specific use case. Remember that the key to optimization is not just about choosing the right tool but also about understanding the nature of your problem and how best to leverage the tools at your disposal.