Bodo’s Approach to Open Platforms and Open Source

May 9, 2022

Ehsan Totoni

Bodo’s mission is to enable easy access to high-performance computing; to build a platform that makes working with petabyte-scale datasets as fast and straightforward as running pandas on small datasets using a laptop. We believe that Python/pandas (integrated with SQL) should be a “first-class” production solution in the industry, not just for prototyping. Bodo’s JIT compiler brings C++/MPI levels of speed and scalability to make it happen.

But big data infrastructure is extremely complex, and no individual or company can do it alone. So we have chosen a path to stand on the shoulders of giants: To build an open platform that leverages existing innovative ideas and tools in the data domain, especially from open source communities.

In short, we use – and actively contribute to – community-driven open source projects as much as possible, yet we create commercially available optimized software and enterprise tools. And in the spirit of community membership, we want to accomplish our goals by adhering to true collaborative participation - by design. Our native Python approach (e.g. the pandas APIs and using Numba) gives us a mandate to continuously improve open-source software, instead of trying to replace it.

As always, any thoughts or feedback on our approach is welcome!

‍

Background: Open Platforms for Data

A platform is generally called “open” when it is compatible and interoperable with various technologies, tools, products, and platforms. This includes supporting standard protocols, common file formats, storage systems, workflow orchestration tools and so on. For example, supporting the Parquet file format allows data to move across many tools since Parquet has become a de facto standard. Furthermore, supporting standard APIs such as ANSI SQL and pandas allows workloads to be ported to other platforms more easily. Similarly, a workload should run anywhere you choose so you can have maximum control and flexibility.

In addition, an “open platform” can also mean that it has built-in open source components, which gives customers confidence that they are leveraging the latest open source innovation. These two interpretations are related, since open source can also help with compatibility as well.

On the other hand, proprietary and/or closed platforms can be “walled gardens” that often allow vendors to customize and optimize their solutions more freely since they have full control over all components. In addition, some users may want to stick to a single vendor to simplify their operations.

Leveraging open source software brings the latest innovations to a platform, and can also often help increase the chances of a platform being compatible with others. For example, since Parquet has open source implementations, other tools and platforms can support it with much less effort. In addition, if multiple platforms use the same open source Parquet reference implementation, the data exchange between them is more likely to be seamless. In essence, open source software components act as key connectors in the data ecosystem… even for proprietary software.

Given the intricate and constantly evolving nature of the data landscape, we believe that effective modern data processing platforms have to be open. No single company can solve all data problems effectively by itself, no matter the amount of resources it may have. You, the developer, should be in control of what components you use for each of your data applications.

‍

The “Magic” of Community-Driven Open Source

We at Bodo believe that community-driven open source is critical for data infrastructure, since it is such a complex domain and as we mentioned earlier, no single company can build all the components effectively. Developers from different backgrounds need to work together on mutual problems in a transparent and collaborative manner without organizational boundaries. Community-driven open source produces a kind of innovation “magic” that is hard to replicate elsewhere.

The key word here is community-driven. Due to the popularity of open source, many companies want to be associated with open source software. However, the magic does not happen by just uploading a piece of code on GitHub. A diverse and vibrant community is critical for achieving the benefits of open source.

A community-driven open source project has these components:

Publicly available source code with permissive licenses
Open and transparent development
Open and transparent decision making
Multi-organizational engagement

Corporate-sponsored open source projects usually fulfill the first requirement but not others. Communication about the project largely happens within their own internal systems (e.g., corporate Slack), and the decision makers are appointed by the company. Furthermore, only the company can practically develop, support and distribute the software, which is not conducive to the openness of software platforms.

This doesn’t mean that corporate-sponsored open source software is not useful. It allows new software to get off the ground, users to look at the code and learn, contribute bug fixes, etc. But it definitely doesn’t have all the benefits of community-driven projects. I usually look to see if multiple organizations develop and distribute the software as a sign of a successful community-driven open source project.

‍

Bodo’s Relationship with Open Source

Building an open and capable Bodo platform would not be possible without a collection of community-driven open source software. Bodo currently uses innovations from Numba, MPI, IPython Parallel, Arrow, pandas, NumPy, Scikit-learn, Calcite, and many others. We directly depend on these projects, and are committed to contributing to their continued success.

Often, when commercial vendors have such broad dependencies on open source, they work to own and control the projects to reduce risk. This may involve either gaining control over the upstream project or even forking it. However, this ultimately hurts the community, erodes trust, and diminishes the long-term independent success of the project.

In contrast, Bodo is committed to working collaboratively with these open source projects, and even to help identify opportunities and/or deficiencies that we would like to collaboratively help address. What’s good for the project will be good for Bodo.

We believe that alignment of goals and transparent communication among community members (including us) is critical for success in open source software development. Bodo platform is very well aligned with open-source communities by design – we support native open source APIs (e.g., pandas) instead of trying to replace open source or artificially augment it (e.g., “pandas-like”).

‍

Example: Bodo and IPython Parallel

A good example of our open source model is our engagement with the IPython Parallel project. We started using IPython Parallel to build our parallel notebook support since IPython Parallel supports interactive MPI process management. However, we discovered several gaps and bugs with our use cases – plus, the project seemed relatively inactive. Given that notebook support is critical in our platform and our customers had a lot of urgent issues, our response could have been to fork or create something new from scratch. But instead we chose the community engagement route.

IPython Parallel is developed and maintained by Min Ragan-Kelley, who has been one of the main contributors to Jupyter projects for many years. But he didn’t have the bandwidth for active development of IPython Parallel. So we offered sponsorship of the project and worked through the mechanics and paperwork details but the main hurdle was building trust.

Min understandably needed to make sure we have a long-term interest in the project, that we appreciate his vision, and that we would be good collaborators. We also needed to make sure we were being heard, and that our needs aligned with IPython Parallel’s general direction. The conversations took a while, but our directions converged and got started. A significant factor behind this convergence was Min's genuine interest in understanding real-world use cases and gaining user feedback. He also had a desire to share expertise and expand the maintainer base to improve the project's health.

Throughout the effort, we provided Min with feedback on use cases, designs, code reviews, and code contributions. In some areas, we initially had different preferences but trusted Min’s vision, which is not easy to do for companies since solving customer problems quickly and efficiently is paramount. But the decisions proved to be appropriate over time. Overall, I believe this was a major success since IPython Parallel is significantly improved for all users including us. We still need to work on adding regular maintainers to the project, which we plan to do soon as Bodo grows.

A key lesson I personally learned in this process was that trusting the vision of open source leaders is key for vendors and community users alike. On the other hand, open source leaders should always try to understand real-world use cases and seek feedback from users, and be willing to share expertise and expand the maintainer base.

‍

Example: Bodo and Numba

Numba is a compiler toolkit for building high performance compilers and libraries in Python, and is key infrastructure for Bodo’s compute engine. Bodo’s user-facing JIT decorator and workflow are built on top of Numba’s JIT implementation. In addition, Bodo uses Numba to interface Python and LLVM, and reuses Numba’s implementations of Python/Numpy basics. Therefore, we heavily rely on Numba and want to ensure the project is successful in the long term. Bodo engineers regularly contribute to Numba, but the current rate of progress sometimes doesn’t meet our needs. So we work closely with the Numba team to see how and where we can help, while always respecting the project’s independence.

I’ll also offer a bit of history. I started contributing to Numba in 2017 as part of a 3-person Intel Labs team. We were looking for a way to port our compiler optimization and parallelization technologies from Julia to Python due to the growing popularity of Python. Numba was an excellent compiler infrastructure, allowing us to build optimization and threading-level parallelism support for Python (“@njit(parallel=True)” in Numba). The Python/Numba infrastructure was much easier to use than the Julia one, leading to rapid development. As a result, parallelism support in Numba has become a core feature that seems to be widely used in the Python community. Through this and other efforts, we have been an active part of the Numba community for years, and believe that there is an opportunity to help take Numba to the next level.

Numba is enormously successful as a core infrastructure package of the Python ecosystem with over 7 million monthly downloads, but it faces a lot of challenges. Due to the complex nature of compilers especially for Python, Numba users are not necessarily able to contribute, which breaks the “users solving their mutual problems” assumption. Currently, Numba maintainers are full-time employees sponsored by Anaconda and other companies. The reason is that from a practical perspective, developing the Numba core needs several months of ramp-up and then full-time focus. Furthermore, stability is a major focus because Numba is a critical dependency of the Python ecosystem, and focusing on it slows down development and contributions substantially. We have been working with Quansight Labs via OpenTeams which specializes in scaling interactions like we had with Min to all of the maintainers of the PyData stack. We have been talking to the Numba team directly and by working with Quansight Labs to identify an approach that addresses these concerns, while helping accelerate Numba’s overall evolution.

We are still in the process of communicating our own interests, while understanding the Numba team’s vision. Establishing trust and effective communication takes a while, but I believe we are both on the right track. We want to make sure Numba supports Python basics more comprehensively, provides a better compiler infrastructure, provides better error reporting, and ultimately has a high-quality code base. As I said before, this is a matter of finding a collaboration path that works best for open source leaders and can benefit the community as a whole.

‍

What about open sourcing the Bodo compiler?

Building a successful company around open source software has been, and continues to be, a hot discussion topic. The general consensus is that open sourcing an entire code base and then relying solely on support revenue is a challenging business model. So the question becomes what portions – if any – should be closed- versus open-source? And, are there business cases where open source simply makes no sense for Bodo?

One popular approach is the “open core” model, where the core functionality of the software is open, but some features are only available in a paid enterprise version. The popularity of the open source project brings credibility for the company to investors and customers. But in our view, the open core model has some drawbacks in general, and with the Bodo case in particular.

The open core model allows users to contribute and use the software for free, but the company usually owns and controls the open source project. This means that the project is not always driven by a community of users, and the broad open source community “magic” may not happen. In addition, large enterprises with enough engineering resources can deploy and even resell the software for free, while smaller organizations have to shoulder the burden of funding its development. Moreover, the vendor is always looking for building differentiated “premium” features to make the business model work, which can distract it from the core software project. Overall, making the open core model work with a thriving community is not a trivial task.

Thus, pursuing the open core model has several issues in the Bodo compiler case:

First, Bodo’s basic functionality itself is already available in pandas and other open source packages. Open sourcing Bodo’s optimization software would require the company to focus on building essentially other products to differentiate.
Second, Bodo is built for simplicity and is thus too easy to deploy by enterprises or to resell by hyperscalers without funds coming back to Bodo. This makes it harder to fund further development.
Third, the Bodo compiler is even more complicated than Numba and we don’t expect significant contributions from users. Various early prototypes in Julia and Python were open source and received essentially no contributions. Plus, the optimization and parallelization piece of Numba has been available for 5 years with wide use, but has received no contributions beyond maintainers.
Fourth, the infrastructure components that benefit from open source like Numba, MPI, pandas, etc. are already open source.

These factors make a compelling case for the Bodo compiler to remain a commercial offering for the near future, allowing us to continue innovation and development.

‍

Summary

We believe that supporting external community-driven open source development projects can and will benefit Bodo – as well as the whole community – in the long run. While many vendors try to own and control open source projects, we believe that being true collaborative community members and contributors is critical to the technologies that are our underpinnings.

Our approach, therefore, is to continue to provide the commercial value-add that the Bodo platform provides, while simultaneously actively supporting the vision and efforts from our open source partners.

Our open source roots have shown that we can navigate this path. Our powerful and differentiated technologies allow us to have a viable business model, yet be committed, vested community members in a number of open source projects.

In the meantime, we are working on various open source initiatives to improve pandas, Numba, Data APIs and others that we will announce soon. Stay tuned!

‍