2025 Wrapped: Bodo’s Year in Review

December 18, 2025

Bodo Engineering Team

2025 was a big year at Bodo. We delivered major performance improvements, richer APIs, tighter lakehouse integration, and practical support for AI workflows — and along the way, we launched entirely new products too.

Here’s a look back at what we shipped, shared, and learned this year.

Bodo Went Open Source!

Bodo has always been shaped by the open-source ecosystem, so this wasn’t a philosophical pivot as much as a natural step forward. We wanted to give back to the community that has inspired us from the beginning, ensuring everyone can benefit from—and contribute to—what we’ve built.

At its core, our compute engine transforms standard Python code into efficient, parallelized execution without requiring changes to the codebase— making it possible to achieve performance at scale, no HPC expertise required. We’re excited to see more teams try it, push it in new directions, and help shape what comes next!

🔗 Bodo GitHub

🔗 Bodo Open Source Announcement [blog]

🔗 Bodo Slack Community

‍

Bodo DataFrames: A Drop-In Replacement for Pandas

In May, we launched Bodo DataFrames with the goal of combining the simplicity and usability of Pandas with database-grade optimization and HPC-class performance.

It’s a drop-in replacement for Pandas that uses compiler-driven execution instead of task orchestration. Under the hood, it uses Bodo Engine’s JIT compilation and a database-grade query optimizer to turn DataFrame pipelines into efficient, parallel execution plans.

In practice, that means:

No rewrites: Just swap with import bodo.pandas as pd
Vectorized execution handles billion-row workloads without OOMs
10×–100× faster on production workloads
Parallelize UDFs, ML inference, and ETL using the same pandas format

Throughout the year, we also:

Expanded DataFrame and Series operations
Faster planning and compilation for large workloads
Iceberg read support in the Pandas API (pd.read_iceberg)
Enhanced I/O connectors (Parquet, object storage workflows)
Predictable execution across multi-node clusters
Support for timezone-aware types
Enhanced groupby operations and Series statistical functions
Column pruning + CTE optimization that yields up to 4× improvements on some queries
Enabled scalable vector workflows with Amazon S3 Vectors, allowing teams to generate, store, and process embeddings at massive scale directly from Pandas-style code

🔗 Bodo DataFrames Launch Announcement [blog]

🔗 Scaling Amazon S3 Vectors Workflows Effortlessly in Python with Bodo [blog]

🔗 pip install Bodo DataFrames

‍

Enhanced Iceberg Support

Many teams want Pandas-level expressiveness on top of Iceberg tables, but in practice that often means slow scans, heavy metadata overhead, or awkward handoffs to other engines. So this year, we invested in closing that gap:

Native integration with PyIceberg 0.10 — scalable Iceberg reads with Pandas APIs preserved
Compiler-level optimizations for Iceberg scan planning
Reduced metadata and I/O overhead for large table access
Better coordination of planning & object storage — closer to real world workloads
Iceberg time travel hooks in readers and write support via DataFrame.to_iceberg()
Direct reads and writes to Iceberg tables in Amazon S3 Tables using pd.read_iceberg() and DataFrame.to_iceberg(), scaling transparently from a laptop to multi-node clusters—no JVM required

🔗 Bodo Native Integration in PyIceberg 0.10: Bringing Scalability to PyIceberg with Pandas APIs ‍ [blog]

‍

Bodo DataFrames AI Toolkit

The AI toolkit extends Pandas and Series APIs to support LLM and embedding workloads directly, using the same compiler-driven parallel execution model as the rest of Bodo DataFrames.

Users can:

Run LLM inference and text embedding generation using familiar Series operations
Scale those workloads across cores or clusters without changing code
Batch, parallelize, and stream execution automatically via the compiler
Combine analytics, feature preparation, and AI inference in a single pipeline

Under the hood, these operations benefit from the same infrastructure as the rest of Bodo DataFrames: Bodo engine’s JIT compilation, query optimization, MPI-backed distributed execution, and spill-to-disk support.

This means teams can treat LLMs as just another step in a Pandas workflow, instead of a separate system bolted on at the edges. Analytics, embeddings, and inference all run in the same execution substrate, making pipelines easier to build, easier to debug, and easier to scale.

We demonstrated this end-to-end in a unified Iceberg-to-LoRA pipeline, where Iceberg tables serve as the system of record, Pandas-style transformations handle filtering and feature preparation, and Bodo scales the entire workflow through to LoRA-based fine-tuning. The same execution model powers analytics, training data preparation, and fine-tuning — no framework switching required.

🔗 From Iceberg to LoRA: A Unified LLM Fine Tuning Pipeline [blog]

🔗 Using LLMs at Scale: A Pandas-Native Approach [blog]

‍

PyDough: A New Approach to Text-to-Analytics

We also introduced PyDough Community Edition, driven by a problem we kept seeing as natural language interfaces became more common in analytics: generating SQL is easy; generating correct, secure, trustworthy analytics is not.

At the center of PyDough is a formal but friendly, Python-native domain-specific language (DSL) built specifically for LLM-driven analytics. This DSL acts as a bridge between natural language and executable logic, making intent explicit rather than implicit.

Designed for LLMs: a bounded syntax reduces ambiguity, hallucination, and token bloat
Readable by people: clean, logical code that anyone can inspect, test, and reason about
Safe by default: embedded metadata enforces valid joins, filters, and relationships
Grounded execution: every operation ties back to known semantics and rules

PyDough is grounded in a knowledge graph of business semantics that resolves ambiguity, prevents invalid joins and overcounts, and enforces structure that mirrors how businesses actually reason about their data. And to further improve accuracy, PyDough also uses an AI ensemble to generate, review, and challenge proposed logic.

Early benchmark results have been encouraging, and we’re excited to share more as PyDough continues to mature!

🔗 Introducing PyDough CE: A Simpler, Safer Path for National Language Analytics [blog]

‍

Benchmarking: Performance and Tradeoffs

We published benchmarks and comparisons not as a humble brag, but as a way to understand where different approaches shine. We focused on realistic workloads and the kinds of tradeoffs engineers actually have to make.

‍

Community Engagement

Connecting with the community was a big part of 2025 for us.

We spoke at PyData Global, PyData Pittsburgh, and an Open Source Architect Community event about how we think about scaling Pandas in practice with Bodo DataFrames. We also presented at the Iceberg Summit, diving into Iceberg I/O performance and the optimizations we’ve built into the Bodo compute engine. We also spent time exhibiting with our friends at OpenTeams at PyTorch Con.

Beyond conferences, many of the most impactful discussions happened online. Our Slack community grew steadily this year, and the questions, feedback, and real-world use cases shared there have been invaluable.

Thank you to everyone who came to a talk, stopped by a booth, joined Slack, opened an issue, or asked a great question. Those conversations directly influence what we build next — and they’re a big part of what makes this work fun.

🔗 Iceberg Summit: Iceberg I/O Optimizations in Compute Engines [session recording]

🔗 OSA Community Event: Bodo DataFrames [session recording]

🔗 Bodo Slack Community

‍

Looking Ahead

We’re incredibly proud of what we’ve built this year — and even more excited about what’s coming next. We’ve got big plans for 2026: CPU-GPU hybrid execution, advanced query optimizations, more Pandas coverage, new interactive BodoSQL, improved I/O and Iceberg integration. Stay tuned!

‍