Using LangChain Agents + Bodo DataFrames on Large Datasets

October 21, 2025

Scott Routledge

AI agents are becoming powerful tools for automating data-driven workflows, yet ensuring they can analyze large datasets efficiently and reliably is an open challenge.

The problem is scale: most agents today lean on popular libraries like Pandas for analysis, which works fine for small to medium datasets but was never designed for billion-row, production-grade workloads. If you ask an agent to crunch through NYC Taxi’s billion-trip dataset using Pandas, you’ll hit hour-long runtimes or memory errors. The “intelligence” of the agent is irrelevant if the infrastructure can’t sustain the workload.

Bodo DataFrames was built to close this gap. It brings distributed execution, streaming computation, and HPC-grade performance to standard Pandas code without requiring refactoring. Swap in import bodo.pandas as pd, and the same agent logic that timed out on your laptop can now easily churn through billions of rows and scale seamlessly on a larger cluster.

In this post, we’ll show how to use langchain_bodo—Bodo’s integration with LangChain, a popular framework that lets you easily build AI agents—as well as the NYC Taxi dataset to create an agent capable of answering complex questions on real world data, such as estimating the average fare between Newark airport and Manhattan.

‍

Getting Started

First, let’s start by installing the packages we need:

pip install -U langchain-bodo langchain-openai

Since we are using OpenAI in this example, you will also need to set the environment variable OPENAI_API_KEY.

Next, open a new notebook or file and import the required packages:

import bodo.pandas as pd
from langchain.agents.agent_types import AgentType
from langchain_bodo import create_bodo_dataframes_agent
from langchain_openai import ChatOpenAI

‍

Using Bodo DataFrames on NYC Taxi Data

Our agent will need to have either a single DataFrame or list of DataFrames. So, before we create an agent, let's use Bodo to read NYC taxi data as a DataFrame. You can optionally download the dataset from the official site, which is about 25GiB in compressed, Parquet format. For convenience, we also provide the data in a publicly available S3 bucket.

The following code snippet creates a BodoDataFrame from taxi data (stored in Parquet format on S3) and displays the column names. Note that since the `pd.read_parquet` API is lazy, no data is actually loaded at this point:

path_to_taxi = "s3://bodo-example-data/nyc-taxi/fhvhv_tripdata/"‍

taxi_df = pd.read_parquet(    
	path_to_taxi,
)‍

taxi_df.columns

‍

Output:

Index(['hvfhs_license_num', 'dispatching_base_num', 'originating_base_num',

'request_datetime', 'on_scene_datetime', 'pickup_datetime',

'dropoff_datetime', 'PULocationID', 'DOLocationID', 'trip_miles',

'trip_time', 'base_passenger_fare', 'tolls', 'bcf', 'sales_tax',

'congestion_surcharge', 'airport_fee', 'tips', 'driver_pay',

'shared_request_flag', 'shared_match_flag', 'access_a_ride_flag',

'wav_request_flag', 'wav_match_flag'],

dtype='object')

When querying our agent, the head of this DataFrame will be embedded into the prompt as a markdown table so that the agent can infer things like column names and types. Let’s also display the head of the DataFrame to get a sense for what our agent “sees”. To simplify the output, let’s select a subset of the columns from the list above that might be relevant:

taxi_cols = ["PULocationID", "DOLocationID", "base_passenger_fare", "trip_time", "tips", "pickup_datetime"]

taxi_df = taxi_df[taxi_cols]

taxi_df.head()

‍

Output:

| PULocationID | DOLocationID | base_passenger_fare | trip_time | tips | pickup_datetime        |

|--------------|--------------|---------------------|-----------|------|------------------------|

| 245          | 251          | 9.35                | 579       | 0.0  | 2019-02-01 00:05:18    |

| 216          | 197          | 7.91                | 490       | 2.0  | 2019-02-01 00:41:29    |

| 261          | 234          | 44.96               | 2159      | 0.0  | 2019-02-01 00:51:34    |

| 87           | 87           | 7.19                | 179       | 3.0  | 2019-02-01 00:03:51    |

| 87           | 198          | 24.25               | 1799      | 4.0  | 2019-02-01 00:09:44    |

Notice that, in this dataset, there are only IDs for the pickup and drop-off location, and the agent will not inherently know which IDs correspond to "Newark Airport" or "Lower East Side".

To fix this, let's provide additional context with another DataFrame. A table mapping location IDs to their zone name can be downloaded from here under "Taxi Zone Maps and Lookup Tables" > "Taxi Zone Lookup Table (CSV)". We can read the CSV file using Bodo and inspect the head:

zone_map_df = pd.read_csv("taxi_zone_lookup.csv")‍

zone_map_df.head()

‍

Output:

| LocationID  | Borough       | Zone                     | service_zone  |

|-------------|---------------|--------------------------|---------------|

| 1           | EWR           | Newark Airport           | EWR           |

| 2           | Queens        | Jamaica Bay              | Boro Zone     |

| 3           | Bronx         | Allerton/Pelham Gardens  | Boro Zone     |

| 4           | Manhattan     | Alphabet City            | Yellow Zone   |

| 5           | Staten Island | Arden Heights            | Boro Zone     |

Now that we have all the data an agent would need to answer our question, let's set up a new Bodo DataFrames agent using the `create_bodo_dataframes_agent` function:

agent = create_bodo_dataframes_agent(
	ChatOpenAI(temperature=0, model="gpt-4o"),
    [taxi_df,
    zone_map_df],    
    verbose=True,    
    agent_type=AgentType.OPENAI_FUNCTIONS,    
    allow_dangerous_code=True
    )

Since this agent will be generating and executing arbitrary code in its own Python REPL, you will have to opt-in to this feature using `allow_dangerous_code=True`. For more details about specific parameters to this function and examples, see our tool documentation page.

Finally, we are ready to ask our question:

agent.invoke("How long does it take on average to go from Newark Airport to the Lower East Side?")

Which might provide intermediate reasoning steps and code that look like this (simplified for clarity):

>>> newark_airport_id = df2[df2['Zone'] == 'Newark Airport']['LocationID'].values[0]
>>> lower_east_side_id = df2[df2['Zone'] == 'Lower East Side']['LocationID'].values[0]
>>> newark_airport_id, lower_east_side_id
 
(1, 148)
 
Now, let's filter df1 for trips from ‘Newark Airport’ to ‘Lower East Side’ and calculate the average trip time.
 
>>> newark_to_les_trips = df1[(df1['PULocationID'] == newark_airport_id) & (df1['DOLocationID'] == lower_east_side_id)]
>>> average_trip_time = newark_to_les_trips['trip_time'].mean()
>>> average_trip_time
 
2651.775

Final output:

{'input': 'How long does it take on average to go from Newark Airport to the Lower East Side?',

 'output': 'On average, it takes approximately 2652 seconds (or about 44.2 minutes) to travel from Newark Airport to the Lower East Side.'}

This entire reasoning chain, which required crunching over one billion taxi trips, took about 4.5 minutes to complete on a 2024 MacBook Pro using 10 parallel workers.

‍

Wrapping Up

Our LLM agent was able to use Bodo DataFrames effectively even without prior exposure to its documentation or examples thanks to Bodo’s strong compatibility with the Pandas API. This example highlights the advantages of Bodo’s execution model: its MPI-based, parallel backend brings HPC-grade performance and scalability to familiar Python workflows while streaming data through operators prevents out of memory errors. With these features, agents can move beyond small in-memory samples and work directly with real, billion-row datasets. To learn more about Bodo's integrations with LangChain, check out our integrations page.

And even if you are not ready to rely on agents for all your data processing needs just yet, Bodo DataFrames can still improve the performance and scalability of your Python workloads. To get started using Bodo yourself: