Scaling Amazon S3 Vectors Workflows Effortlessly in Python with Bodo

August 11, 2025

Ehsan Totoni

Amazon S3 Vectors is a new service that simplifies storing and querying vector embeddings in a cost-efficient manner. Vectors are a critical part of many AI and semantic search workloads including Retrieval Augmented Generation (RAG) use cases. RAG enhances the accuracy and relevance of large language models (LLMs) by incorporating information from external knowledge sources.

However, using S3 Vectors at scale currently requires manually calling Python SDKs in parallel or using complex distributed frameworks which can be cumbersome. Bodo addresses this challenge by providing Pandas-compatible APIs that are auto-parallelized, which allows using Pandas code for end-to-end data processing workflows that use S3 Vectors. This simplifies storing and querying vectors at scale substantially.

‍

What is Bodo?

Bodo is an open-source, high-performance DataFrame library for Python that is a drop-in replacement for Pandas. Bodo simplifies accelerating and scaling Python workloads from laptops to clusters without code rewrites. Under the hood, Bodo relies on MPI-based high-performance computing (HPC) technology and an innovative auto-parallelizing just-in-time (JIT) compiler—making it both easier to use and often orders of magnitude faster than tools like Spark or Dask.

S3 Vectors in Python with Pandas APIs

Bodo provides Pandas-compatible APIs for storing and querying S3 Vectors and scales the Pandas code automatically end-to-end by just replacing `import pandas with pd` with `import bodo.pandas as pd`. This means that you can write AI workloads using RAG with large datasets without needing to migrate data to another system or configure distributed frameworks manually.

To start using S3 Vectors with Bodo, create an S3 vector bucket and a vector index in the Amazon S3 console (follow the first two steps of S3 Vectors tutorial). Then, make sure you have active AWS credentials (e.g. AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set or credentials set in ~/.aws/config) and the necessary dependencies are installed and upgraded (see Getting Started below). Also, make sure the user associated with your credentials has permissions for S3 Vectors (see S3 docs here).

To store vector data, simply call the to_s3_vectors method of Bodo DataFrame. For example:

df.to_s3_vectors(
   vector_bucket_name="my-test-vector",
   index_name="my-test-ind",
   region="us-east-2",
)

The DataFrame should have key, data and metadata columns for storage. The key column should have strings, while the data column should have a list of floats for vector data in each row and the metadata column should be a struct type.

To query a vector index, simply use the Series.ai.query_s3_vectors method to find matching vector keys. For example:

out_df = df.data.ai.query_s3_vectors(
	vector_bucket_name="my-test-vector",
    index_name="my-test-ind",
    region="us-east-2",
    topk=3,
    filter={"genre": "scifi"},
    return_distance=True,
    return_metadata=True
    )

The output DataFrame provides a list of matching keys in each row, as well as distances and metadata columns if specified.

Getting Started

This feature is available in Bodo pip and Conda releases starting from 2025.8. To try it out on macOS, Linux or Windows, install Bodo with:

pip install bodo

For more details, see the Bodo documentation and our GitHub repository.

Using Bodo’s high-performance DataFrames with JIT compiler and MPI-based backends, you can now scale your AI workloads without complicated rewrites. Give Bodo a try and share your feedback on GitHub or our Slack community!

End-to-End Example

This end-to-end example demonstrates creating embeddings, writing them to S3 Vectors, and querying S3 Vectors all in parallel using Pandas-compatible APIs.

Make sure to create an S3 vector bucket and vector index following the first two steps of S3 Vectors tutorial. The dimension should be 1536 for the OpenAI model used and the distance metric should be Cosine. Also, replace the OpenAI API key, vector bucket name, vector index name and region in the code.

import bodo.pandas as pd




# Create embeddings using OpenAI API
texts = [
   "Star Wars: A farm boy joins rebels to fight an evil empire in space",
   "Jurassic Park: Scientists create dinosaurs in a theme park that goes wrong",
   "Finding Nemo: A father fish searches the ocean to find his lost son"
]
keys = ["Star Wars", "Jurassic Park", "Finding Nemo"]
genres = ["scifi", "scifi", "family"]


df = pd.DataFrame({"key": keys, "text": texts, "genre": genres})
df["data"] = df.text.ai.embed(model="text-embedding-3-small", api_key="my_api_key")


# Write embeddings into vector index with metadata.
df["metadata"] = df.apply(lambda row: {"source_text": row.text, "genre": row.genre}, axis=1)
df.to_s3_vectors(
   vector_bucket_name="my-test-vector",
   index_name="test-index",
   region="us-east-2"
)


# Query the vector index (with filtering)
input_text = "adventures in space"
df = pd.DataFrame({"text": [input_text]})
df["data"] = df.text.ai.embed(model="text-embedding-3-small", api_key="my_api_key")
out = df.data.ai.query_s3_vectors(
   vector_bucket_name="my-test-vector",
   index_name="test-index",
   region="us-east-2",
   topk=3,
   filter={"genre": "scifi"},
   return_distance=True,
   return_metadata=True,
)
print(out)

‍

                            keys              distances                                           metadata
0  ['Star Wars' 'Jurassic Park']  [0.6218842 0.7364397]  ["{'genre': 'scifi', 'source_text': 'Star Wars...