PyTorch Conference 2025: Recap and Key Takeaways

November 4, 2025

Scott Routledge

Last week, some of us had the chance to attend PyTorch Conference 2025, joining our partners at OpenTeams to represent Bodo on the expo floor. It was an energizing mix of deep technical discussions and hands-on demos as well as an up-close look at how the PyTorch ecosystem is rapidly scaling to meet the demands of modern AI workloads.

Everywhere we looked, the focus was on compute. Talks and demos revolved around squeezing more throughput from GPUs, specialized inference chips, and orchestrating workloads across thousands of nodes. It is very exciting for us to see so much focus on high performance computing (HPC) techniques including optimizing compilers and efficient parallel communication libraries.

We wanted to share some of our key takeaways from the conference; from distributed training to efficient inference, as well as highlight one aspect that is absolutely crucial but received less attention: data infrastructure.

‍

Inference Was Everywhere

Inference dominated PyTorchCon this year. With demand for serving models outpacing available compute, vendors showcased managed GPU clusters, custom inference chips, and optimized software stacks built around projects like vLLM and Transformers.

Despite this emphasis on inference, one topic was notably absent: data preparation. Most platforms assume input data is already available in JSONL or another model-ready format. For organizations running large-batch inference on real-world datasets, that assumption leaves a major gap. Without robust data infrastructure, even the fastest inference pipeline ends up waiting for input.

‍

The Distributed Training Ecosystem Is Developing Rapidly

Distributed training also featured prominently in many talks and sponsor booths. The surrounding ecosystem is evolving quickly, making large-scale PyTorch training more accessible than ever. Announcements Included:

Monarch, a distributed execution engine that lets users spin up parallel actors across clusters directly from notebooks—similar in spirit to Bodo’s own Spawn model for parallel data processing.
TorchComm, a new API for scalable communication collectives such as allreduce, which are essential components of any distributed training workflow. These collectives can scale to hundreds of thousands of GPUs and include features such as fault tolerance and asynchronous communication support.

These innovations improve usability and performance, but data engineering remains a bottleneck. Distributed training depends on massive, high quality, sharded datasets, and most organizations still rely on tools like Pandas or PySpark, which often fail to scale effectively or leave performance untapped.

‍

Compilers Were the Real MVP

Across inference, RL, and distributed training, compilers are what make scale possible by bridging gaps between high-level productivity and low-level performance. Notable projects in this space include:

Torch.compile: For automatic optimization of PyTorch models and operators with minimal code changes.
Triton: For fine-grained control over custom machine learning kernels.

And announced at this year’s conference:

Helion: an embedded Python DSL that combines the simplicity and portability of PyTorch with the performance characteristics of lower-level languages tuned for specific hardware.

More and more, engineers and scientists are moving away from hand tuning hardware specific, low level kernels in favor of higher level DSLs. For compiler developers like us at Bodo, this shift is exciting to see—and proof that the future of AI innovation is compiler-driven.

‍

Data Infrastructure for AI: the Missing Piece

Robust data pipelines remain a critical part of AI infrastructure, even as advances in inference, distributed training, and performance tuning allow practitioners to scale workloads faster and more cost-effectively than ever before.

Modern AI workloads consume massive volumes of training and inference data that must be cleaned, joined, and transformed into model-ready formats. These steps are computationally intensive and highly parallelizable, yet they are still handled by outdated tools that either don’t scale or do so inefficiently. Without efficient preprocessing and loading, GPU clusters sit idle.

At Bodo, we believe data infrastructure for AI should be seamless. Our DataFrame library scales and accelerates existing Pandas code with minimal changes, and integrates directly with AI workflows such as distributed training in PyTorch to help teams iterate faster than ever.

‍

Conclusion

PyTorch Conference 2025 showcased an ecosystem rapidly maturing in its ability to scale training, inference, and reinforcement learning. The community’s progress in distributed compute and compiler technology is remarkable, and points to a future where large-scale AI workloads are increasingly accessible.

But at the same time, as models grow larger and systems become more distributed, scalable and reliable data infrastructure will be more important than ever. Without it, even the most optimized training loop will end up waiting for data.

That’s the future we’re building toward, and we’re excited to collaborate with the PyTorch community to make it real! Join our community Slack to collaborate, ask questions, and shape the future of large-scale AI systems together.