Building PyDough: Why We Built a New Analytics Language

February 2, 2026

Bodo Engineering Team

Analytics work often starts with a question——often a relational one. For example: “Which customers contacted support before making a purchase?” The question is intuitive. Most people would interpret it the same way. It’s about the relationship between two events and their order in time.

Answering that question in SQL, however, requires a translation. You don’t describe the relationship directly. Instead, you decompose the question into tables, joins, filters, and comparisons. You decide which tables to join, which timestamps to compare, how to handle multiple related records, and where conditions should apply. The relationship you care about—support happened before purchase—isn’t stated explicitly. It’s implied by how the query is constructed.

And that implication hides a surprising amount of ambiguity. Does “before” mean any time prior, or within a specific window—an hour, a day, a week? What counts as “contacted support”? A phone call with an agent? An automated chatbot interaction? A ticket that was opened but never responded to?

None of those distinctions are visible in the query itself. They’re encoded indirectly through table choices, join logic, timestamp comparisons, and conventions that live outside the question. The SQL may be syntactically correct and logically consistent, but whether it actually reflects the intent of the original question is something you have to infer—often by reading between the lines.

Any time a question is translated into SQL—whether by a person, a BI tool, or a text-to-SQL system—there’s room for meaning to shift. Assumptions get baked in, relationships get flattened. And it’s often hard to tell, just by looking at the SQL, whether the query you build actually matches what you meant to ask. The database can tell you that the query runs, but it can’t tell you whether it answers the question you thought you were asking.

As we spent more time thinking about what it would mean to genuinely “talk” to data—asking questions, refining them, and trusting the answers—it became clear that SQL wasn’t giving us the right level of expression for that kind of workflow. What we wanted instead was:

a way to express relationships directly, without reconstructing them every time
a way for intent to be visible in the query itself, not inferred from structure
a language that could sit closer to the question, so meaning could be inspected and reasoned about before execution

That’s why we built the PyDough DSL. In the rest of this post, we’ll walk through the specific limitations of SQL that led us here, and show how PyDough approaches analytics from a different starting point—one built around relationships and intent, rather than tables and joins.

‍

A Concrete Example

Let’s go back to this question: “Which customers contacted support before making a purchase?”

In SQL, your question has to be decomposed into parts. Then rewritten, mechanically:

SELECT DISTINCT c.customer_id
FROM customers c
JOIN support_tickets s ON c.customer_id = s.customer_id
JOIN purchases p ON c.customer_id = p.customer_id
WHERE s.created_at < p.purchase_date;

The query runs. But it doesn’t read like the original question. There’s no explicit notion of before, and no clear representation of the idea that a support interaction precedes a purchase. Those ideas are encoded indirectly through joins and a timestamp comparison.

Whether this query actually matches the original intent depends on details that aren’t obvious from the question itself: how multiple purchases are handled, which timestamps are used, and whether multiple support interactions should matter. The logic is present, but it’s buried in structure.

Where SQL hides relationships in schema, PyDough expresses them in language.

In PyDough, this question would look like:

Customers.FILTER(
    EXISTS(
        support_tickets.timestamp < purchases.timestamp
    )
)

Or:

Customers.WHERE(
    FOLLOWED_BY(
        interaction = support_tickets,
        outcome = purchases
    )
)

Here, the query isn’t manually reconstructing relationships. It’s referencing relationships that already exist in the semantic layer. PyDough understands how customers relate to support tickets and purchases, which timestamps define ordering, and that support happens before a purchase. You can look at this query and immediately understand what it’s asking.

Queries can also build on each other—relationally.

Start with:

high_value_customers = Customers.FILTER(total_spent > 1000)

Then branch:

churn_risk = high_value_customers.FILTER(    NO(purchases AFTER support_tickets))‍

For a data engineer, this approach changes how analytical logic is developed. Instead of flattening everything into a single query, logic can be composed, named, and extended over time. Each step preserves context and intent, mirroring how the underlying question evolves.

It also makes validation easier. An analyst can read these expressions and quickly confirm that they reflect what she’s trying to ask. The query doesn’t just run—it communicates.

‍

The PyDough Approach

PyDough was designed around a simple idea: if a language is going to sit between people and their data, it needs to make intent explicit—not just executable. That shows up in a few key ways.

Readable

PyDough queries are structured to read like the questions they represent. Relationships are expressed directly, and the shape of the query mirrors the shape of the reasoning. This makes queries easier to review, easier to discuss, and easier to reuse without re-interpreting logic each time.

Inspectable

In SQL, intent is implicit. In PyDough, it’s part of the language. Concepts like before, after, exists, and followed by are first-class. That means you can inspect a query and understand the behavior it describes—not just the operations it performs.

This becomes especially important as queries evolve, get reused, or are generated by tools. Instead of asking “does this SQL look right?”, you can ask a more meaningful question: “does this expression reflect the behavior we care about?”

Constrained

One of SQL’s strengths is its flexibility. You can join almost anything to anything, and the database will try to execute it. That flexibility is also a source of ambiguity: many logically invalid or nonsensical queries are still perfectly executable.

With PyDough, queries are bound to a knowledge graph that defines what relationships exist and how entities connect. If a relationship isn’t defined, it can’t be queried. This prevents entire classes of errors, reduces ambiguity, and keeps queries grounded in the actual structure of the domain. Instead of relying on downstream validation, PyDough enforces correctness at the language level.

‍

What This Enables in Practice

PyDough unlocks a set of practical benefits that are difficult to achieve with SQL alone:

Built-in guardrails: PyDough enforces constraints at the language level, preventing entire classes of errors before execution.
Shared, reusable logic: Define a concept once and reuse it across analyses without flattening it into a single expression.
Policy-driven analytics: Teams can enforce analytical rules—such as preventing inappropriate aggregations or disallowed operations—directly in the language.
Safer by construction: Because queries are generated from structured expressions rather than raw strings, entire classes of issues like SQL injection are eliminated.
Extensible by design: New relational patterns, domain-specific operators, and organizational rules can be added without changing the underlying execution layer.
Intent-aware error messages: Errors explain what’s wrong with the question, not just what failed during execution, making debugging and onboarding easier.

‍

Closing Thoughts

SQL remains an incredibly powerful execution language. It’s optimized for running queries efficiently over structured data, and it does that job well.

PyDough isn’t trying to replace SQL. It’s designed to sit upstream, where questions are formed, refined, and reasoned about. By making relationships explicit and intent inspectable, PyDough reduces the gap between the question you’re asking and the query you’re running.

As analytics becomes more conversational, iterative, and collaborative, that gap matters more than ever. PyDough is our attempt to close it—not by generating better SQL, but by giving people a language that’s closer to how they actually think about their data.

PyDough-CE is fully open source and available on GitHub. Check out the PyDough-CE repository, run the quick-start guide on your own data, and experience simpler, safer analytics generation firsthand.