Building PyDough: Why We Built a New Analytics LanguageBuilding PyDough: Why We Built a New Analytics Language

Building PyDough: Why We Built a New Analytics Language

Date
February 2, 2026
Author
Bodo Engineering Team

Analytics work often starts with a question——often a relational one. For example: “Which customers contacted support before making a purchase?”  The question is intuitive. Most people would interpret it the same way. It’s about the relationship between two events and their order in time.

Answering that question in SQL, however, requires a translation. You don’t describe the relationship directly. Instead, you decompose the question into tables, joins, filters, and comparisons. You decide which tables to join, which timestamps to compare, how to handle multiple related records, and where conditions should apply. The relationship you care about—support happened before purchase—isn’t stated explicitly. It’s implied by how the query is constructed.

And that implication hides a surprising amount of ambiguity. Does “before” mean any time prior, or within a specific window—an hour, a day, a week? What counts as “contacted support”? A phone call with an agent? An automated chatbot interaction? A ticket that was opened but never responded to?

None of those distinctions are visible in the query itself. They’re encoded indirectly through table choices, join logic, timestamp comparisons, and conventions that live outside the question. The SQL may be syntactically correct and logically consistent, but whether it actually reflects the intent of the original question is something you have to infer—often by reading between the lines.

Any time a question is translated into SQL—whether by a person, a BI tool, or a text-to-SQL system—there’s room for meaning to shift. Assumptions get baked in, relationships get flattened. And it’s often hard to tell, just by looking at the SQL, whether the query you build actually matches what you meant to ask. The database can tell you that the query runs, but it can’t tell you whether it answers the question you thought you were asking.

As we spent more time thinking about what it would mean to genuinely “talk” to data—asking questions, refining them, and trusting the answers—it became clear that SQL wasn’t giving us the right level of expression for that kind of workflow. What we wanted instead was:

  • a way to express relationships directly, without reconstructing them every time
  • a way for intent to be visible in the query itself, not inferred from structure
  • a language that could sit closer to the question, so meaning could be inspected and reasoned about before execution

That’s why we built the PyDough DSL. In the rest of this post, we’ll walk through the specific limitations of SQL that led us here, and show how PyDough approaches analytics from a different starting point—one built around relationships and intent, rather than tables and joins. 

A Concrete Example

Let’s look at this question: For every nation in Europe, what is the total quantity shipped to customers in that nation from Japanese suppliers that were ordered in the first quarter of 1996?

In SQL, your question has to be decomposed into parts. Then rewritten, mechanically:

SELECT N1.N_NAME AS NATION_NAME, COALESCE(T1.TOTAL_QUANTITY, 0) AS TOTAL_QUANTITY
FROM NATION N1
INNER JOIN REGION R
ON N1.N_REGIONKEY = R.R_REGIONKEY
LEFT JOIN (
  SELECT C.C_NATIONKEY, SUM(L.L_QUANTITY) AS TOTAL_QUANTITY
  FROM CUSTOMER
  INNER JOIN ORDERS O
  ON C.C_CUSTKEY = O.O_CUSTKEY
  INNER JOIN LINEITEM L
  ON O.O_ORDERKEY = L.L_ORDERKEY
  INNER JOIN SUPPLIER S
  ON L.L_SUPPKEY = S.S_SUPPKEY
  INNER JOIN NATION N2
  ON N2.N_NATIONKEY = S.S_NATIONKEY
  WHERE N2.N_NAME = 'JAPAN'
  AND STRFTIME('%Y-%m', o_orderdate) IN ('1996-01', '1996-02', '1996-03')
  GROUP BY C.C_NATIONKEY
) AS T1
WHERE R.R_NAME = 'EUROPE'

The query runs. But it doesn’t read like the original question. There’s no explicit notion of “from Japanese suppliers,” no clear expression of “customers in that nation,” and no visible concept of “ordered in the first quarter of 1996.” Those ideas are encoded indirectly through a web of joins, subqueries, filters, and grouping logic. The meaning of the question emerges only if you mentally reconstruct how nations, regions, customers, suppliers, orders, and line items relate to one another—and how those relationships are being constrained.

Whether this query actually matches the original intent depends on a long list of details that never appear in the question. You have to remember SQL’s order of operations, decide which joins should be inner versus left, keep track of all the join keys and grouping keys, and know when aggregation needs to happen and at what level. You also have to deal with dialect-specific quirks—like how to express “the first quarter of 1996” in this particular SQL engine—and compensate for structural side effects by coalescing nulls introduced by outer joins. The logic is present, but it’s buried under implementation mechanics. Instead of expressing the question directly, the query forces you to translate intent into a set of low-level rules and hope they recombine into the meaning you had in mind.

Where SQL hides relationships in schema, PyDough expresses them in language. 

In PyDough, this question would look like:

selected_orders = customers.orders.WHERE((YEAR(order_date) == 1996) & (QUARTER(order_date)==1)(

selected_lines = selected_orders.lines.WHERE(supplier.nation.name)==“JAPAN”)

result = nations.WHERE(region.name == “EUROPE”).CALCULATE(name, total_quantity=SUM(selected_lines.quantity))

Here, the query isn’t manually reconstructing how tables fit together. It’s referencing relationships that already exist in the semantic layer. PyDough understands how customers relate to orders, how orders relate to line items, how suppliers relate to nations, and how nations roll up into regions. It also understands how time is modeled—what it means for an order to fall in the first quarter of a given year—without relying on dialect-specific date logic. You can read this expression top to bottom and see the question taking shape: select the relevant orders, filter to Japanese suppliers, then aggregate shipped quantity by European nation. The query doesn’t just execute correctly; it makes the intent of the analysis explicit.

Queries can also build on each other—relationally.

Start by defining the scope of interest:

european_nations = nations.WHERE(region.name == "EUROPE")

Then refine the set of relevant orders:

q1_orders_1996 = customers.orders.WHERE(
    (YEAR(order_date) == 1996) & (QUARTER(order_date) == 1)
)

Then narrow to the subset that involves Japanese suppliers:

japanese_lines = q1_orders_1996.lines.WHERE(
    supplier.nation.name == "JAPAN"
)

Finally, bring those pieces together:

result = european_nations.CALCULATE(
    name,
    total_quantity = SUM(japanese_lines.quantity)
)

For a data engineer, this changes how analytical logic is developed. Instead of flattening everything into a single, monolithic query, logic is composed in stages, named, and reused. Each step captures a meaningful concept—European nations, first-quarter orders, Japanese suppliers—and preserves the context in which it was defined. As the question evolves, the analysis evolves with it.

It also makes validation easier. An analyst can read each expression and confirm that it matches a specific piece of the question. There’s no need to reverse-engineer intent from join order or aggregation keys. The query doesn’t just run—it communicates what it’s trying to say.

The PyDough Approach

PyDough was designed around a simple idea: if a language is going to sit between people and their data, it needs to make intent explicit—not just executable. That shows up in a few key ways.

Readable

PyDough queries are structured to read like the questions they represent. Relationships are expressed directly, and the shape of the query mirrors the shape of the reasoning. This makes queries easier to review, easier to discuss, and easier to reuse without re-interpreting logic each time.

Inspectable

In SQL, intent is implicit. In PyDough, it’s part of the language. Concepts like before, after, exists, and followed by are first-class. That means you can inspect a query and understand the behavior it describes—not just the operations it performs.

This becomes especially important as queries evolve, get reused, or are generated by tools. Instead of asking “does this SQL look right?”, you can ask a more meaningful question: “does this expression reflect the behavior we care about?”

Constrained

One of SQL’s strengths is its flexibility. You can join almost anything to anything, and the database will try to execute it. That flexibility is also a source of ambiguity: many logically invalid or nonsensical queries are still perfectly executable.

With PyDough, queries are bound to a knowledge graph that defines what relationships exist and how entities connect. If a relationship isn’t defined, it can’t be queried. This prevents entire classes of errors, reduces ambiguity, and keeps queries grounded in the actual structure of the domain. Instead of relying on downstream validation, PyDough enforces correctness at the language level.

What This Enables in Practice

PyDough unlocks a set of practical benefits that are difficult to achieve with SQL alone:

  • Built-in guardrails: PyDough enforces constraints at the language level, preventing entire classes of errors before execution.
  • Shared, reusable logic: Define a concept once and reuse it across analyses without flattening it into a single expression.
  • Policy-driven analytics: Teams can enforce analytical rules—such as preventing inappropriate aggregations or disallowed operations—directly in the language.
  • Safer by construction: Because queries are generated from structured expressions rather than raw strings, entire classes of issues like SQL injection are eliminated.
  • Extensible by design: New relational patterns, domain-specific operators, and organizational rules can be added without changing the underlying execution layer.
  • Intent-aware error messages: Errors explain what’s wrong with the question, not just what failed during execution, making debugging and onboarding easier.

Closing Thoughts

SQL remains an incredibly powerful execution language. It’s optimized for running queries efficiently over structured data, and it does that job well.

PyDough isn’t trying to replace SQL. It’s designed to sit upstream, where questions are formed, refined, and reasoned about. By making relationships explicit and intent inspectable, PyDough reduces the gap between the question you’re asking and the query you’re running.

As analytics becomes more conversational, iterative, and collaborative, that gap matters more than ever. PyDough is our attempt to close it—not by generating better SQL, but by giving people a language that’s closer to how they actually think about their data.

PyDough-CE is fully open source and available on GitHub. Check out the PyDough-CE repository, run the quick-start guide on your own data, and experience simpler, safer analytics generation firsthand.

Ready to see Bodo in action?
Schedule a demo with a Bodo expert

Let’s go