Natural language to SQL transformation is a very active field of research and product development, boosted by the advancing capabilities of LLMs. In keeping with the GenAI hype, ambitious claims are made about the potential of NL2SQL, particularly the idea that it will increase access to the valuable business data in relational databases by 10-100x, once the hurdle of having to know SQL is removed.
I’ve spend time experimenting with NL2SQL backed by an LLM, evaluating what it was capable of, and more importantly, what it can’t do. What spurred me to look into this was the recent open sourcing of a product in this area, Dataherald. Pulling directly from Dataherald’s GitHub page:
Dataherald is a natural language-to-SQL engine built for enterprise-level question answering over relational data. It allows you to set up an API from your database that can answer questions in plain English.
Dataherald’s APIs allow one to build a repository of target database schemas, validated queries, schema annotations and instructions, and then inject this as context into an LLM prompt. The idea is the additional context will help the LLM generate more accurate SQL queries. Dataherald will also execute the SQL query generated from a natural language question against the target database using agents created through its integration with LangChain.
Dataherald currently supports the following database targets: PostgreSQL, SQL Server, Databricks, Snowflake, Redshift, BigQuery, Athena, MariaDB and Clickhouse. I thought it would be an interesting exercise to extend the Dataherald code to add support for Yellowbrick. Previously, I created a connector for LangChain to enable Yellowbrick to be used as a vector store for similarity searches, so I decided to integrate this capability into Dataherald too, as well as adding Yellowbrick as a target database.
With my Dataherald extension anyone with data in Yellowbrick can ask natural language questions of their data and get answers via SQL queries that run against their databases. Dataherald and the LLM do the heavy lifting of converting the question into SQL based on the context provided by Dataherald, and then LangChain agents execute the query on Yellowbrick.
Yellowbrick comes with a set of sample datasets, including NOAA weather observations. The NOAA dataset contains two main tables of interest: the observation fact table that contains 9.2Bn records representing weather measurements over the past 300 years; and the stations dimensions table that lists the 125,000 weather stations where the observations were made. The SQL queries generated ran on a single node 16 vCPU Yellowbrick compute cluster in AWS.
I ran experiments against the NOAA data set using Dataherald backed by gpt-4o. I didn’t manually add any context to the schema which meant only the metadata that Dataherald automatically collects when scanning tables and columns was added to the prompt. This metadata includes table and column names, data types and sample column values.
The first question I asked was:
Find the start and end dates of the longest consecutive period with no rain in Cary, NC in June 2024
Dataherald and gpt-4o really struggled with this. Although the combination successfully ran the SQL generated in response to this question, it took 30 attempts asking the same question over and over before it finally lucked on a SQL query that produced the right answer.
The start and end dates of the longest consecutive period with no rain in Cary, NC in June 2024 are from June 11, 2024, to June 30, 2024.
I verified the answer on the NOAA page for the Cary weather station. An accuracy of around 3%. It’s clear that without much context, gpt-4o does a poor job of generating an accurate SQL query in response to the plain English question. Looking over the LLM’s failed attempts, it was clear that it was struggling to figure out how to craft the SQL needed to calculate consecutive periods between events. It needed a helping hand, and so I set up these instructions to be added to the prompt, which also included a couple of pointers on date formats and weather station names:
If you are asked to calculate consecutive periods between events, base your answer on the following pattern:
WITH current_prior_date AS (
SELECT
year_date,
LAG(year_date) OVER (ORDER BY year_date) AS prev_date
FROM
noaa_ghcn_pds.observations
),
consecutive_period AS (
SELECT
year_date,
prev_date,
CASE
WHEN prev_date IS NULL OR year_date - prev_date > 1 THEN 1
ELSE 0
END AS is_new_period
FROM
current_prior_date
)
year_date is an integer field that stores a date following the pattern YYYYMMDD
Station names and Station states are all stored as UPPER CASE
Once provided with this guidance, the accuracy went up to around 80%. Not bad. How can I improve the accuracy further? One way is to “cheat” and inject the correct NL/SQL pair into the prompt from the start. This is easy to do by labelling the successful query as a validated Golden SQL query in Dataherald. Dataherald stores the NL question and a reference to the validated SQL text as an embedding in Yellowbrick, performs a vector similarity search to find SQL queries that most closely matches the given question, and then provides the matching SQL as context to the LLM prompt.
Unsurprisingly, doing this yields the correct answer from the LLM 100% of the time. A more interesting experiment is to test whether knowing how to join tables, find weather stations and handle dates given one example yields more accurate answers to an adjacent weather question. Let’s try with the following question:
Find the start and end dates of the longest period of rain at Portland International Airport, Oregon in February 2024.
The system provided the correct answer, this time in a tabular format:
start_date | end_date
-----------|---------
20240214 | 20240221
Which the NOAA website confirms as correct.
It turns out the “cheat method” of adding examples of successful queries can yield more accurate results than spending lots of time on annotating the database schema with metadata descriptions, as this paper points out. That’s not to say that providing the LLM with information about the schema is time wasted, it’s just that more significant boosts to accuracy come from feeding it prior SQL examples.
If you really double down on the Golden SQL examples and schema annotation, then the accuracy starts to become quite interesting. Take, for example, the TPC-DS data set. Yellowbrick comes with the 1 TB scale variant along with 99 SQL queries that answer business questions against this synthetic retail data set. This schema is more complex snowflake schema comprising of 24 fact and dimension tables. The TPC-DS documentation provides a natural language description of each of the 99 queries, and so NL/SQL pairs can be easily loaded as embeddings into Yellowbrick using Dataherald. With this large corpus of validated example queries, gpt-4o becomes rather good at generating SQL and answers. For example, I can ask the question:
How many mens watches were sold in Florida through the web in 2020 and what was the average age of the buyer?
The right answer is returned first time:
Number of men’s watches sold: 18,520. Average age of the buyer: 42.24 years
And the SQL generated to arrive at this answer is correct:
-- SQL query to find the number of men's watches sold in Florida through the web in 2000 and the average age of the buyer
SELECT
COUNT(ws.ws_quantity) AS mens_watches_sold, -- Count the number of men's watches sold
AVG(2000 - c.c_birth_year) AS average_age -- Calculate the average age of the buyers
FROM
tpcds_sf1000.web_sales ws
JOIN
tpcds_sf1000.customer c ON ws.ws_bill_customer_sk = c.c_customer_sk
JOIN
tpcds_sf1000.customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
JOIN
tpcds_sf1000.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk
JOIN
tpcds_sf1000.item i ON ws.ws_item_sk = i.i_item_sk
WHERE
dd.d_year = 2000 -- Filter for the year 2000
AND ca.ca_state = 'FL' -- Filter for sales in Florida
AND i.i_class = 'mens watch' -- Filter for men's watches
GROUP BY
i.i_class;
-And another example:
Find the top 5 most profitable and least returned items through the store channel in North Carolina in 1999
Which resulted in the SQL:
-- Find the top 5 most profitable and least returned items through the store channel in North Carolina in 1999
WITH item_profit AS (
SELECT
ss_item_sk,
SUM(ss_net_profit) AS total_net_profit
FROM
tpcds_sf1000.store_sales
JOIN
tpcds_sf1000.date_dim ON ss_sold_date_sk = d_date_sk
JOIN
tpcds_sf1000.store ON ss_store_sk = s_store_sk
WHERE
d_year = 1999
AND s_state = 'NC'
GROUP BY
ss_item_sk
),
item_returns AS (
SELECT
sr_item_sk,
SUM(sr_return_quantity) AS total_return_quantity
FROM
tpcds_sf1000.store_returns
JOIN
tpcds_sf1000.date_dim ON sr_returned_date_sk = d_date_sk
JOIN
tpcds_sf1000.store ON sr_store_sk = s_store_sk
WHERE
d_year = 1999
AND s_state = 'NC'
GROUP BY
sr_item_sk
)
SELECT
i.i_item_id,
i.i_item_desc,
COALESCE(ip.total_net_profit, 0) AS total_net_profit,
COALESCE(ir.total_return_quantity, 0) AS total_return_quantity
FROM
tpcds_sf1000.item i
LEFT JOIN
item_profit ip ON i.i_item_sk = ip.ss_item_sk
LEFT JOIN
item_returns ir ON i.i_item_sk = ir.sr_item_sk
ORDER BY
total_net_profit DESC,
total_return_quantity ASC
LIMIT 5;
This looks ok!
Thanks for reading this far. In this post, we’ve tinkered with Dataherald, seen how it performs at the task of NL2SQL given different levels of additional context about the schema and example SQL workloads. We’ve also discussed how Dataherald can be integrated with the Yellowbrick Data Platform, using it both as a target database for executing generated SQL queries, and as a vector store for retrieving Golden SQL examples from.
As we discovered, the accuracy of can NL2SQL vary wildly. The vision of NL2SQL democratizing access to the crown jewels of a company’s data estate to all is a huge stretch and is probably an unrealistic goal, given the complexity of most enterprises’ data warehouses. It’s too easy for current LLMs to generate the wrong answer, right now. I think this capability has application today as an exploratory tool and as a SQL code co-pilot to generate a SQL starting point that is tailored to a customer’s schema and data, but it won’t kill the SQL language or jobs market any time soon.
As a sign-off, take a look at the short recorded session showing Dataherald, gpt-4o and Yellowbrick in action: Datahearld and Yellowbrick