Text-to-SQL with Dataherald and Yellowbrick
Yellowbrick | Mark Cusack
5 Min Read

Natural language to SQL transformation is a very active field of research and product development, boosted by the advancing capabilities of LLMs. In keeping with the GenAI hype, ambitious claims are made about the potential of NL2SQL, particularly the idea that it will increase access to the valuable business data in relational databases by 10-100x, once the hurdle of having to know SQL is removed.

I’ve spend time experimenting with NL2SQL backed by an LLM, evaluating what it was capable of, and more importantly, what it can’t do. What spurred me to look into this was the recent open sourcing of a product in this area, Dataherald. Pulling directly from Dataherald’s GitHub page:

Dataherald is a natural language-to-SQL engine built for enterprise-level question answering over relational data. It allows you to set up an API from your database that can answer questions in plain English.

Dataherald’s APIs allow one to build a repository of target database schemas, validated queries, schema annotations and instructions, and then inject this as context into an LLM prompt. The idea is the additional context will help the LLM generate more accurate SQL queries. Dataherald will also execute the SQL query generated from a natural language question against the target database using agents created through its integration with LangChain.

Dataherald currently supports the following database targets: PostgreSQL, SQL Server, Databricks, Snowflake, Redshift, BigQuery, Athena, MariaDB and Clickhouse. I thought it would be an interesting exercise to extend the Dataherald code to add support for Yellowbrick. Previously, I created a connector for LangChain to enable Yellowbrick to be used as a vector store for similarity searches, so I decided to integrate this capability into Dataherald too, as well as adding Yellowbrick as a target database.

With my Dataherald extension anyone with data in Yellowbrick can ask natural language questions of their data and get answers via SQL queries that run against their databases. Dataherald and the LLM do the heavy lifting of converting the question into SQL based on the context provided by Dataherald, and then LangChain agents execute the query on Yellowbrick.

Yellowbrick comes with a set of sample datasets, including NOAA weather observations. The NOAA dataset contains two main tables of interest: the observation fact table that contains 9.2Bn records representing weather measurements over the past 300 years; and the stations dimensions table that lists the 125,000 weather stations where the observations were made. The SQL queries generated ran on a single node 16 vCPU Yellowbrick compute cluster in AWS.

I ran experiments against the NOAA data set using Dataherald backed by gpt-4o. I didn’t manually add any context to the schema which meant only the metadata that Dataherald automatically collects when scanning tables and columns was added to the prompt. This metadata includes table and column names, data types and sample column values.

The first question I asked was:

Find the start and end dates of the longest consecutive period with no rain in Cary, NC in June 2024

Dataherald and gpt-4o really struggled with this. Although the combination successfully ran the SQL generated in response to this question, it took 30 attempts asking the same question over and over before it finally lucked on a SQL query that produced the right answer.

The start and end dates of the longest consecutive period with no rain in Cary, NC in June 2024 are from June 11, 2024, to June 30, 2024.

I verified the answer on the NOAA page for the Cary weather station. An accuracy of around 3%. It’s clear that without much context, gpt-4o does a poor job of generating an accurate SQL query in response to the plain English question. Looking over the LLM’s failed attempts, it was clear that it was struggling to figure out how to craft the SQL needed to calculate consecutive periods between events. It needed a helping hand, and so I set up these instructions to be added to the prompt, which also included a couple of pointers on date formats and weather station names:

    
     If you are asked to calculate consecutive periods between events, base your answer on the following pattern:

WITH current_prior_date AS (
SELECT 
  year_date,
  LAG(year_date) OVER (ORDER BY year_date) AS prev_date     
  FROM 
     noaa_ghcn_pds.observations
),
consecutive_period AS (
   SELECT 
       year_date,
       prev_date,
       CASE 
           WHEN prev_date IS NULL OR year_date - prev_date > 1 THEN 1
           ELSE 0
       END AS is_new_period
   FROM 
       current_prior_date
)

year_date is an integer field that stores a date following the pattern YYYYMMDD

Station names and Station states are all stored as UPPER CASE
    
   

Once provided with this guidance, the accuracy went up to around 80%. Not bad. How can I improve the accuracy further? One way is to “cheat” and inject the correct NL/SQL pair into the prompt from the start. This is easy to do by labelling the successful query as a validated Golden SQL query in Dataherald. Dataherald stores the NL question and a reference to the validated SQL text as an embedding in Yellowbrick, performs a vector similarity search to find SQL queries that most closely matches the given question,  and then provides the matching SQL as context to the LLM prompt.

Unsurprisingly, doing this yields the correct answer from the LLM 100% of the time. A more interesting experiment is to test whether knowing how to join tables, find weather stations and handle dates given one example yields more accurate answers to an adjacent weather question. Let’s try with the following question:

Find the start and end dates of the longest period of rain at Portland International Airport, Oregon in February 2024.

The system provided the correct answer, this time in a tabular format:

    
     start_date | end_date
-----------|---------
20240214 | 20240221
    
   

Which the NOAA website confirms as correct.

It turns out the “cheat method” of adding examples of successful queries can yield more accurate results than spending lots of time on annotating the database schema with metadata descriptions, as this paper points out. That’s not to say that providing the LLM with information about the schema is time wasted, it’s just that more significant boosts to accuracy come from feeding it prior SQL examples.

If you really double down on the Golden SQL examples and schema annotation, then the accuracy starts to become quite interesting. Take, for example, the TPC-DS data set. Yellowbrick comes with the 1 TB scale variant along with 99 SQL queries that answer business questions against this synthetic retail data set. This schema is more complex snowflake schema comprising of 24 fact and dimension tables. The TPC-DS documentation provides a natural language description of each of the 99 queries, and so NL/SQL pairs can be easily loaded as embeddings into Yellowbrick using Dataherald. With this large corpus of validated example queries, gpt-4o becomes rather good at generating SQL and answers. For example, I can ask the question:

How many mens watches were sold in Florida through the web in 2020 and what was the average age of the buyer?

The right answer is returned first time:

Number of men’s watches sold: 18,520. Average age of the buyer: 42.24 years

And the SQL generated to arrive at this answer is correct:

    
     -- SQL query to find the number of men's watches sold in Florida through the web in 2000 and the average age of the buyer

SELECT 
   COUNT(ws.ws_quantity) AS mens_watches_sold, -- Count the number of men's watches sold
   AVG(2000 - c.c_birth_year) AS average_age -- Calculate the average age of the buyers
FROM 
   tpcds_sf1000.web_sales ws
JOIN 
   tpcds_sf1000.customer c ON ws.ws_bill_customer_sk = c.c_customer_sk
JOIN 
   tpcds_sf1000.customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
JOIN 
   tpcds_sf1000.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk
JOIN 
   tpcds_sf1000.item i ON ws.ws_item_sk = i.i_item_sk
WHERE 
   dd.d_year = 2000 -- Filter for the year 2000
   AND ca.ca_state = 'FL' -- Filter for sales in Florida
   AND i.i_class = 'mens watch' -- Filter for men's watches
GROUP BY 
   i.i_class;
    
   

-And another example:

Find the top 5 most profitable and least returned items through the store channel in North Carolina in 1999

Which resulted in the SQL:

    
     -- Find the top 5 most profitable and least returned items through the store channel in North Carolina in 1999

WITH item_profit AS (
    SELECT
        ss_item_sk,
        SUM(ss_net_profit) AS total_net_profit
    FROM
        tpcds_sf1000.store_sales
    JOIN
        tpcds_sf1000.date_dim ON ss_sold_date_sk = d_date_sk
    JOIN
        tpcds_sf1000.store ON ss_store_sk = s_store_sk
    WHERE
        d_year = 1999
        AND s_state = 'NC'
    GROUP BY
        ss_item_sk
),
item_returns AS (
    SELECT
        sr_item_sk,
        SUM(sr_return_quantity) AS total_return_quantity
    FROM
        tpcds_sf1000.store_returns
    JOIN
        tpcds_sf1000.date_dim ON sr_returned_date_sk = d_date_sk
    JOIN
        tpcds_sf1000.store ON sr_store_sk = s_store_sk
    WHERE
        d_year = 1999
        AND s_state = 'NC'
    GROUP BY
        sr_item_sk
)
SELECT
    i.i_item_id,
    i.i_item_desc,
    COALESCE(ip.total_net_profit, 0) AS total_net_profit,
    COALESCE(ir.total_return_quantity, 0) AS total_return_quantity
FROM
    tpcds_sf1000.item i
LEFT JOIN
    item_profit ip ON i.i_item_sk = ip.ss_item_sk
LEFT JOIN
    item_returns ir ON i.i_item_sk = ir.sr_item_sk
ORDER BY
    total_net_profit DESC,
    total_return_quantity ASC
LIMIT 5;
    
   

This looks ok!

Thanks for reading this far. In this post, we’ve tinkered with Dataherald, seen how it performs at the task of NL2SQL given different levels of additional context about the schema and example SQL workloads. We’ve also discussed how Dataherald can be integrated with the Yellowbrick Data Platform, using it both as a target database for executing generated SQL queries, and as a vector store for retrieving Golden SQL examples from.

As we discovered, the accuracy of can NL2SQL vary wildly. The vision of NL2SQL democratizing access to the crown jewels of a company’s data estate to all is a huge stretch and is probably an unrealistic goal, given the complexity of most enterprises’ data warehouses. It’s too easy for current LLMs to generate the wrong answer, right now. I think this capability has application today as an exploratory tool and as a SQL code co-pilot to generate a SQL starting point that is tailored to a customer’s schema and data, but it won’t kill the SQL language or jobs market any time soon.

As a sign-off, take a look at the short recorded session showing Dataherald, gpt-4o and Yellowbrick in action: Datahearld and Yellowbrick

Sign up for our newsletter and stay up to date