Stories From the Field: The Devil Is In the to_number Details

Stories From the Field: The Devil Is In the to_number Details

to_number

As a software engineer, a big part of my job is to develop (and later maintain) new features for our product. In my specific case, that can range from planner optimizations like join elimination to whole new user-facing SQL functions.

Most recently, I had to implement one of those customer-facing functions, called to_number.

What Is to_number?

to_number is a SQL function that, given a string and a pattern template the string is assumed to follow, converts the expression into a canonical number.

For instance:

    
     SELECT to_number('1234', '9999');
    
   

returns 1234.

    
      
SELECT to_number('-12.34', '99.99');
    
   

returns -12.34.

Easy. But these cases would have worked with just a simple cast, no need to use to_number. Now, imagine you store the numbers in a way that’s easily readable by a human being, by adding separators, like 1,234. Then a cast won’t work anymore. But to_number will: 

    
     SELECT to_number('1,234', '9,999');
    
   

return 1234.

But what if the separator differs depending on the language the database is in? As a French engineer, I would write 1 234,000. But my US or UK colleagues would write 1,234.000 and a Spanish one would use 1.234,000…

This is one of the cases where to_number shines:

    
     SELECT to_number('1,234.000', '9G999D999');
    
   

returns 1234.000 in a US locale database. So does:

    
     SELECT to_number('1 234,000', '9G999D999');
    
   

in a French one.

If you set up your database locale correctly, the same query can be used in all languages, saving people the confusion of whether to use spaces, commas, or periods.

Those are just some examples of how one can use to_number. The function actually can do much more, with more than 15 different tokens that one can use in the format string to customize the output – a full list can be found in the Oracle documentation.

PATTERNDESCRIPTION
9Digit position (can be dropped if insignificant)
0Digit position (will not be dropped, even if insignificant)
. (PERIOD)decimal point 
D decimal point (uses locale) 
, (COMMA)  group (thousands) separator
G group separator (uses locale) 
PR negative value in angle brackets
MI minus sign in specified position (if number < 0) 
PL plus sign in specified position (if number > 0) 
SG plus/minus sign in specified position 
S sign anchored to number (uses locale) 
V shift specified number of digits, i.e., divides the input values by 10n, where n is the number of digits following V

Most common patterns supported by TO_NUMBER

Given how useful that function can be, it’s no wonder it’s available in multiple databases and people were asking for it in Yellowbrick. And at first glance, implementing it in our product seemed like a nice and easy job, worth a couple of days at most. After all, a function so widely used is well documented…

And yet this ended up way more challenging than I expected, keeping me up at night to try and figure out how to work around a certain hurdle.

Implementing to_number

Implementing to_number is mostly straightforward:

  1. Parse the pattern and create a list of actions.
  2. Go through each character of the input expression and perform the related action from the pattern.

The first step is easy: go through the pattern string one character at a time and map the character to the corresponding action token for later. This step lets us error out on invalid patterns or bad usage – specifying more than one sign token for instance.

The second step is the main work: for each character in the input expression string, look at the action and perform it. For instance, if the current action token is “9,” parse one digit of the final number.

While the logic behind the implementation is easy, there were a couple of issues I encountered.

1. Follow Other Implementations

Because the function exists in several databases, the specifications were obvious: make it work like in the other databases. As such, reading their online documentation was my main source of information on how to_number should work and what each pattern of the format meant. In my case, that meant looking at the Oracle documentation (the provider our customer was used to) and PostgreSQL documentation (the database Yellowbrick was built upon). Unfortunately, it turns out those documentations (and subsequent implementations) don’t agree with each other, which can lead to different results:

    
     -- PG13 results : 
SELECT to_number('123,45', '999G99'); -> 12345 
SELECT to_number('123,45', 'FM999G99'); -> 1234 
SELECT to_number('123,45', '99G99'); -> 123 
SELECT to_number('123,45', '99G999'); -> 1234 
    
   
    
     -- Oracle results : 
SELECT to_number('123,45', '999G99'); -> 12345 
SELECT to_number('123,45', 'FM999G99'); -> 12345 
SELECT to_number('123,45', '99G99'); -> ORA-01722: invalid number 
SELECT to_number('123,45', '99G999'); -> ORA-01722: invalid number
    
   

Likewise, spaces are not dealt with in the same way: Oracle allows spaces only at the beginning of the input while PG allows up to one space in between a digit/period and another token:

    
     '+ 123,45' and '+123,45' 
    
   

are the same in PG but in Oracle, the first returns an error. Likewise,

    
     '123 456.99' and '12 3456.99' 
    
   

are only valid in PG.

Most of those small differences are not documented online, leaving testing in all cases as the only way to figure out the details of what my implementation should be.

Worse, because we had customers using both PG and Oracle implementations, we had to implement both behaviors with a configuration option so that one could switch from one to the other.

2. Mapping Format with Input

Another issue I encountered was the mapping between the format patterns and the actual input. At first, the mapping between format and input seemed evident: one token of the format means one action and I could go like that until the end of the input. The problem is that the format doesn’t always match the input perfectly. In fact, it’s very common to see things like:

    
     SELECT to_number(‘1.234’, ‘9999.9999’);
    
   

A one-on-one mapping would not work here because the code would try to read 4 digits before getting to the decimal separator. But the input only has one digit before the period. The code ended up having to allow reading a digit or decimal indifferently of the action token in the format. Likewise, it can support signs or spaces on unexpected placements in comparison with the format.

3. Predicting the Result Size

The to_number function takes a string and returns a number of the type decimal. In the Yellowbrick engine, a decimal is defined by two values: precision and scale. The precision defines the maximum total number of digits, including decimal places (its scale). The scale defines the maximum number of digits to the right of the decimal point. For example, the number 5999.95 has a precision of 6 and a scale of 2.

The new problem I encountered here is that I needed to know the precision and scale of the result before parsing the actual input string so that I could declare the right integer type to store the result (32, 64, or 128 bytes) and compute the correct sizes up the tree during query planning. The most evident way of doing so is to use the format argument instead of the input, as it’s a constant and can be parsed while still planning the query.

Following that logic, pattern 999.99 would result in a decimal (5, 2).

But that expects the customer to give a format exactly the same as the input. As pointed out above, that is not true as things like:

    
     SELECT to_number(‘1.234’, ‘9999.9999’); 
    
   

are perfectly valid.

If I just used the precision and scale from the format and read the input as is, I would end up with a decimal (8, 4) and read 0.1234, which is not what I want.

To solve that issue, some post-processing is required. While parsing the input, I count the actual number of digits before and after the decimal. Once done, I compare that with the assumed scale and shift the result by the difference of scales so that I output the correct value.

4. Performance

Even when the code yields correct results, another important issue is performance. How well does the code perform with real-world data and scale?

When investigating how to best implement that function, I had a look at the vanilla PostgreSQL implementation and was quite surprised. Their implementation works around most of the above issues by parsing the input string more than once.

First, they extract the relevant information (sign, digits…) and create a new string formatted correctly for reading using the database parameters. Then they convert that second string into a numeric, without caring for the format anymore. While this works fine and makes some issues a bit easier, this didn’t hold up performance-wise in our testing and going directly from the input string to the actual result was the best solution for us.

All in all, porting to_number to Yellowbrick ended up being way more complicated than I expected it to be. Not only was it a good reminder not to underestimate how much work a relatively small piece of code can be, but it also really showed the importance of good documentation and testing processes.

Get the latest Yellowbrick News & Insights
Why Private Data Cloud?
This blog post sheds light on user experiences with Redshift,...
Data Brew: Redshift Realities & Yellowbrick Capabilities –...
This blog post sheds light on user experiences with Redshift,...
DBAs Face Up To Kubernetes
DBAs face new challenges with Kubernetes, adapting roles in database...
Book a Demo

Learn More About the Only Modern Data Warehouse for Hybrid Cloud

Faster
Run analytics 10 to 100x FASTER to achieve analytic insights that have never been possible.

Simpler to Manage
Configure, load and query billions of rows in minutes.

Economical
Shrink your data warehouse footprint by as much as 97% and save millions in operational and management costs.

Accessible Anywhere
Achieve high speed analytics in your data center or in any cloud.