Application Oriented Memory Allocation for Performance

Jim Peterson

April 17, 2023

5 Min Read

Application Oriented Memory Allocation for Performance

The Yellowbrick Data Warehouse distributed database implementation was designed for high performance. Computation and data required to complete a SQL query is partitioned across a set of workers. Each worker runs a C++ client which in turn runs compiled query execution code. One of the problems faced in this system was memory allocation on the worker C++ clients.

Overall Requirements and Application Interface

An allocator was needed with the following characteristics:

Little memory overhead required for memory management.
Operate seamlessly with userland IO drivers, for NVMe and InfiniBand, providing physical addresses to mlock’ed memory.
High performance with the client application.
NUMA aware allocation.
Facilitate concurrent query execution, which includes:
1. Guaranteed recovery of all memory used by a canceled query.
2. Enable strong limits on query memory usage.

After experimenting with various available allocators, we elected to build a custom allocator with some special properties. We then made use of these properties when writing the application. The result is that the allocator and the application work closely together to get the most work out of the available hardware.

The current version has the following memory allocation interface for the core of the application:

The maximum memory allocation size is 2 MiB. The application adapts when it has larger memory needs by using data structures or algorithms which accommodate this limit. This forced the application writers to think carefully about how memory would be used and accessed, which generally resulted in better algorithm choices.
Memory is allocated in a small number of specific sizes – about a dozen. This again forces the application author to carefully consider memory usage.
All available allocations of 4 KiB or greater are in multiples of 4KiB, aligned to 4KiB boundaries, and locked to unchangeable physical memory addresses. This removes some of the burden of working with the storage subsystem, which requires these memory characteristics for any read or write from NVMe storage.
Allocated memory is delivered on the NUMA node on which the (pinned) application thread is running. In special circumstances, memory can be requested from a specific NUMA node.
There is no in-band metadata stored when memory is delivered to the application. Some allocators will use the first few bytes of an allocated region to track information related to the allocation. This can lead to fragmentation, as these bytes must be accounted for when attempting to use aligned memory for IO. If the application asks for a 4 KiB region, exactly that much is allocated, and it’s all delivered to the application.
The CPU cost of each allocation is extremely low. For most allocations, the time required is on the order of a few function calls, a few uncached accesses to memory, and a branch or two.
Each query executes using its own arena provided bythe memory allocator, to facilitate lossless return of all memory allocated to that query. The arena is entirely managed by the allocator subsystem, simplifying the application.
Access to a cross-query arena and the OS malloc allocator are also possible from within the application but are restricted to certain parts of the C++ client, and never used directly from the compiled query code.

Interface with the Operating System

The memory allocator is initialized during initial setup of the C++ client. All memory to be used by the allocator is mmap’ed in one contiguous virtual address region. The mmap request and subsequent analysis guarantee that the allocator only uses memory that is in either 2 MiB or 1GiB HugePage blocks. The virtually contiguous HugePages need not be physically contiguous. The use of HugePages decreases the time required for the hardware to perform virtual-to-physical translation.

These initial pages are mlock’ed, forcing them to remain in memory at their initial physical addresses.

The memory allocator works entirely within this contiguous virtual address space. It leverages the contiguity to enable addressing with fewer bits, which saves space in the memory metadata storage. The application can also use fewer bits if desired, but generally does not.

Overall Allocator Design

Some of the key points of the allocator design depend on the thread locality of the underlying executor. All threads in the C++ client are pinned to particular physical cores, which are located on particular NUMA nodes.

The overall allocator design is based on the following:

The lowest level allocates memory to queries in 2 MiB pages by NUMA node. Each of these pages has a metadata record that tracks the page owner (a query), the physical address, the NUMA node of the page, and other data. The rest of the memory allocator operates in the context of a particular query.
Next, the 2 MiB pages are divided using a slab allocation model. This provides the larger memory blocks. The slab allocation metadata is kept with the other page metadata.
Some of the slab-allocated sub-pages are then further divided using a sequential allocation mechanism. When a slab-allocated sub-page has been exhausted via sequential allocations, a new sub-page is acquired.
Freelists are kept for every query, for every thread, and for every possible memory size. These flists allow the lock-free reuse of recently released memory blocks. The lists have a maximum size-dependent length, to avoid tying up memory in released allocations.
A second set of freelists are kept per NUMA node, per size. When the per-thread lists fill, the free memory is moved to these per NUMA-node lists, so that they can be shared across threads on the same NUMA node.
If a slab of allocations is completely released by the application, the slab is reassembled into a 2 MiB page, and the page is released from the query to the overall free pool.
A statistics mechanism was included in the allocator. When it’s turned on (a compile-time option), the performance of the allocator can be examined. We’ve used this facility to tune the allocation algorithm parameters and the application design several times since the allocator was first put in service.

Allocator Design Evaluation

This allocator was originally constructed over approximately five days in the early startup phase of Yellowbrick, with two major upgrades since then. It’s generally served us well throughout this time. Some of the good points we’ve found with this design:

The limited number of different allocation sizes has not been a problem for the application. This is partly due to the tuning exercises above, as we tuned the application and allocator together.
There are about a dozen places in the application where we used local sub-allocators (mostly custom slab allocators). This was done to achieve certain benefits: minimal total allocation time, allocated block adjacency, and minimal fragmentation due to a shared lifetime for all sub-allocations.
Average allocation time is very short, and is often lock-free, as determining the allocation source and the allocation methods are simple processes and entirely based on the requested size of the allocation.
Memory overhead due to fragmentation and metadata overhead is extremely low, a small fraction of a byte per allocation.

The major downside of this design is that the deallocation process can be time-consuming and cumbersome. Some of the reasons for this:

The deallocation method for a returned memory block depends on the size of the block. We considered requiring the application to return this size with the block, but it was deemed too much of a burden. This means that the allocator must deduce the size of the returned block, which requires a look-up of the metadata for the block. This can take significant CPU time, as it may require 2-3 un-cached memory accesses to locate the size.
The per-NUMA node-free lists must be locked to operate correctly, as they are used by multiple threads.
The components of a slab must be removed from the free lists when the slab is reassembled for deallocation of the page. This requires cycles inside the lock protecting the free lists, introducing some contention across the threads.

Things which could be done to fix these issues:

Replace the sequential allocation algorithms with slab allocation. This would result in faster allocation, less metadata, and faster determination of the size of a deallocated memory region.
Replace the second level of free lists (by NUMA by size) with a simpler mechanism that leverages the slab allocation mechanism to also manage the free blocks at this level.
The above two changes would allow some operations which occur inside of locks to be done as processor atomic operations instead, which would greatly decrease the contention between the threads.

Conclusion

We believe that the simplicity, speed, and capabilities of this allocator have contributed to the overall speed and reliability of the Yellowbrick Data Warehouse database engine.

Go ahead and build an application-oriented C++ allocator, for your application – it’s worth it!

Platform

workloads

Resource Center

Customer Stories

About us

Newsroom

careers

partner

CONTACT US

Application Oriented Memory Allocation for Performance

More like this

The Case of the Exploding COALESCE Statements

Low-level Debugger (LLDB) Extension for Structure Visualization

Methods and Techniques for Fast and Efficient Logic Simulation

Stories From the Field: The Devil Is In the to_number Details

Yellowbrick Squares Up to the Data Warehousing Heavyweights

Simplifying Manufacturing Test Automation

Customers

Learn Why Customers Love Us!

Sign up for our newsletter and stay up to date

Application Oriented Memory Allocation for Performance

Product

Customers

Pricing

Resources

BLOG

Resources

competitive

Platform

workloads

Resource Center

Customer Stories

About us

Newsroom

careers

partner