Motivation
The logic team at Yellowbrick is responsible for the design and validation of silicon accelerators that are part of the on-premises Yellowbrick Data Warehouse offering. We are a small team of three engineers – two of us responsible for design and one responsible for simulation validation.
A fast, flexible, and efficient logic simulation infrastructure is an extremely important component of chip development – it enables you to thoroughly test and validate the design in simulation, which in turn makes hardware bring-up relatively painless. Logic simulation is the preferred environment to find and fix bugs because it is a controlled environment where you have 100% visibility into all of the chip states. The turnaround time from finding a bug to being able to debug and fix it is also orders of magnitude shorter in simulation.
Modern simulation infrastructure development has adopted a constrained random test methodology. A well-constructed and flexible constrained random simulation environment should allow for the creation of directed tests as needed. In addition to being efficient, the environment also must be fast to allow engineers to run interactive simulations to test new code or to debug and fix a failure.
Constraints
We needed to work within the constraints of a start-up while embarking on the development of simulation infrastructure. A product needed to be delivered under very aggressive schedules, with a small team, and with minimal spend on tool licensing. We didn’t have the luxury of buying a large number of logic simulation licenses or other auxiliary tools.
So, it was imperative for us to find alternate solutions to achieve our goals – developing the tools, techniques, and methods to maximize the use of these expensive resources to meet our deliverables. The word “maximize” only half conveys what we needed to do. We also needed to make the process of development and debugging efficiently so we could rapidly debug issues that are found in simulation or during hardware tests.
Development Cycle
It is useful to outline the development cycle that the logic group goes through before highlighting the work that’s been done to build a fast, flexible, and efficient simulation environment.
We start with a requirements specification for the silicon accelerator and then work with the software team to come up with a detailed architecture specification. This specification includes all the information that the software and simulation engineers need to drive the accelerator and parse the output or results produced by the accelerator.
The next phase is implementing and validating the design in simulation. Once the simulation test-plan has been executed and the coverage reports have been reviewed, we provide the design for validation on real hardware. If issues are found there, they are fed back into the logic simulation to re-create the issue so that it can be debugged and that also helps close the coverage gap in simulation. A fix is tested, and we go back to test the hardware again.
The goal of a sound simulation methodology and environment is to ensure that the vast majority of bugs are found and fixed in simulation before going into hardware test. This goes a long way in ensuring that we can get to a product release as quickly as possible.
Simulation
This section highlights some key things that we did toward the goal of making logic simulation fast, efficient, and flexible.
Resource Management
Simulation licenses – and the servers that we run logic simulations on – are a shared resource, so we needed a resource management and job scheduling system to allow us to manage these resources.
Rather than spend money on another commercial tool, we used SLURM – an open-source cluster management and job scheduling system. This required some effort to make it work for us, but we were able to configure this exactly as we needed and set up various policies for interactive simulation jobs, simulation regressions, implementation, and lint jobs across work hours and outside of regular work hours.
SLURM manages the scheduling of jobs based on priority, machine and memory resources, and license availability. Jobs from various users would be queued and scheduled when the required resource was available. We were able to use servers and licenses very efficiently using SLURM.
Flexible Constrained Random Stimulus Generation
Modern chips have become very large and complex, and the simulation methodology has moved to a constrained random test methodology to allow us to test and validate these chips in a reasonable amount of time to hit the desired coverage metrics.
It is impossible to adhere to a directed test strategy to complete development in a reasonable amount of time. This means that the simulation environment is not composed of a bunch of directed tests – rather, it is an environment that generates random stimuli within a set of defined constraints with a few base tests.
We run these tests with different seeds and then measure code and functional coverage to see what gaps exist. If the simulation environment is flexible enough, we will be able to create directed tests for coverage gaps by adjusting the constraints and re-running the tests to hit the functional cover points that we are interested in.
We have followed this methodology at Yellowbrick with very good results. We have also been able to take issues that were discovered in hardware test and re-create those failures in simulation.
Short Efficient Tests to Allow Rapid Code Development and Check-in
As a design is being developed, we need some quick tests to allow us to check-in functional code. It is incredibly important to be able to keep the code repository in a functional state so that everyone on the team can make rapid progress on their deliverables.
The test suite needs to be quick (less than 10 minutes to run) to allow us to make quick check-ins of new code that we’ve written, and it also needs to be a test with good coverage to ensure that it checks that base functionality of the chip is not broken by the check-in so that it doesn’t affect other engineers working on the project. This test-suite is constantly updated as new functionality comes online so that it tests more of what we’ve developed.
Use Incremental Compile to Enable Faster Debug
Nothing is worse than waiting a long time for design and test-bench compilation while you are trying to debug a simulation. We have a large codebase of IP, design, and simulation code and it would be prohibitive to compile the full database every time we had to run a test. The Synopsys VCS logic simulator provides the ability to compile only parts of the design or test-bench components that were modified. This saves us a ton of time in our daily work when we are analyzing simulations.
Fast Behavioral Simulation Models for Mixed-signal IP
The silicon accelerator designs use quite a few mixed-signal IP components – such as PCI Express controllers, DDR4, and High Bandwidth Memory (HBM) memory controllers. Simulations with mixed signal IP are typically much slower than pure digital logic and thus set the limit of how fast simulations can run. We implemented a couple of solutions to get around this.
For PCI Express, we used the fast simulation modes in both the test-bench and the design component to significantly reduce link training time spent in each state. A full PCIe link training simulation would normally take ~100 milliseconds or more. With the fast simulation mode, we reduced this down to approximately 60 microseconds. For the memory controllers, simple behavioral models were created specifically for simulation. These models were written to have the same interfaces so we could use these interchangeably with the IP.
Additionally, the behavioral models allowed us to enable some modes of operation of the model that would have not been possible with the vendor model. For example, in the HBM (High Bandwidth Memory) model, we wanted to test the behavior of one of our data-path accelerators, when the HBM model returned data for various requests in an interleaved fashion.
This would have been more difficult to do directly with the vendor’s simulation model but was very easy to configure the behavioral model to do this. The latencies of the memory in the behavioral model could also be changed, if needed, for performance modeling.
The behavioral models for DDR4 and HBM provided a significant speed-up in simulations offering up between 4x to 8x gain in simulation speed. This allowed us to get a lot more simulations through in our nightly regressions. We ran the vast majority of our simulations with these behavioral models.
We also always run one simulation with the vendor models prior to final sign-off for a release. There was a point when we were working on two different projects in parallel sharing the same pool of simulation servers and licenses, and this sharing of resources was possible primarily due to the significant simulation speed-up achieved by these behavioral models.
Hardware Assertions and Consistency Checkers
Our silicon accelerator designs are quite complex and when there is a simulation failure, it can be very tedious and time consuming to track down the bug. There are many different sub-blocks on the chip involved in processing a command and locating the source of the bug can be a challenge.
To address this and to aid debugging in both simulation and hardware tests, we have added assertions and consistency checkers in our design. These have proven to be immensely useful in diagnosing logic, software, and simulation infrastructure bugs.
Once one of these assertions fires, it provides a very important clue about the nature of the bug and which part of the chip or software to start looking at. A simple example of one of these assertions is something like this: check that the number of rows indicated in a chunklet header in the bitstream matches the number of rows produced by the data section of the chunklet.
If this assertion fires, we now have a very focused area to look in to debug this. If we didn’t have such an assertion, the test might continue to run and cause secondary failures a few microseconds further into the simulation, which would take much longer to debug. These assertions are synthesizable, meaning they also exist in the hardware, so they provide value during hardware tests and eventually when hardware is deployed.
We have also written bus monitors to identify data corruption from/to DDR4 memory. Again, this results in more localized debug, rather than having to spend time tracing a whole command through a series of blocks in an attempt to identify the problem.
Sharing Code Between Software and Simulation Infrastructure
Since some of the algorithms that we were accelerating in silicon had already been implemented previously in software, we could (and we did) re-use a lot of that reference code in building out the simulation infrastructure.
This was immensely useful since there was now a single reference in both software and simulation to check the behavior of logic against. When we saw issues in the hardware test, we did not have to waste any time trying to figure out whether there were differences between the software and simulation reference.
Logging
We also like to do a lot of our initial debugging by looking at various logs. To that end, over time we have put all kinds of useful information from both the design and the simulation environment in logs. Careful thought was put into the format of the log messages to allow us to easily search for things by a specific tag.
We also have logs of transactions on all the external interfaces of the chip like PCI Express, DDR4, and HBM memory that allow us to perform a fair amount of debugging before needing to go look at logic waveforms.
Conclusion
When you have a scarce resource, you innovate and learn to extract more efficiency out of it. That’s exactly what the logic team has done – we have designed, validated, and released three different silicon accelerators using the techniques and methods described above. High-quality logic releases for hardware tests have been delivered with only a handful of logic simulation test escapes.