I came into manufacturing test automation a little over a year ago when I joined Yellowbrick. We have new generations of hardware that ship pretty frequently because each year there are better CPUs, SSDs, networking, and so forth. Assuring systems are built consistently and correctly is a nontrivial problem and can be resource-intensive and error-prone. We also need to ensure that the software is installed the same way and that the tests are started the same way. Missing these issues before delivery can result in unhappy customers and lots of heartaches.
We also have two different contract hardware manufacturers. I was recently tasked with porting our processes from the first contract manufacturer to the second. I thought that I’d target making the processes the same for everyone. When I showed this process to my co-worker, he said the operators at the second contract manufacturer would need an easier-to-follow process with fewer manual steps.
This made a lot of sense to me. I was trained in the current process and had to learn how to troubleshoot when things went wrong. I hadn’t thought about making the process less hands-on.
I also hadn’t considered the technical level of expertise each contract manufacturer had. Every manual step increases the risk of error and takes up someone’s valuable time, so we try to minimize the steps involved with the process. Many of our manufacturing processes are new and take time to refine. Regular reviews of the processes are needed to keep them as streamlined as possible.
On the other hand, you don’t want to change your manufacturing process frequently. It takes time to set up a process and the standard operating procedure, and then train the operators. Changing the process too often could lead to confusion and unexpected errors. It reminds me of the saying, “If it ain’t broke, don’t fix it.”
I knew the process was mostly automated already. But it required logging into the management processors in the browser for two servers per cluster to get the external IP addresses that come up differently after each initial bring up. These two logins were required to start the script and then again when the servers rebooted during the first part of the script.
Being familiar with this process, it didn’t seem that complicated to me. But the operators at the second contract facility would not know how to log into a website to get a remote console redirect and then run commands in there. I had to come up with a way to get these external IP addresses programmatically.
We used a process with our first contract manufacturer that would log into the management processor programmatically. I dug up the code and saw how it was done. I figured this would be the best route to take to solve the problem. This would let me get into the management processor using a Serial Over LAN (SOL) connection and grab what I needed without ever leaving the terminal.
There was just one problem. When I tried using the process to get into a system at the second contract manufacturer, it didn’t let me in. The SOL connection wasn’t working.
I reached out to a senior developer to get some guidance on how to allow the SOL connection. While troubleshooting this, I noticed that our Ubuntu USB drive allowed SOL connections but when we booted into CentOS, we couldn’t get the SOL connection up.
To fix the SOL connection in CentOS, I had to modify the clone images we use on the clusters. I copied over:
/usr/lib/system/system/serial-getty@.service to /etc/systemd/system/serial-getty@ttyS1.service.
This copies the default login prompt, then I needed to edit the ExecStart to look like:
ExecStart=-/sbin/agetty 115200 %I $TERM
to remove the keep-baud arg and fix the baud rate at 115200.
After that I linked the service to:
/etc/system/system/getty.target.wants/.
Once this was done, I rebooted the systems to let the service start up. This process allowed me to get the SOL connection into the management processor which allowed me to programmatically grab the IP addresses my script needed. All operator interaction was removed with the script, so it can now be called with just one command. This makes the process easier for me and everyone involved. Success.
All this insight was facilitated by a process review from a colleague who had been outside the process. I’m very grateful for the help and glad to have co-workers that can help me see past my blind spots. We can get so caught up with putting out fires that we can forget to think outside the box a little. We can free up our time by further automating things.
As far as plans for the future, it may be best for us to create a desktop application that can run with input boxes for the parameters such as which station the cluster is at and what we want to name it to kick off all the installations and test that we’ve got in manufacturing.
If we were to go this route, I think the button to start it should say “full send” because it would start a process that I pray works well and doesn’t break too frequently. The future is wide open and perhaps with operator feedback, we will decide to take it in a different direction.
Currently, the process involves two operator scripts that are just two commands and the question is whether condensing them to something with a GUI instead of a command line interface is worth it. I guess it comes down to how many engineering resources we want to commit to making this process simpler and seamless. I’ve found that estimating how long something can take in engineering hours is a really hard task because I’m usually too optimistic. If I weren’t optimistic, how could I take the constant error messages and setbacks that so often come up on software?
It’s been a wild ride with the engineering work I’ve done for Yellowbrick. But the work is satisfying because the issues we’re solving seem to be pressing and that motivates me to work hard for solutions.