Tardate 2016: On "random" CI failures

Saturday, June 25, 2016

On "random" CI failures

I closed a bug yesterday that's been kicking around for almost a year as a sometimes fails on CI but no-one can figure out why frustration.

Sooner or later you'll hear someone suspect it must be a problem with CI. Which is ironically funny in a shoot-the-messenger kind of way!

Thankfully our "CI issue" turned into a for-real bug. In short, the code involved many classes with near 100% test coverage. It had been read and re-read and everyone would swear there's no way this could fail.

No, of course we were wrong. The bug was basically a conspiracy of two bits of code in two very different places:

a record validation that ensured field1 was not the same as field2
a data collection routine that could be configured to filter/replace sensitive values with a random ** string: ["*" * rand(4..10)]

And you can see where this leads: our problem was filtered data ending up by obscure and circuitous means in field1 and field2 ... with a 1 in 7 chance of the record validation failing (never happens on our machines of course). After that it was an easy fix.

So once again we learn the lesson:

If CI say red but we can't figure out why, "must be a problem with CI" is 99.999% the wrong answer. It just means we haven't found the bug yet.

I've seen this scenario play out a dozen times in as many years, and CI was always right;-) Since it keeps cropping up, it made me think about how to best knock these on the head. Five things:

start by assuming there is a bug until proven otherwise

It's too easy to give up, find scape-goats or "magic" explanations otherwise.
Take heart in the fact that if you assume CI is right, the odds are on your side.

put a canary in a coalmine

When we first encountered this issue and failed to find the root cause, we added code to catch the "this is about to fail in that unexpected way" situation and log/report appropriately.

So while the ticket got iced, it's been that "canary" that keeps dying in order to keep the issue alive! So when it died again yesterday, it was a painful reminder to get to the bottom of the issue once and for all.

finding bugs .. is like looking for your keys

Always found in the last place to look. So when you've honed in on the code you think is failing, studied it upside down, left to right, and still can't find the issue .. maybe it's time to consider you might be right. Throw out that hypothesis, pull back and fan out instead.

treat random errors like a lottery

If errors happen infrequently, reproducing them is like trying to win the lottery. The more entries, the better your chances.

So don't run tests a few times, run them millions of times if you have to. Computers are good at this. That's how I diagnosed this latest issue while tweaking logging and the test itself. Bash away:

for (( ; ; )) ; do
  rspec spec/that_wierd_spec.rb 
  if [ $? == 1 ] 
  then 
    echo "JACKPOT!" 
    break 
  fi 
done

random failures ... might really be random

This sounds so simple that it's easy to overlook.

If things fail randomly .. it only takes a few moments to search the code to see if anything is using something similar to a random function.

Could it be possible that random failures and the use of rand() might be related?!
May be not, but if they are, that's a cheap win!

Tardate 2016

my recent reads..