Detecting Performance Regressions for Pinterest Web

Posted on Saturday, November 25th, 2017

I’ve been at Pinterest since around December 2014 and have been loving it on the web platform team. A few months ago (June-ish?) we had our first reorg ever and I found myself, for the first time, on a new team. Our team had had some minor churn but nothing quite as large as this. Focusing on performance has been a pretty interesting change of pace.

The first order of business was to basically solve the problem of how do we make sure our site doesn’t get slower before developers release code into production. Turns out the standard operating procedure had been 1) see that production numbers regressed from extremely variance-riddled statistics, 2) try to identify the commit that caused it (number of suspicious deploys could be from 1-6 full of 10-20 commits each due to the variance of the metrics), 3) do a tedious bisect or just plain guess which commit it could have been, 4) try to find an owner and convince them to fix or revert the change.

The New World: Detect Regressions Per Build Candidate

So instead of detecting regressions 1-14 days after they happen, why not try to detect them as soon as code went in? So the system goes like this:

Developer commits code
Build candidate is produced
We tell a fleet of test prod-like servers to deploy this build candidate
We tell a CI system to hit this fleet with multiple suites of performance tests and spit the numbers out to a graph
We set up alerts so that if the graph ever goes above a certain threshold, our team is notified.

We deploy roughly twice a day, so unless a developer puts in a bad change right before a deploy, we can (and have) detect well before code ever goes out.

The Performance Regression Framework Stack: BuildKite, GhostJS, AWS, Teletraan, StatsD

I really need to come up with a better name than "The Performance Regression Framework." Anyway, it uses AWS EC2 instances to serve the production build candidate of the site, and uses BuildKite to spawn GhostJS remote-controlled Linux-based Firefox browser processes to hit the site with. This combination of high-end device and high-end network connection on the test-runner side doesn’t make for a good approximation for our p90 users, but I found that as long as we run enough tests such that the variance isn’t too bad, the tests do remain meaningful.

A future feature will be to throttle the network, which we should be able to do with an interface to Chrome and its Devtools. This will help us detect regressions that cause the app to be network-bound. I’m also exploring using non-headless Puppeteer, which should give us easier access to the Chrome API.

Anyway, nothing about the framework is really coupled with the tools I chose. They’re really just separate components that work together to achieve the goal quickly with what we had and the limitations of our current app platform:

BuildKite is the CI which gets triggered with every build that completes via the BuildKite API. It then builds a docker container of the test framework repo, really a tiny collection of test files and dependencies like GhostJS, and then just runs the tests lots and lots of times in parallel. I like the configurability of BuildKite and the UI, though I had to submit a few tickets because BuildKite had never had a client that was running potentially hundreds of tasks in parallel >:D
GhostJS is the test runner that we chose because it was already set up to do some other integration testing with our specific site. I also found the codebase pretty easy to work with and extensible when I needed some new features, but really all it is responsible for is hitting the site and reporting back on some global variable values. I may end up using Puppeteer and some other script to do this if I start needing more features.
Teletraan is an open-source instance management UI that Pinterest released a bit ago. It has a nice API that BuildKite can hit to trigger deploys and report failures and such. Pretty good interface to our instances and make sure that we’re locking serving environments when we run tests against them.
I’m using OpenTSDB to stat out the actual timings. I run maybe 100-200 test runs per suite of tests and take the p90 of that sample, then stat it to a graph. I then set up alerts based on thresholds and that’s how we have a visual of how performance is going on a per-build basis and the moment it starts to climb.

Regressions Caught!

I’m happy to say that after a few months of this framework in place, we’ve detected and caught a number of regressions. It’s tough at first to trust the numbers but after awhile it proves itself out. I’m glossing over some of the bugs that we found, as well as the whole instrumentation framework that even spits out these numbers. I’m also glossing over this other tool called the investigation framework (we should really come up with a better name) that my team built that allows us to automatically do a git bisect when a regression does happen and basically build every single commit inside of an offending build (could be between 10-20 commits!) and run the whole test suite against each one until it finds the commit that caused it.

Yeah, that tool is actually awesome. It takes the machines a few hours to narrow it down sometimes. Imagine a human doing that work!

Anyway, maybe I’ll write more about the instrumentation framework. I’m currently rewriting it, which I also think is interesting. Yay more things to write about.