Last week I was helping a client with some performance problems in one of their subsystems. Performance profiling is often a tricky subject where there’s no one clear preventative step, but I want to highlight a few positive qualities that it encourages in your codebase.
A wild performance problem appears!
The system in question started exhibiting performance problems. What was more interesting though, was the nature of the performance problem, with calls to save data bottlenecking for minutes at a time – all with a relatively small number of users.
If you’ve ever done any performance tuning in the past, this sounds like a classic resource contention issue – where a scarce resource locks and users are rate limited in their access to it. Conspicuously, there hadn’t been any significant code changes to the portion of the system that saved the data in question.
Reproducing Performance Problems
Like any kind of issue in software development, you can’t do anything to solve a problem unless you can see it, and until you can see it, you can’t start to identify what kind of fixes you could use. Discerning the “what” from a chorus of people frustrated with a system is pretty difficult and we both benefited and suffered from the fact that system in question is part of a small ecosystem of distributed systems.
To make matters worse, the methods that were taking a huge amount of time to execute, were methods that both interacted with a third party component, and other systems in the distributed architecture.
Perceptions of performance in distributed systems
Performance is frequently hidden behind smoke and mirrors – there’s a big difference between actual performance, and perceived performance. Actual performance is concrete and measurable, where perceived performance is an ephemeral feeling that a user has towards your software.
“Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.” – Wikipedia
When a Facebook user posts some new content, that content is immediately added into their browsers DOM. Basically, what they’ve posted is put on their screen right in front of them. Other users of the site will see the content seconds later, when it eventually replicates across all their data-stores. This can take seconds to minutes if their system is running slowly, but the data is send and queued, rendered to the user, and the perception is that Facebook is performing very quickly.
The key take away, is that the way users feel about performance directly correlates to their experiences and what they can see, and rarely the performance of the system as a whole.
The perception of poor performance in a distributed system will always fall on the user-facing components of that system. To compound the problem, reproducing “production like” performance problems is often much more difficult, with the strain felt by various components in the system becoming very difficult to isolate and identify.
Performance is a feature of your system
Performance problems are notoriously hard to solve because “performance” is often a blanket term used to describe a wide variety of problems. To combat this, mature systems often have expected performance requirements as a feature – benchmarkable, verifiable performance characteristics that they can be measured and tested against.
I’m not a proponent of performance first design. Performance first design frequently leads to micro-optimisations that ruin the clarity and intent of a codebase, where a higher level macro-optimisation would yield far greater performance improvements. I am, however, a big fan of known, executable, performance verification tests.
Performance verification tests provide a baseline that you can test your system against – they’re often high level (perhaps a subset of your BDD or feature tests), and they’re run in parallel, frequently. These tests are important to establish a baseline for conversations about “performance increasing” or “performance decreasing” because they’ll give you tangible, real world numbers to talk about. The value of these tests varies throughout development – but they’re the easiest to add at the start of a project and evolve with it.
Performance is a feature, even if it isn’t necessarily the highest priority one.
Measurement vs. Testing
While performance tests are a great way to understand the performance of your system as you build it, real systems in a real production environment will always have a more diverse execution profile. Users use your system in ways that you’re not designing for. It’s ok. We all accept it.
Measurement on the other hand, is the instrumentation and observation of actual running code. Performance tests will help you understand what you expect of your system, but quality measurement and instrumentation will help you understand what’s going on right now.
Measuring the actual performance of your system is vital if you’re investigating performance problems, and luckily, there are great tools out there to do it. We used the excellent New Relic to verify some of our performance related suspicions.
NewRelic is not the only tool that does hardware and software measurement and instrumentation, but it’s certainly one of the best and it’s part of a maturing industry of software as a service offerings that support application, logging, and statistical reporting over servers and apps.
Code reading and profiling
Given that we had a suspicious looking hot-spot that we’d identified from IIS logs, we were also able to lean on old-fashioned code review and profiling tools.
Profiling tools are a bit of a headache. People struggle with them because they often deal with concepts that you’re not exposed to at any time other when you’re thrashing against performance issues. We’re lucky in the .NET ecosystem that we’ve got a couple of sophisticated profiling options to turn to, with both JetBrains’ dotTrace and RedGate’s ANTS Performance Profiler being excellent mature products.
We profiled and read our way through the code and doing so highlighted a few issues.
Firstly, we found some long running calls to another system. These were multi-second HTTP requests that were difficult to identify and isolate without deep code reading because there were no performance metrics or instrumented measurement around them.
Secondly, and much more significantly, was a fundamental design problem in a third party library. Due to some poor design in the internals of this library, it wasn’t able to cope with the capacity of data that we were storing in it. After some investigation, we established a work-around for this third party library problem, and prepared a fix.
How do we prevent this happening?
There are some useful takeaways from this performance journey. The first, is a set of principles should be considered whenever you’re building software.
Monitoring, instrumentation and alerting need to be first class principals in our systems.
This means that we should be recording timings for every single HTTP call we make. We should be alerting set on acceptable performance thresholds and this should all be built into our software from day one.
In order to get the visibility into the software that we need, we need great tooling.
New Relic was instrumental in helping us record the changes in performance while testing our solution. Further monitoring, instrumentation and aggregation of exceptions and stats would have made our lives much simpler – letting us identify potentially long running API calls much quicker.
There are tools on the market (StatsD, LogStash, Papertrail, Kibana, Raygun) that you can buy and implement trivially that’ll vastly increase visibility of these kinds of problems – they’re essential to reliably operate world class software in production, and they’re much cheaper to buy and outsource, than build and operate. If they save a few developer days a month, they pay for themselves.
Poor design ruins systems. In this case, the poor design was in a third party library, but it’s worth reiterating regardless. A design that can’t cope with an order of magnitude increase in load, needs to be evaluated and replaced.
Fit for purpose is very load dependant – we should consider if we can catch these potential problems while evaluating third party libraries that can’t be easily replaced – going to the effort of scripting and importing load, rather than discovering these issues when scaling to a point of failure.
Luckily, much of this will be things we already know – instrumentation is vital, and monitoring and performance metrics help us build great software – but these are some nice, practical and easy wins that can be implemented in your software today.