Archive for October, 2014

Performance Tuning and the Importance of Metrics

Sunday, October 26th, 2014

Last week I was helping a client with some performance problems in one of their subsystems. Performance profiling is often a tricky subject where there’s no one clear preventative step, but I want to highlight a few positive qualities that it encourages in your codebase.

A wild performance problem appears!

The system in question started exhibiting performance problems. What was more interesting though, was the nature of the performance problem, with calls to save data bottlenecking for minutes at a time – all with a relatively small number of users.

If you’ve ever done any performance tuning in the past, this sounds like a classic resource contention issue – where a scarce resource locks and users are rate limited in their access to it. Conspicuously, there hadn’t been any significant code changes to the portion of the system that saved the data in question.

Reproducing Performance Problems

Like any kind of issue in software development, you can’t do anything to solve a problem unless you can see it, and until you can see it, you can’t start to identify what kind of fixes you could use. Discerning the “what” from a chorus of people frustrated with a system is pretty difficult and we both benefited and suffered from the fact that system in question is part of a small ecosystem of distributed systems.

We were lucky in this case – the system was a JavaScript app that posted data to web services hosted inside itself over HTTP. This means that we had access to IIS logs of the requests. This meant that we could aggregate them to identify the slow calls that the users were experiencing. This was the “canary in the coal mine”, highlighting some API methods that were taking a long time to execute.

To make matters worse, the methods that were taking a huge amount of time to execute, were methods that both interacted with a third party component, and other systems in the distributed architecture.

Perceptions of performance in distributed systems

Performance is frequently hidden behind smoke and mirrors – there’s a big difference between actual performance, and perceived performance. Actual performance is concrete and measurable, where perceived performance is an ephemeral feeling that a user has towards your software.

There’s a really great example of this that everyone recognises. Facebook uses an eventual consistency driven data store for much of their user data.

“Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.” – Wikipedia

When a Facebook user posts some new content, that content is immediately added into their browsers DOM. Basically, what they’ve posted is put on their screen right in front of them. Other users of the site will see the content seconds later, when it eventually replicates across all their data-stores. This can take seconds to minutes if their system is running slowly, but the data is send and queued, rendered to the user, and the perception is that Facebook is performing very quickly.

The key take away, is that the way users feel about performance directly correlates to their experiences and what they can see, and rarely the performance of the system as a whole.

The perception of poor performance in a distributed system will always fall on the user-facing components of that system. To compound the problem, reproducing “production like” performance problems is often much more difficult, with the strain felt by various components in the system becoming very difficult to isolate and identify.

Performance is a feature of your system

Performance problems are notoriously hard to solve because “performance” is often a blanket term used to describe a wide variety of problems. To combat this, mature systems often have expected performance requirements as a feature – benchmarkable, verifiable performance characteristics that they can be measured and tested against.

I’m not a proponent of performance first design. Performance first design frequently leads to micro-optimisations that ruin the clarity and intent of a codebase, where a higher level macro-optimisation would yield far greater performance improvements. I am, however, a big fan of known, executable, performance verification tests.

Performance verification tests provide a baseline that you can test your system against – they’re often high level (perhaps a subset of your BDD or feature tests), and they’re run in parallel, frequently. These tests are important to establish a baseline for conversations about “performance increasing” or “performance decreasing” because they’ll give you tangible, real world numbers to talk about.  The value of these tests varies throughout development – but they’re the easiest to add at the start of a project and evolve with it.

Performance is a feature, even if it isn’t necessarily the highest priority one.

Measurement vs. Testing

While performance tests are a great way to understand the performance of your system as you build it, real systems in a real production environment will always have a more diverse execution profile. Users use your system in ways that you’re not designing for. It’s ok. We all accept it.

Measurement on the other hand, is the instrumentation and observation of actual running code. Performance tests will help you understand what you expect of your system, but quality measurement and instrumentation will help you understand what’s going on right now.

Measuring the actual performance of your system is vital if you’re investigating performance problems, and luckily, there are great tools out there to do it. We used the excellent New Relic to verify some of our performance related suspicions.

NewRelic is not the only tool that does hardware and software measurement and instrumentation, but it’s certainly one of the best and it’s part of a maturing industry of software as a service offerings that support application, logging, and statistical reporting over servers and apps.

Code reading and profiling

Given that we had a suspicious looking hot-spot that we’d identified from IIS logs, we were also able to lean on old-fashioned code review and profiling tools.

Profiling tools are a bit of a headache. People struggle with them because they often deal with concepts that you’re not exposed to at any time other when you’re thrashing against performance issues. We’re lucky in the .NET ecosystem that we’ve got a couple of sophisticated profiling options to turn to, with both JetBrains’ dotTrace and RedGate’s ANTS Performance Profiler being excellent mature products.

We profiled and read our way through the code and doing so highlighted a few issues.

Firstly, we found some long running calls to another system. These were multi-second HTTP requests that were difficult to identify and isolate without deep code reading because there were no performance metrics or instrumented measurement around them.

Secondly, and much more significantly, was a fundamental design problem in a third party library. Due to some poor design in the internals of this library, it wasn’t able to cope with the capacity of data that we were storing in it. After some investigation, we established a work-around for this third party library problem, and prepared a fix.

How do we prevent this happening?

There are some useful takeaways from this performance journey. The first, is a set of principles should be considered whenever you’re building software.

Monitoring, instrumentation and alerting need to be first class principals in our systems.

This means that we should be recording timings for every single HTTP call we make. We should be alerting set on acceptable performance thresholds and this should all be built into our software from day one.

In order to get the visibility into the software that we need, we need great tooling.

New Relic was instrumental in helping us record the changes in performance while testing our solution. Further monitoring, instrumentation and aggregation of exceptions and stats would have made our lives much simpler – letting us identify potentially long running API calls much quicker.

There are tools on the market (StatsD, LogStash, Papertrail, Kibana, Raygun) that you can buy and implement trivially that’ll vastly increase visibility of these kinds of problems – they’re essential to reliably operate world class software in production, and they’re much cheaper to buy and outsource, than build and operate. If they save a few developer days a month, they pay for themselves.

Poor design ruins systems. In this case, the poor design was in a third party library, but it’s worth reiterating regardless. A design that can’t cope with an order of magnitude increase in load, needs to be evaluated and replaced.

Fit for purpose is very load dependant – we should consider if we can catch these potential problems while evaluating third party libraries that can’t be easily replaced – going to the effort of scripting and importing load, rather than discovering these issues when scaling to a point of failure.

Luckily, much of this will be things we already know – instrumentation is vital, and monitoring and performance metrics help us build great software – but these are some nice, practical and easy wins that can be implemented in your software today.

Lessons learnt running a public API from #dddnorth

Sunday, October 19th, 2014

Yesterday I gave a talk at #DDDNorth (a free community lead conference in the “Developer! Developer! Developer!” series) about running public facing RESTful APIs. Many thanks to all the kind feedback I’ve had about the talk on social media – thrilled that so many people enjoyed it. It was a varient on previous talks I’ve given at a couple of usergroups on the topic – so here are the updated slides.

Google presentation link

Deferred execution in C# – fun with Funcs

Thursday, October 16th, 2014

I want to talk a little about deferred execution in C#.

I use deferred execution a lot in my code – frequently using it to configure libraries and build for extensibility – but I’ve met lots of people that understand the concept of deferred execution (“this happens later”) but have never really brushed up against it in C#.

Deferred execution is where code is declared but not immediately run – instead being invoked later

Deferred execution is formally described by MSDN as meaning “the evaluation of an expression is delayed until its realized value is actually required”.

It’s common in JavaScript to supply a method as a callback which is later invoked:

image

In the above example, we’re declaring a new function while calling doSomething(), that is later executed as a callback by the doSomething() method, rather than being executed when we call it. Because JavaScript is executed linearly, it makes extensive use of callbacks in just about every part of the language.

By contrast, deferred execution is less obvious in C#, even though there have been keywords and types that leverage deferred execution available for years. There are two common ways that deferred execution is implemented in C#

The Yield Keyword

The yield keyword was introduced in C#2 as some sweet syntatic sugar to help people implement iterators and enumerators without boilerplate code. Using the yield keyword generates a state machine at compile time and actually does a surprising amount. There’s a really great (and ancient) post by Raymond Chen about how yield is implemented – but the short version is, you can “yield return” in methods that return an IEnumerable<T> and the compiler will generate a class with a whole bunch of goto statements in it. It’s an elegant compiler trick that a lot of you will have used, even if you didn’t realise it at the time.

The Action and Func Types

Now we get to the fun stuff. Actions and Func’s were also introduced in C#2, but became much more common from C#3 onwards when Lambdas were introduced to the language. They look like this:

image

Delegates and lambdas in C# are collectively called “Anonymous methods”, and (sometimes) act as closures. What this means, is that when an anonymous method is declared, it can capture “outer variables” so that you can use them later.  There’s a good response by Eric Lippert explaining the exact semantics of anonymous methods in a StackOverflow post here, and numerous examples around the web. This is interesting because you can use a closure to capture some context in one place in the application, and invoke it somewhere else.

There are lots of fun use cases for this, and I want to highlight a couple.

Templating methods

Sometimes referred to as the “hole in the middle” pattern – there are plenty of scenarios where you have a block of repetitive identical code, and four or five tiny variations. This is one of the most frequent sources of copy-paste code, and a great refactor for lots of older codebases. You could create an abstract base class, and a whole bunch of inheritance to solve this problem – or you could do something simpler, where this

image 

can trivially become this

image

Entirely removing the repetition from the codebase without having to build a bunch of abstract cruft. The actual bodies of the methods are entirely different, but with some creative use of deferred execution, we can make sure we only include the boilerplate code once. There are more compelling examples though, consider this pattern if you’re boiler plating HTTP requests, or doing repetitive serialization.

Method Interception

If you’re authoring libraries and want to give your users a way to “meddle with the default behaviour” providing them optional func’s is a really neat way to do it without compromising your design. Imagine a library call that by design swallows exceptions – while that’s not a great idea, we’ll run with it. So users might rightfully want to know when this happens, so you can leverage optional parameters and Action callbacks to give them a hook without compromising your library call.

image

This is a very simplistic example – there are whole frameworks built on the concept of wiring up large collections of Funcs and Actions that are chained together. The Nancy web framework’s Before and After hooks are nothing more than a chain of Funcs that get executed in sequence. In fact, the whole of the OWIN work-in-progress spec for the next generation of .NET webservers revolves around the use of a single “AppFunc”.

Mutator Actions and Funcs that leverage current context

I use Funcs for configuration in just about every library that I own. They’re a great way to allow people to configure the future behaviour of a library, in a repeatable way. In the following example I’m going to create a Factory class that will store an Action that it’ll use to modify the type it creates each time create is called.

image

This is especially useful if you want to do something like “get a value out of the current request” or some other thing that changes each time the factory is called – if you design your Actions and Funcs right, passing in the “current executing context” you can declaratively define behaviours at start-up that evaluate differently on each execution.

I’ve commonly used these kinds of configuration overrides to do things like “fetch the current tenant from context” or “get the ISession from a session scoped IoC container” – the configuration model of ReallySimpleEventing is a small example of using deferred execution and Actions to override the default behaviour when the library encounters an unhandled exception. A default “Throw all” implementation is provided, with the ability to override by configuration.

Working around problems with classes you can’t control the creation of

I recently had to dip into doing some WCF code – and one of the less desirable parts about working with WCF is that it creates your services for you. What this means is that if you want to use some kind of DI container across your codebase based around constructor injection, you’re out of luck. It’s annoying, and it can ruin your testing day, but with a little bit of legwork you can use Func’s to plug that hole.

Given a class that looks like this that’s created by some framework magic

image

You can make a publically settable static func, that’ll act as a proxy to your container and bind it up at bootstrapping time. That’s a bit of a mouthful, so let me illustrate it.

image

In the above example, you can wire up a public static Func<MyDependency> to your container in your bootstrapping code. This means that even if you don’t control the lifecycle of your class, you have a hook to callback to the container to grab a current valid instance of your dependency, without relying on deep static class usages or service locators. It’s preferable, because you can override this behaviour in test classes, giving you a way to test this previously hard to test code. This is especially useful if you want to exhume references to HttpContext or some other framework provided static from your code.

Dictionaries of methods for control flow

Here’s a fun example. Lets say you write a command line app, that takes one of five parameters, and you want to execute a different method based on the parameter passed. Simple enough, lets right an if statement!

image

A little naff and repetitive, but it’ll do. You could perhaps refactor this to a switch statement

image

This looks a little tighter, but it’s still quite verbose – even for such a small example. With the help of a dictionary and some Actions, you can convert this to a registry of methods – one that you could even modify at runtime.

image

This is a visibly concise way of expressing the same control flow in a way that’s mutable.

This is just a taste of some of the things you can do when you leverage deferred execution, anonymous methods and the Action and Func types in .NET.  There are plenty of open source codebases that make use of these kinds of patterns in code, so do dig deeper!