Building software that’s easy to monitor and administer

Building software that’s easy to monitor and administer

09/23/2013 20:12:04

When you’re building software that’s central to what your company does, it’s very important that you build software that’s easy to support.

This might sound like obvious advice, but at the end of the day, if something goes wrong, you’re likely to be the person called to support the live system, so you should make sure that the system you build exposes the information you’re going to need to troubleshoot it. When times are good, it’s also important for you to be able to see the status of a system that’s currently in-flight. People will ask, so you may as well arm yourself with the information you need.

Here’s some simple, and practical advice, to making systems monitoring-friendly.

Make sure you have logging

Again, seemingly obvious advice, but make sure you’re logging important information. Logging is a solved problem across many languages and frameworks, so don’t reinvent it. Use Log4X (.NET/J/whatever), and make sure the logs roll and are available to everyone easily. There are some great services out there that support syslog searching and indexing – check out papertrailapp.com for my personal favourite.

Track and aggregate errors and exceptions

Understanding what constitutes a “normal” amount of errors in your application is very important. There are plenty of reasons for websites to generate errors under traffic, web-crawlers generating invalid uris, poor or malicious user input, however a single error in a payment processing system is often critical. You should spend time understanding the error profile of your application – fix the bugs that cause the “expected” errors to ensure that “real” errors don’t get mistaken for noise. There are plenty of services out there to help you track and fold errors, I particularly like raygun.io for .NET and JavaScript projects (though their support is much wider). You want to watch general trends of errors over time, along with new introductions, to understand how to respond to errors in your software after launch.

Windows software? Use the event log!

Log files are great, but some solid event log messaging and some custom performance monitors in your application will make that special sysadmin in your life very happy. There are plenty of tools that can monitor these logs for messages, status codes and spikes in performance counters (including Microsofts own SCOM along with lots of popular third party tools).

Building system services? Don’t hide them!

System services are common and just as easily forgotten. If you’re writing “invisible” software, it’s important to force it into the limelight so people don’t forget it’s there, and especially so they notice if it’s not running. As good practice, I always recommend running monitoring dashboards from inside the system service to ensure people know it’s there. I’m a big fan of embedding a web server in all system services that would otherwise be invisible that provide monitoring dashboards with the kind of statistics that you’d need during troubleshooting. Your applications will know what’s important to them, so measure stats in real time and message them over HTTP – everyone knows how to use a browser, and the presence of the status page is a great way to monitor availability. If you want to do a great job, you can use graphing libraries and expose the data as json for other systems to query. Consider surfacing things like “average time to process a request”, “number of failures since launch”, “throughput” and other metrics that’ll help you if you’re investigating live issues. If you’re working in C# / .NET I highly recommend using NancyFx as an embedded webserver in your system services.

Building APIs? Measure performance and make use of response headers to message information

The performance of your APIs will help the apps that depend on them flourish or fail – and there’s nothing more frustrating than a poor feedback cycle as an API developer. You should measure, in memory, in real time, per node, the number of requests you’re serving, average requests a second, per method, the rate of errors, and the overall percentage of errors in calls. You should return the time taken on the server as a request header (something like “X-ServerTime”) to help the caller debug any weird latency issues they’re encountering, and you should offer this information over the API itself, either via a reporting or status API call, or through a web dashboard. When I was working at JustGiving, we put a lot of effort into the end developer experience, serving both the API docs and single node statistics to the public per node and it saved us weeks of debugging and messaging. You can check out an example of what we did here: JustGiving single node stats page – not only did it help us diagnose problems, but it helped people coding against our APIs verify error behaviour if they experienced it.

Whenever you’re building anything, remember that you, or someone you work with, is going to be the person that has to fix it if it fails. So be nice to that person.