hero banner

Thoughts about delivering platforms from Nick Dentons leaked memo

December 11th, 2014

There’s an interesting piece here about Nick Denton stepping down yesterday as company president of Gawker media, the first “big web media company”.

The interesting, and tech relevant passage from his pretty readable memo is below. For reference, Kinja is gawkers editorial platform underneath. Denton is big and brash in the memo (which is basically his Christmas note / resignation) and there are a few interesting observations about the common trials and tribulations of software development in there.

“The problems I’m going to identify are common. Excellence in software development is elusive; no online publisher has yet succeeded in transforming itself into a platform…

The current principles of software product development hold that candidconversations–with developers, designers and users–lead to a better web experience. We lacked that necessary candor. We left too many opportunities on the table, too many known problems unresolved. And in our external communications, in our stories, we sometimes shied away from controversy, fearful of online critics. We weren’t ourselves.

We all understand how this works. Editorial traffic was lifted but often by viral stories that we would rather mock. We — the freest journalists on the planet — were slaves to the Facebook algorithm. The story of the year — the one story where we were truly at the epicenter — was one that caused dangerous internal dissension. We were nowhere on the Edward Snowden affair. We wrote nothing particularly memorable about NSA surveillance. Gadgets felt unexciting. Celebrity gossip was emptier than usual.

We pushed for conversations in Kinja, but forgot that every good conversation begins with a story. Getting the stories should have come first, because without them we have nothing to talk about.?..

…And the development of Kinja itself was a challenge. Our Tech department proclaimed a new era of multi-disciplinary cross-functional teamwork and collaboration. The reality: the best tech teams in online media in both New York and Budapest, with too many developers grinding away at re-factoring (thankful though we’ll be next year for that prep work). And a product manager on the 2015 design refresh had barely talked to the consultant who had driven the other major new project of the forthcoming year. Open collaboration in theory; the opposite in practice.
Who raised the alarm? Would I have even heard it? For a good 12 months from the summer of 2013 I was variously betrothed, distracted, obsessed by Kinja, off on honeymoon, obsessed by Kinja, off on sabbatical. I’m not sorry for that. For ten years, I’ve danced with this octopus. That’s what one person on Twitter calls Gawker: an octopus armed with chainsaws. I deserved a break.

When I was disengaged, I didn’t leave any real authority in place. In my absence, the company ticks along nicely; with the challenge of Buzzfeed and Vox, ticking along nicely is no longer enough. Even when I’m here, if I’m obsessed by something, other parts of our common project can spin off in unpredictable directions, causing me to overlook developing risks and opportunities. As Joel said, I am the company’s greatest asset — and it’s greatest liability. To be saved from myself, like many of us, I need partners in the fullest sense of the word, to take up the slack or keep me on focus. And I didn’t have them.

During this period I made a mistake in Editorial, hiring a talented guy whose voice and vibe I loved, who represented nerd values, and whom I thrust into a job which changed under his feet: he was competing with Lockhart Steele of Vox and Ben Smith of Buzzfeed, two of the most effective editorial managers in the business, each with the funding to go after the very best talent.

I was so obsessed with the design of Kinja discussions, I didn’t even think to warn that Gawker is always first about the story. I took that for granted. I was in so much in a hurry that I didn’t even look at other candidates, a cardinal sin. I made a mistake, and I’m sorry to Joel, and I’m sorry to those to whom he is a friend.

And during this time too, we embarked piecemeal on a software project whose eventual scope we barely imagined. Tom told me years ago he did not want to run the department beyond 30 people, that he wanted to get back to coding. Tech is now at 55 people. Tom didn’t push me. I didn’t want to mess with what was comfortable, the best relationship with a CTO by far that I’ve ever had in my career. And no other views were solicited.

So we attracted impressive technical talent — with our culture, audience testbed, and idea — and then we let those people down. We embarked on the Kinja expansion before we’d recruited the management; each major hire was reactive, each to fix a problem created by the last. Hire engineers. Now manage engineers. Oh no, we need product people. Lean, what’s that? I had to learn fast. It wasn’t quite that bad; but not that far off.”

Food for thought.

NuGet 101 – A Bootcamp

December 9th, 2014

I’ve been running and recording a lot of workshops over the last couple of months – here is one on NuGet packaging for beginners – starting out as a slide deck then moving into a practical demo.

Slide deck 

  • History
  • What’s a package
  • So it’s a zip file right?
  • Why should I use them?
  • An open source mentality
  • Disadvantages
  • Realities

Demo

  • Directory topologies of a library package
  • Adding packages to your solutions
  • Pairing nuspec files with csproj’s
  • Replacement tokens for metadata
  • The NuGet docs
  • Package dependencies
  • Dependency discovery and bundling
  • Versioning and SemVer
  • Package sources
  • NuGet.config

Mentions

Slide Deck: https://docs.google.com/presentation/d/1Zo_-MpO9XRHZMpC0UTvd05EfFGBf3Y5Lj_ZWG8vokvM/pub?start=false&loop=false&delayms=10000

Deployments on Windows / .NET in 2014

November 13th, 2014

Deploying software in the Microsoft ecosystem has long been one of the more unloved, and challenging aspects of .NET development. Over the last 3 years, there have been several improvements to deployment practices around Windows, but no single obvious way. With recent announcements of new deployment partnerships for Windows Azure, I want to take a step back and look at what our environments look like, and what tools we’re using to deploy to them.

The results of this survey will be collated and circulated, and I’ll try express some trends from them if I can get enough responses.

Take the survey here

Garbage Collection in .NET – Workstation vs. Server GC.

November 6th, 2014

So, I’m reading the largely excellent “Writing High-Performance .NET Code” by @benmwatson at the moment, and I wanted to share something that’s expressed especially clearly, that I find ambiguous in many of the official docs.

The garbage collector in .NET is treated by a vast majority of devs as a thing of mystery – and one of those mysterious options is “Do you want to run the GC in ‘Workstation’ or ‘Server’ mode?

Workstation GC

  • All GCs happen on the same thread that triggers the collection
  • They’re all run at the same priority
  • A full suspension occurs before collection

Server GC

  • Creates a dedicated thread for each logical processor or core
  • Each of these threads run at the highest priority
  • These threads sleep between collections
  • A memory heap is created for each GC thread
  • Garbage collections happen in parallel due to multiple heaps

For both, from .NET 4.5

  • A background thread exists for generation 2 garbage collection
  • This dedicated thread can be disabled

As a result

  • Server GC has the lowest latency
  • Choose Server GC if you have a dedicated box for your application
  • If you have many managed processes be more wary – the abundance of high priority background threads could cause a performance problem.
  • You can mitigate this contention by setting process affinity to specific logical processors, at this point the CLR will only create GC threads and managed heaps for the logical processors your process is affinitized to.

TL;DR: Choose server, unless you have *lots* of managed applications on the box. Always choose server in a dedicated one-machine-per-app environment.

Also, read the book!

Code dojos as a learning tool

November 5th, 2014

As part of the “software consultant” gig, I often get involved with mentoring developers and teams. Mentoring developers one on one works pretty well, but after a point, it’s quite hard to scale the face time you can give to teams or departments of 50-60 people.

The truth is, that people learn by doing, and people learn from each other – so one of the most important things to do in your technology department is bake that culture of learning and development into the day-to-day. I’ve had good success using code dojos as a crutch to introduce pair programming and learning into departments in a couple of different companies.

So what’s a code dojo?

A code dojo is session where you perform “coding katas” to practice, improve and learn. The concept comes from Japanese martial arts, and I’m going to liberally borrow from the wikipedia page on the subject.

Kata, a Japanese word, are the detailed choreographed patterns of movements practised either solo or in pairs. The term form is used for the corresponding concept in non-Japanese martial arts in general.

Kata are used in many traditional Japanese arts such as theatre forms like kabuki and schools of tea ceremony (chado), but are most commonly known for the presence in the martial arts. Kata are used by most Japanese and Okinawan martial arts, such as aikido, judo, kendo and karate.

The basic goal of kata is to preserve and transmit proven techniques and to practice self-defence. By practicing in a repetitive manner the learner develops the ability to execute those techniques and movements in a natural, reflex-like manner. Systematic practice does not mean permanently rigid. The goal is to internalize the movements and techniques of a kata so they can be executed and adapted under different circumstances, without thought or hesitation. A novice’s actions will look uneven and difficult, while a master’s appear simple and smooth.

The way that we interpret katas in software development is to practice coding by solving synthetic or interesting problems that are similar to a variety of real world problems. Generally, katas are performed as “pairs”, with two people sharing a single computer, alternating in participation in the task. This is called pair programming, and it’s a great way for people to learn from one another – it’s a common extreme programming / agile practice.

Pair programming is an agile software development technique in which two programmers work together at one workstation.

One, the driver, writes code while the other, the observer, pointer or navigator, reviews each line of code as it is typed in.

The two programmers switch roles frequently.

Why do we do katas?

We learn archetypal solutions to common types of problems – commonly, programming tasks of similar shapes share similar solutions – so practising building quick solutions to these problems helps us in our practical work.

We learn alternative solutions to things we thought we knew well – one of the joys of a kata is the inverse of the first point – because the kata is practice, and explicitly not real code, the risk of failure is removed and we can learn and experiment with alternate approaches to categories of problems.

We learn how our peers approach problems – the way individuals think about solving problems is drastically different, and the best way to improve your skills is to learn from other talented people. A code dojo gives you the opportunity to pair program with someone who you might not normally work with, or who has a distinctly different skill-set than you do.

We get to work on some fun, interesting problems – over multiple sessions, you’ll get exposed to different categories of problems, that should be approached in varying ways – it’s one of the easiest ways to expose yourself to new kinds of development.

It’s safe to try out new languages – katas are one of the best ways of learning new languages – they’re a safe, failure-free environment to play and learn in.

The first kata

I’m working with a new client at the moment, and this week we ran the first code dojo.

The first kata involved building a “zero player videogame” in an hour by implementing Conway’s Game of Life. It’s a really fun first dojo, because it’s out of the comfort zone of the kinds of things most “developers who work on business software” normally get a chance to program.

Some of this is paraphrased from the excellent wikipedia article and I was inspired to pick this particular kata because of a blog post I recently read by Jeremy Bytes about implementing the game in TDD, highlighting the natural fit for a kata.

The Game of Life is a cellular automaton devised by the mathematician John Horton Conway in 1970.

The “game” is a zero-player game, meaning that its evolution is determined by its initial state, requiring no further input. One interacts with the Game of Life by creating an initial configuration and observing how it evolves or, for advanced players, by creating patterns with particular properties. The universe of the Game of Life is an infinite two-dimensional orthogonal grid of square cells, each of which is in one of two possible states, alive or dead.

Every cell interacts with its eight neighbours, which are the cells that are horizontally, vertically, or diagonally adjacent. At each step in time, cells evaluate their state and transition between being alive and dead depending on their neighbours.

The initial pattern constitutes the seed of the system. The first generation is created by a set of rules simultaneously to every cell in the seed. Births and deaths occur simultaneously, and the discrete moment at which this happens is called a tick.

We ping-pong-paired (one person writes a test, the other makes it pass and writes the next) six different solutions to the Game of life, in the room, in just over an hour.

The full outline of the first code kata is available on my github account, if you want to have a go yourself along with my quick (and sub-optimal!) implementation in C# here. If you’re going to have a go, I’d suggest you don’t look at any solutions first.

Here’s one we made earlier:


Consider running some code dojos

The great thing about code dojos is they’re easy to get off the ground.

You just need a few people, some snacks, and a problem. If you want to get started, there’s lots of ideas on http://rubyquiz.com/ that you can base katas from – though it might take a little bit of prep work on behalf of the organiser to reformat some of their questions.

Over the next couple of weeks I’m going to try dig up some of the katas I ran with my last client and make them available – mostly reformats of old rubyquiz quizs – but with structured user stories and examples.

Code dojos and katas help you prevent premature specialisation, and they’re good fun.

Performance Tuning and the Importance of Metrics

October 26th, 2014

Last week I was helping a client with some performance problems in one of their subsystems. Performance profiling is often a tricky subject where there’s no one clear preventative step, but I want to highlight a few positive qualities that it encourages in your codebase.

A wild performance problem appears!

The system in question started exhibiting performance problems. What was more interesting though, was the nature of the performance problem, with calls to save data bottlenecking for minutes at a time – all with a relatively small number of users.

If you’ve ever done any performance tuning in the past, this sounds like a classic resource contention issue – where a scarce resource locks and users are rate limited in their access to it. Conspicuously, there hadn’t been any significant code changes to the portion of the system that saved the data in question.

Reproducing Performance Problems

Like any kind of issue in software development, you can’t do anything to solve a problem unless you can see it, and until you can see it, you can’t start to identify what kind of fixes you could use. Discerning the “what” from a chorus of people frustrated with a system is pretty difficult and we both benefited and suffered from the fact that system in question is part of a small ecosystem of distributed systems.

We were lucky in this case – the system was a JavaScript app that posted data to web services hosted inside itself over HTTP. This means that we had access to IIS logs of the requests. This meant that we could aggregate them to identify the slow calls that the users were experiencing. This was the “canary in the coal mine”, highlighting some API methods that were taking a long time to execute.

To make matters worse, the methods that were taking a huge amount of time to execute, were methods that both interacted with a third party component, and other systems in the distributed architecture.

Perceptions of performance in distributed systems

Performance is frequently hidden behind smoke and mirrors – there’s a big difference between actual performance, and perceived performance. Actual performance is concrete and measurable, where perceived performance is an ephemeral feeling that a user has towards your software.

There’s a really great example of this that everyone recognises. Facebook uses an eventual consistency driven data store for much of their user data.

“Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.” – Wikipedia

When a Facebook user posts some new content, that content is immediately added into their browsers DOM. Basically, what they’ve posted is put on their screen right in front of them. Other users of the site will see the content seconds later, when it eventually replicates across all their data-stores. This can take seconds to minutes if their system is running slowly, but the data is send and queued, rendered to the user, and the perception is that Facebook is performing very quickly.

The key take away, is that the way users feel about performance directly correlates to their experiences and what they can see, and rarely the performance of the system as a whole.

The perception of poor performance in a distributed system will always fall on the user-facing components of that system. To compound the problem, reproducing “production like” performance problems is often much more difficult, with the strain felt by various components in the system becoming very difficult to isolate and identify.

Performance is a feature of your system

Performance problems are notoriously hard to solve because “performance” is often a blanket term used to describe a wide variety of problems. To combat this, mature systems often have expected performance requirements as a feature – benchmarkable, verifiable performance characteristics that they can be measured and tested against.

I’m not a proponent of performance first design. Performance first design frequently leads to micro-optimisations that ruin the clarity and intent of a codebase, where a higher level macro-optimisation would yield far greater performance improvements. I am, however, a big fan of known, executable, performance verification tests.

Performance verification tests provide a baseline that you can test your system against – they’re often high level (perhaps a subset of your BDD or feature tests), and they’re run in parallel, frequently. These tests are important to establish a baseline for conversations about “performance increasing” or “performance decreasing” because they’ll give you tangible, real world numbers to talk about.  The value of these tests varies throughout development – but they’re the easiest to add at the start of a project and evolve with it.

Performance is a feature, even if it isn’t necessarily the highest priority one.

Measurement vs. Testing

While performance tests are a great way to understand the performance of your system as you build it, real systems in a real production environment will always have a more diverse execution profile. Users use your system in ways that you’re not designing for. It’s ok. We all accept it.

Measurement on the other hand, is the instrumentation and observation of actual running code. Performance tests will help you understand what you expect of your system, but quality measurement and instrumentation will help you understand what’s going on right now.

Measuring the actual performance of your system is vital if you’re investigating performance problems, and luckily, there are great tools out there to do it. We used the excellent New Relic to verify some of our performance related suspicions.

NewRelic is not the only tool that does hardware and software measurement and instrumentation, but it’s certainly one of the best and it’s part of a maturing industry of software as a service offerings that support application, logging, and statistical reporting over servers and apps.

Code reading and profiling

Given that we had a suspicious looking hot-spot that we’d identified from IIS logs, we were also able to lean on old-fashioned code review and profiling tools.

Profiling tools are a bit of a headache. People struggle with them because they often deal with concepts that you’re not exposed to at any time other when you’re thrashing against performance issues. We’re lucky in the .NET ecosystem that we’ve got a couple of sophisticated profiling options to turn to, with both JetBrains’ dotTrace and RedGate’s ANTS Performance Profiler being excellent mature products.

We profiled and read our way through the code and doing so highlighted a few issues.

Firstly, we found some long running calls to another system. These were multi-second HTTP requests that were difficult to identify and isolate without deep code reading because there were no performance metrics or instrumented measurement around them.

Secondly, and much more significantly, was a fundamental design problem in a third party library. Due to some poor design in the internals of this library, it wasn’t able to cope with the capacity of data that we were storing in it. After some investigation, we established a work-around for this third party library problem, and prepared a fix.

How do we prevent this happening?

There are some useful takeaways from this performance journey. The first, is a set of principles should be considered whenever you’re building software.

Monitoring, instrumentation and alerting need to be first class principals in our systems.

This means that we should be recording timings for every single HTTP call we make. We should be alerting set on acceptable performance thresholds and this should all be built into our software from day one.

In order to get the visibility into the software that we need, we need great tooling.

New Relic was instrumental in helping us record the changes in performance while testing our solution. Further monitoring, instrumentation and aggregation of exceptions and stats would have made our lives much simpler – letting us identify potentially long running API calls much quicker.

There are tools on the market (StatsD, LogStash, Papertrail, Kibana, Raygun) that you can buy and implement trivially that’ll vastly increase visibility of these kinds of problems – they’re essential to reliably operate world class software in production, and they’re much cheaper to buy and outsource, than build and operate. If they save a few developer days a month, they pay for themselves.

Poor design ruins systems. In this case, the poor design was in a third party library, but it’s worth reiterating regardless. A design that can’t cope with an order of magnitude increase in load, needs to be evaluated and replaced.

Fit for purpose is very load dependant – we should consider if we can catch these potential problems while evaluating third party libraries that can’t be easily replaced – going to the effort of scripting and importing load, rather than discovering these issues when scaling to a point of failure.

Luckily, much of this will be things we already know – instrumentation is vital, and monitoring and performance metrics help us build great software – but these are some nice, practical and easy wins that can be implemented in your software today.

Lessons learnt running a public API from #dddnorth

October 19th, 2014

Yesterday I gave a talk at #DDDNorth (a free community lead conference in the “Developer! Developer! Developer!” series) about running public facing RESTful APIs. Many thanks to all the kind feedback I’ve had about the talk on social media – thrilled that so many people enjoyed it. It was a varient on previous talks I’ve given at a couple of usergroups on the topic – so here are the updated slides.

Google presentation link

Deferred execution in C# – fun with Funcs

October 16th, 2014

I want to talk a little about deferred execution in C#.

I use deferred execution a lot in my code – frequently using it to configure libraries and build for extensibility – but I’ve met lots of people that understand the concept of deferred execution (“this happens later”) but have never really brushed up against it in C#.

Deferred execution is where code is declared but not immediately run – instead being invoked later

Deferred execution is formally described by MSDN as meaning “the evaluation of an expression is delayed until its realized value is actually required”.

It’s common in JavaScript to supply a method as a callback which is later invoked:

image

In the above example, we’re declaring a new function while calling doSomething(), that is later executed as a callback by the doSomething() method, rather than being executed when we call it. Because JavaScript is executed linearly, it makes extensive use of callbacks in just about every part of the language.

By contrast, deferred execution is less obvious in C#, even though there have been keywords and types that leverage deferred execution available for years. There are two common ways that deferred execution is implemented in C#

The Yield Keyword

The yield keyword was introduced in C#2 as some sweet syntatic sugar to help people implement iterators and enumerators without boilerplate code. Using the yield keyword generates a state machine at compile time and actually does a surprising amount. There’s a really great (and ancient) post by Raymond Chen about how yield is implemented – but the short version is, you can “yield return” in methods that return an IEnumerable<T> and the compiler will generate a class with a whole bunch of goto statements in it. It’s an elegant compiler trick that a lot of you will have used, even if you didn’t realise it at the time.

The Action and Func Types

Now we get to the fun stuff. Actions and Func’s were also introduced in C#2, but became much more common from C#3 onwards when Lambdas were introduced to the language. They look like this:

image

Delegates and lambdas in C# are collectively called “Anonymous methods”, and (sometimes) act as closures. What this means, is that when an anonymous method is declared, it can capture “outer variables” so that you can use them later.  There’s a good response by Eric Lippert explaining the exact semantics of anonymous methods in a StackOverflow post here, and numerous examples around the web. This is interesting because you can use a closure to capture some context in one place in the application, and invoke it somewhere else.

There are lots of fun use cases for this, and I want to highlight a couple.

Templating methods

Sometimes referred to as the “hole in the middle” pattern – there are plenty of scenarios where you have a block of repetitive identical code, and four or five tiny variations. This is one of the most frequent sources of copy-paste code, and a great refactor for lots of older codebases. You could create an abstract base class, and a whole bunch of inheritance to solve this problem – or you could do something simpler, where this

image 

can trivially become this

image

Entirely removing the repetition from the codebase without having to build a bunch of abstract cruft. The actual bodies of the methods are entirely different, but with some creative use of deferred execution, we can make sure we only include the boilerplate code once. There are more compelling examples though, consider this pattern if you’re boiler plating HTTP requests, or doing repetitive serialization.

Method Interception

If you’re authoring libraries and want to give your users a way to “meddle with the default behaviour” providing them optional func’s is a really neat way to do it without compromising your design. Imagine a library call that by design swallows exceptions – while that’s not a great idea, we’ll run with it. So users might rightfully want to know when this happens, so you can leverage optional parameters and Action callbacks to give them a hook without compromising your library call.

image

This is a very simplistic example – there are whole frameworks built on the concept of wiring up large collections of Funcs and Actions that are chained together. The Nancy web framework’s Before and After hooks are nothing more than a chain of Funcs that get executed in sequence. In fact, the whole of the OWIN work-in-progress spec for the next generation of .NET webservers revolves around the use of a single “AppFunc”.

Mutator Actions and Funcs that leverage current context

I use Funcs for configuration in just about every library that I own. They’re a great way to allow people to configure the future behaviour of a library, in a repeatable way. In the following example I’m going to create a Factory class that will store an Action that it’ll use to modify the type it creates each time create is called.

image

This is especially useful if you want to do something like “get a value out of the current request” or some other thing that changes each time the factory is called – if you design your Actions and Funcs right, passing in the “current executing context” you can declaratively define behaviours at start-up that evaluate differently on each execution.

I’ve commonly used these kinds of configuration overrides to do things like “fetch the current tenant from context” or “get the ISession from a session scoped IoC container” – the configuration model of ReallySimpleEventing is a small example of using deferred execution and Actions to override the default behaviour when the library encounters an unhandled exception. A default “Throw all” implementation is provided, with the ability to override by configuration.

Working around problems with classes you can’t control the creation of

I recently had to dip into doing some WCF code – and one of the less desirable parts about working with WCF is that it creates your services for you. What this means is that if you want to use some kind of DI container across your codebase based around constructor injection, you’re out of luck. It’s annoying, and it can ruin your testing day, but with a little bit of legwork you can use Func’s to plug that hole.

Given a class that looks like this that’s created by some framework magic

image

You can make a publically settable static func, that’ll act as a proxy to your container and bind it up at bootstrapping time. That’s a bit of a mouthful, so let me illustrate it.

image

In the above example, you can wire up a public static Func<MyDependency> to your container in your bootstrapping code. This means that even if you don’t control the lifecycle of your class, you have a hook to callback to the container to grab a current valid instance of your dependency, without relying on deep static class usages or service locators. It’s preferable, because you can override this behaviour in test classes, giving you a way to test this previously hard to test code. This is especially useful if you want to exhume references to HttpContext or some other framework provided static from your code.

Dictionaries of methods for control flow

Here’s a fun example. Lets say you write a command line app, that takes one of five parameters, and you want to execute a different method based on the parameter passed. Simple enough, lets right an if statement!

image

A little naff and repetitive, but it’ll do. You could perhaps refactor this to a switch statement

image

This looks a little tighter, but it’s still quite verbose – even for such a small example. With the help of a dictionary and some Actions, you can convert this to a registry of methods – one that you could even modify at runtime.

image

This is a visibly concise way of expressing the same control flow in a way that’s mutable.

This is just a taste of some of the things you can do when you leverage deferred execution, anonymous methods and the Action and Func types in .NET.  There are plenty of open source codebases that make use of these kinds of patterns in code, so do dig deeper!

How to make it harder for people to steal your users accounts (and data!)

September 3rd, 2014

The internet has been a-buzz with the recently high-profile thefts of a lot of salacious celebrity photos. As soon as it transpired that iCloud was allegedly the source of these leaks – many, myself included, feared the worst – some kind of exploited, zero-day exploit in Apples cloud storage solution.

As it turns out, the vulnerability wasn’t really anything new, just a well coordinated and lengthy exercise in social engineering and password theft. It appears the data that was stolen and subsequently leaked online had been collected and traded over several months and obtained by gaming the password recovery and signup processes of iCloud, and presumably, several other online sites.

So how does it work?

It’s actually fairly simple – the hackers recover the passwords for the accounts of their targets, and then restore any stored files or device images. This is mostly achieved by identifying the valid email address of the target, and then gaming the password recovery systems.

The hackers first need to verify that the email they’ve been given is a valid login credential – they do this by brute forcing signup and password recovery pages on the target website – paying close attention to the login failure messages returned. If a failure message indicates that the login credential is correct, but the password is wrong, the hacker then has 50% of the information required to break into that account.

They’ll then use a social media driven attack on password recovery questions based on publically available (or privately sourced) data on the target, attempting to guess the recovery answers. If the site they’re targeting isn’t particularly security conscious, as a last ditch they’ll attempt to script a brute force attack to crack into an account. This can be very obvious, so it’s not a preferred attack vector.

Any data stolen can subsequently be used to break into other digital accounts owned by the target – people often use the same passwords everywhere, or if they break into an email account, they can simply password reset everything. Often any contacts or addresses stolen can be used to attack other targets – especially in scenarios where important people have been targeted where they can potentially learn contact information of other potential targets.

Avoid being part of the problem

Here are a few things that you can change in your software to make it less likely that your accounts are hijacked, or that your site is used to exploit somebody else:

Return generic “username or password invalid” responses to authentication failures

Your UX department will probably hate you for this, but it’s a very simple compromise – never tell a user which part of their login credentials failed. Simple by returning “Invalid Password” instead of “Invalid Credentials” verifies to a potential hacker that the email address they attempted to use is already registered.

Use usernames for authentication instead of email addresses

Using email addresses during authentication allows a potential hacker to exploit your signup process to verify a valid account. If your site returns a message like “Email already registered” you are confirming the existence of a user account that’s open to exploitation. You can potentially mitigate against this by returning a more generic “Email not allowed” but often even that is enough of a positive signal to a potential hacker.

Ensure authentication attempts are rate limited

This is really anti-brute-forcing-101 – don’t let people repeatedly submit a login form without slowing them down after a few failed attempts. Once you’re in the region of 5 or so failed attempts, it’s perfectly acceptable to start making users complete CAPTCHAs – google offers recaptcha for free – use it. This is a UX compromise, but so long as you only flag suspicious traffic you’re not going to inconvenience many real users. Make sure that your CAPTCHA implementation doesn’t rely on cookies or other client side state – people writing tools to brute force logins will simply not respect them.

Ensure API authentication attempts are rate limited

Similar to using CAPTCHAs in your website – make sure any API authentication methods are similarly rate limited. The simplest implementation is to start scaling the time a login attempt takes based on the number of failures. For 1-10 attempts – allow the user to login as fast as they can, but after that, start artificially slowing your response times after every failed request from a given IP, client or key – preventing brute forcing your APIs.

Ditch all those “password recovery” questions

There aren’t many good ways to deal with password recovery – but the obvious one is to use SMS verification based on real phone numbers. Allowing people to recover accounts based on personal details that are easy to Google is crazy. If the answers to your password recovery questions are available on your users public Facebook profiles (date of birth, mothers name, place of birth, first school, etc) then you’re just letting your users fall into the pit of failure by accepting them as security questions. Prefer verification by phone or SMS – there are great services out there that make this easy so prefer them over perpetuation the “password recovery” problem. Alternatives to this are accepting “recovery email addresses” like Gmail, and verifying a small 0-99p charge made to any registered credit card.

If your users refuse to give you contact information to recover their account, don’t, under any circumstances, hand the account over.

Don’t enforce restrictive password requirements

Enforcing silly “0-9 characters, alpha numeric and at least one special character” style password requirements absolutely reduces the attack space for a password hack and opens your site up to simplistic brute force attacks. Encourage users to set long passwords, or jumbled pass phrases. We know passwords aren’t great, but they’re what we have – so the longer the better.

Don’t store passwords in either clear text or a reversible hash

Whilst not directly related, please remember that any kind of clear text or reversible password hash is a terrible idea – if you’re compromised, there’s a good chance that your users passwords will be stolen and used to attack more important online properties they own. Play nice, use something like bcrypt.

In the vast majority of cases of “hacking a user account” data made publically available is used to exploit a third party site – so lets all play nicely and protect our users together, lest the data stolen by something even more valuable that humiliating or salacious photos.

Inverting the FN keys on a Microsoft Wedge Keyboard

August 11th, 2014

Just picked up one of the pretty nice Microsoft Wedge bluetooth keyboards, and was reasonable incensed when I discovered there was no FN-Lock anywhere on the thing – and it’d been hardware fixed to have my precious function keys used as media player keys. Not very Visual Studio friendly!

Apparently, no official toggle exists, but here’s an AutoHotKey script that works:

Media_Play_Pause::F1
Volume_Mute::F2
Volume_Down::F3
Volume_Up::F4
<+#F21::F5
<!<#F21::F6
<^<#F21::F7
<#F21::F8
PrintScreen::F9
Home::F10
End::F11
PgUp::F12
F1::Media_Play_Pause
F2::Volume_Mute
F3::Volume_Down
F4::Volume_Up
F5::<+#F21
F6::<!<#F21
F7::<^<#F21
F8::<#F21
F9::PrintScreen
F10::Home
F11::End
F12::PgUp

Stolen from http://stackoverflow.com/questions/14996902/microsoft-wedge-mobile-keyboard – which took me an age to find in the the Googles, so hopefully this’ll help someone.