SMERFS   Resources     Downloads     Sources  

Software Reliabilty Overview

by Walt Stoneburner

When you buy a car, often the dealership will ask you if you're interested in purchasing a warranty. However, a warranty simply means it's affordable to fix, not that the vehicle is reliable.

When we discuss the reliabilty of software, we can do so from different roles. As a manager, I might want to know is the software good enough to roll out to production and put in the hands of users. As a consumer, I may be more interested in how long the software is going to work correctly until the next failure.

Typically it works like this:
  Step 1. Keep a log of past failures.
  Step 2. Plot the failures.
  Step 3. Determine a curve that best fits the observations.
  Step 4. Measure how accurate your curve model is.
  Step 5. Predict the future using the model.

Estimation [of the model, not future behavior] is finding the right curve for historical data. Prediction is just that, intelligently guessing what will happen in the future.

Complex software should be less reliable than less complex software. High level languages should be more reliable than low level languages. Experienced staff should produce more reliable code than less experienced staff. Proven technologies and processes should be more reliable than untested technologies. All these factors, and many more, go into twiddling the shape of the curve.

Software and hardware don't behave the same way. The longer software is used and maintained, the higher the degree of confidence it obtains as defects are found and weeded out. Hardware, on the other hand, ages. Heat, oxidation, and wear take their tolls on hardware, and even a perfectly functioning device (like your old VCR) will eventually fail, despite how well it's treated.

A software reliabilty curve flattens out over time. A hardware reliability curve is more U-shaped. Since a system consists of running software on hardware, both need to be taken into account concurrently, and for that we have the system reliability model.

Measuring Software Reliabilty is more of an art than a hard science. The tricky part is finding a realistic curve. This isn't a matter of making a spline that goes through all points, because what's really being mapped is a general trend over time. Luckily for us the field of statistics provides many tools for generating curves and measuring their fitness to a collection of raw data points.

Luckily for people who dislike statistics, are terrible at math, and/or have a phobia of greek letters, there's a wonderful tool called SMERFS. SMERFS will take collected raw data, analyze it, try various models, tell us which is the best, and predict future behavior based on past trends.

Note that while a model allows for incremental improvements (such as weeding out defects), it typically does not handle major rewrites of code or process. Curves therefore are much better at evolutionary changes than revolutionary changes.

However, the real world does have revolutionary steps such as massive code change, new versions, and new platforms. In those cases, a model needs to take data from the latest version and weight it more than historical data of prior versions. Sure the historical trends can provide an indication of what may happen, but they may not provide a prediction of what may happen.

The real world can be even harsher. Sometimes there is no data, or it's inaccurate, or it's just plain missing. For this we have to fall back to the original factors that went into the design and make scientific wild-ass guesses from there.

There are lots of models to choose from, and what this really means is that there are lots of equations that express different shaped trend curves. The different models take different things into accounts with various weightings.

Commonly, it's useful to collect things like time of failure (or the interval between failures) and the intensity of a failure. Failure intensity is how bad things went wrong. This can be through some arbitrary ranking, or it can be an aggregate total number of failures over a given time period.

A failure is when the software and/or hardware doesn't do what it should when it should. This is due to a fault.

There's a difference between a failure and a fault. A failure is the a crashing into a tree. The fault is the steering wheel wasn't screwed on at the factory, the driver fell asleep while driving, or the roads were covered in ice. Faults can be product defects, user related, or environment related. Failure is what happens when the fault shows itself during operation.

Note that we're trying to track new, unexpected failures. If we notice the steering wheel isn't screwed on tightly, that doesn't count. If we noticed it before and opt not to fix it right now, we don't count the failure again. Because software shops tend to fix things in batches and then release them, the actual raw data appears as a stair-step. The model's curve, however, demonstrates what an instanteous fix would look like the moment a fault is discovered.

By measuring the whats and whens of a failure, it possible to calculate things like the mean time to failure, the number of failures at a given point at time, or at what point in time there will be some specific number of failures left to occur.

It's doubtful any model or methodology is going to be 100% accurate, however it is possible to get a statistically strong confidence level by which make a sound decision.

This page last updated 11-Sep-2015 20:06:04. Site Maintained by Walt Stoneburner Visit SlingCode.com