Achieving Quality when Something is Always Broken

frac1In the quality profession, we are accustomed to thinking about product and component quality in terms of compliance (whether specifications are met), performance (e.g. whether requirements for reliability or availability are met), or other factors (like whether we are controlling for variation effectively, or “being lean” which is realized in the costs to users and consumers). So this morning I attended Ed Seidel’s keynote talk at TeraGrid 09, and was struck by one of his passing statements on the quality issues associated with a large supercomputer or grid of supercomputers.

He said (and I paraphrase, so this might be slightly off):

We are used to thinking about the reliability of one processor, or a small group of processors. But in some of these new facilities, there are hundreds of thousands of processors. Fault tolerance takes on a new meaning because there will be a failure somewhere in the system at all times.

This immediately made me think of society: no matter how much “fault tolerance” a nation or society builds into its social systems and institutions, at the level of the individual there will always be someone at any given time who is dealing with a problem (in technical terms, “in a failure state”). Our programs that aim for quality on the scale of society should take this into account, and learn some lessons from how today’s researchers will deal with fault tolerance in hugely complex technological systems.

It also makes me wonder whether there is any potential in exploring the idea of quality holography. In large-scale systems built of closely related components, is the quality of the whole system embodied in the quality of each individual part? And is there a way to measure or assess this or otherwise relate these two concepts operationally? Food for thought.

5 replies »

  1. Great post Dr. R

    We might characterize this as “planned failure”. I remember reading an article about Google’s data center years ago. They took the “pizza box” approach – thin, rack mounted systems with modified linux kernels optimized for search.

    The approach was counter to the current data center design of large scale systems in one monster server (think Sun E10K).

    One of the Google engineers said when a failure occurs, “the systems are so cheap, we pull one out of the rack and replace it with a spare.”

    Another support-related item on the grid topic is how to manage/maintain/troubleshoot failures in a hugely complex technological system.

    Thanks for posting –

  2. I linked to your post here:

    I’m trying to think about differing perspectives of the quality field vs. Enterprise Risk Management. A lot of the ERM people are coming from finance/insurance originally – we expect bad things to happen, and we reserve [i.e. set up money aside] against expected losses and hold risk capital [i.e. set aside even more money] or buy insurance/reinsurance to protect against catastrophic faults.

    However, they’ve extended some of these thoughts beyond finance, though still the concept is one can measure money value for things going wrong [it takes money to fix things going wrong, after all]. I’m wondering how some of these concepts might be applied to the large systems you’re talking about here.

    One of the biggest problems in ERM is trying to measure/capture operational risk, which has to do with internal processes failing, human error, IT not working right, etc. It seems to me the quality professionals have a lot of experience in this area and can show us some ways to quantify failure costs, or at least find a place to start on this.

    • I had a couple of thoughts reading this reply and your post on actuarialoutpost.com. First off let me establish that I am definitely don’t spend much time doing risk management myself – I know the basics, I can step organizations through the exercise of basically outlining the issues, quantifying the risk exposure financially (kindergarten style, using approaches like Severity x Opportunity x Detection), and outlining mitigation strategies. Last weekend, I was wondering why all of our risk management deals with the bad things we think could happen. Granted, we have to protect ourselves and our projects from these events and understand our contingencies so we don’t have to “think” too much if unfortunate things happen – we execute to our risk management plan.

      But why don’t you ever see anyone doing “wild opportunity management”? Use the same process, but brainstorm all the possible ways something amazingly good might happen, and how you might respond! And then be prepared to capture the financial (social, technological, political, etc.) opportunity if any of these scenarios play out.

      Second: regarding money value for things gone wrong. In terms of risk exposure, that’s pretty interesting. In a system of a million processors, could each of those processors be weighted equally, or would there be ones that are “more important”? I’m not too familiar with the internal topology of a giant machine like that, but I’d guess that if there are nodes with higher centrality, there would be more of a financial value if something went wrong. Would be analogous to assessing money value of a risk on a social network: the hubs would have to be weighted more strongly if the network was scale-free.

      Third: are you aware of the quality costs model? The sense here is that you partition your costs in terms of prevention, appraisal, internal failure and external failure (instead of just relying on external failure). There are several studies that relate the maturity of your operational processes to your distribution of quality costs over these categories. The bible on quality costs is http://www.asq.org/quality-press/display-item/index.pl?item=H1013 – let me know if you want me to ship it to you (I could pick it up next time we head to CT, e.g. late July/early Aug).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s