Tag Archives: fault tolerance

Achieving Quality when Something is Always Broken

frac1In the quality profession, we are accustomed to thinking about product and component quality in terms of compliance (whether specifications are met), performance (e.g. whether requirements for reliability or availability are met), or other factors (like whether we are controlling for variation effectively, or “being lean” which is realized in the costs to users and consumers). So this morning I attended Ed Seidel’s keynote talk at TeraGrid 09, and was struck by one of his passing statements on the quality issues associated with a large supercomputer or grid of supercomputers.

He said (and I paraphrase, so this might be slightly off):

We are used to thinking about the reliability of one processor, or a small group of processors. But in some of these new facilities, there are hundreds of thousands of processors. Fault tolerance takes on a new meaning because there will be a failure somewhere in the system at all times.

This immediately made me think of society: no matter how much “fault tolerance” a nation or society builds into its social systems and institutions, at the level of the individual there will always be someone at any given time who is dealing with a problem (in technical terms, “in a failure state”). Our programs that aim for quality on the scale of society should take this into account, and learn some lessons from how today’s researchers will deal with fault tolerance in hugely complex technological systems.

It also makes me wonder whether there is any potential in exploring the idea of quality holography. In large-scale systems built of closely related components, is the quality of the whole system embodied in the quality of each individual part? And is there a way to measure or assess this or otherwise relate these two concepts operationally? Food for thought.