One chapter stands out in the book Reliability Characterization of Electrical and Electronic Systems, edited by Jonathan Swingler. That chapter is Reliability and Stupidity: Mistakes in Reliability Engineering and How to Avoid Them, by contributing author R.W.A. Barnard of Lambda Consulting, Pretoria, South Africa. Following are excerpts from that chapter:
“Unfortunately, the development of quality and reliability engineering has been afflicted with more nonsense than any other branch of engineering. This has been the result of the development of methods and systems for analysis and control that contravene the deductive logic that quality and reliability are achieved by knowledge, attention to detail, and continuous improvement on the part of the people involved.” —Patrick O’Connor
These statements, written by a well-known reliability engineering author, provide for an excellent introduction on stupidity in reliability engineering. At the same time, they explain why executing industry-standard reliability engineering activities is no guarantee that high product reliability will be achieved in operations. In order to understand this paradox in reliability engineering, let us consider a few fundamental concepts. We need to understand that all product failures are caused and that all product failures can be prevented. Crosby stated that “All non-conformances are caused. Anything that is caused can be prevented.” Based on this statement and applying common sense to real-life experience, reliability may simply be defined as the absence of failures, and reliability engineering as the management and engineering discipline that prevents the creation of failures.
These simple definitions imply that a product is reliable if it does not fail and that this failure-free state can only be achieved if failure is prevented from occurring. What is required to prevent failure? Firstly, engineering knowledge to understand the applicable failure mechanisms and, secondly, management commitment to mitigate or eliminate them. Proactive prevention of failure should be the primary focus of reliability engineering, and never reactive failure management or failure correction.
Figure 1 shows part of a typical product development process, with an emphasis on design and production verification. The iterative nature of product development is evident. It can also be seen that reliability engineering changes from “proactive” during design (i.e., failure prevention) to “reactive” during production and especially during operations (i.e., failure management). Reactive reliability engineering should be avoided due to the very high cost of corrective actions, which may be required (e.g., redesign and product recalls).\
It is important to understand that reliability is a nonfunctional requirement during design and that it becomes a characteristic of a product during operations. “Analysis” and “test” are two primary verification methods used in engineering. Reliability engineering activities to perform reliability analyses and tests are well documented in various textbooks on reliability engineering. Product reliability is the result of many management and technical decisions taken during all product development stages (i.e., concept, definition, design, and production).
The worst mistake in reliability engineering that a company can make is to ignore reliability during product development. For any product of reasonable complexity, the outcome almost certainly will be unacceptable reliability, resulting in an inferior product. In theory, a formal reliability engineering program is not a prerequisite for successful product development, but in practice, it is highly recommended.
Stupidity in reliability engineering implies that some activities can be considered as nonsense, ineffective, inefficient, incorrect, wasteful, etc. To decide which reliability engineering activities fall into this category is not trivial, since it depends on many factors such as product complexity, product costs, technology maturity, and failure consequences.
This chapter discusses various aspects of reliability engineering that should be considered during product development, since many industry-standard practices may be misleading or simply a waste of time and money. An attempt is made to relate these aspects to the ultimate goal of reliability engineering, namely, the prevention or avoidance of product failure. It will become evident that “stupidities” experienced with reliability engineering are related to:
(1) The correct selection of reliability engineering activities
(2) The correct execution of those activities, or
(3) The correct timing for execution of selected activities.
Common mistakes in reliability engineering
The following aspects, discussed in no particular order, are common mistakes that should be avoided in reliability engineering. One mistake is inadequate integration of reliability engineering with product development.
“What is the goal of reliability engineering? We need to distinguish between tasks that are often useful way stations and the ultimate goal. The ultimate goal is to have the product meet the reliability needs of the user insofar as technical and economic constraints allow. The ultimate goal surely is not: To generate an accurate reliability number for the item.”--Ralph Evans
Reliability engineering activities are often neglected during product development, resulting in a substantial increase in the risk of project failure or customer dissatisfaction. In recent years, the concept of design for reliability has been gaining popularity and is expected to continue for years to come. Reliability engineering activities should be formally integrated with other product development processes. A practical way to achieve integration is to develop a reliability program plan at the beginning of the project.
Appropriate reliability engineering activities should be selected and tailored according to the objectives of the specific project and should be documented in the reliability program plan. The plan should indicate which activities will be performed, the timing of these activities, the level of detail required for the activities, and the persons responsible for executing the activities. Raheja wrote, “Reliability is a process. If the right process is followed, results are likely to be right. The opposite is also true in the absence of the right process. There is a saying: ‘If we don’t know where we are going, that’s where we will go.’”
ANSI/GEIA-STD-0009-2008, Reliability Program Standard for Systems Design, Development, and Manufacturing, can be referenced to develop a reliability program plan. This standard addresses not only hardware and software failures but also other failure causes during manufacturing, operations, maintenance, and training.
ANSI/GEIA-STD-0009-2008 supports a life cycle approach to reliability engineering, with activities logically grouped under the following objectives:
- Understand customer/user requirements and constraints
- Design and redesign for reliability
- Produce reliable systems/products
- Monitor and assess user reliability
The first category (i.e., understanding of requirements and constraints) is of significant importance. Yet, design teams rarely pay enough attention to formal requirements analysis. Inadequate understanding of requirements frequently results in design modifications and production rework, or, even worse, premature product failure. Complete and correct requirements are therefore necessary at the beginning of product development. These requirements should include all intended environmental and operational conditions. It is not possible to design and produce a reliable product if you do not know where and how the product will be used. It is also a bad practice to “copy and paste” requirements from one development specification to another. Rather, spend more time upfront, and avoid costly redesign and rework later.
Due to a multitude of reliability engineering activities available, inexperienced engineers may find it difficult to develop an efficient and effective reliability program plan. For example, should failure mode and effects analysis (FMEA) be considered, or should fault tree analysis (FTA) rather be considered, or perhaps both analyses? Figure 2 indicates a few relevant questions, which may be used to guide the development of a reliability program plan for a specific project.
It may not be immediately evident, but a company’s approach to reliability engineering will depend to a large extent on which definition of reliability they support. The conventional definition requires much more emphasis on, for example, failure data analysis (using various statistical methods), while the other definitions place more emphasis on in-depth understanding and knowledge of possible failure mechanisms.
Companies supporting the common sense definition of zero failures will invest in technical and management processes to design and produce failure-free products. The following “reliability equation” clearly illustrates the difference between these approaches to reliability engineering:
While it may be useful to calculate, measure, or monitor reliability (i.e., the left side of the equation), it does little or nothing to achieve or improve reliability. Rather, focus on defining and controlling the factors that directly influence reliability (i.e., the right side of the equation) (derived from Carlson).
“Quantification of reliability is in effect a distraction to the goals of reliability.”--Ted Kalal
As mentioned, the conventional definition of reliability requires a major focus on probability of failure. Although useful information can be obtained from proper statistical analyses (e.g., analysis of part failures), in many cases, the reliability effort simply degrades to “playing the numbers game.” This is almost always the case where a quantitative requirement is set by the customer, and the developer has to show compliance with this requirement. Reliability prediction and reliability demonstration are prime examples of activities where quantification of reliability should be avoided during product development.
It is not possible to determine product reliability by merely adding part failure rates obtained from any document (such as Mil-Hdbk-217). If you claim to be able to predict reliability, you imply that you know what will fail and when it will fail. If this were possible, why don’t you rather use that knowledge to prevent the failure from occurring? The only valid reliability prediction is for wear-out failure modes, where the specific failure mechanisms are well understood.
Many reliability engineering activities therefore become part of a historical documentation process, which cannot possibly contribute to higher product reliability. Assumptions are often “adjusted” by the reliability engineer to achieve the desired “result.”
“MTBF is the worst four letter acronym in our profession.”--Fred Schenkelberg
Mean time between failures (or MTBF) is probably the most widely used reliability parameter. It is frequently specified as requirement in product development specifications, it is published in both marketing and technical product (and even part)/brochures, it is discussed by engineers in technical meetings, etc. Yet, when engineers are asked to explain the meaning of MTBF, only a small minority of them actually understands it correctly. Most people, including many engineers, think that MTBF is the same as “average life” or “useful life.” This is of course totally incorrect and frequently leads to inferior design decisions.
Theoretically, MTBF is the mean value of the exponential failure distribution. For time t as an independent variable, the probability density function f (t) is written as:
The probability of no failures occurring before time t (i.e., reliability) is obtained by:
For items that are repaired, λ is called the failure rate, and 1/ λ is called the MTBF. Using Equation (3), reliability at time t = MTBF can be calculated as only 36.8%. In other words, 63.2% of items will have failed by time t = MTBF. This characteristic of the exponential failure distribution suggests that MTBF may not be a very useful reliability parameter.
To efficiently meet any reliability objective requires comprehensive knowledge of the relationships between failure modes, failure mechanisms and mission profile, according to the German Electrical and Electronic Manufacturers’ Association.
Barnard concludes that a product is not reliable because reliability engineering activities have been performed. It is reliable if it was designed and manufactured to be reliable. This requires that the correct reliability engineering activities are correctly executed at the correct time during product development. It is reliable when management provides the necessary resources to enable engineering to achieve the goal of zero failures. Reliability is simply a consequence of good engineering and good management.
Swingler’s 274-page book covers other subjects:
- Physics-of Failure methodology for electronic reliability
- Modern Instruments for characterizing degradation in electrical and electronic equipment
- Reliability building of discrete electronic components
- Reliability of optoelectronics
- Reliability of silicon integrated circuits
- Reliability of emerging nanodevices
- Design considerations for reliable embedded systems
- Reliability approaches for automotive electronic systems
- Reliability modeling and accelerated life testing for solar power generation systems
More information on Reliability Characterization of Electrical and Electronic Systems may be found on at http://store.elsevier.com/product.jspisbn=9781782422211&pagename=search.