Call us at
(903) 293-3539


The Reliability Leadership Connection

Book Cover01

This book is intended to give a short and concise explanation of Dr. Deming’s principles with practical examples and the principles of John Moubray’s reliability-centered maintenance (RCM2) and draw a clear connection of interdependence between the two. The book’s focus is to show how Deming’s principles support RCM and how RCM supports Deming’s principles.

Furthermore, the book stakes out the position that Moubray’s RCM2 is not just another tool to pick from, but rather establishes the leadership requirements needed for organizational success. RCM2 is an organizational leadership discipline that can bring positive, transformative effects on organizations willing to embrace it.

This book gives organizational leaders and anyone from the boardroom to the shop floor an understanding of what asset reliability and maintenance is truly all about. It also outlines what steps must be taken to transform an organization to succeed in the 21st century.

To Order: Go To Amazon – The Reliability Leadership Connection

Joel Levitt, said – “This is a great review of the work of W.E. Deming and its relevance to maintenance today. It also weaves in the work of John Moubray and RCM II. The 14 principles of Deming are well worth restudying because it looks to me that we have not learned much since 1950 when he first published them. Shellogg has a great handle on RCM and introduced a number of ideas that go well beyond the 7 questions and were new to me.  Easy reading and clear examples.”

Joyce Nilsson Orsini, PhD, said – “In this book, Jay Shellogg has taken each of Deming’s fourteen points and deadly diseases and, in his own words, interprets what the point means in the world of reliability-centered maintenance. The author cites examples of each point from his own experience in the field of reliability and offers suggestions for improvement in his field.”

Posted in Reliability Centered Maintenance, reliability leadership |

Where RCM Fits in the Maintenance Program

LinkedIn Picture Shellogg mayjun2018 Where RCM fits in the Maintenance Program_Page_1

Where RCM fits in the Maintenance Program

Learn about the rapid deployment of Aladon Reliability Centered Maintenance (RCM2 & RCM3) and where RCM fits in the operational & maintenance process.

Article appeared in the May/June 2018 issue of TAPPI Paper360

Posted in management, RCM, RCM2, Reliability Centered Maintenance, reliability leadership, strategy |

Failure Isn’t Just an Option— It’s Unavoidable Part 2

In Part 2 of this 2-part series, we look at how to use the P-F Curve to determine the point of functional failure.  For full article see May/June Issue of Paper 360.

In Part 1 (which appeared in the March/April issue of Paper360°) we looked at changing our worldview on what defines failure. I explained the concept of “Functional Failure,” and introduced the P-F Curve (see Fig. 1.) So, how does the P-F curve help us?

We use the P-F Curve to identify what can be done to detect a pending functional failure and to determine what the inspection frequency should be. We base the inspection frequency on one-half the length of the P-F interval. We do this to give us enough time to proactively plan and mitigate the pending functional failure. If half the P-F interval will not give us enough time to proactively plan for corrective maintenance, then we may consider an alternate detection method with a longer P-F interval or we may choose a frequency of less than half so we have ample time to plan.

It is important to note that in no way does the concept of criticality influence the frequency of inspection. Criticality’s only role is in determining if the inspection is worth doing. Those two previous sentences are sometimes hard to swallow, because this flies in the face of traditional thinking and probably 95% of the maintenance and reliability industry. But, no matter how critical a system is considered to be, if its Failure Modes that lead to Functional Failure have identifiable P-F intervals, why would we inspect at a rate more frequent than half the P-F interval? Criticality plays no role in inspection frequency.

If half the P-F interval is not long enough to prevent functional failure, we may decide to shorten the inspection interval; however, we commonly find that frequencies are set at intervals much shorter than half of the P-F interval. This is done for a host of reasons: to make us feel better, in response to pressure to “never have another failure”, because it is “best practice”, or because “it’s what everyone else is doing.” The list goes on and on, but none of these reasons are based on a logical, systematic approach. The P-F interval must set the inspection frequency—not criticality, not emotions, not best practices.

Electrical components

So the P-F Curve is fine for mechanical devices, but what about electrical components? They either work or they don’t, right? Well, maybe not.

Let’s go back to that idea of the Functional Failure and instead of a pump, let’s consider a finish products line. It doesn’t matter if it is fine paper converting, tissue, a winder, etc. because I bet everyone who reads this article at one time or another has cycled power on a computer, PLC, or drive to clear a fault. It is also very common to clear that fault and still meet the daily/week/monthly production standard. In this example, functional failure is not defined at the drive rest, but when the drive fault disrupts production to such a degree that we do not meet our performance standard. Even if we define Functional Failure at a loss of our production standard, we still must define a P-F Interval.

Back to our earlier question. Electrical systems either work or they don’t, right? I will bet again that most of you have experienced a drive rest only to see the drive fault return—first within weeks or months, then within days or weeks, then within hours to days—and finally not to clear at all. There you go: a P-F Curve. From the first time the drive faults, you enter the P-F Curve at point P. Then, each time you rest the drive you progress down the curve until you reach Functional Failure.

Want more proof? How many times have you had a drive fault that you have experienced in the past, and at the first rest you go to the storeroom and make sure you have a spare drive in-stock? I’ll bet many times. Without even knowing what the P-F Curve was, you were still acting on the P-F concept and your experience.

In this discussion, I have not mentioned criticality; however, there is a clear point of relevance. Criticality will play a role in the storeroom stocking decision.

One last point—and I’m very interested in reader feedback on this. Have you ever heard a manager state, “I don’t want to have another drive fail!” I have heard this pronouncement for many mechanical components, but never electrical. I wonder if it’s because electrical components lend themselves more easily to the concept of random failures, while mechanical failures seem as if they should be time-based—even though both mechanical and electrical failures are most likely random in nature.

The causes of failure

There is another concept that supervisors/planners/engineers, as well as maintainers, know very well—though it is often overlooked by operations managers. This is the concept of how the initiating cause of a functional failure develops.

A bearing defect caused from normal fatigue of a race will develop over a period of time (OAPOT); a bearing defect caused by human error (lubrication, overload, installation, damage, etc.) will occur suddenly. Understanding how a cause of failure develops (OAPOT or suddenly) will determine which strategy we should employ to mitigate it. For causes that develop OAPOT, we will look for a P-F Curve and a usable P-F Interval; but for the causes that occur suddenly, most of the mitigations we should employee will be policy- or procedure-based.

As an example: for the cause of failure related to fatigue we may use vibration analysis, but for a lubrication error we should institute training/standardization/controls. Yet it is very common to determine that a failure has been caused by sudden human error, and then react by implementing a strategy that is appropriate for an OAPOT-type cause.

See if this sounds familiar: A bearing failure occurs. The maintainer reviews previous vibration analysis, which shows no signs of a defect. The vibration team is adamant that their program would catch any normal OAPOT causes of failure, so this failure must be SUDDEN in nature. Yet the management team increases the interval of vibration analysis instead of digging into what caused the SUDDEN (human error) failure. Now we are wasting time, taking additional vibration readings that are not needed. We degrade the confidence of the vibration team, and we will not prevent this sudden functional failure from reoccurring.

There is one additional way a cause of functional failure can occur and it has to do with protective devices: I call it “this only matters if” (TOMI). The idea is that a failure of the protective device only matters if the protected function fails. For example, if you have a duty/standby pump, the standby pump only matters if the duty pump fails. Under normal circumstance, you will not know if the standby pump works unless you need it. The same holds true for other protective devices such as smoke detectors, spare tires, alarms, or light curtains. For TOMI-type functional failures, we must understand and determine how often the duty system is failing, how often the protective device is failing, and how often we are willing for both devices to be functionally failed (remember, “never” is unattainable). With these three pieces of information, we can calculate how often we must check our protective devices.

So there you have the concepts every supervisor/planner/engineer must understand for real reliability to begin:

  1. The P-F interval
  2. That criticality plays no role in inspection frequency
  3. The new definition of failure: Functional Failure
  4. The different types of causes of failure: OAPOT, Suddenly, or TOMI

If you have any questions or comments about the concepts I have outlined in this article I would really like to hear from you. Feel free to contact me directly by e-mail or through the editor.


Posted in Uncategorized |

Failure isn’t just an option – Failure is unavoidable

Failure isn’t only an Option – It’s unavoidable

In this article, we explore a new definition of failure and how to mitigate its consequences.

In Part 1 of this 2-part series, we’ll look at changing our worldview about what defines failure.  JAY SHELLOGG

Many years ago, while completing my degree in engineering, I was forced to take a class in philosophy. To this day I can’t give to a good reason for why this was required; most likely it was to provide some needed funding to that department. Two concepts have stuck with me from that class, though they never seemed of any value until I began working in the realm of reliability.

The first concept is that our “view of the world” influences our perceptions and actions. Basically, our professor taught that we are all biased (in the statistical sense) by our experiences. A great example of this is in the debates between Albert Einstein and Neil Bohr. Bohr was a physicist who worked in quantum mechanics where events seem strange and random. Einstein’s work was in belief and pursuit of order. Einstein believed that the entire universe was ordered, patterned, and predictable; he once stated, “the Almighty does not play dice with the universe.” Bohr’s reply was basically “who are you to tell the Almighty how to run the universe?” Each mans “view of the world” undoubtedly influenced his actions. Einstein’s view was order, and Bohr’s view was probability.

The second concept I took away from that philosophy class was that if something had a beginning, it must have an ending. If there is a creation, there must be a destruction; if there is a start, there is a finish.

That brings us back to the title of this article: failure is unavoidable. Assets and systems are designed, built, installed, operated/maintained, and fail. The only thing we can do is prolong the life cycle and/or avoid the consequence of failure. Anyone working the realm of reliability in the pursuit of preventing failure is on a fool’s errand. I know this flies in the face of much reliability industry thinking, because most folks are trying to stop things from failing—but consider your personal health. Everyone who reads this article will die; the question is whether it will be in minutes, days, or decades. All we can do is take actions that prolong our lives, and put things in place to mitigate the consequences when our time comes. That is what true reliability is all about—prolonging asset life and mitigating the consequence of the inevitable failure.

Functional Failure

Over my years in the paper industry, I have heard machine managers demand, “I don’t want another bearing to fail.” This sets up an impossible demand on the organization. Every bearing will fail no matter how loud the pronouncement against failure or how many threats are made against the operating/maintenance crews. I bet many of you reading this article have lived through rash demands made by frustrated managers. So how do we deal with these pronouncements?

It’s not easy. These type of demands come from an emotional reaction typically based in a lack of understanding of the degradation process, and from stress applied on managers from upper managers who are just as uninformed. Unfortunately, the task to educate these folks flows to the supervisor/planner/engineers. These folks are the “lowest common denominator” in any mill because they’re sandwiched between the hourly work force and management. As a former mill manager I knew once put it, these people make up “that thin line that is between a rock and a hard place.”

The task I set before them is to be vocal in the face of direct management hostility. This is a tough thing to ask, but if we don’t stand for what is right, who will? All it takes is guts, confidence, knowledge, and a little bit of a cavalier attitude. Every supervisor I have ever known possess all of these traits. Hopefully, what I have said to this point sets the scene for where I’m going: we need to give the supervisor/planner/engineer a little more knowledge, confidence, and guts about addressing the inevitable failure that will come.

Where does that leave us, if failure is something we cannot prevent? We must develop a new definition of failure. We need to define a type of failure that can be prevented. We call this “functional failure.” This is a failure we can avoid.

Simply stated, Functional Failure is a point at which the user/owner has drawn a line that they do not want the degradation to cross. For example, let’s consider pump performance. If we have a pump that can pump 400 gpm, but all we want is 300 gpm, then our functional failure is when the pump can no longer pump 300 gpm. Maybe it can pump 299 gpm, but that is less than the performance standard we have set. So at 299 gpm, we say the pump is functionally failed. Once functional failure is defined, we look for indications that the pump is heading towards functional failure. We might look for a decrease in flow rate below 400 gpm, but above 300 gpm. This way we have time to prevent the functional failure of falling below 300 gpm.

As for the paper machine bearing discussed earlier—when the manager pronounces, “I don’t want another bearing to fail” we must ask that manager how we define when the bearing has failed. The most common response I’ve heard is, “when it won’t turn!”—but that’s too late.

Okay, I know that everyone has jumped ahead of me and is thinking about all the technology/techniques we apply to bearings to detect their degradation (vibration, ultrasound, broomsticks, etc.) The most important take-away at this point is to see that most components will follow a degradation curve. This degradation curve is known as the P-F Curve.

In Part 2 of this column (scheduled for the May/June issue of Paper360°), I’ll discuss the P-F Curve and how to use it to determine the point of functional failure. I’ll also look at different causes of failure and how they develop. If you have any questions or comments about the concepts I have outlined in this article I would really like to hear from you.


Posted in Inspections, management, RCM, RCM2, Reliability Centered Maintenance, reliability leadership, strategy | Tagged , , , , |

Reliability & Safety Inspections

Getting the most from safety inspections

TAPPI Ahead of the Curve


I enjoyed reading Graeme Rodden’s article “Know the State of Your Emergency Showers” in the Jan/Feb issue of Paper360°. I read it with a since of urgency, and reached out to a colleague of mine, Carlo Odoardi, to help me write this response.

The first step in correcting any problem is recognizing that you have a problem. In the article, Rodden quotes a presentation made by Larry Kilian of Haws at the recent PPSA conference. The article lays out the problem perfectly: “on average, only 25 percent of emergency shower/eye wash stations work properly and can provide proper first aid – despite weekly or monthly checks.”

In his presentation, Kilian calls out many possible reasons for non-conformance of emergency shower/eye wash stations, including such things as obstructed access, too much or too little flow, or no dust covers. To be sure, there are many more causes of failure, but determining those are better left to the Subject Matter Experts (SMEs)—the users of the systems.

In this article, I instead want to address two points: 1) How to determine what we must do to insure our system does what we want it to do; 2) What tolerable availability we are willing to accept, and what actions get us there.

Tolerable availability
Let’s discuss the second point first. An emergency shower is a very special type of industrial device, because it is a protective device. What it protects is our most precious resource: people. The thing about protective devices is that their failure only matters if the function they’re protecting also fails. That is, the emergency shower only needs to work in the event that someone gets contaminated. This is no different than a smoke detector, which only needs to work in the event of a fire. This sets up a very unique situation: these devices must be periodically tested with a failure finding task to ensure they will work.

One way we can arrive at the testing frequency is by understanding what availability we require and what the failure rate is. Let us start with what I believe is a straightforward idea: The more often we check if something is working, the higher we drive its availability.

As an example, you cannot know if an E-Stop works unless you check it. If you check once a month, all you can be assured of is that it was working (or not) the last time you checked. The more often you check (say daily), and repair any failures, the less time it will be in a failed state. Said another way, if you check it monthly, the longest it can be failed is 30 days; if you check it daily, the longest it can be failed is 24 hours. The more checks, the more availability.
Now think back to Rodden’s article, which tells us that 25% of all showers are in a failed state, even with weekly or monthly checks. That means that, checking every week or month, we find 3 out of 4 showers that are NOT working. With a little bit of applied statistics we can determine that, based on a failure rate of 75% with weekly or monthly inspection, we will achieve a maximum availability of only 63%. This means that only 63% of the time our emergency shower will work.

The question is, “IS THIS GOOD ENOUGH?” Maybe, but not likely. We know that the more often we check something, the higher we drive its availability, but what can we do if increasing the rate of inspection is not practical? We change the failure rate, as the article suggests, with better inspections. Not necessarily more, but better. There is something else we can do—decrease the demand rate on the shower. This means that we train our employees and maintain our process to diminish the need for the shower.

Getting the system to do what we want
This brings us back to point 1): How to insure our system does what we want it to do. This is how we improve failure rates.
On any given site, most emergency showers are of a similar age and construction; as the plant may have expanded and new showers were added, typically these new showers are added in batches. So while the new showers may be different from the old ones, they are similar to their contemporaries.

We tend to think that similar equipment can be managed the same way; however, causes of shower failure in the woodyard, bleach plant, recaust, or paper machines are all different. Have you ever seen a shower in the woodyard caked with lime dust, or a shower in the basement of a paper machine buried in fines, or a shower in the bleach plant obstructed with core buggies? Even though showers are very similar, when they are operated in different parts of the mill the maintenance strategy must be tailor-fitted for each area. In my experience, mills inspect all of the site’s identical showers with the same frequency, using the same parameters. We need to be prescriptive to each shower. This will improve our availability.

This sounds complicated and tedious, but it’s really not. Here is how we do it:

1. We gather a team of the potential users of the shower and anyone else who has something legitimate to say about its management. These are the SMEs.
2. The team builds an understanding of the operating context where the shower is located.
3. The team defines the functions the shower must perform.
4. The team defines the point at which the shower does not meet its required functions (functional failure).
5. The team identifies what causes the shower to fail (failure modes).
6. The team describes the effect of the failure (failure effect).
7. The team determines an appropriate proactive maintenance strategy, or
8. The team determines if a default maintenance strategy is appropriate.

Let’s look back at the reasons cited in the article—no dust covers—and run it through the steps above. One possible solution might be as follows:

• Function: The primary function of an eye wash is to provide wash water of a flow rate greater than 0.4 gal/min and less than 0.5 gal/min.
• Functional Failure: This shower does not provide wash water at all.
• Failure Mode: Dust covers left off following flow testing.
• Failure Effect: Following periodic flow testing, dust covers are not reinstalled. Airborne dust settles and accumulates in nozzles, eventually plugging the discharge. This only matters if an employee needs their eyes washed. Time to repair, 1.25 hours; cost of repair, $87.50.
• Proactive Maintenance: No scheduled maintenance.
• Default Action: Add “dust cover reinstallation” to the flow check test sheet.

It is important to recognize that the solution above is for example only and can in no way be taken as prescriptive. With each step there are several possible solutions depending on the operating context of the eye wash, and this article cannot possibly predict operating context. For example, if the failure mode had been “dust covers missing due to degradation of the plastic”, under “failure effect” we might have noted that, over time the plastic degrades, cracks propagate through the cover, and the cover falls off. In this case, with the help of the team, we have a proactive maintenance task: we look for the cracks beginning to develop. We would set our inspection frequency for half the time it takes between recognizing the cracks forming and the cover dropping off.

Finally, only the subject matter experts—the users/maintainers/owners of the showers—can determine this strategy. Consultants, contractors, and engineers can help—but only the mill’s SMEs are the true experts.

This process at first does seem overwhelming, but a team learns how to process the steps quickly. Ultimately, the output is the best maintenance strategy money can buy.

Thanks to Graeme Rodden and Larry Kilian for the interesting article and presentation that prompted this reply. I also want to thank Carlo Odoardi of COCO NET Inc. ( for helping me with this article, performing a peer review, and editing. If you have any questions or would like any details around this process please feel free to contact either Carlo ( or Jay (

Posted in Inspections, management, RCM, RCM2, Reliability Centered Maintenance, reliability leadership, Safety, strategy, Uncategorized |

Criticality’s use in setting inspection frequencies…

Below is a response to a question I once received about the relationship between criticality and inspection frequency:

(Editor’s Note: In the July/August issue of Paper 360° magazine, Jay Shellogg presented an article on the necessity to understand culture change management, and proposed his 9 principles of reliability ((, which resulted in a number of questions from readers. The following is his response to a question from a member of senior leadership within a large pulp and paper company.)

Q: In your sixth bullet point you state, “It is vital to take into account how a failure’s consequence affects safety, environment, quality, and/or production, and under no circumstance allow consequence of failure to determine frequency of inspection.” We could not rationalize this one. Can you elaborate? It seems reasonable to focus inspection resources (time, frequency, etc) more so on the critical assets where failure is a high cost.

A: I have heard this particular topic discussed several times this year. As it relates to inspections, the use of failure consequence or criticality should only be used to determine if a route/inspection is worth doing. The frequency of inspection is determined by a different set of principles, specifically the P-F interval (bullet point 2), the Failure Finding Interval (bullet point 7), and the 20% of failures that are time based (this is the reciprocal of bullet point 1).

As an example, over the years I have interacted with several heavy industries. In assessing their understanding of reliability, I would ask them about their vibration analysis program; specifically, how often they ran their “typical” vibration routes. In every case, I received basically the same reply: we run most of our routes monthly, although there are a few things we look at more frequently as time allows.

Okay, this is where it gets tricky. I would then ask them how they decided on measuring frequencies monthly, to which they usually responded in one of the following ways:

  • – That’s what the vendor suggested when we bought the vibration analyzer.
  • – That’s what our sister mills were doing when we started our vibration program.
  • – We checked with other industries and it seemed liked most were doing it monthly.
  • – We started out quarterly but had some failures, so we increased the frequency.

Ironically, I never had anyone give me a response based on engineering or scientifically-based principles. Most frequencies were based on a rule of thumb, someone else’s idea, or a directive.

To read more go to:


Posted in management, reliability leadership, strategy | Tagged , , , , |

The Maintenance Department vs the Maintenance Function

How much time is your maintenance department spending on the maintenance function? Maybe not as much as you think. Over my years working and talking with maintenance and mill organizations, I have often heard that they don’t have enough people to get all the work done. When I have investigated this complaint, what I have found is that people in the maintenance department are only spending a portion of their time on the maintenance function.

July/August issue TAPPI Paper360


Posted in management, reliability leadership, strategy | Tagged , , |

Stop Swapping Your Pumps – TAPPI Paper360

Does anyone out there run two pumps in parallel so that if one fails you can swap to the other pump with little or no consequence?  The reliability of duty/standby pumping systems are explored in the January/February issue of Paper360.
TAPPI Paper360

Posted in management, reliability leadership, strategy | Tagged , , , , , |

ROI: Reliability Orchestrated Improvement

Many years ago, we were introduced to the world of continuous business improvement methodologies like: LEAN, Kaizen, SMED / TAKT, 5S, 6-Sigma / SPC / TPM / TQM, 5-Whys / RCFA / RCA, FMEA / FMECA / Risk Analysis / Pareto / PMO / MTA, Fishbone Diagrams, Kanban / PdM/CBM, Poka-Yoke, Just in Time (JIT), MRO / Critical Spares Analysis, Culture Change Management / KPIs / Balanced Scorecard, and probably several more that we have not mentioned.

Our work with all these methods was in the realm of trying to get more production from a piece of equipment or process with no additional cost, and ideally less.  These methods were taught and applied in a ‘tool box’ fashion.  Meaning, that each methodology was just an approach  in the grand scheme of approaches to be selected and applied where and when appropriate. However, this is where there was always a struggle – which method to pick and why?

Now depending on the problem, the selection of the first method was generally straight forward.  For example:  Say we want to increase the production output of a product packaging line.  So, using LEAN and measuring output over a period of time is going to be critical to us, which leads us to TAKT and/or SMED analysis.

The problem was that we would always run into some other challenges to be overcome such as ‘waste’.  But what kind of waste? …Overproduction? …Underproduction? …Off-quality? …Process waste? …and so on.  Now, each of these wastes would drive us to the other methodologies…

  • In the case of overproduction, we would look into JIT to reduce inventory and Work In Process/Progress (WIP) in the manufacturing process.
  • For underproduction, we would have to find out the reasons ‘why?’ we were under-producing, which tended to be many. Is it because of delays in production process?  Or is it because of the unnecessary activities of workers, assets, or materials in the production process?  So that would drive us to brainstorming and fishbone diagraming. We often found that people make mistakes or the machinery was not running well and that would lead to 5-Whys/Root Cause Failure Analysis (RCFA) with a FMEA / FMECA to capture the results.
  • If we had high Process waste or a high reject rate, we might look at Kaizen and TPM coupled with a 5S program. If we were making off-quality product, then we would enter the realm of 6 Sigma and TQM in an effort to eliminate defects.

The problem with the work we did for many years in continuous improvement was that there was not a fundamental guiding strategy of how these very different (yet useful) methodologies all fit together.  To say it another way – All of these methodologies are just ‘Tactical’ tools in the proverbial toolbox.

Now, there’s an old expression that says: “If all you have is a hammer, everything looks to you like a nail!” So, suppose you have a screw that needs to be driven in. you could hit it like a nail but most of us know that is not a good solution. So, what if you use the tip of the claw end of a claw hammer as a flat screw driver?

You probably aren’t using the best tool for the job, even though it may work somewhat.  So, what would cause a person to use a hammer for a screw driver?  Here are some possible answers:

  1. A lack of a proper screw driver and the hammer was ‘handy’, or
  2. The lack of a principled understanding of how each tool in the toolbox fulfills a specific purpose and how each tool is interconnected.

What we’re getting at is this –  most were never taught nor had ever seen how to ‘strategically’ apply the continuous improvement methods.  The continuous improvement methods were always applied ‘tactically’ at individual problems. They were not applied ‘holistically’ in a disciplined and principled strategic plan.

We need to tackle not just a single problem, but an array of issues that are uniquely related to an entire process, system or subsystem.  It was not until our introduction to the Aladon RCM2 methodology, where we found a strategic model that formally orchestrates all of the continuous business improvement methods together into a comprehensive strategic framework.

In an RCM2 analysis, the RCM2 facilitator behaves like a conductor leading an orchestra through a complicated musical score.  Depending on what’s required at the time, the RCM2 facilitator will focus on a particular continuous business improvement method over all the others.  Then, as the analysis Team proceeds and a different need arises, the RCM2 facilitator may shift to another continuous business improvement method.

Best of all, RCM2’s usage of these continuous business improvement methods is done ‘organically’ within the RCM2 process; so that, individuals not formally trained in LEAN, TQM, 5S, Kaizen, 6 Sigma, etc. will learn how to use these methods naturally as part of the RCM2 process.  As long as you are being led through an RCM2 analysis by a Certified Aladon RCM2 facilitator training by a Certified Aladon RCM2 Practitioner, no additional training is required for the analysts beyond the Aladon 3-day RCM2 Introductory course.

The remainder of this discussion will be to generally describe why and where some of these continuous improvement methods are utilized in the process of the RCM2 analysis.

Kaizen is a compound Japanese word of ‘Kai’ & ‘Zen’.  Kai having many meanings but one being: “to change or restore”.  Zen means “good” or “better”.  So, loosely translated, Kaizen means, “Better Change” or “Good Restoration”.  Generally, the principles of Kaizen include the following:

  • That the employees working closely to a problem are the subject matter experts, they are the best equipped to solve any problems that arise.
  • To act now even if the improvement is only small. Some improvement is better than the status quo.
  • Use a multidisciplinary team approach to problem solving will result in the best solutions.
  • Management must support and empower the team to take action, and give them a clear mandate.

These Kaizen principles are inherently supported in RCM2.  The RCM2 analysis is always performed by those closest to the production process. The RCM2 Group is made up of a multidiscipline team and fundamentally needs management support for activities.

However, RCM2 takes the process a few steps further.  For instance, RCM2 establishes at least nine principles of reliability that must be understood before any process or system can be improved with sustainable results.

Furthermore, RCM2 takes the idea of employee empowerment and involvement to a higher level of sophistication.  Kaizen efforts that we were involved in over the years always struggled to sustain the initial results.  This was primarily due to the outcome of the Kaizen being based on an incorrect understanding about the nature of the equipment and how operators and maintainers interact and behave with it.

Insight about equipment nature and our behavior with it is given to the user through the RCM2 process. RCM2 provides an asset-focused context, which helps us reduce or eliminate the consequences of failure to a safe minimum.

SMED & TAKT are time-related measures or ‘indicators’ used to define the ‘pulse’ of a company’s operations.  These two measures are founded on production task completion and production cycle time.

SMED stands for Single Minute Exchange of Die and is a measure used to determine the minimum time it takes to ‘change’ a machine over to run a different product.  TAKT is a German word for an orchestra’s conductor’s baton.  So the TAKT sets out the tempo of the needed rates of production.

Both SMED & TAKT are very much alive within the body of RCM2.  They are found at the first level of the RCM2 analysis – the establishment of the function performance standards.  Any asset or system is purchased and put into operation to fulfill a function.

For example:  The primary function of a milling machine might be: ‘To finish mill a work piece to a depth of 0.500 inch ± 0.050 inch’.  A secondary function of the milling machine might be: ‘To retool the milling machine in not more than 3 minutes by a normally skilled & trained operator.’

At this point in RCM2, we would establish the minimum and maximum performance standards that are associated with SMED and TAKT measures.  So SMED & TAKT analyses are natural ingredients of the RCM2 process.

The Five S’s often show up in RCM2 during the function, functional failure, failure mode or failure effect development stages in the analysis…

  1. Sort,
  2. Scrub / Shine / Sweep,
  3. Systematize / Set / Straighten,
  4. Standardize and
  5. Sustain

More precisely, the lack of a workplace that is sorted, straightened, scrubbed, systematized, etc. is often identified during RCM2 failure mode or failure effect development.  Function development will define the all-important performance standards for the equipment sub-system.

Also, since RCM2 is a living program, a reference task is added in the CMMS System to regroup the RCM2 analysis team at least once annually to review the asset’s analysis for any changes. This routine work order will ensure continuous improvement, sustainability of the RCM2 process and, ultimately, the reliability of the physical asset sub-system.

Finally, 5S shows up in the RCM2 default actions where our review group has identified a credible failure mode, but no PM task to address it.  Such as for our milling machine retool changeover taking four minutes instead of three minutes because steps in the setup process are being missed by the operators.  No preventative maintenance can address this failure mode, but a RCM2 default action calling for the standardization of a check sheet or standard operating procedure will address this failure mode.

6-Sigma / SPC / TQM / TPM
When too much rework or a high scrap rate exists, it may be warranted to consider 6-Sigma within a SPC (Statistical Process Control) context.  The overarching objective would be to use these in a larger TQM / TPM (Total Quality Management / Total Productive Maintenance) program to reduce or eliminate defects and off-quality product.  (See Kanban below as it may be used to trigger signals using N.W.A.C as status indication that product quality is drifting)

RCM2 entirely satisfies 6-Sigma and SPC as it inherently supports the definition of a SPC system. (See page 27 of RCMII) This is done by using the P-F curve as the means to communicate the Potential failure to Functional failure relationship. (More about P-F curves another time) See below for a simple example of a number of production runs and how the normal distribution can ‘drift’ off-spec:

6-Sigma SPC TQM TPM (2)

RCM2 also supports TQM and TPM since Aladon RCM2 Practitioners train and certify RCM2 facilitators to write Operating Contexts. The Operating Context delineates many things, not the least of which includes the following:

  • Company management commitment to not only the RCM2 pilot initiative, but a long-term reliability improvement organizational strategy.
  • The new Culture Change mindset using MoC (Management of Change) and other change management techniques to measure and track results,
  • The RCM2 Team empowered to lead and implement the proactive reliability program, (Especially the rites and rituals – see Culture Change),
  • Goals to be achieved: more production/uptime, less cost/safety/spills, build morale, knowledge harvesting for future trades, etc.,
  • A thorough description of the physical asset sub-system: …batch or flow? …redundancy? …quality, environmental or safety standards? …shift schedules? …inventory? …labor repair time/costs? …critical spares? …market demand? …material supply? …process documentation? …etc.

5-Whys / RCFA / RCA
5-Whys and RCFA (Root Cause – Failure – Analysis) are related methods used throughout industry.  5-Whys repeatedly ask “Why?” to explore reasons that cause a defect / failure we are interested in resolving. The objective is to ask this until we find the Root Cause of the problem. It is generally observed that you need to ask “Why?” five times to arrive at the root-cause of the problem.

However, practically speaking, what if we found that asking “Why?” only three times sometimes arrives at the failure’s root-cause? And yet, the next time, it takes seven questions of “Whys?”

Most folks know that asking “Why?” only a couple of times may lead to superficial and sometimes dangerous results. However, we have found that the main challenge of RCFA is the avoidance of ‘Analysis Paralysis’. Why? …because if you ask why enough times, you will always arrive at ‘Creation’!

We have seen 5-Whys being used prescriptively. That is, purposely and blindly asking “Why?” five times, whether or not it is warranted because they think they should. We have also known others to just ask enough times until they are satisfied the root-cause they have found will adequately resolve the problem within some reasonable conditions.  The key is to know when to stop.

Aladon RCM2 facilitators are trained how to know when to stop and the same goes for RCFA.  Although, there is a distinct difference that can be made between RCM2 and traditional RCFA.

In a traditional RCFA approach, one or maybe a few, likely root-causes of failure are sought for a single piece of equipment.  In contrast, RCM2 is zero-based and seeks to find ALL credible failure modes (root-causes) for a process or sub-system, not just a few associated with one piece of equipment.

FMEA / FEMECA / Risk Analysis / Pareto / PMO
The FMEA / FMECA (Failure Modes and Effects – Criticality – Analysis) are the means to capture the results of the RCFA / RCA. RCM2 innately includes this and extends its usefulness by also including the asset sub-systems Functions and Functional Failures in the Information Worksheet. A genuine wealth of optimal trades’ knowledge of the asset / sub-system in a very compact ‘one-stop-shopping’ format!

Furthermore, the failure modes on the FMEA / FEMECA are tied back to the original functions’ performance standards as detailed earlier in SMED and TAKT measures. This is done by identifying, prior to failure mode development, when a process or sub-system has ‘Functionally Failed’ to meet its user’s demands. (Performance standards)

Now, Risk Analysis is the process of using the corporate Risk Matrix (See below, Consequence vs. Probability) to determine which assets / sub-systems are critical to the organization and which are not so much. Failure of critical assets / sub-systems usually lead to very serious failure consequences like a safety incidents, environmental breaches, releases to the air or nearby rivers, streams, lakes, etc., or may cost us an exorbitant amount of money to keep the plant running. The goal is to perform a reliability improvement initiative on the critical assets / sub-systems to reduce their Risk to the organization.

In the RCM2 process, a Pareto analysis (the 80/20 rule or Asset Prioritization) is always performed on a list of the plant’s physical assets’ Performance Report. Our guideline is to select focus candidates from the top 20% of all the physical assets / sub-systems that cost us 80% of our annual spend. These so-called Bad Actors are ‘eating our lunch’, so-to-speak.

Commonly, RCM2 analyses are performed on the top 20% of all the assets / sub-systems because of RCM2’s thoroughness in finding all the likely Failure Modes. In other words, those critical physical assets in the organization that, should they fail, will likely put us out of business or severely cripple us.

FMEA FEMECA Risk Analysis (2)

The other 80%, the non-critical assets, will have a PMO or MTA (Preventive Maintenance Optimization, Maintenance Task Analysis) performed on them, which is less rigorous than RCM2. PMO or MTA evaluates an existing preventive maintenance program, evaluates its effectiveness, looks for critical omissions, synergies, opportunities and then repackages the results into a more effective program. (i.e. PM routes)

Nevertheless, it is worthwhile noting that world-class organizations choose to perform RCM2 analyses on ALL their physical assets!  This way, there’s minimal risk of missing any failure modes that result in failure consequences.

Fishbone (Ishikawa) Diagrams
Developed by Dr. Kaoru Ishikawa, this is another popular technique used to identify possible causes for a problem or defect.  This diagraming method groups possible causes of failure into the 6-Ms (categories) of production – Manpower, Methods, Measurement, Material, Machinery, Milieu (Environment).  Or, into the 4 categories of administration – Personnel, Policies, Procedures, Plant.

Fishbone diagraming is sometimes used in RCM2 when identifying likely failure modes to ensure that none of the above categories are missed. However, RCM2 offers a slightly different approach to the identification of the categories.  RCM2 looks for reasonably likely (i.e. credible) failure modes that are:

  • Currently being prevented by a preventative maintenance program
  • Have occurred on the same (or similar) equipment
  • Have not occurred yet but are considered as real possibilities
  • And, those that may not be likely to occur but whose consequences effect safety or environmental.

The philosophy of the RCM2 process is fundamentally different from Fishbone diagraming in another important way.  In RCM2, it is not necessary to list all the intermediate failure modes. (I.e. the bones of the fishbone diagram) Some consider drawing the other failure causes ‘in-between’ as insightful or interesting. However, in RCM2, we are concerned with productivity. As such, we document only the root causes that can lead to failure consequences.

In this way, documenting root-causes in RCM2 can be faster than Fishbone Diagramming.

Kanban / PdM / CBM
Kanbans are simple signals or status indicators that are typically used to start a supply chain replenishment or manufacturing process.  However, Kanbans can also be used to alert operators and maintainers that an action is required based on the condition of the asset. This is the thrust of CBM – Condition-Based Maintenance / Monitoring. (See also MRO section)

PdM (Predictive Maintenance) is the technological means by which we gather the condition of the asset for CBM and Kanban reporting. PdM tools are used to identify patterns in collected data. The goal is so that the start of asset failure can be determined, with enough advance notice, to mitigate safety, environmental or economic consequences. This can include vibration analysis, infrared thermography, and ultrasound, lube analysis, NDE / NDT (Non-Destructive Examination/Testing), Human Senses inspections, and so on. These may use a variety of readily available condition monitoring techniques: Dynamic, Particle, Chemical, Physical, Temperature, and Electrical. These methods are so relevant to RCM2 that John Moubray included over 100 in his world-class, best-selling textbook: “RCMII”. (See Appendix 4)

An example might be lubricant min / max fill lines associated with a sight glass on a gearbox.  A simple, effective and sustainable system to implement Condition-Based Maintenance (CBM) tasks from Work IDs & Trades’ knowledge uses N.W.A.C. Define asset health Indicators each with a Normal, Warning, Alarm and Critical (N.W.A.C) states according to the following:

N – Normal (FULL – No action)
W – Warning (¾ FULL – Record & continue to monitor fill condition)
A – Alarm (½ FULL – Schedule work order to refill at next available downturn)
C – Critical (¾ EMPTY – Contact Maintenance for immediate refilling)

Another example might be to look for loose mounting base bolts on the gearbox and if any of the washers are seen ‘dancing’, action is taken to correct the problem.  In these examples, the use of a Kanban finds its way into RCM2 at the action plan level and possibly at the default action as well.

A Poka-Yoke is an error-proofing tool that minimizes or prevents failure consequences.  Here is a common example:  At a filling station the diameter of the diesel pump nozzles are larger than the unleaded fueling receptacle in standard vehicles.  This prevents someone from inadvertently fueling up with diesel in an unleaded gas car.  Also, a 120 VAC electrical outlet will not accept a 240 VAC plug style.

In RCM2, Poka Yoke show up in Action Plans.  In an action plan, an operator or maintainer may be given a tool or gauge used to check the wear of a component like a belt sheave.  This is where Poka-Yoke finds its way into RCM2.

Before the 1980s, typical manufacturing processes kept their production lines moving by using WIP, to ensure any problems with upstream processes do not affect the downstream activities.  These temporary stockpiles of partially finished / assembled products to draw from kept the production line going while the upstream problem was being fixed by maintenance.

With JIT, WIP is eliminated to reduce inventory – a capital cost to the business.  Often, finished product is no longer stored at the manufacturing site.  The idea is to produce goods, which get immediately shipped to the customer, continuously, without intermediary warehouse storage.

RCM2 fully supports the JIT model because it defines the performance standards in the function statements articulated by the users in the manufacturing plant.  Furthermore, during failure management strategy development, RCM2 identifies the necessary skills to maintain those performance standards using a proactive PM task that is technically feasible and worth doing.

Ultimately, this leads to a much improved asset utilization, lower capital costs (since less inventory exists), which results in improved financial performance.

MRO / Critical Spares Analysis
MRO (Maintenance Repair and Operation/Overhaul) and Critical Spares Analysis are used to determine appropriate spare parts inventory levels. However, most of this work is based on historical failure rates and risk tolerance of a related failure.

RCM2 brings the concept of MRO and Critical Spares Analysis to a pinnacle.  It offers the user a result that is process-driven through knowledge and logic that is wholly defensible.

To do this, we must understand how an asset fails at the failure mode level.  With an asset’s Failure Modes formally documented, it then becomes possible to apply one of two major categories of maintenance strategies to the asset:

  • A Condition-Based Maintenance (CBM) strategy for assets such as bearings, which fail randomly but are eligible for a proactive task for the detection (i.e. it gives warning signs) of the given Failure Mode (via the plant’s choice of proactive options such as inspections, PdM technologies, PM, etc.). The chosen proactive task results in an understanding of the state of the given Failure Mode at the point in time of the inspection.
  • A No-Scheduled Maintenance (NSM) strategy for assets such as electronic devices and other complex kit like pneumatics and hydraulics, which fail randomly with little or no notice.

Both of these maintenance strategies require a different approach for making storeroom decisions whether to stock a spare versus not stocking one.

  • CBM Strategies (See also the Kanban / PdM/CBM section)

In order to ensure the CBM collection frequency is correct, RCM2 uses a time horizon at the individual failure mode level called the P to F Curve. (see below) P is the point of Potential failure and F being the point of Functional Failure.  If an asset inspection discovers the state of P, such that a corrective task is required, then the remaining time to point F (called the ‘Nett’ P-F time) must provide an adequate corrective task time horizon.  This P to F time horizon is formally documented during an RCM2.  Some simple math then determines the Nett P-F time, which is the remaining time available to the Maintenance Planner once a given P point is detected.

With the remaining time to functional failure clearly understood and available for comparison against the vendor’s spare part lead time, an informed stock vs. no-stock decision can be made.  For example, a 4 to 5 day vendor lead time on the spare part required for a failure mode with a 2 year P-F interval leads to a clear no-stock decision on the spare.

MRO Critical Spares Analysis (2)

  • NSM Strategies

If the failure mode is random (of which upwards of 80% of failure modes are) and the P – F curve is of no practical use – such as in the failure of many electronic device(s) – then on what technical basis can a no-stock spare decision be validated?  The answer is found in the RCM2 analyses with two additional pieces of data captured in the Failure Modes:

  • The Consequence of the failure.
  • The Statistical Probability of the Failure (SPF) occurring derived from MTBF data.

Once determined, the SPF and MTBF are compared against the vendor lead time, cost of downtime/repair, cost of the part including expediting, and the number of asset locations that the part will spare.   Based on this information a probability cost model is calculated, which determines when (in number of years), for a population of installed identical parts, the cumulative probability of failure will equal or exceed 50%.  It is this number of years to reach the 50% probability of failure of the population that is used to make a cost decision of whether stocking the part is cheaper than the downtime cost of not stocking it.

Once developed, the cost model requires only the following data inputs to provide the stock/no-stock answer:

  1. Unit cost to purchase
  2. MTBF in number of years
  3. Number of identical running units

RCM2 bring process, logic, and knowledge to bear on the problem of MRO and Critical Spares Analysis, which has historically been driven by ‘emotional spares’ (spares stashed away in toolboxes and lockers ‘Just in case’, due to lack of confidence in the Storeroom, Stores personnel, the work process, CMMS, etc.),  ‘gut feelings’ or because “That’s not the way things are done around here!”. (See Culture Change next)

Culture Change Management / KPIs / Balanced Scorecard
Culture change management has only recently come into its own as a standalone discipline.  Most of the popular beliefs around managing culture change are to:

  • Simply state the new order of things,
  • Put process in place to check that the new order is being adhered to, and
  • Manage anyone who deviates from the plan.

Unfortunately, it is not quite that simple, because culture change deals with people and people are very complex.  A heavy handed top down ‘positional authority’ approach will only find resistance to the change.  Now, the resistance may be hidden from the measures, but believe this – the resistance will be there!

People need to have a ‘compelling reason’ to change or at least a good reason to give the new order a chance.  RCM2 wholly satisfies this requirement to give its participants that need for change. What’s more, RCM2 sets up a new cultural infrastructure of the new order of things:

  • Providing maintainers, operators, supervisors, etc. with a common set of personal and organizational values with a common ‘language’
  • Teamwork builds morale through collegial sharing/learning work issues
    e. Greater safety and environmental integrity and operating performance
  • Widespread ‘pride of ownership’ since the Team devised the PM tasks
  • A clear view of resources needed: time, trades, spares, tools, materials…
  • Real empowerment as they execute THEIR resulting PM routes/programs
  • Establishment of a new set of rules guiding correct behavior (Rituals) and,
    Events/activities that reinforce the correct behavior (Rites)

Physical asset / sub-system performance and PM task completion must be measured as a general rule in business management. The objective is to identify gaps between current performance and expected / desired performance. Ultimately, we use this ‘delta indication’ as a measure of progress towards closing the gaps. Well-chosen KPIs (Key Performance Indicators) highlight what areas of the business operation (production, maintenance, technical, etc.) need action to improve business performance.

Once these KPIs are implemented, measured, analyzed & optimized, then an organization has an important opportunity to integrate them into a Balanced Scorecard format. This crucial method enables us to compare the value of a company’s Financial performance with its Customer Satisfaction performance, Learning & Growth performance and Internal Business Processes performance. Using RCM2 fully supports the Balanced Scorecard’s goal to align business activities with the company’s vision, mission and strategy.

It is important to note that a company’s cultural behaviour can be modified through the Goal Achievement Model. The model puts in place KPI measures for day to day activities that support organizational initiatives, which realize corporate goals that are aligned with the company vision / mission.

Note RCM2 is critical from the Culture Change Management / KPIs / Balanced Scorecard perspective because it helps identify what KPIs must be constructed from Function performance standards, Kanban signals, PdM/CBM indicators, etc.

At the end of the day, the ‘tools in the toolbox analogy’ is valid, once you realize that RCM2 IS the toolbox.


Written By:
Carlo Odoardi  and Jay Shellogg
Principal Members, The Aladon Network

Tel: (905) 536-0865
Strategic Maintenance Reliability LLC
Tel: (903) 293-3539

More information on Aladon and the Aladon Network can be found at: .


[1] Stephen J. Thomas, “Improving Maintenance & Reliability Through Cultural Change”, © 2005, Industrial Press Inc., New York, USA

[2] Ricky Smith, Bruce Hawkins, “Lean Maintenance – Reduce Costs, Improve Quality and Increase Market Share”, ©2004, Elsevier Butterworth-Heinemann, Burlington, MA USA

[3] Ramesh Gulati, “Maintenance & Reliability Best Practices”, © 2009, Industrial Press Inc., New York, USA

[4] John Moubray, “Reliability-Centered Maintenance”, 2nd Ed., © 1999, Butterworth-Heinemann, New York, USA

[5] John D. Campbell & James Reyes-Picknell, “Uptime: Strategies for Excellence in Maintenance Management”, 3rd Ed., © 2015, CRC / Productivity Press, New York, USA

Posted in Cost Reduction, management, reliability leadership, strategy | Tagged , , , |

From Reactive to Proactive

My experience has taught me that most folks working in the pulp & paper industry don’t have an understanding of what it takes to change a reactive maintenance culture to a proactive and reliable one.  Two (2) key areas must be addressed for reliability to take root and be sustainable:  1. Mill Culture, and 2. The understanding of the principles of reliability.  To read more see the July/August issue of TAPPI Paper 360 –

Posted in Cost Reduction, management, reliability leadership, strategy | Tagged , , , , , , |