Preventing control systems software-related root causes from being overlooked or misdiagnosed streamlines operations.

Given that 57% of lost-time injuries occur on the rig floor and drilling control system software manages 75% to 95% of the movement and operation of drill floor equipment, it is clear that software needs to be addressed on today’s high-specification assets.

Due to the industry-wide shortage of software-experienced personnel, however, many drilling contractors and operators do not have enough in-house knowledge to be able to identify all potential software-related failure modes or conduct thorough root cause analyses. Consequently, when equipment failures occur on high-specification drilling assets, control systems-related issues can be overlooked as potential root causes.

The safety and reliability of systems can be improved through incident investigations that identify and address both software- and hardware-related issues.

Focusing on ‘why’

Often, offshore drilling incident analysis is focused on what happened and how it happened. For example, a pipe drops because the racker malfunctioned. But in order to determine the changes needed to prevent an incident from happening again, operators must also ask why the failure occurred. Was there a mechanical issue with the racker? Was there a problem with the system the driller uses to control pipe handling operations? Was the failure related to both hardware and software?

The most effective incident investigation plans evaluate failures holistically using root cause analysis that accurately assesses both software and hardware-related factors.

View larger image

Causal factors flowcharting us a useful method for analyzing failures that occur during drilling operators. (images courtesy of Athens Group)

From theory to practice

Effective root cause analysis includes four major steps: data collection/observation, data analysis, root cause identification, and recommendation/implementation.

A typical scenario involved an offshore drilling asset where three load control cycles were completed normally. During the fourth, the motor reached the set point speed but continued to accelerate. The operator attempted to stop the load via the control interface. When this attempt was unsuccessful, he activated the motor emergency stop (eStop). Despite the fact that the operator activated the eStop, the load continued to accelerate, eventually colliding with an object in its path.

Data collection/observation. Data collection/observation can be accomplished using either statistical methods or timeline/sequence of events methods. For incidents that occur on offshore drilling assets, observation and timeline or sequence of events data collection methods are generally used because there usually is not enough historical process data available to use statistical data collection methods. To assess this example scenario, data was collected from the operator and witness interviews, the equipment log data, and the equipment post-mortem investigation.

Data analysis. The data analysis method must correlate with the data collection method. For example, it is not useful to perform a statistical analysis on small amounts of timeline data. Likewise, it would not be useful to analyze a large amount of historical data using a timeline method. Causal factors flowcharting generally is the most useful method for analyzing failures that occur during drilling operations.

Root cause identification. From the causal factors flowchart, three primary causes can be identified:

1. The motor continued to accelerate past tripping

rpm set point;

2. The variable frequency drive (VFD) tripped, zeroing

the torque and regenerative braking; and

3. The emergency friction brakes failed.

The single linear path nature of the causal factors flowchart identifies the load continuing to accelerate past the programmed set point as the single triggering event for the incident. Ishikawa (fishbone) analysis of potential root causes for the motor not holding the programmed set point indicates that the example scenario’s root cause was software-related: the control systems software did not include a set point calculation that accounted for the motor’s ability to produce enough torque to control the load at a given rpm value.

Even though the root cause was determined to be software-related, there were still hardware-related factors. The VFD tripped, and the emergency friction brakes failed.

Recommendation/implementation. Once the root causes of the incident are identified, appropriate software and hardware-specific recommendations can be provided and implemented.

Root cause analysis recommendations for this example scenario include:

Establishing a set point calculation that takes into account the motor’s ability to produce enough torque to control the load at a given rpm value and incorporating this set point calculation into the control systems software; and

Installing new emergency friction brakes that have the stopping power required to stop the load in an emergency acceleration.

Ishikawa (fishbone) analysis of potential root causes indicates a software-related cause for the incident described in the scenario.

Benefits of holistic incident investigation

In the example scenario, the root cause was determined to be both hardware- and software-related. Without a comprehensive incident investigation, replacing the brakes and the VFD may have appeared to be the only appropriate correction. But when the incident is assessed using comprehensive root cause analysis methods, the software-related cause becomes visible and can be addressed.

The most immediate use for root cause analysis is to identify the drilling operation components that are determined to be at the root of the failure. Implementing corrections to specific drilling components is certainly necessary, but comprehensive incident investigation also enables operators to dig deeper to mitigate future failures across their fleets. The root cause(s) of incidents that occur onboard drilling assets can typically be traced back to one of two factors:

Poor design, testing, and installation; or

Weak ongoing maintenance processes/procedures.

Prevention is still the best cure

Once a company has identified the true root cause, it can revisit the asset lifecycle to determine where corrections and risk mitigation initiatives should be implemented during newbuild/refurb projects or on other assets that are currently in operation. For example, many equipment failures can be traced back to software-related issues that were not addressed during the design phase (e.g., in the concept of operations and/or contractual language). Future incidents and subsequent root cause analyses often can be prevented by implementing lessons learned early in the lifecycle of other assets.