Status Report

NASA Return to Flight Task Group Final Report: Annex A.2 Individual Member Observations

By SpaceRef Editor
August 17, 2005
Filed under , , ,
NASA Return to Flight Task Group Final Report: Annex A.2 Individual Member Observations
http://images.spaceref.com/news/2005/rtf.final.jpg

Return to Flight Task Group Final Report

A.2 Observations by Dr. Dan L. Crippen, Dr. Charles C. Daniel, Dr. Amy K. Donahue, Col. Susan J. Helms, Ms. Susan Morrisey Livingstone, Dr. Rosemary O’Leary, and Mr. William Wegner

Taken one-at-a-time, the RTF TG assessments of the NASA implementation of the CAIB return-to-flight recommendations may leave an impression of accomplishment that we believe does not present a comprehensive picture of NASA’s return-to-flight effort. Without a doubt, we share with NASA the same fervent desire to see the Space Shuttle Program successfully continue as a healthy, vibrant tribute to the achievements of human spaceflight. To this end, although it was not within the explicit charter of the Return To Flight Task Group, we have documented additional observations relevant the post-Columbia environment that we believe are important to share with NASA leadership to help them address what we perceive to be continuing challenges. This is not a set of conclusions, but is a detailed summary of persistent cultural symptoms we observed throughout the assessment process.

We agree that the improvements to the Space Shuttle and its organization are real, and often significant. This is a tribute to the dedicated efforts of many people working hard at all levels and in all parts of the Agency. At the same time, we believe that the leadership and management climate that governed NASA’s return-to-flight effort was weak in some important ways that bear discussion. While we explicitly address the Space Shuttle return-to-flight effort, we believe these organizational and behavioral concerns are still pervasive throughout the human spaceflight programs. These observations are not intended as criticism of the entire NASA workforce. We have stated several times – in this report and elsewhere – that within the “working levels,” much of the NASA and contractor workforce “got it” and we believe at least some have always gotten it. And, indeed, there are some capable leaders at NASA who also “get it.”

Our observations also are not meant to diminish the achievements made in addressing the individual CAIB recommendations. The workforce performed to the best of its ability, often with little direction. We commend their efforts and recognize their accomplishments. We also believe, however, that leadership and managerial shortfalls generally made the return-to-flight effort more complicated, more costly, and lengthier than it needed to be.

The Rogers Commission and the Columbia Accident Investigation Board (CAIB) reports are both rich in explanation of factors that have weakened NASA’s ability to effectively manage a high-risk program. Yet while NASA leadership was focused on the 15 CAIB return-to-flight recommendations, they missed opportunities to address the enduring themes of dysfunctional organizational behavior that the CAIB and other external evaluators have repeatedly found. As a result, in our view, many fundamental concerns persist. Our intent here is to present some of the most prominent of these that we observed. The advantage of hindsight, and the opportunity to second-guess decisions made since February 2003, permeates these observations. All of them were, however, written prior to the launch of STS-114. It is also important to recognize that the behaviors and attitudes described here were not chance occurrences that were observed only once or twice, but that emerged numerous times throughout the Task Group’s interactions with NASA. The intent of these observations is to help NASA leadership identify and rectify these concerns. We will address four main areas: rigor, risk, requirements, and leadership. At the conclusion of our discussion, we cite specific examples to support and clarify our observations. Rigor

“Rigor” refers to the scrupulous adherence to established standards for the conduct of work. In NASA’s context, the safe and reliable execution of high-risk, complex technical endeavors requires the rigorous and consistent understanding of, and adherence to, standard process. These processes should be enforced across all projects and elements, and preferably even across programs. Implementing standard processes across programs allows more consistent evaluations of the programs, and eases the transition of personnel moving from one program to another. As we observed them, the return-to-flight activities often demonstrated a lack of standard processes, and, in some cases, simply a lack of any process at all.

One dilemma the Agency faced in this regard was how to communicate about its goals and standards of achievement. Once the Agency is on record as committed to a specific achievement, it becomes unpalatable to back off of that target for fear of appearing to fail. Instead, the adjustment of performance standards to allow a “best-effort” provides the appearance that the goal has been met, but without the rigor and discipline necessary do so safely or completely. Before making commitments to specific achievements, NASA should fully consider how much progress is feasible, and motivate public and private expectations accordingly. When achievements are mandatory at first but become “goals” when the going gets tough, it sends a strong message to everyone that nothing is mandatory.

With the benefit of hindsight, the Agency’s unquestioned endorsement of, and commitment to comply with, the CAIB return-to-flight recommendations may have been laudable and reasonable – and perhaps even necessary under the circumstances the Agency faced at the time – but it may also have been a mistake. The endorsement of the CAIB recommendations, before conducting a thorough engineering and programmatic assessment of their implications, short-circuited a more traditional and rigorous process. NASA has long maintained a list of the hardware risks to the vehicle, and had an upgrades program in place before the Columbia accident. Ideally, NASA should have determined the importance of the CAIB recommendations in relation to the risks and upgrades it was already tracking. Then leaders should have prioritized the implementation of the CAIB recommendations against other desired risk mitigation efforts to determine the best expenditure of limited program resources to provide the largest reduction in overall risk.

The change in National Policy dictating the Space Shuttle be retired in 2010 presented the Agency with an opportunity to re-evaluate the decision to fully implement all of the CAIB recommendations and to curtail actions that were proving to be unproductive or inefficient; NASA did not. In our view, NASA leadership should not have foregone their traditional process of conducting detailed assessments of proposed changes. The CAIB recommendations were important, but the accident board fully acknowledged that they had not considered their recommendations within the larger context of the Space Shuttle Program. In addition, before committing to a short-term launch date – that ultimately drove any number of important implementation decisions – NASA should have conducted detailed engineering assessments of the CAIB recommendations, traded them against other risk mitigation efforts, developed a clear understanding of the physics of foam loss, and devoted serious consideration of alternatives to “fix the foam;” e.g., Orbiter hardening or a redesigned External Tank. This would have allowed the program to determine how long a stand-down was necessary to implement a reasonable set of requirements to reduce the risk of flying the vehicle.

As we reviewed the return-to-flight effort, it was apparent that there were numerous instances when an opportunity was missed to implement the best solution because of this false schedule pressure. As early as September 2003 the RTF TG was told that specific technical activities were not being performed because they could not meet the schedule. Too often we heard the lament: “If only we’d known we were down for two years we would have approached this very differently…” This overall lack of integrated planning resulted in ad hoc and redundant efforts. Even the NASA Implementation Plan disappoints: it has no document number, no change history, and no clear place in the program’s effort. Its subtitle – “A periodically updated document demonstrating our progress …” – makes clear that it is not an executable “plan” but, instead, a status report. Many of the lower-level “plans” that were presented in formal meetings were developed after implementation was initiated, instead of setting clear objectives and acceptance criteria before the work was begun. Activities were undertaken without an understanding of how they contributed to the overall return-to-flight effort and without any sense of budgetary or other limits. As a result, at the end of 2-½ years and $1.5 billion or more, it is not clear what has been accomplished.

While solving the technical problems associated with return to flight was always seen as the highest priority, the cost associated with accomplishing this was for the most part neither effectively monitored nor managed. In fact, many of the return-to-flight efforts were initiated at mid- to lower-levels with little visibility or traceability to the Space Shuttle Program level (Level II). These factors have combined to allow for uncontrolled cost growth and an overall lack of cost management. If the return-to-flight effort had been better managed to control costs, it is possible that funding would exist to upgrade the Orbiter with newer systems and eliminate risks posed by hardware not involved in the Columbia accident.

We also observed that instead of concise engineering reports, decisions and their associated rationale are often contained solely within Microsoft PowerPoint® charts or emails. The CAIB report (Vol. I, pp. 182 and 191) criticized the use of PowerPoint as an engineering tool, and other professional organizations have also noted the increased use of this presentation software as a substitute for technical reports and other meaningful documentation. PowerPoint (and similar products by other vendors), as a method to provide talking points and present limited data to assembled groups, has its place in the engineering community; however, these presentations should never be allowed to replace, or even supplement, formal documentation.

Several members of the Task Group noted, as had CAIB before them, that many of the engineering packages brought before formal control boards were documented only in PowerPoint presentations. In some instances, requirements are defined in presentations, approved with a cover letter, and never transferred to formal documentation. Similarly, in many instances when data was requested by the Task Group, a PowerPoint presentation would be delivered without supporting engineering documentation. It appears that many young engineers do not understand the need for, or know how to prepare, formal engineering documents such as reports, white papers, or analyses.

Another disturbing trait that we observed was that personalities were allowed to dominate over strict process – examples exist of strong personalities attempting to avoid process and others allowing avoidance to occur. Many in senior leadership observed these lapses in process, but did little to correct the situation. For example, during the System Design Certification Review (DCR) II on February 23, 2005, a senior program manager commented that, “It is no longer an important question as to whether or not any given item is certified. Some things won’t be certified … Items don’t have to be certified to fly, and we can even get waivers for the safety cert if need be.” It was astounding that there was no rebuttal to this statement, even though the individual was not the most senior person at the table. This mocking of rigor sends a message to junior staff that it is acceptable to modify or avoid established processes. As a result, both organizational and individual accountability fell by the wayside. Senior leadership should not trivialize established processes since their attitudes can be infectious, either to the benefit or detriment of the Space Shuttle Program and the Agency.

Risk

The CAIB report (Vol. I, p. 193, F7.4-3) states: “Over the last two decades, little to no progress has been made toward attaining integrated, independent, and detailed analyses of risk to the Space Shuttle System.” In terms of the propensity to accept cumulative risk, the CAIB noted (Vol. I, p. 139): “These little pieces of risk add up until managers are no longer aware of the total program risk, and are, in fact, gambling.” Throughout the return-to-flight effort, we observed these propensities still exist.

Very few human endeavors, particularly related to high energy activities involving advanced technologies, are completely free of risk. Spaceflight in general and human spaceflight in particular, is such that it is impossible to drive the risk to zero. Most who have led high risk, technical organizations will readily admit one of the greatest threats resides is unknown, unrecognized, or unacknowledged risks. Ultimately, all three of NASA’s human spaceflight mishaps resulting in crew loss fell prey to one or more of these. To eliminate these threats, successful risk management approaches mandate thorough, ongoing, and critical assessments of potential individual and systemic vulnerabilities. While the return to flight efforts may have reduced some known risks, Space Shuttle missions will always be “accepted risk” operations. NASA must be vigilant to prevent the development of a false sense of security by accepting faulty assumptions, or otherwise inappropriate analyses, to justify continued Space Shuttle missions. The vehicle is not inherently unsafe, but it demands a high degree of vigilance to fly safely.

Unfortunately, we do not believe the risk management processes in place within the Space Shuttle Program are sufficiently robust. One telling sign is the program’s development of a document entitled, The Integrated Risk Acceptance Approach For Return To Flight, which was revised several times during early 2005. This narrative has little substance regarding classical risk management. It is more a brief status report on a list of known and significant risks, noting where risk has been “accepted,” with no rationale or explanation. The document exhibits the very lack of accountability we referenced previously: it does not have an official document number, has no change history, appears not to be under configuration management, lists no authors, and has no approval signatures. The Task Group was informed that this document did not reflect the complete integrated risk acceptance for the return to flight, but to our knowledge, a total integrated risk acceptance rationale was never provided to the Task Group.

We note that NASA managers also tend to confuse the exhaustive and laudable Integrated Hazard Report system with integrated risk management. The Space Shuttle Program has executed a thorough review of all Integrated Hazard Reports on its own initiative and at a considerable cost in hours and funds. As commendable as this effort has been, the review of thousands of Integrated Hazards does not constitute, nor should it be a substitute for, a comprehensive integrated risk management approach. Throughout the return-to-flight effort, there has been a reluctance to appropriately characterize the risks inherent in the Space Shuttle Program. As an example, it has proven irresistible for some officials to characterize the modified External Tank as “safer,” the “safest ever,” or even “fixed,” when neither the baseline of the “old” tanks nor the quantitative improvement of the “new” design has been established. The tank may well be safer, but without adequate risk assessment based on objective evidence it is impossible to know.

The CAIB noted (Vol. I, pp. 118, 189-190, 200) that as the Space Shuttle became “operational,” NASA did not sustain the rigorous risk identification, assessment, management capabilities, or mindset required for what in reality was a developmental vehicle operating in a high-risk environment. Prior flight history became, incorrectly, an accepted risk rationale. In the end, few human spaceflight activities are more important than identifying and assessing the residual risk to flight and determining if it is acceptable from both a cumulative and integrated perspective. It is axiomatic that a fundamental capability of a “high-risk” agency is the ability to analyze risk, and failure to do so rigorously is a failure of leadership.

It is ironic that the Space Shuttle will need to be treated as a developmental vehicle even as the program is winding down toward retirement; the risks of the last flight will be every bit as great as the risks for each of the flights before it. History has shown that leadership has occasionally, but boldly, made the wrong choices and has been too easily convinced that the risk is acceptable. For the future of manned spaceflight, NASA leadership must protect against such tendencies.

Requirements

The Space Shuttle Program does not seem to have a basic understanding of what requirements are, what they can do for the program, and what they can do to the program. In many cases during the return-to-flight effort, hardware was built or modified, and models or analysis tools were coded and used, before any requirements were generated. This was explained by one program official as, “… if this were ‘business as normal’ … we would follow the classic approach of defining requirements first. Return to flight is not in that mode, if we want to fly anytime soon.” The fact was, they didn’t fly anytime soon, partly because they did not have adequate requirements. The same program official continued, “We are pushing for answers on RCC vulnerability, test results on debris allowables, best available resolution for imagery – best effort across the board – without really knowing what the requirements are.” We are not convinced that implementing changes to man-rated systems without first defining requirements is a desirable approach. The lack of requirements also partially explains the difficulties the program has in determining how to verify, validate, and certify the new capabilities, and how to adequately determine how much remaining risk needs to be accepted.

The discipline of defining integrated requirements before embarking on implementation allows an overall picture of work to be done, including associated interdependencies. This in turn facilitates prioritization of those requirements and therefore also prioritization of tasks to be done. Had this been accomplished, NASA would have been in better position to determine which tasks should have been constraints to the return to flight and which should not. This would also have allowed the development of proper schedules and plans, the generation of reasonable budget and resource estimates, and their allocation as established by priorities. As it was, it seemed that when it became apparent that a particular function would not be completed before return to flight (e.g., TPS repair), the program simply decreed that it was no longer mandatory for STS-114. Because of this lack of discipline, the Space Shuttle Program experienced instances where flight hardware was manufactured, accepted, and manifested prior to the completion of design reviews and the release of approved engineering documentation. Major testing and design activities were undertaken without specific requirements or success criteria. In some cases, the program simply refused to write down requirements, citing the “work” as more important than documentation. Lacking specific direction from the program, working-level personnel proceeded to perform test, design, and analysis activities based on their best guess of what was required. This resulted in designs that failed to meet the requirements that were ultimately written, tests that did not apply to the actual environments, models based on flawed assumptions, and a general expenditure of resources in an uncoordinated manner.

It is recognized that even with correctly-written requirements, non-conformances will exist on either a temporary or permanent basis. These non-conformances need to be documented, completely assessed, and formally presented to management for a determination if the requirement should be changed, waived, or if it should be met as-stated and the nonconformance eliminated. Although a process exists to manage this, it is not rigorously followed in all instances.

The Space Shuttle Program has been repeatedly cited for having too many waivers, and has become reluctant to add additional waivers, choosing instead to “beat” the system by using other means. Evidence of this involved open work on the External Tank despite its generally rigorous process. Numerous open items came out of the External Tank DCR. Instead of capturing each one of these as a separate piece of open work, the ET Project announced at the February 24-25, 2005, DCR pre-board that it would document them in a “Verification Limitations Document.” While it is laudable that the project at least captured the deficiencies in the certification (unlike some others), the stated rationale for this approach was that the Verification Limitations Document would negate the need for any waivers. This, in effect, clouds the number of requirements that are not being met and diminishes the certification of the External Tank.

The Use of Models

As part of the return-to-flight effort, NASA initiated the development of a suite of more than 20 new models to assist in assessing both pre- and post-launch risk. Standard engineering practice calls for objectives (requirements and interface definitions) to be established prior to development for any model or system of models, and processes and criteria defined for validating and verifying the model’s results. Also, it is not unusual for a peer review by outside experts to be employed, especially to evaluate systems of complex models that are by necessity inter-related but do not naturally resolve themselves to systemic specification. Initially, we did not observe these normal processes being followed during the development of these models, and a formal request by Ralph Roe of the NESC for a stand-down to evaluate the completed works was ignored. Later the NESC and other organizations did undertake limited peer reviews.

In the case of debris analysis, models for: 1) debris liberation; 2) aerodynamic characteristics of the debris; 3) transport analysis of debris; 4) impact tolerance of the thermal protection system; and, 5) the resultant thermal and structural models of the effects of damage, are all necessary to assess risk. The uncertainties in one model (or system) inherently feeds into and compounds the uncertainty in the second model (or system), and so on. It appears, however, that NASA largely designed these five classes of models without the attention to the interdependencies between the models necessary for a complete understanding of the end-to-end result. Understanding the characteristics of, and validating and verifying, one type of model without examining the implications for the end-to-end result is not sufficient.

Further compounding the modeling challenge is the fact that the models most often used for debris assessment are deterministic, yielding point estimates, without incorporating any measure of uncertainty in the result. Methods exist to add probabilistic qualities to the deterministic results, but they require knowledge of the statistical distribution of the many variables affecting the outcome. Typically, the distributions of the “independent” variables would be derived from empirical observation. In the case of spaceflight, however, empirical evidence is often limited or non-existent, so theoretical or engineering distributions must be substituted. The probabilistic analysis therefore is very dependent on the quality of the assumptions made by the developers. Although they evaluated some of the assumptions used by the model developers, the NESC end-to-end “peer review” primarily analyzed whether the output of one model could be incorporated into the next, not the joint probability associated with any given output … without which it is difficult to know the reliability of the result.

Probability distributions are analytic methods necessary when assessing risk. Without an understanding of the likelihood of an outcome, risk acceptance is a judgment based on instinct and experience. But, as the Columbia accident showed, in a high risk environment that involves many unknowns like human space flight, experience and instinct are poor substitutes for careful analysis of uncertainty. This requires that analytical models be used appropriately to inform decisions within a rigorous engineering process.

Leadership

Leadership is critical to the success of any organization of the size and complexity of NASA. Without leadership the organization lacks cohesiveness and its goals lack coherence, resulting in wasted resources and, potentially, compromised products. A true leader is one who creates/coerces/compels/attracts/demands a responsive organization. It is never enough for a leader to say: “I made that decision, what more do you want me to do?” Instead, at NASA, leaders must follow through to ensure decisions are executed with the rigor and discipline necessary for safe human spaceflight.

Nonetheless, what our concerns about rigor, risk, and requirements point to are a lack of focused, consistent, leadership and management. What we observed, during the return-to-flight effort, was that NASA leadership often did not set the proper tone, establish achievable expectations, or hold people accountable for meeting them. On many occasions, we observed weak understanding of basic program management and systems engineering principles, an abandonment of traditional processes, and a lack of rigor in execution. Many of the leaders and managers that we observed did not have a solid foundation in either the theory or practice of these basic principles. As the CAIB noted (Vol. I, p. 223, O10.12-1), “Unlike other sectors of the Federal Government and the military, NASA does not have a standard agency-wide career planning process to prepare its junior and mid-level managers for advanced roles.” In fact, NASA’s early successes are rooted in program management techniques and disciplines that few current managers in the human spaceflight arena have been willing to study. As a result, they lack the crucial ability to accurately evaluate how much or how little risk is associated with their decisions, particularly decisions to sidestep or abbreviate any given procedure or process.

It is essential that senior managers have previously-demonstrated program management and systems engineering skills and a dedication to well-established, rigorous principles as they apply to complex, geographically and organizationally dispersed programs. More to the point, we remain concerned that NASA senior leadership did not recognize or correct this, and indeed sent contrary signals that the rigor and discipline of a sound program management process was not required.

The Role of Accountability

A crucial factor in creating a responsive and responsible organization is accountability. Within the human spaceflight programs, the lack of accountability appears to be pervasive, from the failure to establish responsibility for the loss of Columbia, up to and including a failure to require an adequate risk assessment of the next flight. While accountability takes many forms, to inculcate an organization and its culture with accountability requires, at a minimum, the consistent setting of expectations, as well as appropriate consequences for not meeting them. This is an important role of a leader. If no one, or no part of the organization, is held accountable for failing to meet those expectations, performance becomes simply a case of “best effort” – a term that became common during many return-to-flight discussions.

A general attitude within the Space Shuttle Program seems to be that best-effort is a satisfactory substitute for meeting specific technical requirements; often requirements were not even documented to avoid the chance they could not be met. However, best-effort is a very poor substitute for a thorough understanding of the technical situation. Parts of the Agency seem to have forgone their traditional engineering rigor in favor of “when you have done your best effort, you are good to go.” This is not an appropriate philosophy for a high-performance organization that routinely puts the lives of its employees into high-risk situations. As Richard Feynman pointed out in his appendix to the Rogers Commission report, “… reality must take precedence over public relations, for nature cannot be fooled.”

Although not described as such, the CAIB noted many of the symptoms of an organization operating with a best-effort attitude. The accident board wrote, “… traits and organizational practices detrimental to safety were allowed to develop, including: reliance on past success as a substitute for sound engineering practices (such as testing to understand why systems were not performing in accordance with requirements); organizational barriers that prevented effective communication of critical safety information and stifled professional differences of opinion; lack of integrated management across program elements; and the evolution of an informal chain of command and decision-making processes that operated outside the organization’s rules” (CAIB, Vol. I, p. 9). Yet we witnessed the best-effort approach during the return-to-flight effort; we saw it in the NASA responses to Task Group requests for information (RFI), observed it during briefings, and experienced it while processing the closure packages sent to us by the Space Shuttle Program.

Since NASA leadership had few rigorous requirements or expectations for CAIB compliance, the closure packages, which should have represented the auditable, documented status of the NASA implementation of the CAIB recommendations, tended to rely on mass, rather than accuracy, as proof of closure. The closure packages showed an organization that apparently still believes PowerPoint® presentations adequately explain work and document accomplishments. Our frustration with these packages drew the response that the engineering teams able to provide the detail were too busy preparing for launch and “doing real work” to properly document their actions. The inadequate and disorganized closure packages frequently required significant effort to obtain even minimally essential documentation. The packages themselves were often provided prematurely presumably (and sometimes with direct request) to seek guidance on “what it would take” to get the Task Group to “pass the recommendation.”

Individual accountability – what the Agency is now calling “technical conscience” – can overcome the best-effort malaise if accompanied with sufficient positive and negative consequences. Part of being accountable, providing more than best-effort, includes having a well thought-out, focused plan prior to beginning implementation. Technical conscience provides the impetus to carry out those plans with rigorous adherence to engineering discipline. We feel significant progress can be made if this new technical conscience can be spread throughout the Space Shuttle Program and the rest of NASA.

Attitude and Learning

The CAIB noted an air of “arrogance” within NASA that led leaders and managers to be dismissive of the views of others, both within the organization and, especially, from outside the Agency. A less critical way to describe the phenomenon is one of “comfort” – comfort with existing beliefs, comfort with past experience, and comfort with information developed inside NASA. As an excuse for not listening, especially to criticism from outside the agency, NASA often proclaims itself to be unique. We readily admit that few organizations of any type – governmental, academic, or commercial – do the kind of work NASA does. Although the end product may be different, however, many of the processes are not different from those found in many large organizations. Whatever the source of this apparent insularity, it is inappropriate for an agency that routinely operates in a high-risk environment. The recurrence of apparently preventable accidents and the seeming unwillingness to learn should be sufficient to instill some humility to temper what often looks like arrogance. During the past two years, we have not witnessed very much of such humility.

During the return to flight effort, even while NASA was systematically encouraging everyone to speak up and many processes were opened to more participation, the result was still very much the same as before the accident – roles, positions, and strength of personality often determined critical outcomes more than facts and analysis. More people were talking, but not many more were listening. Not listening manifests itself in other ways. It appears to us that NASA, unlike most high-performance organizations, rarely studies its own, or anybody else’s, mistakes; the CAIB also commented on this trait (CAIB Vol. I, p.11). It is widely believed that organizations that study and learn from small mistakes can often avoid larger ones. Conversely, those who do not learn as they go have no experience base to help avoid the big mistakes – such as the Challenger and Columbia accidents. An organization that places little value on sustained improvement from prior mistakes will tend to repeat them and certainly will not effectively carry the necessary lessons forward to other programs. We have seen little evidence of renewed commitment to learning lessons from past mistakes at NASA.

For instance, while many academic and government entities use the Challenger accident as a case study, ironically the human spaceflight programs do not. Similarly, NASA scarcely considers lessons from other organizations involved in high-risk endeavors, such as the Navy’s courses on the Scorpion and Thresher submarine accidents and its SUBSAFE program. As stated in the CAIB Report, Chapter 7, “The submarine Navy has a strong safety culture that emphasizes understanding and learning from past failures. NASA emphasizes safety as well, but training programs are not robust and methods of learning from past failures are informal.” Although NASA has maintained a “lessons learned” system since 1992, the human spaceflight activities appear not to have embraced it.

In addition to not being willing to learn from mistakes, many NASA managers are not willing to learn from success, either. NASA’s early successes, as well as many in DoD are rooted in program management techniques and disciplines that few managers in the human spaceflight arena have been willing to study. Having apparently not done so, they lack the ability to accurately evaluate how much or how little risk is associated with sidestepping or abbreviating any given procedure or process.

Summary

It is difficult to be objective based on hindsight, but it appears to us that lessons that should have been learned have not been. Perhaps we expected or hoped for too much. The CAIB report should have served NASA as a “wake-up” call. As the CAIB noted (Vol. I, p. 208), “The recognition of human spaceflight as a developmental activity requires a shift in focus from operations and meeting schedules to a concern for the risks involved. Necessary measures include … Barring unwarranted departures from design standards, and adjusting standards only under the most rigorous, safety-driven process.”

We expected that NASA leadership would set high standards for post-Columbia work. We expected upfront standards of validation, verification and certification. We expected rigorous and integrated risk management processes. We expected involved and insightful leadership from NASA Headquarters. We were, overall, disappointed.

There certainly are capable leaders to be found in the Space Shuttle Program and throughout NASA. In our view, though, the return-to-flight effort, when taken as a whole, was not effectively led or managed. The absence of accountability, of having managers dedicated to program management processes, and of managers being assigned to programs only after demonstrating these skills are what we believe to be the causes of the surface-level symptoms we saw so often. In particular, leadership and managerial failures to set expectations and requirements and a failure to hold people accountable; these promoted a lack of engineering rigor, discipline, and integrated risk assessment. Ultimately, this cost the program significant time and money while producing, in some areas, suspect, disappointing and/or inadequate results. Learning the lessons of these failures is important to NASA’s future.

Conclusion

Among the most damning observations CAIB made of NASA was the sense of complacency toward the problem of the External Tank shedding of foam. Despite program requirements that no debris should be shed, there were over 15,000 instances of damage to the Orbiter, most of which came from debris from the Space Shuttle elements. As has been widely reported, two flights before Columbia, a large piece of foam was shed and caused minor damage to one of the Solid Rocket Boosters. Photographic documentation was available of major foam shedding from the External Tanks on at least seven previous flights (CAIB Vol. I, p. 85). Despite all this evidence, foam had never destroyed an Orbiter and the program relied on this “flight history” to justify inactivity before and during the flight of Columbia. This “We’ve seen this before” mentality is still present, and it appeared on more than one occasion during MMT simulations. In addition, leading up to the return-to-flight, the program justified not pursuing potential ice damage to the Orbiter umbilical doors because there had not been substantial damage on previous flights. Despite the evidence of impacts all around the area, the official rationale for accepting the risk was listed as “flight history;” i.e., we’ve never had critical damage there before.

NASA’s leaders and managers must break this cycle of smugness substituting for knowledge. NASA must be able to quantify risk, even if imperfectly, set requirements and expectations, and hold organizations and individuals accountable, Analytical models – while valuable tools – cannot substitute for engineering judgment and conscience. Rigor must be reestablished throughout the Agency. Opinion, no matter how well informed, cannot replace objective evidence. Flight history, while critical for informed judgment, cannot substitute for it. “We’ve been lucky” is a statement that should never be associated with the human spaceflight programs. Perhaps most disturbing is the engineering legacy that seems to be developing within NASA. As with many professions, the basics of engineering are learned in school. However, good engineering practices – such as rigor in process and documentation – are learned outside the classroom in an apprentice-like environment. These practices are passed onto future generations as part of the “culture” of an organization. However, when an organization loses focus on its core values, the effects stretch far beyond the present because those principles are no longer passed onto future generations. Senior leaders do not appear to be concerned with following defined processes and are passing this legacy on to future leaders.

In order to properly prepare the Agency for the future, including the return to the Moon and journey to Mars, we offer the following suggested actions, all of which must start at the top and flow down to the programs, projects, and workforce:

  1. Clearly set achievable expectations and hold people accountable; in addition to positive consequences, this includes negative consequences for not performing to expectations;
  2. Return to classic program management and systems engineering principles and practices (including integrated risk management), and execute these with rigor;
  3. Ensure managers at all levels have a solid foundation in these attributes before appointing them to such responsibilities; this requires not only training, but successful demonstration of these skills at a lower level;
  4. Eliminate the prejudices and barriers that prevent the Agency, and especially the human spaceflight programs, from learning from their own and others’ mistakes. NASA needs to learn the lessons of its past … lessons provided at the cost of the lives of seventeen astronauts.

Specific Examples

The examples that follow this narrative are just that – examples. However, the behaviors and attitudes were not random events that were seen merely once or twice, but numerous times throughout the Task Group’s interactions with NASA. Many of the examples presented are not intended as detailed case studies, but are meant to provide evidence demonstrating behaviors of concern. We offer these observations for consideration and future improvement.

Rigor Example 1

From our vantage point, the process for selecting a launch date was flawed, if indeed there was a process. We understand these were not normal circumstances, and the usual processes used to establish launch dates – hardware processing templates and payload readiness, to name two – were not applicable. However, we feel that the establishment of launch dates that seemingly did not take into account the full ramifications of the analysis and development efforts being conducted ultimately proved detrimental to the program.

As discussed in the narrative of this observation, we feel the program should have begun the return-to-flight effort with a process that determined what work needed to be accomplish to return to flight, what the interdependencies were among that work, then develop schedules that supported the execution of the work. This process would have helped determine which efforts needed to be accomplished first since their results were required by other efforts. For example, determining the damage tolerance of the Orbiter before giving the ET Project their debris allowables requirements would have helped ensure the tank modifications would eliminate the appropriate debris. Establishing the RCC damage thresholds early would have provided the OBSS effort with their inspection criteria. Instead, it appears to us that senior management selected launch dates based on non-technical concerns, ultimately placing unnecessary and unrecoverable restrictions on teams working return-to-flight hardware development. In addition, several important requirements – such as the critical damage and debris size – were scheduled to be finalized at FRR, far too late to influence the products being provided by the External Tank Project, OBSS, and other systems. In addition, the constant setting of a launch date only a few months away never allowed the development efforts to take full advantage of the ultimate two-year stand-down; we heard several times that different solutions to various problems would have been selected if launch had not been 90 days away.

Scheduled Return-to-Flight Launch Dates

Meeting Date STS-114 Launch Date Days Until Launch Months Until Launch
Jan 29, 2003 Mar 01, 2003 31 1.0
Feb 10, 2003 Mar 01, 2003 19 0.6
Feb 11, 2003 Apr 03, 2003 51 1.7
Feb 24, 2003 Apr 03, 2003 38 1.3
Feb 25, 2003 Jul 21, 2003 146 4.9
Apr 16, 2003 Jul 21, 2003 96 3.2
Apr 17, 2003 Oct 01, 2003 167 5.6
May 21, 2003 Oct 01, 2003 133 4.4
May 22, 2003 Dec 18, 2003 210 7.0
Jul 28, 2003 Dec 18, 2003 143 4.8
Jul 29, 2003 Mar 11, 2004 226 7.5
Oct 05, 2003 Mar 11, 2004 158 5.3
Oct 06, 2003 Sep 12, 2004 342 11.4
Mar 08, 2004 Sep 12, 2004 188 6.3
Mar 09, 2004 Mar 06, 2005 362 12.1
Oct 28, 2004 Mar 06, 2005 129 4.3
Oct 29, 2004 May 12, 2005 195 6.5
Feb 15, 2005 May 12, 2005 86 2.9
Feb 16, 2005 May 14, 2005 89 2.9
Apr 06, 2005 (SFLC) May 15, 2005 39 1.3
Apr 19, 2005 (SFLC) May 22, 2005 33 1.1
Apr 29, 2005 (HQ) July 13, 2005 75 2.5
Jul 13, 2005 (scrub) July 16, 2005 3 0.1
Jul 16, 2005 (HQ) July 26, 2005 10 0.3

When a revised launch date was proposed to the Spaceflight Leadership Council in February, 2005 for consideration, the briefing leading up to the decision only identified KSC processing timelines; no questions were asked regarding the ability of the elements to complete their work with adequate rigor in time to support this date; activities that were ongoing to support meeting the CAIB recommendations. Additionally, the debris/flight rationale requirements were discussed after the launch date was set, thereby never entering into the launch rationale.

As we reviewed the path that NASA has taken to prepare for STS-114, it became apparent that there were numerous instances when an opportunity was missed to implement the best solution because of this false schedule pressure. Many times technical-level personnel indicated that if they had known that they were going to be grounded for 2 years, the solutions chosen would have been much different. The following examples illustrate how an unrealistic schedule for return-to-flight compromised standard processes:

  1. A decision was made not to install LO2 feedline bellows heaters on ET-120 (the first STS-114 tank) and ET-121 (the initial STS-121 tank, ultimately used on STS-114) despite evidence one might be required. Instead, only a relatively easy, but ultimately questionably effective, “drip lip” was installed on the first two tanks. Continued questions about its effectiveness eventually drove the program to roll-back STS-114 from the launch pad to install the heater.
  2. The implementation of the OBSS sensor package was selected before knowing the size of the damage that needed to be detected. In fact, as of July 11, 2005, the NSTS 60517, PRD for the Shuttle On-Orbit TPS Inspection System still had numerous TBDs for critical requirements regarding the required resolution capability. On several occasions, members of the NASA workforce have expressed that methods other than the OBSS would have been preferable and the OBSS was chosen due to the short time before the targeted launch date.
  3. The decision to stay with the STA-54 tile repair material was made on the apparent near-term availability of this material and not because anybody believed it was the best possible choice.
  4. Both Shuttle and ISS teams reworked flight manifests, schedules, and analysis many more times than should have been necessary due to this lack of an integrated approach to resolving the real issues and planning a realistic timeline to launch. This also resulted in repeated coordination with the international partners.

True research and development (R&D) efforts – such as TPS repair – should not have been a constraint to the launch of STS-114 unless the Agency felt the capabilities to be provided by these R&D efforts were so important they could not risk lives without them. Additionally, NASA should have evaluated their return-to-flight activities and determined which efforts were not progressing as originally intended, then been completely honest with itself, higher authority, and the Task Group that they would not be able to meet those recommendations within the funding and schedule constraints imposed on the program. Schedules for R&D activities are difficult to predict, and perhaps should be the rationale to not include them as return-to-flight criteria.

Rigor Example 2

On August 27, 2004, one year after the release of the CAIB report, the Space Shuttle Program signed PRCBD S062246 approving the Post-STS-107 Return-to-Flight Design Certification Review Plan and Procedures document, NSTS 60524. The proclaimed purpose of this document was to “define the activities and procedures for accomplishing the Space Shuttle Program (SSP) Design Certification Review (DCR) process for Return-to-Flight after the Columbia accident. This plan establishes the requirements, responsibilities, preliminary schedule, general implementation guidelines, and success criteria required to complete the post-STS-107 RTF DCR process and document the review results.” This document directed a tiered DCR process to formally demonstrate that new or modified systems, software, supporting processes, and operations meet the program design, safety, performance, and operational requirements levied upon the item in question. The process also required demonstration that “appropriate certifications have been performed” at lower levels. The document specifically recognized that the “tiered DCR process being conducted for RTF is not classical in nature as more content than simple certification of hardware is reviewed. The SSP is utilizing this process to cover other major topics, such as standard operational and process changes, which would otherwise be discussed at a Flight Readiness Review (FRR).” The design to be certified during the DCR was to include all changes occurring after the STS107 Certificate of Flight Readiness (CoFR) was signed on January 9, 2003.

Ironically, the effect seems to have been largely the opposite. Instead of pulling FRR material forward into the DCRs, many of the projects/elements, and the program itself, stated during their DCRs that reviews of several activities would be deferred until the FRR. This seems too late in the process to be making critical decisions.

In all, the 60-page document was an attempt to instill some discipline into the return-to-flight process. However, during fact finding, we noted that while a program-wide process for DCRs existed, it was not imposed on the various projects – each project and the Space Shuttle Program executed DCRs in different manners with wide variances in the scope, execution and rigor for the various project and system-level DCRs. In response to a question from a Task Group member on this wide variability, the Systems Engineering and Integration Office indicated they didn’t set any standard processes because MSFC and JSC operate differently – this from the organization that put together the DCR process originally.

Additionally, a senior Space Shuttle Program official at one point denied the existence of a document governing the DCR process, despite the fact that he approved NSTS 60524. It is a concern that processes put in place specifically for the return-to-flight effort can be ignored so cavalierly without consequence.

During early 2005, the program decided that since they would likely not be able to “certify” the debris aspects of the Space Shuttle system, the term Design Certification Review was no longer appropriate. Instead, a series of newly termed Design Verification Reviews (DVR) were held. These seemed to suffer a rough start; during the first DVR, when asked about the availability of data and documentation to support the review, the program responded that none was available. When asked about the success criteria for the review, the response from the program was that none had been established. Interestingly, NSTS 60524 was never updated to reflect the newly-coined DVR process. On one hand, the rigorous DCR process was ignored by many; on the other hand, there were too many of these reviews. Various parts of the program did not have all the necessary work completed in time for scheduled DCRs, so there ended up being multiple DCRs for each project/element to cover all the work. Rather than 12 System Reviews (6 DCRs and 6 DVRs – and none of these covered TPS repair), it likely would have been a better use of resources (particularly the reviewers’) to delay the System DCR/DVR until all the work was complete.

The degree of rigor employed during the return to flight effort has varied with the individual projects. At the one extreme are activities like the SRB Bolt Catcher redesign and, to a lesser extent, the modifications to the External Tank thermal protection system. Both of these projects exhibited a formal and documented approach to the establishment of requirements and execution of their design review and certification processes. Both the SRB Bolt Catcher and the ET had formal plans for their various reviews, formal data packages, a formal issue review process, formal pre-boards and boards, and well-documented formal findings.

The other extreme includes activities such as Orbiter TPS repair, which have been extremely convoluted. While presentation material for the repair efforts was developed for the Orbiter DCR II held in February 2005, the material was not covered at that meeting. At the System DCR II later that month, it was stated that repair techniques would not be addressed in the DCR/DVR process, because there were no Level II requirements to have a repair capability. This was even though the DCR process was supposed to cover all changes since the STS-107 CoFR, criteria that certainly applied to the repair techniques.

Perhaps the most revealing behavior observed during the design reviews was at the Program DCR at KSC on April 19, 2005. This “review,” like many witnessed during the return-to-flight effort, was not so much a review as it was a status briefing. No technical questions were asked by the Board; no technical responses given. With the single exception of the SSME Project, each project and element simply presented a high-level summary of their current status, including open work; SSME attempted to describe a technical problem and request help in resolving it, without much success. The final certification was conditional on the “satisfactory completion of identified open work,” but nobody before, during, or after the meeting kept track of the open work presented by the projects/elements. This meeting validated the CAIB observation of engineering and decisions via PowerPoint presentation rather than technical detail and rigor.

Risk Example 1

The Space Shuttle Program has, in the past, too often accepted risks that should have been mitigated; this trend appeared to continue during the return-to-flight effort. It appears to us that what the CAIB wrote (Vol. I, p. 193, F7.4-5) is still applicable today: “Risk information and data from hazard analyses are not communicated effectively to the risk assessment and mission assurance processes. The Board could not find adequate application of a process, database, or metric analysis tool that took an integrated, systemic view of the entire Space Shuttle system.”

Ultimately, few programmatic responsibilities are more important than identifying and assessing risk and determining if it is acceptable from both a cumulative and integrated perspective. As the Space Shuttle became “operational,” NASA did not sustain the risk identification, assessment, management capabilities, or mindset required for what in reality was a developmental vehicle operating in a high-risk environment. Prior flight history became an accepted risk rationale. The perceived risk level during the launch of STS-107 was not aligned with the facts regarding the actual debris environment, just as the perceived risk for Challenger had not been aligned with the true state of the o-rings. Nevertheless, the issues were considered accepted risks that had potentially catastrophic consequences, but with a remote likelihood of occurrence. Despite this perception, in reality the risks should have been considered unacceptable – potentially catastrophic consequences with a good likelihood of occurrence.

This should have initiated a design change, either to eliminate the debris environment or to modify the Orbiter to withstand the resulting debris environment, in accordance with the Space Shuttle Hazard Reduction Precedence Sequence (NSTS 5300.4[1D-2] Section 1D201, Item 6, based on MIL-STD-882D, Section 4.4). This program-wide policy has as its first step, design action to eliminate the hazard:

Hazard Reduction Precedence Sequence. To eliminate or control hazards, the contractor shall use as a minimum the following sequence or combination of items:

a. Design for Minimum Hazard. The major goal throughout the design phase shall be to ensure inherent safety through the selection of appropriate design features as fail operational/fail safe combinations and appropriate safety factors. Hazards shall be eliminated by design where possible. Damage control, containment and isolation of potential hazards shall be included in design considerations.

b. Safety Devices. Known hazards which cannot be eliminated through design selection shall be reduced to an acceptable level through the use of appropriate safety devices as part of the system, subsystem, or equipment.

c. Warning Devices. Where it is not possible to preclude the existence or occurrence of a known hazard, devices shall be employed for the timely detection of the condition and the generation of an adequate warning signal. Warning signals and their application shall be designed to minimize the probability of wrong signals or of improper personnel reaction to the signal.

d. Special Procedures. Where it is not possible to reduce the magnitude of existing or potential hazard through design, or the use of safety and warning devices, special procedures shall be developed to counter hazardous conditions for enhancement of ground and flight crew safety. Precautionary notations shall be standardized.

It is recognized that any design change takes time to develop, implement, and certify; however, the specific design action could be underway while the program assesses the technical risk of continuing operations and maintains a focused awareness of the risk in each area. The program should not have the option of short-circuiting the process by skipping to “accepted risk” as was done before both Challenger and Columbia.

The goal is to change the design to completely eliminate the risk. As with all design actions – especially when dealing with high technology programs such as spaceflight – it is recognized that there will be limitations driven by the laws of physics and program resources. The Space Shuttle has a finite life (scheduled to be retired in 2010) and no program has, or will have, infinite resources. The best available technical solution should be sought without regard to schedule and resources limitations; these will come into play when the proposal is formally brought before program management (i.e., the PRCB). The modification should be installed at the earliest opportunity to remove the risk; however, in the interim, procedural mitigations could be used to minimize the risk of continuing to fly if an acceptable-risk rationale can be developed. This is the approach we expected to see in the Integrated Risk Acceptance Approach For Return To Flight, but did not. Every risk (non-conformance) should be documented, have a documented rationale for limited acceptance, and a documented risk retirement plan with the objective of completely eliminating the risk. Again, it may not be feasible to retire all risk, but it is important for NASA to develop an understanding of what is involved in the resolution of non-conformances and the retirement of risk. We do not feel that the program is currently using this process to mitigate or accept risks. For example, it took the current NASA Administrator’s personal intervention during a technical review held shortly after his appointment to force appropriate recognition by program management that the well known and recognized ice shedding from the External Tank was, in fact, potential critical debris and should be treated as such. His further direction finally forced the slip of STS-114 to the July 2005 launch window in order to incorporate necessary technical control measures (i.e., the forward LO2 feedline bellows heater). Absent the Administrator’s direct action, STS-114 might very well have launched with the physical cause of Columbia’s loss (ET bipod ramp foam) fixed, but with an identified, yet unacknowledged, risk to vehicle and crew.

NASA should return to compliance with its long-established procedures for addressing risks. There are enough risks in the “unknown-unknowns” without unnecessarily increasing risk by not promptly and rigorously resolving the “known-knowns” and “known-unknowns.”

Risk Example 2

We do not believe the risk assessment processes in place within the Space Shuttle Program are sufficiently robust. One telling sign is the program’s development of a document entitled, The Integrated Risk Acceptance Approach For Return To Flight, which the Space Shuttle Program points to as a response to inquiries regarding risk assessment and risk management. The document appears to be intended for the uninitiated reader rather than a being a technical document for use by the program. The main text of the May 2005 version consists of 41 pages that are essentially a chronology of events leading to the current state written more for a general primer than a serious treatise on institutional process and rigor necessary for consistent, successful risk management. The Residual Risk Matrix contained in an additional 18 pages of Appendix A lists remaining tasks or “Objectives” rather than identified areas of risk. The remaining three columns delineating “Evidence of Objective Completion,” “Remaining Risk,” and “Acceptance Rationale,” in order are populated by items which are frequently vague and require considerable suspension of belief to conclude a particular risk acceptable. As an example, under “Remaining Risk” on page A-3 the last item states:

“Although these efforts will in all likelihood reduce the potential for flow of liquid nitrogen through the flange and reduce the potential for foam loss in flight, there is no quantitative means to demonstrate this as fact. Previous foam divot formation in the flange area produced foam debris below the current allowable. NASA has considered and accepted this risk.”

The corresponding “Acceptance Rationale” states simply, and somewhat glibly, “Acceptable Risk.” Unfortunately, this raises more questions than it answers, such as:

How does one conclude reduction, “…in all likelihood…”?

What “…previous foam divot formation…”? Flight? Ground Test?

What is “…the current allowable…”?

How was “…allowable.” determined?

Finally, how did NASA consider and accept the risk? As a follow on, what is the plan to reassess or correct if existing risks are accepted?

Requirements Example 1

The NSTS 07700 (the top-level specification for the program) requirements are substandard in a number of areas: they are not individually numbered to facilitate referencing an individual requirement (i.e., there are multiple “shalls” per paragraph); they are often stated in an ambiguous and untestable fashion; and there is inconsistent use of terminology such as “shall,” “will,” and “must.” Given the 2010 retirement of the Space Shuttle, it does not make sense to go back and correct existing requirements; however, those requirements modified or added as part of the return to flight effort and any subsequent requirements changes should adhere to industry-standard requirements practices. This includes documenting the verification and validation criteria at the same time as the requirement (before implementation begins), a practice not in evidence in the requirements documents made available to the RTF TG. Additionally, on multiple occasions the Task Group asked about the ability to have an auditable trail from Level II (NSTS 07700) requirements and directives down to the implementing actions on the floor. The Task Group was informed that the Program Configuration Management System did not allow for such an auditable trail.

Nor does the program seem to know how to change requirements. For example, there is a requirement in NSTS 07700 for zero debris “that would jeopardize the flight crew, vehicle, mission success, or would adversely impact turnaround operations” (Book X, 3.2.1.2.14). After it was determined that the External Tank would continue shedding debris of a potentially critical size, the program decided – after 113 flights – they needed to change or waive the requirement. The first change was to add a requirement that the External Tank could not shed debris that generated impacts larger than 1,500-ft/lbs (NSTS 07700, Book X, 3.2.1.2.14.4.1). This requirement later turned out to be inadequate since the Orbiter could not be certified to withstand impacts that large. In response, a permanent waiver to this Level II requirement was proposed stating “This requirement is waived for the External Tank.” However, this generated controversy within the program and an alternate proposal was brought forward to eliminate the need for any debris waivers by adding an “exception” (see Requirements 2) to the top-level NSTS 07700 requirement. The Task Group does not know the status of either proposal since the PRCB does not publish minutes of their meetings. As late as the second Program DCR (June 2005), program management was attempting to establish the mechanism for documenting requirements and exceptions; by this time the hardware was on the pad.

Requirements Example 2

How do you meaningfully track requirements when you do not understand the definitions of programmatic terms? For instance, at the STS-114 Flight Readiness Review, the Space Shuttle Program attempted to define the terms “Waiver,” “Deviation” and “Exception.” Within the documentation listed, there were 11 definitions for “Waiver,” 7 for “Deviation,” and 5 for “Exception;” some definitions were combinations of the terms. Sometimes there were multiple definitions for a single term with in one volume of NSTS 07700 – and even worse, sometimes within the same paragraph of NSTS 08171! Standard definitions for many engineering terms exist in industry and academia; NASA should adopt these standard definitions wherever possible and use them consistently. For instance, like the CAIB, the Task Group found that the “in-family/out-of-family” designators a continuing source of confusion. Unfortunately, NASA seems to place a low priority on maintaining standard terms and definitions. The following entry was found in a list of NASA Handbooks at Headquarters: “NHB 5300.4(1G) NASA Assurance Terms and Definitions – Has been deleted – Long term plans call for development of a NASA-Standard for definitions.” We have no idea when these “long-range plans” will come to fruition.

Models Example 1

NASA has in the past maintained certain models in formal requirements documents (e.g., NSTS 07700) and employed well-recognized processes for developing and using analytical models. However, during the return-to-flight effort, there has been an enormous expenditure of time and resources – amounting to tens of millions of dollars – without the discipline of a formal development plan, clear objectives, explicit plans for verification and validation, thorough outside review, documented ICDs between models, or a good understanding of the limitations of analytical systems employing multiple, linked deterministic models. Validation and verification planning has been left to the end of the process rather than the beginning. Early peer reviews were limited to the question of appropriateness for the proposed task and never reassessed or reconstructed post-development. Even the belated efforts by the NESC are not classic peer reviews. Outside peer reviews would highlight, for example, the extreme difficulty, if not impossibility, of forming an end-to-end conclusion on the confidence interval inherent in any particular result. Even more troubling, in many instances historical flight data was not used during the initial stages of model development.

On several occasions, members from the NESC and RTF TG expressed concern regarding the development and use of debris models. We observed that development test data was being used rather than verification test data in attempts to verify models. It should be noted that the development test data was obtained over widely varying test conditions. Analytical models have essentially driven the return-to-flight effort; however, industry and academic standards and methods for developing, verifying, and validating the models have not been used. In addition, no sensitivity analyses had been conducted and no empirical data from flight history had been incorporated in the models or their validation. Suggestions to use flight history, probabilistic techniques, and sensitivity analysis were disregarded. A formal request for a stand-down to evaluate the completed works was ignored.

All the while, the External Tank was being modified to meet requirements established by preliminary and interim model outputs. In December 2004, a modified External Tank meeting these interim requirements was shipped to the Kennedy Space Center with the understanding that if the final requirements determined by the modeling effort resulted in smaller debris allowables, the next tank in line would be modified to meet the more stringent requirements, a so called “trailing tank” concept. For various reasons, the program decided to abandon the trailing tank concept before the second tank was shipped to the Kennedy Space Center.

Models Example 2

Progress has been made by the ET Project to reduce the risk of critical debris during ascent. Many of these changes were made on the basis of debris-flow modeling and transport analysis. Initial analysis and simulation of the Orbiter showed that the RCC could withstand impacts up to 1,500-foot-pounds, a figure that was turned into a requirement for the ET Project. The tools that produced these initial estimates had not been verified or validated, yet their output was used to develop and build flight hardware. Further impact testing of actual RCC, however, showed that 1,500-foot-pounds are far greater than the RCC can actually withstand reliably. This knowledge came too late, and the ET Project had already modified External Tanks based on the original 1,500-foot-pound number.

In an attempt to justify both numbers – the larger number given to the ET Project and the lower number for the Orbiter Project – a complex effort was undertaken to develop a Capability over Environment (C/E) analysis, using several of the models already being developed. In this case, the “capability” is the size and speed of impact the Orbiter can withstand, while the “environment” is the amount of debris coming off the External Tank (and other sources). Numerically, a value of “1.0” implies the hardware can withstand the environment – but just. Normally a factor of safety, often “1.4” in the Space Shuttle Program, is required for additional margin.

The C/E approach was first introduced to the RTF Task Group during the September 2004 Plenary. The program provided the results of the initial assessment that indicated critical debris was a particular size, but admitted that the uncertainties were several orders of magnitude on either side. On February 17, 2005, the then-current C/E analysis was presented to the program; in almost all cases the C/E was less than 1.0, meaning that the capability of the Orbiter to withstand damage was less than the amount of debris in its expected flight environment. This analysis had been done using “worst-on-worst” conditions corresponding to certification levels; something everybody agreed was unlikely to occur in real life, but in accordance with the ground rules laid out when the C/E analysis began. The next day, at the Spaceflight Leadership Council meeting, it was presented that if the worst-on-worst C/E was less than 1.0, the program would look at best-estimate C/E.

It should be noted that determining exactly what factor of safety was provided by the models is a bit challenging. For instance, each of the component models in the C/E worst-on-worst analysis had a factor of safety of 1.4 built into it, therefore a combined C/E of 1.0 in reality still had an adequate factor of safety. This became more problematic with the “best-estimate” analysis, where estimates of more likely performance (for both the Orbiter and the External Tank) were used. Keeping track of the ever-changing differences in inputs between worst-on-worst C/E and best-estimate C/E was difficult; there was little agreement in the community on what factors should be included in the best-estimate version.

Standard “4 by 3” risk chart used by the Space Shuttle Program.

At the beginning of the Delta Design Verification Review (April 26-27, 2005), the rules stated that if the “best-estimate” of Orbiter capability relative to debris environment (C/E) was less than 1.5, then the likelihood on the standard program 4×3 risk matrix would be “infrequent,” and if 1.5 or greater it would be categorized as “remote.” For example, for the LO2 intertank flange closeout, the best-estimate C/E was 1.1, which should have classified the risk as “infrequent.” However, many argued that the problem was sufficiently well understood that it could be put in the “remote” category despite its low best-estimate C/E. Others said it should be “infrequent” since that is what the process dictated, and that it involved “old” foam and therefore contained a greater uncertainty. At that point, Space Shuttle Program management pronounced that the best-estimate C/E value of 1.5 determining an “infrequent” vs. “remote” rating was only a guideline, not a rule. The middle of a design review does not seem an appropriate time to be changing the rules.

The arbitrary nature of the requirements/fulfillment process is demonstrated by the February 2005 changes in approach (from “worst-on-worst” to “best estimate”) and reduction in factors of safety to make the numbers come out right. When asked during the RTF TG March 2005 Plenary if the tank will change when results from the modeling are finally available, the answer was “… no, that’s why we’re changing the models so we don’t have to change the tank.”

The program continued to develop and make decisions on analysis techniques, such as best-estimate C/E, which used non-standard approaches. This history of the C/E logic raises questions regarding the management of the return-to-flight effort. Fortunately, this approach was abandoned when the Program reluctantly initiated probabilistic analyses on additional critical areas at the direction of the new Administrator. Analytical models have essentially driven (and delayed) the return-to-flight effort even though industry and academic standards and methods for developing, verifying, and validating the models have not been used.

Models Example 3

During the reviews of the probabilistic analysis efforts, requests were made by several people, including Task Group members, to clearly state the assumptions going into these models. It was not until the end of the 6-week review effort that the assumptions to some of the models used for this activity were recorded; for some of the models, no comprehensive set of assumptions were ever documented. By their nature, these models are complex and sophisticated analysis tools. Therefore, the quality of the original assumptions is important; they should be written down and consistently applied.

Leadership Example 1

A clear example of a lack of budgetary restraint was demonstrated by the management of the TPS repair projects. For an extended period of time, there were few constraints placed on the development teams, and money was spent on the development of almost any idea that was proposed. Only when the large-scale RCC rigid overwrap repair effort was finally deemed untenable was it removed from the list of options being considered for STS-114. At the direction of the Space Shuttle Program, during late 2004 the Orbiter Project Office initiated a study to eliminate some of the repair options in an attempt to focus manpower and budget on the most promising techniques. This resulted in resources being expended to generate proposals, and an evaluation team worked over the winter holidays to develop a recommendation on which options to select. Ultimately the Orbiter Project Office brought forward two tile repair options and a single RCC repair option to Space Shuttle Program management. Nevertheless, six options (two for tile repair and four for RCC repair) were actually considered by the program, and only one RCC option was ultimately dropped. Surprisingly, although a stated reason for performing the down-select was to reduce expenditures, cost estimates were explicitly excluded from the factors used in the decision.

Several months later, when two cost Change Requests (CR) for continued tile and RCC repair development were brought to the PRCB (asking for nearly $100M for the last 5 months of FY05), the CR sponsor objected to suggestions that the program needed to consider if this was how limited resources should be spent. In the end, the only criteria that determined how much would be spent on the two CRs was whether all the money could be spent by the end of the fiscal year, not whether this was a wise use of limited program resources.

SpaceRef staff editor.