In science results are obtained by a combination of theorizing and experimentation. A theory is tested through the collection and interpretation of facts. This is usually an iterative process because the experiment generates new facts that lead to changes in the theory, thus a new experiment is needed.
In the IT world such an iterative approach can also be beneficial. This Whitebook describes a successful project where the Lean Six Sigma DMAIC method was used to improve results by generating and analyzing data.
Case: data migration project
A client needed a new financial application due to changed legislation. Also all data (20,000 customer cases) from the old application needed to be migrated to the new application. The project had a tough go-live date. The quality requirement was that at least 92% of the customer cases would be fault free migrated to the new application.
This project had the following challenges:
- The new table structure was different from the old structure. The data needed to be converted, an error-prone process.
- The new application was still under development.
- The calculated results from both applications could not easily be compared because of the functional changes.
So given all these challenges, how could we determine whether the data migration was successful and in what manner could we find a root cause when problems would arise?
Because of the many uncertainties an iterative approach was chosen with an automated process for comparing predicted and calculated outcomes. The approach was based on the Lean Six Sigma DMAIC methodology (Define, Measure, Analyze, Improve, Control).
Lean Six Sigma DMAIC
Six Sigma is a quality improvement process. Six Sigma focuses mainly on reducing variability in production quality.
Lean is a method (almost a philosophy) focused on creating value in a process and reducing unnecessary process steps. Lean is largely derived from the Toyota Production System and now in use around the world in various companies (including Dell and Porsche) and also implemented in other types of processes (eg product, services and IT processes like software development project and project management).
In recent years, Lean and Six Sigma are increasingly combined into an integrated approach. Lean is initially used for large process improvements and Six Sigma to dot the i’s.
The core process within Six Sigma is the DMAIC process. This is a variation on the well-known Deming Plan-Do-Check-Actualize cycle (though usually the process is followed only once and takes in between 1 and 5 months).
|Define||Define the problem and determine how the process can be measured.|
|Measure||Measuring relevant data and build knowledge|
|Analyze||Analyze the facts and find causes of errors|
|Improve||Fix the problem|
|Control||Implement the solution within the organisation|
In the Lean Six Sigma DMAIC process every step within this process is supported by various ‘tools’. Think of statistical calculations, problem analysis, etc. In our data migration project we only used a limited set of Lean Six Sigma tools. By automation we speeded up the DMAIC process so that one DMAIC cycle could be finished within one day.
Together with the client we determined how the quality of the data migration could be assessed. It was found that only testing the data migration itself was insufficient to guarantee the required outcome. The client obviously had no understanding of table structures and data types. Although our project scope was only to convert data, the project’s outcome could only be assessed by calculating the customer cases within the new application.
The client selected 200 client cases that were sufficiently representative for the entire contents of the database. These 200 cases were then manually calculated by the client. These results were delivered in a spreadsheet and we saved this data into the database as ‘predicted outcomes’.
It would have been too much work to manually compare results. We wouldn’t be able to meet the deadline. In addition there were too many changes in functionalities and data to rely on one single test iteration.
We needed an automated process to repeatedly compare the calculated and predicted outcomes. A migration framework which we could use repeatedly. The figure below shows this migration framework in green. In blue the newly developed application is visible. In grey the database from the old application is shown.
First a copy of the old production database was created. With a push of a button we could start the migration process:
- The data in the old structure was translated to the new structure.
- The new application was started.
- The calculated results were then automatically compared with the results predicted by the client and the differences were reported.
For the differences the root cause was investigated. Lean Six Sigma has many tools in the ‘Analyze’ step to interpret data and identify causes of problems. We made use of:
- Five times “why” question
The “Five Times why” technique literally means that you ask the question “why is this wrong?” five times. Experience shows that the 5th time you should have found the real root cause.
- Pareto chart
Classifying data into subgroups in order to exclude irrelevant data and focusing on problem areas. This helps to determine which 20% of the root causes are responsible for 80% of the problems.
The following root causes were found during the project:
- GIGO (Garbage In Garbage Out), the data in the old application was wrong (not previously discovered bug in the old application);
- an error occurred during the export of data from the old database to the copy database;
- the conversion of data from the old to the new structure went wrong;
- there was a bug in the new application;
- The new application worked as specified, but the specifications were wrong;
- The manually calculated predictions were wrong;
- the differences report itself was wrong.
In other words, there was no part of the data migration process without errors. From our experience with migration projects, we think that it is normal that errors occur in all parts of the process. We also think that it is unique that all these errors are found and dealt with.
For each problem identified a (possible) root cause and possible solution was determined. In some cases we had a clearcut problem and solution, in other cases we had nothing more than a hypothesis. By using the migration framework we could quickly perform experiments to test our hypotheses.
Solutions to problems found were implemented.
Changes in software and predicted outcomes were secured by configuration management. The migration framework was restarted.
The first two weeks of the project we created a design for the migration and set up the migration framework. Then several times a day we executed the migration framework.
The first migration results were far below the target of 92% error-free records. Less than 10% of the calculated results were correct. (See chart below).
By iterating the DMAIC process we could very quickly improve results. The data generated by the migration framework helped us searching for problems and root causes. Also we could afford to perform experiments. Rather than intensive and lengthy analysis we could quickly try out a small change by running the framework.
Ultimately the migration project was a great success:
- Completed within schedule;
- Quality better than required (99.2% instead of the required 92%, within a total of 20,000 cases that is a gain of 1,440 cases);
- During the migration project more than 30 bugs in the new application were found and resolved;
- We found errors in both the old application and database;
- Incorrect assumptions in the design of the new application were found and amended.
Other examples of DMAIC
DMAIC is a widely applicable approach for process improvement. In other types of processes this approach also demonstrated its value.
A helpdesk organization notices that one specific client reports a lot more problems than other clients. After one year a structured process is implemented for root cause analysis and administration of these findings. Over time it becomes apparent that more than half of the reported problems are related to a lack of technical knowledge within the customer. A frequent consultancy session is implemented at which the reported problems and causes are discussed and specific steps are taken to build knowledge. As a result of this cycle of assessments and improvements the number of reported problems decreases significantly.
A client complains about the availability of the system. The statement is that 10% of the transactions fail. There is a lot of time spent on finding the root cause, but it is very difficult to find specific cases to analyze because there is no data recorded.
Gradually the application is modified to also measure and record the behavior of the application. It turns out that 3% of the transactions go wrong, where 0.1% of the cases have a root cause within the application and 2.9% a root cause within the clients infrastructure.
A financial organization provides a monthly export of financial data to another party. That export contains errors each month over and over again. The underlying financial calculations are very complex and changes to the software design lead to new errors each time. Then the approach is changed. The software is amended to store all intermediate results into the database to facilitate analysis. Root causes analysis of the problems, assisted by the additional data, helps determine the software changes that are needed . The next data export is fault free.
Examples of the DMAIC methodology can be found in various aspects of software development. Scrum has its roots in the Lean philosophy where the process of designing, making and testing of software is reduced to only those processes that create value (in this case working software).
Also, many best practices of “good software development” are based on the same underlying principles. An example is the incorporation of logging and tracing code into applications to measure it’s behavior, or test driven development by following a data driven approach to software quality.
When you do stop analyzing problems and when do you start proving your ideas and assumptions by measuring? Often we choose for analysis because we have confidence in our ability to create solutions by using our ‘brainpower’. Also we often think that analysis is more time effective than measuring and experimentation.
My experience is that in complex environments too much faith is put on analysis alone. If there are many uncertainties our mental models have very little predictive power. It is wiser to iteratively test theory with facts.
Building in measurements during the initial design of the system can greatly reduce time and costs. This also applies to applications or processes that are seemingly stable and predictable. The time and effort spend in building in monitoring will pay itself back when the application or processes do not perform as planned (which usually is the case).