Test data management risks and solutions
Providing quality test data is a challenge across the software testing life cycle. A significant amount of resources is spent on creating, maintaining, and archiving test data using manual and semi-automated processes. Using direct copies of production data without de-risking it may result in the exposure of sensitive customer data and financial data, thereby violating regulatory and compliance directives.
Acquiring de-risked high-quality test data faces the following challenges:
1. Distributed Environment – In the current outsourcing model of development, different stages of the software development lifecycle are executed across multiple locations and data, which are to be seen and worked on their entirety. Banks have to deal with their data environment going outside its premises and resolve confidentiality and security problems. Also, if you operate with an Agile approach where time to market is very critical, there is very little additional time available for data generation.
2. Data Complexity – Often, testing teams have to work with different types of data stored in multiple environments in the legacy back-end. For assurance purposes, this data needs to be unified in a central repository. This is a big challenge which is often complicated by the fact that there is little documentation on the relationship between databases, and how to connect them.
3. Differences in Types of Testing – As different types of testing, such as user acceptance testing (UAT), system integration testing (SIT), performance testing, etc., require different types of data, it is imperative that the effort spent by testers to prepare test data is minimized while at the same time the results ensure maximum coverage and volume in the correct format wherever needed.
4. Data Security and Confidentiality – Moving confidential data to the test environment is a risky proposition. With regulators imposing strict compliance norms on banks, the need for effective data masking becomes all the more important.
The solution that addresses these challenges need to ensure the following: unification of data from multiple sources, provide copies of production data, generate data for code coverage, de-risk production data based on regulations and compliance requirements, de-duplicate and reuse data across multiple test environments, effectively categorize and establish relationships between databases. More importantly, all this has to be done without any loss in data quality or integrity. The solution must also allow data to be provisioned to multiple locations across the globe if needed.
Data Profiling & Categorization
This involves identifying relationships between different types of data at the source level. As a best practice, it is not recommended for the production data to be used for the data discovery process. Disaster recovery database or data backup can be used as source data for data profiling and categorization. This process can be done manually as well as through tools such as IBM Discovery, Oracle EM, and CA Test Data Manager. In addition to data profiling, it is ideal to categorize the data based on its business nature and usage. This will help to categorize the data properly as transaction data, financial data, master data, etc., and prepare the data for the next steps in test data management.
It is best to first consider which type of data masking would suit our current goals. There are different types of techniques, such as substitution, shuffling, user-defined function, etc. for effective data masking. For example, name columns can be masked using substitution technique to represent data with another meaningful name, whereas for functional data credit card numbers – where simple substitution or scrambling will fail in functional tests, it is better to use algorithms such as Luhn’s to generate functionally valid credit card numbers.
There are scenarios – such as performance testing, testing an enhancement that is absent in production, testing negative cases, etc. – where data pulled from production databases may not be sufficient for testing. In such cases, data generation tools such as CA Test Data Manager can be used to synthetically create data that satisfies the requirements (in terms of volumes/business rules) and can substitute production data. Also, since it is synthetically generated, the data can be used in different environments without violating regulatory or compliance guidelines.
Copy Data Virtualization
Extremely useful when multiple test environments need the same set of data, data virtualization enables us to provision data using virtual data environments to multiple test environments. With tools like Actifio and Delphix, virtual data can be provisioned and any change made to the data by one user is reflected only in his/her local copy (of changes alone) and doesn’t affect the main copy.
In addition to methods and tools which address the challenges of test data management, there are processes that enable implementing a functional and effective test data management solution. Defining test data request process workflow, documents to be used during requests, SLA and metrics, etc. will facilitate a seamless test data service.
*This article was previously published in Software Testing Magazine on 26 June, 2017