The Power of a Test Data Lake in Agile Environments
A recent survey of over 250 professional QA engineers across the world showed that more than a third of the testing time was spent on managing test data. When there are multiple systems that are involved in a business process that needs to be tested, the process of finding test data becomes close to impossible for test engineers.
For most test engineers, finding the golden record that can be used for testing a business process end to end is done by sending an email with details of the profile of data required to a DBA or a test data architect. This person, in turn will issue a series of SQL queries to find the right set of test data that the QA engineer can use for testing. This process can take from 15 minutes (if this is a repeat request and the data already exists) to multiple days (if data needs to be provisioned that fits the profile). Very often, this results into missed test cases, and failed test cases due to lack of proper test data.
Test Data Lake allows test data architects to populate the test data for an environment into a data lake. This can be done based on a production environment after masking sensitive information and generating data for tables that do not exist in the production system. Test Data Lake systems sitting on top of Hadoop environments leverage the cheap storage options of Hadoop.
A self-service interface on top of the test data lake gives access to the test data to test engineers. They can visualize the test data to ensure complete test coverage, and generate test data where none exists. Test engineers can tag certain rows of data with test cases for easy reference and for marking of these records as reserved. Reservation of rows of test data allows test engineers to work on their test cases without the fear of stepping on some other test engineers test data.
Once the test engineer is satisfied with the test data in the test data lake, they can make a copy of just the test data that they need for their test cases into the lake. Before every test automation run, the test environment can be provisioned with just the test data that is required for the test cases. This provisioning will not impact other test engineers and can be added to the DevOps environment provisioning strategy.
With this approach of managing the test data in a Test Data Lake and providing a self-service UI on top of it, Test Engineers are empowered to manage their own test data, and become more efficient. Powerful analysis and visualization techniques allow for better coverage of test data, thereby improving quality. Making test data provisioning as part of DevOps makes repeatable testing possible in an Agile environment.