Software Quality with Engineered test data

Nov 7, 2021

engineered-test-data
test-data
software-quality

What is one important factor in a project’s success? If you are a QA, your answer might be testing — but what makes testing good? Answers vary, yet one thing most of us relate to is good test data.

Bad data does not always mean “wrong values.” It often means too much of the right values, locked in a shared database nobody dares touch.

Test data as foundation for quality

Everyone in software agrees test data matters. So why do we ignore it until a batch job fails at 2 a.m. or automation times out on a million-row join?

For production-alike data in test environments, see ThoughtWorks technology radar blip — useful context, not a mandate to copy prod wholesale.

Common test data problems

Common test data pain points

Data aggregation / batch runs

Prod-equivalent volume in lower environments causes:

Longer batch and aggregation time
More storage cost
Uncertainty on corner and edge cases
Slow validation queries
Automated tests that timeout or run unacceptably long

Sound familiar? YOUR UI E2E waits on a warehouse sync that prod needs — but QA only needs ten rows to prove the rule.

Data complexity

Production-equivalent data is rarely simple:

New terminologies, patterns, and standards to learn
Data entangled across sources, tables, schemas
Scenarios impossible in test env without engineered subsets

Data accessibility

Multiple teams share environments — YOU cannot freely mutate data
Time spent learning DB names, relations, columns, and constraints before changing one row

Inefficient processes

Challenge the current process. Introduce a model where testers get accurate data faster — engineered data in version control, reviewed like code.

What we faced on one project

Project challenges overview

Prod-equivalent volume on lower environments
Data size > 500 GB (> 40 interrelated tables, millions of rows)
Batch processing > 8 hours — one failed run cost a full day of feedback
Cloud storage cost climbing
Multiple teams on shared envs — ad hoc changes broke each other’s scenarios
Views and procedures for UI validation — heavy maintenance, multi-minute queries

We could not test microservices or components fast when the data layer was the bottleneck.

The solution — engineer, do not clone

Solution approach

Goals:

Faster aggregation / batch (faster feedback)
Faster SQL for manual and automated checks
Lower cloud storage on Dev, QA, Training

Iterative approach made reduction manageable and built confidence in reduced data reliability.

Iterative data reduction approach

Step 1 — Data analysis

Consider a patient health management DB — row counts in the millions. Do we need millions of patient records to validate the system?

Production-like volume example

Analyze existing data, types, and application impact
Map joins and interconnectivity
Set target reduced size (KB/MB or row counts per table)
Remember source control limits — e.g. Git diff limits (~1 MB / 20k lines for readable diffs)

Step 2 — Data preparation

List data scenarios and corner cases
Data skimming chain — ~10 patients across gender, age, geography may suffice
Follow the chain for those patients through related tables
Skim dependent tables consistently

Data skimming chain

Alternatively, TABLESAMPLE can pick random rows — use with care on join keys.

TABLESAMPLE flow

Preserve join keys across skimmed tables
Note duplicate business rows you may collapse or update for edge cases

Step 3 — Data validation

Dump to a trial test DB
Local or Docker app against test DB
Manual sanity on coverage and edge cases
Run automation for integrity and compatibility
Be patient — iterative; this step takes time
Watch Git tracking limits on large SQL dumps

Step 4 — Data PR

After satisfactory validation:

SQL dump for review

Track changes in Git; keep files reviewable
Export as .sql insert scripts

If files exceed diff limits, batch inserts with awk — example pattern:

awk 'BEGIN{match_count=0;} $0~/^INSERT.*VALUES \(/{match_count++;if (match_count%1000!=1) gsub("INSERT INTO .* VALUES \\(","(", $0); if (match_count%1000!=0) gsub(";$",",", $0); print $0}' input.sql > output.sql

Before awk batching After awk batching Size reduction result

Raise a Data PR — stakeholders review data changes like schema changes.

Step 5 — Data deployment

Deploy via scripts — e.g. createSchemaAndData.sh, DropAndCreateTable.sql:

Deploy script Drop and create tables

Repository layout example:

Data repository structure

Major steps summary:

Data reduction major steps

Iteration cycle:

Data reduction iteration

Final outcome

Reduction outcomes

DB size < 20 MB vs 500 GB (~500,000 MB)
Batch processing < 15 minutes vs 8 hours — multiple feedback cycles per day
Cloud storage cost down for Dev, QA, Training
Teams work in parallel with change management and approvals, zero-downtime deploys
Complex views and procedures run in under a second vs minutes

Those numbers unlocked the test pyramid we wanted — API and component tests could run without apologizing for data.

Steps for the future

Documenting: Quick fact sheets — glossaries, key queries, procedures — for onboarding without reading the entire wiki.

Data management: As teams grow, formalize ownership: who approves data PRs, how often you refresh subsets, how automation seeds fixtures.

Test data preparation is critical to quality assurance. Forethought, creativity, and industry practices beat copying prod and hoping.

YOUR next move

Pick one table that dominates storage in YOUR lower env. What is the smallest row set that still tells the truth about one business rule?

Happy Testing :)

Comments

Comments are hosted by Disqus. The free tier may show sponsored content.