Data Masking 101

Ronald Bradford
October 20, 2022

I continue to dig up and share this simple approach for production data masking via SQL to create testing data sets. Time to codify it into a post.

Rather than generating a set of names and data from tools such as Mockaroo , it is more practical to use actual data for a variety of testing reasons.

The SQL below is a self-explanatory approach of removing Personal Identifiable Information (PII) , but keeping data relevant. I use this approach for a number of reasons.

We are using production data rather than synthetic data. Data volume, distribution, and additional column values are realistic. This is a subset of an example, but dates and locations are therefore realistic
Indexes (and unique indexes) still work, and distribution across the index is adequate for searching. Technically the index will be a little larger in disk footprint.
You cannot reverse engineer the masked value into a real value with just this data set. An engineer in a test environment cannot obtain the underlying information.
If you identify an issue with data quality for any row of data, there is a way to present the uniqueness of that row. This enables a person with production access to match the underlying row. Of course, any unique identifier (auto increment or UUID) should also be modified to mask real data.

`
SELECT CONCAT(SUBSTR(first_name,1,2),REPEAT(’’,LENGTH(first_name)-2)) AS first_name,
CONCAT(SUBSTR(last_name,1,3),REPEAT(’’,LENGTH(last_name)-3),’ ‘, SUBSTRING(MD5(CONCAT(first_name,last_name)),1,6)) AS last_name,
CONCAT(SUBSTR(organization,1,3),REPEAT(’*’,LENGTH(organization)-3),’ ‘, SUBSTRING(MD5(CONCAT(organization)),1,6)) AS organization,
created, country
FROM customer
LIMIT 10;

+------------+--------------------+------------------+---------------------+---------+
| first_name | last_name | organization | created | country |
+------------+--------------------+------------------+---------------------+---------+
| Sa**** | Cor**** 4c23cd | Ski*** d21420 | 2022-09-20 03:30:14 | PH |
| Fu**** | Wat*** 8b97de | Jax***** e629c2 | 2022-04-08 03:20:22 | BY |
| Mo**** | Zis***** b11d94 | Rhy**** b4073a | 2022-10-06 15:58:38 | IR |
| So**** | Bad** 232cc2 | Rhy*** 1734bd | 2022-02-01 07:35:39 | ID |
| Ni***** | Ter***** d9ffb5 | Wor****** 6e476c | 2021-11-08 17:07:34 | IL |
| Ka****** | Scr***** 9201db | Jax**** 481fd8 | 2022-08-18 19:17:54 | BR |
| Li*** | Coz** 0447f6 | Nlo**** 11da59 | 2022-07-29 06:47:56 | HR |
| Ch***** | Hal******** f5d9c8 | Zoo**** c6e07d | 2022-09-28 04:54:30 | UA |
| Er****** | Ste******* d005f2 | Eid** ffc305 | 2022-04-28 18:50:11 | PT |
| Fo** | O'S***** b35c44 | Buz**** 2c8598 | 2022-09-11 02:05:55 | RU |
+------------+--------------------+------------------+---------------------+---------+
`

Tagged with: Data Data Best Practices Data Security MySQL

MySQL and Heatwave Summit Presentation

Ronald Bradford
April 30, 2025

Last week I had the opportunity to speak at the MySQL and Heatwave Summit in San Francisco. I discussed the impact of the new MySQL 8.0 default caching_sha2_password authentication, replacing the mysql_native_password authentication that was the default for approximately 20 of the 30 years that MySQL has existed.

Readyset QueryPilot Announcement

Ronald Bradford
April 22, 2025

At the MySQL and Heatwave Summit 2025 today, Readyset announced a new data systems architecture pattern named Readyset QueryPilot . This architecture which can front a MySQL or PostgreSQL database infrastructure, combines the enterprise-grade ProxySQL and Readyset caching with intelligent query monitoring and routing to help support applications scale and produce more predictable results with varied workloads.

More CPUs or Newer CPUs

Ronald Bradford
April 2, 2025

In a CPU-bound database workload, regardless of price, would you scale-up or scale-new? What if price was the driving factor, would you scale-up or scale-new? I am using as a baseline the first available AWS Graviton2 processor for RDS (r6g).

Data Masking 101

Related Posts

MySQL and Heatwave Summit Presentation

Readyset QueryPilot Announcement

More CPUs or Newer CPUs