Good Test Data

Over the years you collect datasets you have created for various types of testing, seeding databases etc. I have always thought one needs to better manage this for future re-use. Recently I wanted to do some “Big Data” playing and again that question of what datasets can I use let me to review the past collated list at Seeking public data for benchmarks .

The types of things I was wanting to do lead me to realize a lot of content is “public domain” and Project Gutenberg is just one great source of text in multiple languages. This was just one aspect of my wish list but text based data is used from blogs, comments, articles, microblogs etc, and multiple languages was important from some text analysis.

With a bit of thinking about the building blocks, I created Good Test Data . A way for me to have core data, IP’s, people’s names, User Agents strings, text for articles, comments and a lot more. And importantly the ability to generate large randomized amounts of this data quickly and easily.

Now I can build a list of 1 million random names with unique usernames and emails with ease. I can generate millions of varying articles, from a short microblog, a comment, a blog to a multi page article. Then be able to produce HTML/PDF/PNG versions giving me file attachments. I’ve been playing more with image generation, creating banner images with varying text, and now I’m generating MP4 video to simulate the various standard sizes for advertising and just to see what people need.

I’m not sure of the potential use and benefit for others and that wasn’t the primary goal, however I would like to know how these building blocks could be used. The data is relatively agnostic, being able to easily load into MySQL tables. Depending on demand, being able to create pre-configured open source product data for e-commence products, CRM or blogging are all possible options.

Tagged with: Databases MySQL

Related Posts

Creating a More Realistic Benchmark

Common benchmark approaches fall into two general categories, synthetic testing and realistic testing. You have the most generic operations from a synthetic test, starting with the most simple example using a single table, a single column, and for a single DML operation.

Read more

Testing, Benchmarking, Evaluating

Testing and benchmarking are widely used terms in software technology, each serving a distinct purpose and goal. With the increasing adoption of AI in software development, the term evaluating has become significant and with this the re-emergence of what is quality assurance.

Read more

Your Attack Vector Extends Beyond Production Systems

A common data security issue is the unprotected copying of production data to non-production environments without any redaction, masking, or filtering. This practice poses a serious risk. A malicious actor will target the weakest link in your infrastructure, including non-production accounts and the developer systems accessing them.

Read more