Testing, Benchmarking, Evaluating

Ronald Bradford
March 5, 2025

Testing and benchmarking are widely used terms in software technology, each serving a distinct purpose and goal. With the increasing adoption of AI in software development, the term evaluating has become significant and with this the re-emergence of what is quality assurance.

Testing

Testing is the process of evaluating software to ensure it meets expectations and functions as intended. It involves analyzing, examining, and observing the software to find errors and defects.

To improve software quality, accuracy, and usability
To reduce the risk of software failure
To build customer trust and satisfaction
To detect security vulnerabilities

Testing comes in numerous forms including and not limited to Unit testing, Functional testing, Integration testing, Smoke testing, Performance testing and Regression testing.

Benchmarking

Benchmarking is a type of software testing that compares a product’s performance to established standards rather than to the expectations of the software. It helps identify strengths and weaknesses to improve a product that is at the center of the benchmark. Benchmark testing measures?

Response time
Throughput
Resource usage
Read/write speeds
Other capacities

A benchmark is generally skewed towards what product is being benchmarked, and it is often difficult to obtain a realistic benchmark that is a simulation of a production workload.

Benchmarking does not evaluate that software meets expectations and functions as intendend or works to find errors in functionality.

Evaluating

With the increase of AI models in all aspects of technology, the term Evaluation, or AI Evaluation has emerged as an important metric to compare different AI models, i.e. test and benchmark. AI Evaluation is the process of assessing the performance and accuracy of the model.

Why evaluate AI?

To ensure that AI models perform as expected
To ensure that AI models produce reliable and accurate results
To ensure that AI models meet desired objectives
To ensure that AI models don’t create harmful or unethical content

How do you evaluate AI?

Use metrics to measure aspects like relevance, truthfulness, coherence, and completeness
Assess data quality and model performance
Evaluate how the model handles different sources of error, such as bias and drift
Evaluate how the model explains its output and reasoning
Evaluate how the model responds to different situations and inputs
Evaluate how the model represents the real-world problem and data

Effectively, this comes back to the well-known industry term of quality assurance.

Quality Assurance

The ISO 9000 family of standards for quality assurance, particularly the ISO 9001 standard established key certification requirements for companies and products. Many software engineers with only a decade of experience have not encountered an ISO 9001 implementation, adherence, or its evolution through revisions in 1994 and 2000, which significantly impacted the software industry. Some viewed the introduction of quality assurance as a hindrance to the system development lifecycle, perceiving it as a constraint on productivity.

Ironically, quality assurance is now re-emerging as a critical requirement for the release and adoption of AI-assisted products.

The Seven Quality Management Principles

It is important to remember that ISO 9000 is built on a set of quality management principles. These principles are worth considering as we explore AI evaluation.

QMP 1 Customer focus – Organizations depend on their customers and therefore should understand current and future customer needs, should meet customer requirements and strive to exceed customer expectations.
QMP 2 Leadership – Leaders establish unity of purpose and direction of the organization. They should create and maintain the internal environment in which people can become fully involved in achieving the organization’s objectives.
QMP 3 Engagement of people –People at all levels are the essence of an organization and their full involvement enables their abilities to be used for the organization’s benefit.
QMP 4 Process approach – A desired result is achieved more efficiently when activities and related resources are managed as a process.
QMP 5 Improvement – Improvement of the organization’s overall performance should be a permanent objective of the organization.
QMP 6 Evidence-based decision making – Effective decisions are based on the analysis of data and information.
QMP 7 Relationship management –An organization and its external providers (suppliers, contractors, service providers) are interdependent and a mutually beneficial relationship enhances the ability of both to create value.

Conclusion

As software development continues to evolve, the need for rigorous evaluation has become more critical. While testing ensures that software functions as intended and benchmarking provides performance comparisons, AI evaluation extends these concepts to measure model reliability, accuracy, and ethical considerations.

At its core, AI evaluation is a return to fundamental quality assurance principles. The ISO 9000 standards, once seen as a burden to productivity, now find renewed relevance in ensuring AI-driven technologies meet high standards of trust, performance, and accountability.

The recent Thoughtworks Podcast AI testing, benchmarks and evals also discusses the renewed interest in quality assurance via various evaluation techniques for AI all with the goal of accuracy and reliability.

Tagged with: Testing Benchmarking Evaluating Software Development Quality Assurance Gen AI

Databases Kaleidoscope 2010 MySQL Oracle/MySQL Conferences

Improving MySQL Productivity – From Design to Implementation

Ronald Bradford
July 1, 2010

My closing presentation at the dedicated MySQL track at ODTUG Kaleidoscope 2010 discussed various techniques and best practices for improving the ROI of developer resources using MySQL. Included in the sections on Design, Security, Development, Testing, Implementation, Instrumentation and Support were also a number of horror stories of not what to do, combined with practical examples of improving productivity.

MySQL and Heatwave Summit Presentation

Ronald Bradford
April 30, 2025

Last week I had the opportunity to speak at the MySQL and Heatwave Summit in San Francisco. I discussed the impact of the new MySQL 8.0 default caching_sha2_password authentication, replacing the mysql_native_password authentication that was the default for approximately 20 of the 30 years that MySQL has existed.

Readyset QueryPilot Announcement

Ronald Bradford
April 22, 2025

At the MySQL and Heatwave Summit 2025 today, Readyset announced a new data systems architecture pattern named Readyset QueryPilot . This architecture which can front a MySQL or PostgreSQL database infrastructure, combines the enterprise-grade ProxySQL and Readyset caching with intelligent query monitoring and routing to help support applications scale and produce more predictable results with varied workloads.