Testing and benchmarking are widely used terms in software technology, each serving a distinct purpose and goal. With the increasing adoption of AI in software development, the term evaluating has become significant and with this the re-emergence of what is quality assurance.
Testing
Testing is the process of evaluating software to ensure it meets expectations and functions as intended. It involves analyzing, examining, and observing the software to find errors and defects.
- To improve software quality, accuracy, and usability
- To reduce the risk of software failure
- To build customer trust and satisfaction
- To detect security vulnerabilities
Testing comes in numerous forms including and not limited to Unit testing, Functional testing, Integration testing, Smoke testing, Performance testing and Regression testing.
Benchmarking
Benchmarking is a type of software testing that compares a product’s performance to established standards rather than to the expectations of the software. It helps identify strengths and weaknesses to improve a product that is at the center of the benchmark. Benchmark testing measures?
- Response time
- Throughput
- Resource usage
- Read/write speeds
- Other capacities
A benchmark is generally skewed towards what product is being benchmarked, and it is often difficult to obtain a realistic benchmark that is a simulation of a production workload.
Benchmarking does not evaluate that software meets expectations and functions as intendend or works to find errors in functionality.
Evaluating
With the increase of AI models in all aspects of technology, the term Evaluation, or AI Evaluation has emerged as an important metric to compare different AI models, i.e. test and benchmark. AI Evaluation is the process of assessing the performance and accuracy of the model.
Why evaluate AI?
- To ensure that AI models perform as expected
- To ensure that AI models produce reliable and accurate results
- To ensure that AI models meet desired objectives
- To ensure that AI models don’t create harmful or unethical content
How do you evaluate AI?
- Use metrics to measure aspects like relevance, truthfulness, coherence, and completeness
- Assess data quality and model performance
- Evaluate how the model handles different sources of error, such as bias and drift
- Evaluate how the model explains its output and reasoning
- Evaluate how the model responds to different situations and inputs
- Evaluate how the model represents the real-world problem and data
Effectively, this comes back to the well-known industry term of quality assurance.
Quality Assurance
The ISO 9000 family of standards for quality assurance, particularly the ISO 9001 standard established key certification requirements for companies and products. Many software engineers with only a decade of experience have not encountered an ISO 9001 implementation, adherence, or its evolution through revisions in 1994 and 2000, which significantly impacted the software industry. Some viewed the introduction of quality assurance as a hindrance to the system development lifecycle, perceiving it as a constraint on productivity.
Ironically, quality assurance is now re-emerging as a critical requirement for the release and adoption of AI-assisted products.
The Seven Quality Management Principles
It is important to remember that ISO 9000 is built on a set of quality management principles. These principles are worth considering as we explore AI evaluation.
- QMP 1 Customer focus – Organizations depend on their customers and therefore should understand current and future customer needs, should meet customer requirements and strive to exceed customer expectations.
- QMP 2 Leadership – Leaders establish unity of purpose and direction of the organization. They should create and maintain the internal environment in which people can become fully involved in achieving the organization’s objectives.
- QMP 3 Engagement of people –People at all levels are the essence of an organization and their full involvement enables their abilities to be used for the organization’s benefit.
- QMP 4 Process approach – A desired result is achieved more efficiently when activities and related resources are managed as a process.
- QMP 5 Improvement – Improvement of the organization’s overall performance should be a permanent objective of the organization.
- QMP 6 Evidence-based decision making – Effective decisions are based on the analysis of data and information.
- QMP 7 Relationship management –An organization and its external providers (suppliers, contractors, service providers) are interdependent and a mutually beneficial relationship enhances the ability of both to create value.
Conclusion
As software development continues to evolve, the need for rigorous evaluation has become more critical. While testing ensures that software functions as intended and benchmarking provides performance comparisons, AI evaluation extends these concepts to measure model reliability, accuracy, and ethical considerations.
At its core, AI evaluation is a return to fundamental quality assurance principles. The ISO 9000 standards, once seen as a burden to productivity, now find renewed relevance in ensuring AI-driven technologies meet high standards of trust, performance, and accountability.
The recent Thoughtworks Podcast AI testing, benchmarks and evals also discusses the renewed interest in quality assurance via various evaluation techniques for AI all with the goal of accuracy and reliability.