Producing Mode statistics with SQL

Producing Mode statistics with SQL

The mode is the value or values that appear most frequently in a dataset. Unlike the mean or median, it applies naturally to categorical and ordinal data — star ratings, product codes, survey responses — and reveals what is most common, not what is average. A distribution can be unimodal (one dominant value), bimodal (two values tied for highest frequency), or have no mode at all when every value appears exactly once.

MySQL has no built-in mode function. Leveraging the VillageSQL Extension Bundle (VEB) , STATS_MODE(col) counts frequencies in a single aggregate pass and returns all tied modes as JSON. VillageSQL is a drop-in MySQL 8.4 replacement — no Python or R required.

Returns JSON: { "values": [...], "min": ..., "max": ... }
Field Description
values Sorted array of every value tied for highest frequency
min Smallest mode (first element of values)
max Largest mode (last element of values)

STATS_MODE returns NULL when no value appears more than once. Use CAST(STATS_MODE(...) AS CHAR) to read results in the mysql CLI, and CAST(col AS DOUBLE) on integer columns to avoid rounded JSON values.

Product Review Ratings — Unimodal, Bimodal, and No Mode

Mode SQL Example — Product Reviews

Three products on an e-commerce platform illustrate the three most common mode outcomes. Product A is a well-liked item where most customers rate it 4 stars. Product B is divisive — customers either love it (5 stars) or hate it (1 star), with each rating appearing three times. Product C has one review at each star level, so no value repeats and the mode is undefined.

WITH mode_data AS (
    SELECT
        product,
        COUNT(*)                                   AS n,
        ROUND(AVG(rating), 2)                      AS mean_rating,
        CAST(STATS_MODE(rating) AS CHAR(200))      AS json
    FROM product_reviews
    GROUP BY product
)
SELECT
    product,
    n,
    mean_rating,
    JSON_EXTRACT(json, '$.values')                 AS mode_values,
    JSON_EXTRACT(json, '$.min')                    AS mode_min,
    JSON_EXTRACT(json, '$.max')                    AS mode_max,
    JSON_LENGTH(JSON_EXTRACT(json, '$.values'))    AS mode_count,
    CASE
        WHEN json IS NULL                                        THEN 'No mode — all ratings equally rare'
        WHEN JSON_LENGTH(JSON_EXTRACT(json, '$.values')) = 1    THEN 'Unimodal — one dominant rating'
        WHEN JSON_LENGTH(JSON_EXTRACT(json, '$.values')) = 2    THEN 'Bimodal  — two equally frequent ratings'
        ELSE                                                          'Multimodal — multiple dominant ratings'
    END                                            AS distribution_shape
FROM mode_data
ORDER BY product;

Results

product  n   mean_rating  mode_values  mode_min  mode_max  mode_count  distribution_shape
A        10  3.8          [4]          4         4         1           Unimodal — one dominant rating
B        10  3            [1, 5]       1         5         2           Bimodal  — two equally frequent ratings
C        5   3            NULL         NULL      NULL      NULL        No mode — all ratings equally rare

Product A’s mean (3.8) and mode (4) tell a consistent story — most customers are satisfied. Product B is more interesting: the mean of 3.0 sits squarely in the middle, masking the polarised 1-star and 5-star camps. The bimodal result {1, 5} reveals what the average cannot — this product divides opinion. Product C shows why NULL is the correct answer when every rating is equally rare; reporting a single “most common” value would be misleading.

Speed Camera Observations by Road Type

Mode SQL Example — Speed Observations

The second example uses 150 vehicle speed readings (50 per road type) to show how mode complements mean and median for traffic analysis. Mode reveals the speed(s) drivers most commonly travel — useful for understanding compliance with posted limits.

WITH stats AS (
    SELECT
        road_type,
        COUNT(*)                               AS n,
        ROUND(MIN(speed_mph), 0)               AS min_speed,
        ROUND(AVG(speed_mph), 2)               AS mean_speed,
        ROUND(MAX(speed_mph), 0)               AS max_speed,
        CAST(STATS_MODE(speed_mph) AS CHAR(200))  AS mode_json,
        CAST(STATS_IQR(speed_mph)  AS CHAR(300))  AS iqr_json
    FROM speed_observations
    GROUP BY road_type
)
SELECT
    road_type,
    n,
    min_speed,
    mean_speed,
    max_speed,
    ROUND(JSON_EXTRACT(iqr_json, '$.median'), 1)       AS median_speed,
    ROUND(JSON_EXTRACT(iqr_json, '$.iqr'),    1)       AS iqr,
    JSON_EXTRACT(mode_json, '$.values')                AS mode_values,
    JSON_EXTRACT(mode_json, '$.min')                   AS mode_min,
    JSON_EXTRACT(mode_json, '$.max')                   AS mode_max,
    JSON_LENGTH(JSON_EXTRACT(mode_json, '$.values'))   AS mode_count,
    CASE
        WHEN mode_json IS NULL                                             THEN 'No mode'
        WHEN JSON_LENGTH(JSON_EXTRACT(mode_json, '$.values')) = 1         THEN 'Unimodal'
        ELSE CONCAT('Multimodal (', JSON_LENGTH(JSON_EXTRACT(mode_json, '$.values')), ' modes)')
    END                                                AS shape
FROM stats
ORDER BY road_type;

Results

road_type  n   min_speed  mean_speed  max_speed  median_speed  iqr  mode_values  mode_min  mode_max  mode_count  shape
motorway   50  65         69.9        75         70            2    [70]         70        70        1           Unimodal
rural      50  50         58.14       62         59            3    [60]         60        60        1           Unimodal
urban      50  18         25.72       35         28            10   [20, 30]     20        30        2           Multimodal (2 modes)

On the motorway (70 mph limit), the mode of 70 mph confirms that most drivers observe the speed limit — mean (69.9) and median (70) agree. On rural roads (60 mph limit), the mode of 60 mph shows drivers cluster at the limit even though the mean (58.1) is pulled down by slower traffic.

The urban 30 zone is the instructive case. The mean of 25.7 mph and median of 28 mph suggest typical speeds well below the limit, but the bimodal result {20, 30} reveals two distinct driving behaviours — cautious drivers at ~20 mph and limit-observing drivers at 30 mph — with equal frequency. The wide IQR of 10 mph confirms the spread. For traffic planning, the mode tells you where drivers actually cluster; the mean alone would miss the bimodal split entirely.

When to Use the Mode

Question Use mode when
What star rating do most customers give? One dominant value is the headline metric
Is opinion polarised? Bimodal result reveals two camps the mean hides
What speed do most drivers travel? Mode shows the peak of the frequency distribution
Are all values equally rare? NULL is the correct answer — do not force a mode

STATS_MODE works with GROUP BY, so product-level rating summaries, road-type speed profiles, and survey response breakdowns all compute in a single query pass. The min and max fields provide convenient scalar access when a single mode value is expected; values preserves the full set when ties exist.

Summary

Both examples use STATS_MODE(), which returns a JSON object listing every value tied for highest frequency. The function handles unimodal, bimodal, and no-mode cases without special syntax.

Example n Mean Mode Shape
Product A (popular) 10 3.8 {4} Unimodal
Product B (polarising) 10 3.0 {1, 5} Bimodal
Product C (mixed) 5 3.0 NULL No mode
Motorway speeds 50 69.9 mph {70} Unimodal
Rural speeds 50 58.1 mph {60} Unimodal
Urban 30 zone speeds 50 25.7 mph {20, 30} Bimodal

The extension is available at github.com/ronaldbradford/vsql-statistics and installs with a single statement:

INSTALL EXTENSION vsql_statistics;
Tagged with: MySQL VillageSQL Extensions

Producing Alternative Means statistics with SQL

MySQL’s built-in AVG() computes the arithmetic mean — the sum divided by the count. That is the right default for many questions, but it is not always the right measure of central tendency.

Producing One-Sample Z-Test statistics with SQL

The one-sample Z-test determines whether a sample mean differs significantly from a known population mean when the population standard deviation is also known. It is the appropriate test when the population parameters are established — quality control benchmarks, national averages, long-run process measurements — and you want to evaluate whether a new sample is consistent with them.

Producing Skewness statistics with SQL

Skewness measures the asymmetry of a distribution. A perfectly symmetric distribution has a skewness of zero. A positive skew (right-skewed) means the tail extends to the right — a small number of high values pull the mean above the median.