As part of using public datasets with my own Benchmarking Suite I wanted upsize a dataset for larger volume testing.
I have always used the INFORMATION_SCHEMA.TABLES data_length
and index_length
columns as a sufficiently accurate measurement for actual disk space used. FWIW table_rows
and avg_row
can be wildly inaccurate.
As you can see here, you can compare with a self-hosted database the underlying InnoDB Tablespaces align with the provided values (27GB
and 3.5G
).
Comparing INFORMATION_SCHEMA with individual InnoDB Tablespaces
+----------------------+--------+---------+------------+---------+----------+----------+----------+
| table_name | ENGINE | format | TABLE_ROWS | avg_row | total_mb | data_mb | index_mb |
+----------------------+--------+---------+------------+---------+----------+----------+----------+
| title_principal | InnoDB | Dynamic | 363705547 | 49 | 27181.00 | 17009.00 | 10172.00 |
| title_episode | InnoDB | Dynamic | 32860408 | 51 | 3327.00 | 1626.00 | 1701.00 |
+----------------------+--------+---------+------------+---------+----------+----------+----------+
$ ls -lh title_principal.ibd title_episode.ibd
-rw-r----- 1 mysql mysql 3.5G Mar 27 22:00 title_episode.ibd
-rw-r----- 1 mysql mysql 27G Mar 27 23:38 title_principal.ibd
With this verification, I upsized the original IMDb Dataset which was ~20GB, to 4x or ~80GB. In MySQL the results matched my expectations.
IMDb 4x Dataset In MySQL 8.4
This is from a clean load of data and is comparable with a 84GB dataset built up in MySQL 8.0.
+--------------+----------------+----------------+----------------+--------+------------+
| TABLE_SCHEMA | total_mb | data_mb | index_mb | tables | today |
+--------------+----------------+----------------+----------------+--------+------------+
| imdb | 83996.53125000 | 48402.57812500 | 35593.95312500 | 13 | 2025-03-28 |
+--------------+----------------+----------------+----------------+--------+------------+
+----------------------+--------+---------+------------+---------+----------+----------+----------+
| table_name | ENGINE | format | TABLE_ROWS | avg_row | total_mb | data_mb | index_mb |
+----------------------+--------+---------+------------+---------+----------+----------+----------+
| title_principal | InnoDB | Dynamic | 363705547 | 49 | 27181.00 | 17009.00 | 10172.00 |
| title_name_character | InnoDB | Dynamic | 169484096 | 47 | 20769.97 | 7719.00 | 13050.97 |
| title | InnoDB | Dynamic | 42646882 | 105 | 7706.00 | 4299.00 | 3407.00 |
| name | InnoDB | Dynamic | 53712993 | 58 | 6734.98 | 3013.00 | 3721.98 |
| name_profession | InnoDB | Dynamic | 36270238 | 151 | 6515.00 | 5248.00 | 1267.00 |
| name_known_for | InnoDB | Dynamic | 104256630 | 48 | 5985.00 | 4853.00 | 1132.00 |
| title_genre | InnoDB | Dynamic | 63167541 | 74 | 5599.88 | 4457.88 | 1142.00 |
| title_episode | InnoDB | Dynamic | 32860408 | 51 | 3327.00 | 1626.00 | 1701.00 |
| title_rating | InnoDB | Dynamic | 6050616 | 30 | 177.67 | 177.67 | 0.00 |
| credit | InnoDB | Dynamic | 13 | 1260 | 0.02 | 0.02 | 0.00 |
| genre | InnoDB | Dynamic | 28 | 585 | 0.02 | 0.02 | 0.00 |
| characters | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| popular_titles | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+----------------------+--------+---------+------------+---------+----------+----------+----------+
+----------------------+-------------+
| tbl | row_count |
+----------------------+-------------+
| credit | 13 |
| genre | 28 |
| name | 56,870,364 |
| name_known_for | 91,517,992 |
| name_profession | 63,344,864 |
| title | 44,458,976 |
| title_episode | 34,345,816 |
| title_genre | 69,519,892 |
| title_name_character | 174,171,912 |
| title_principal | 358,389,012 |
| title_rating | 6,061,472 |
+----------------------+-------------+
Aurora Storage
However, with AWS RDS Aurora MySQL 8.0 I was surprised to discover the reported disk space is only 60% in comparison. As with a managed service, I have no way to independently verify these numbers as accurate. Aurora is an independent highly-available shared storage layer for RDS Aurora.
This is an interesting fact. If you have a 1TB database in RDS and when you are evaluating migrating to RDS Aurora you would use the same size for calculating storage in RDS Aurora, however this could be an overestimate.
IMDb 4x Dataset in RDS Aurora MySQL 8.0
+--------------+----------------+----------------+----------------+--------+------------+
| TABLE_SCHEMA | total_mb | data_mb | index_mb | tables | today |
+--------------+----------------+----------------+----------------+--------+------------+
| imdb | 49865.37500000 | 29458.53125000 | 20406.84375000 | 13 | 2025-03-28 |
+--------------+----------------+----------------+----------------+--------+------------+
1 row in set (0.01 sec)
+----------------------+--------+---------+------------+---------+----------+---------+----------+
| table_name | ENGINE | format | TABLE_ROWS | avg_row | total_mb | data_mb | index_mb |
+----------------------+--------+---------+------------+---------+----------+---------+----------+
| title_principal | InnoDB | Dynamic | 180667894 | 46 | 12864.98 | 8045.00 | 4819.98 |
| title_name_character | InnoDB | Dynamic | 85801233 | 44 | 9892.97 | 3635.00 | 6257.97 |
| name | InnoDB | Dynamic | 56778447 | 56 | 7048.98 | 3075.00 | 3973.98 |
| name_profession | InnoDB | Dynamic | 58811983 | 87 | 6138.98 | 4901.98 | 1237.00 |
| name_known_for | InnoDB | Dynamic | 90662783 | 53 | 5675.00 | 4599.00 | 1076.00 |
| title | InnoDB | Dynamic | 22336453 | 93 | 3645.00 | 1996.00 | 1649.00 |
| title_genre | InnoDB | Dynamic | 29932810 | 72 | 2625.92 | 2055.92 | 570.00 |
| title_episode | InnoDB | Dynamic | 18534956 | 60 | 1892.91 | 1070.00 | 822.91 |
| title_rating | InnoDB | Dynamic | 3033301 | 27 | 80.59 | 80.59 | 0.00 |
| credit | InnoDB | Dynamic | 13 | 1260 | 0.02 | 0.02 | 0.00 |
| genre | InnoDB | Dynamic | 28 | 585 | 0.02 | 0.02 | 0.00 |
| characters | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| popular_titles | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+----------------------+--------+---------+------------+---------+----------+---------+----------+
13 rows in set (0.00 sec)
+----------------------+-------------+
| tbl | row_count |
+----------------------+-------------+
| credit | 13 |
| genre | 28 |
| name | 56,942,588 |
| name_known_for | 91,636,620 |
| name_profession | 63,419,680 |
| title | 44,524,244 |
| title_episode | 34,398,208 |
| title_genre | 69,632,760 |
| title_name_character | 174,422,448 |
| title_principal | 358,944,016 |
| title_rating | 6,075,560 |
+----------------------+-------------+