MySQL – Wikipedia

Ronald Bradford
June 14, 2007

I was reading only last week the notes from [Wikipedia: Site Internals, Configuration and Code Examples, and Management Issues][1] Tutorial by Domas Mituzas at the recent [2007 MySQL Conference][2]. I didn’t attend this session, like a lot of sessions too much good stuff at the same time.

It’s obviously taken a while to catch up on my reading, but with the present MySQL 12 days of Scale-Out I thought I’ll complete my notes for all to see.

If you have never used Wikipedia well, why are you reading this, you should spend an hour there now. Alexa places Wikipedia in one of the top 10 visited sites on the Internet.

Wikipedia runs on the LAMP stack, powered by the MySQL database. Nothing new here, but how Wikipedia scales is. Some of the interesting points involved how a “Content Delivery Network” was build with components including Squid, Lighttpd, Memcached, LVS to improve caching. Appropriate caching is an important component to a successful scale-out infrastructure. An interesting quote however:

_The common mistake is to believe that database is too slow and everything in it has to becached somewhere else. In scaled out environments reads are very efficient, and difference of time between efficient MySQL query and memcached request is negligible – both may execute in less than 1ms usually).
Sometimes memcached request may be more expensive than database query simply because it has to establish connection to a server, whereas database connection is usually
open.
_

Wikipedia has a developed an application Load Balancer. This offers a flexibility in efficient database use and is critical to any scale-out infrastructure. Combined with a good Database API and items such as the Pager class, allows you to write efficient index-based offsets pager (instead of ugly LIMIT 50 OFFSET 10000) syntax for example.

The main ideology in operating database servers is RAIS: – Redundant Array of Inexpensive/Independent/Instable[sic] Servers

RAID0. Seems to provide additional performance/space. Having half of that capacity with an idea that a failure will degrade performance even more doesn’t sound like an efﬁcient idea. Also, we do notice disk problems earlier. This disk conﬁguration should be probably called AID, instead of RAID0.
innodb_ﬂush_log_at_trx_commit=0, tempted to do innodb_ﬂush_method=nosync. If a server crashes for any reason, there’s not much need to look at its data consistency. Usually that server will need hardware repairs anyway. InnoDB transaction recovery would take half an hour or more. Failing over to another server will usually be much faster. In case of master failure, last properly replicated event is last event for whole environment. No ‘last transaction contains millions’ worries makes the management of such environment much easier – an idea often forgotten by web applications people.

The thing I found interesting in the RAIS mysql-node conﬁguration was slave-skip-errors=0,1213,1158,1053,1007,1062

However, the greatest tip is “All database interaction is optimized around MySQL’s methods of reading the data.” This includes:

Every query must have appropriate index for reads…
Every query result should be sorted by index, not by ﬁlesorts. This means strict and predictable path of data access…
Some fat-big-tables have extended covering indexing just on particular nodes…
Queries prone to hit multiversioning troubles have to be rewritten accordingly…

Wikipedia is also clever in it’s sharding. A means to implement vertical and horizontal partition of data via the application for optimal scale-out. This comes down to designing your application correct from the start. Wikipedia considers it’s partition via:

data segments
tasks
time

HiveDB (not used by wikipedia) is open source framework for horizontally partitioning MySQL systems. Well worth reviewing.

Wikipedia also makes use of compression. This works when your data can be compressed well like text. This improves performance, however analysis on other projects have shown this does place a CPU impact on the server so it is important to monitor and use appropriately.

Another clever approach is to move searching to tools more appropriate for this task, in this case Lucene. As with any scale-out it is important to leverage the power of appropriate tools for maximum benefit.

I have only summarized Domas’ notes. It’s well worth a detailed read.

Tagged with: Databases MySQL

MySQL and Heatwave Summit Presentation

Ronald Bradford
April 30, 2025

Last week I had the opportunity to speak at the MySQL and Heatwave Summit in San Francisco. I discussed the impact of the new MySQL 8.0 default caching_sha2_password authentication, replacing the mysql_native_password authentication that was the default for approximately 20 of the 30 years that MySQL has existed.

Readyset QueryPilot Announcement

Ronald Bradford
April 22, 2025

At the MySQL and Heatwave Summit 2025 today, Readyset announced a new data systems architecture pattern named Readyset QueryPilot . This architecture which can front a MySQL or PostgreSQL database infrastructure, combines the enterprise-grade ProxySQL and Readyset caching with intelligent query monitoring and routing to help support applications scale and produce more predictable results with varied workloads.

More CPUs or Newer CPUs

Ronald Bradford
April 2, 2025

In a CPU-bound database workload, regardless of price, would you scale-up or scale-new? What if price was the driving factor, would you scale-up or scale-new? I am using as a baseline the first available AWS Graviton2 processor for RDS (r6g).

MySQL – Wikipedia

Related Posts

MySQL and Heatwave Summit Presentation

Readyset QueryPilot Announcement

More CPUs or Newer CPUs