Smarter indexing for column LIKE ‘%string%’

Ronald Bradford
March 6, 2007

With my very heavy travel load and skilling load I’ve not had time to scratch myself. It hasn’t stopped the brain working overtime on various issues including the classic find a pattern in a string starting with a wildcard character. On a recent gig I saw the true classic.

`
SELECT columns
FROM users
WHERE username LIKE ‘%str%’
OR firstname LIKE ‘%str%’
OR lastname LIKE ‘%str%’

I went through the various options and comments on leading ‘%’, OR’s, combined columns, FULLTEXT (which doesn’t work in this case), merge indexing etc, however it perplexed me that nobody has really solved this problem, or at least shared their solutions.

I have an idea, a theory and while I’d love to prove/disprove it, I simply just don’t have the time. So here are my notes, hopefully somebody can comment positively/negatively, do some research, or encourage me to pursue it more.

The Problem

The problem is quite simply the leading wildcard, Der!. So how do you eliminate the wildcard from the search?

My Idea

Working with the earlier example, and having a concatenated field of the three components (username,firstname,lastname), my idea is to encode into 27 bits, a bit for each alphabetic character (A-Z) found, and any non-alphabetic character except whitespace in bit 27.

This column is then indexed and searched for bitwise matches. Here is a pictorial description.

<td>
  a
</td>

<td>
  b
</td>

<td>
  c
</td>

<td>
  d
</td>

<td>
  e
</td>

<td>
  f
</td>

<td>
  g
</td>

<td>
  h
</td>

<td>
  i
</td>

<td>
  j
</td>

<td>
  k
</td>

<td>
  l
</td>

<td>
  m
</td>

<td>
  n
</td>

<td>
  o
</td>

<td>
  p
</td>

<td>
  q
</td>

<td>
  r
</td>

<td>
  s
</td>

<td>
  t
</td>

<td>
  u
</td>

<td>
  v
</td>

<td>
  w
</td>

<td>
  x
</td>

<td>
  y
</td>

<td>
  z
</td>

<td>
  1
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  *** MATCH ***
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  1
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  1
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  &nbsp;
</td>

<td>
  NO MATCH
</td>

String

Ronald Bradford

Search: ‘brad’

Search: ‘fred’

My idea is very simple, the question is, will it actually work.

The Questions

The goal is quite obviously to get an Index utilization, to a maximum say of 10%-25% of matching rows. Factors that would affect this include:

If search string is one character, like a vowel I could this not useful, so an implied criteria for optimal work is at least 2 or 3 characters.
I see the effectiveness lost on large column values, it could be that only up to say 30 characters is optimal, I see strings of 100+ characters would be ineffective.
It doesn’t support case-sensitive searching or non ASCII searching

The Tests

Create good sized table and distribution of first/last names (A IMDB Actors table would work)
Create a concatenated column of searchable fields (e.g. details = CONCAT_WS(‘ ‘,firstname,lastname)
Create function to return bitwised search string (e.g. bitize(str))
Create Indexed bitwise column and pre-populate accordingly (e.g. bitfulldetails)
Create BEFORE INSERT|BEFORE UPDATE Triggers to populate Indexed bitwised column
Test the sucker (of course you will need to include the actual LIKE command as well in the WHERE clause)

A sample query would then be:

 SELECT columns FROM users WHERE bitize('str') & bitfulldetails AND fulldetails like '%str%' 

So the challenge to all those budding MySQL Gurus, does it seem plausible?

Tagged with: Databases General MySQL

MySQL and Heatwave Summit Presentation

Ronald Bradford
April 30, 2025

Last week I had the opportunity to speak at the MySQL and Heatwave Summit in San Francisco. I discussed the impact of the new MySQL 8.0 default caching_sha2_password authentication, replacing the mysql_native_password authentication that was the default for approximately 20 of the 30 years that MySQL has existed.

Readyset QueryPilot Announcement

Ronald Bradford
April 22, 2025

At the MySQL and Heatwave Summit 2025 today, Readyset announced a new data systems architecture pattern named Readyset QueryPilot . This architecture which can front a MySQL or PostgreSQL database infrastructure, combines the enterprise-grade ProxySQL and Readyset caching with intelligent query monitoring and routing to help support applications scale and produce more predictable results with varied workloads.

More CPUs or Newer CPUs

Ronald Bradford
April 2, 2025

In a CPU-bound database workload, regardless of price, would you scale-up or scale-new? What if price was the driving factor, would you scale-up or scale-new? I am using as a baseline the first available AWS Graviton2 processor for RDS (r6g).