Smarter indexing for column LIKE '%string%'

March 7, 2007 by ronald

With my very heavy travel load and skilling load I’ve not had time to scratch myself. It hasn’t stopped the brain working overtime on various issues including the classic find a pattern in a string starting with a wildcard character. On a recent gig I saw the true classic.

SELECT columns FROM users WHERE username LIKE '%str%' OR firstname LIKE '%str%' OR lastname LIKE '%str%'

I went through the various options and comments on leading ‘%’, OR’s, combined columns, FULLTEXT (which doesn’t work in this case), merge indexing etc, however it perplexed me that nobody has really solved this problem, or at least shared their solutions.

I have an idea, a theory and while I’d love to prove/disprove it, I simply just don’t have the time. So here are my notes, hopefully somebody can comment positively/negatively, do some research, or encourage me to pursue it more.

The Problem

The problem is quite simply the leading wildcard, Der!. So how do you eliminate the wildcard from the search?

My Idea

Working with the earlier example, and having a concatenated field of the three components (username,firstname,lastname), my idea is to encode into 27 bits, a bit for each alphabetic character (A-Z) found, and any non-alphabetic character except whitespace in bit 27.

This column is then indexed and searched for bitwise matches. Here is a pictorial description.

String

Ronald Bradford

Search: ‘brad’

*** MATCH ***

Search: ‘fred’

NO MATCH

My idea is very simple, the question is, will it actually work.

The Questions

The goal is quite obviously to get an Index utilization, to a maximum say of 10%-25% of matching rows. Factors that would affect this include:

If search string is one character, like a vowel I could this not useful, so an implied criteria for optimal work is at least 2 or 3 characters.
I see the effectiveness lost on large column values, it could be that only up to say 30 characters is optimal, I see strings of 100+ characters would be ineffective.
It doesn’t support case-sensitive searching or non ASCII searching

The Tests

Create good sized table and distribution of first/last names (A IMDB Actors table would work)
Create a concatenated column of searchable fields (e.g. details = CONCAT_WS(‘ ‘,firstname,lastname)
Create function to return bitwised search string (e.g. bitize(str))
Create Indexed bitwise column and pre-populate accordingly (e.g. bitfulldetails)
Create BEFORE INSERT|BEFORE UPDATE Triggers to populate Indexed bitwised column
Test the sucker (of course you will need to include the actual LIKE command as well in the WHERE clause)

A sample query would then be:

SELECT columns FROM users WHERE bitize('str') & bitfulldetails AND fulldetails like '%str%'

So the challenge to all those budding MySQL Gurus, does it seem plausible?

Comments

Dave Shrewsbury says

March 7, 2007 at 5:32 am

Well, you are basically matching letters, not strings. In your example, a search for ‘foal’ would also match since all four letters are have a bit set. Certainly a user would not want that to match ‘Ronald Bradford’. Or did I misunderstand something?
Ronald says

March 7, 2007 at 6:00 am

Dave,

I’ve added some more details to my blog entry to hopefully explain more. Effectively you still restrict by actual search, the goal however is to minimize the rows requiring the scan (e.g. less then a full table scan).
Ants Aasma says

March 7, 2007 at 6:28 am

One quite common way to to substring lookup indexing is by using trigrams. That is for each entry store a set of all different three letter sequences in it and when searching for a subtring you can narrow the search set by only looking at the entries where all the trigrams of the substring are present. Actually its quite similar to your method, but using 3 characters (a trigram) instead of 1 (a monogram? :). Unfortunately the regular btree index doesn’t work very well with this approach (both the trigrams and the bitsets). Postgresql has an intarray extension that uses the GiST framework to facilitate index lookups for finding superset entries for a given subset. It’s trivial to convert the set of trigrams into a set of integers. For MySQL I’m not so sure. It may be possible to coax the fulltext index to support the trigram matching. Possibly by creating an extra fulltext indexed field with trigrams separated by space, you probably would need to encode special chars in trigrams somehow though, and possibly disable or work around fulltext stopwords.

Whether this approach is usable of course depends your specific conditions. For instance the overhead for storing all the trigrams may be too heavy, or the selectivity may not be good enough.
Jay Pipes says

March 7, 2007 at 6:55 am

Well, in your specific example, you’d still be left with a full table scan; the reason being that AFAIK, bitwise operators won’t use an index on the field… So, in effect, it would just be a better post-access filter in the optimizer (you know, when it says “Using WHERE” in the Extra column output of EXPLAIN)…

Perhaps a more rounded solution would be the basics of implementing function indexes in MySQL. But that’s a whole different story
Ants Aasma says

March 7, 2007 at 7:23 am

Unfortunately function indexes won’t help, b-tree indexes only support ,= comparisons on fully ordered values. What you need is pluggable indexes. For substring searches the best way AFAIK is string b-trees (don’t confuse with regular b-trees on strings, theire quite different, despite the confusingly similar name).
Ants Aasma says

March 7, 2007 at 7:25 am

BTW. you might want to fix your commenting system. It currently eats smaller-than, bigger-than symbols. I was say that b-trees only support less-than,greater-than and equals comparisons.
Vance Rodriguez says

March 7, 2007 at 7:32 am

Here’s a stored function if you want to try this. It seems to run faster in some cases, slower in others. Granted, my test was only with a smaller table, about 100K rows and used 6 text fields. As well, I used an external table for hints that would have slowed it down further.

———-

DELIMITER $$

DROP FUNCTION IF EXISTS `test`.`make_hint_bitvalue` $$
CREATE FUNCTION `test`.`make_hint_bitvalue` ( hint_word MEDIUMTEXT ) RETURNS INT
NO SQL
BEGIN
DECLARE char1 CHAR DEFAULT ”;
DECLARE len INT DEFAULT 0;
DECLARE pos INT DEFAULT 1;
DECLARE bit_data INT UNSIGNED DEFAULT 0;
DECLARE subtract_ord_alpha INT DEFAULT 0;

SET len = LENGTH( hint_word );
SET subtract_ord_alpha = ORD( LOWER( ‘a’ ) );

WHILE ( pos = ‘a’ AND char1
Vance Rodriguez says

March 7, 2007 at 7:34 am

Well, guess the comments aren’t long enough for the full data. Email me and I’ll send the remainder to you if you’d like.
Ruturaj says

March 7, 2007 at 2:22 pm

Drupal (http://drupal.org) , uses a very different way, what it does is basically splits the content into words, and creates an entry of words into a ‘search_index’ table, where every word is stored against its blog/node id entry. So its kind of different index.
But it works well !