With my very heavy travel load and skilling load I’ve not had time to scratch myself. It hasn’t stopped the brain working overtime on various issues including the classic find a pattern in a string starting with a wildcard character. On a recent gig I saw the true classic.
SELECT columns
FROM users
WHERE username LIKE '%str%'
OR firstname LIKE '%str%'
OR lastname LIKE '%str%'
I went through the various options and comments on leading ‘%’, OR’s, combined columns, FULLTEXT (which doesn’t work in this case), merge indexing etc, however it perplexed me that nobody has really solved this problem, or at least shared their solutions.
I have an idea, a theory and while I’d love to prove/disprove it, I simply just don’t have the time. So here are my notes, hopefully somebody can comment positively/negatively, do some research, or encourage me to pursue it more.
The Problem
The problem is quite simply the leading wildcard, Der!. So how do you eliminate the wildcard from the search?
My Idea
Working with the earlier example, and having a concatenated field of the three components (username,firstname,lastname), my idea is to encode into 27 bits, a bit for each alphabetic character (A-Z) found, and any non-alphabetic character except whitespace in bit 27.
This column is then indexed and searched for bitwise matches. Here is a pictorial description.
String | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | |
Ronald Bradford | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||||||||||||||||||
Search: ‘brad’ | 1 | 1 | 1 | 1 | *** MATCH *** | ||||||||||||||||||||||
Search: ‘fred’ | 1 | 1 | 1 | 1 | NO MATCH |
My idea is very simple, the question is, will it actually work.
The Questions
The goal is quite obviously to get an Index utilization, to a maximum say of 10%-25% of matching rows. Factors that would affect this include:
- If search string is one character, like a vowel I could this not useful, so an implied criteria for optimal work is at least 2 or 3 characters.
- I see the effectiveness lost on large column values, it could be that only up to say 30 characters is optimal, I see strings of 100+ characters would be ineffective.
- It doesn’t support case-sensitive searching or non ASCII searching
The Tests
- Create good sized table and distribution of first/last names (A IMDB Actors table would work)
- Create a concatenated column of searchable fields (e.g. details = CONCAT_WS(‘ ‘,firstname,lastname)
- Create function to return bitwised search string (e.g. bitize(str))
- Create Indexed bitwise column and pre-populate accordingly (e.g. bitfulldetails)
- Create BEFORE INSERT|BEFORE UPDATE Triggers to populate Indexed bitwised column
- Test the sucker (of course you will need to include the actual LIKE command as well in the WHERE clause)
A sample query would then be:
SELECT columns
FROM users
WHERE bitize('str') & bitfulldetails
AND fulldetails like '%str%'
So the challenge to all those budding MySQL Gurus, does it seem plausible?
Dave Shrewsbury says
Well, you are basically matching letters, not strings. In your example, a search for ‘foal’ would also match since all four letters are have a bit set. Certainly a user would not want that to match ‘Ronald Bradford’. Or did I misunderstand something?
Ronald says
Dave,
I’ve added some more details to my blog entry to hopefully explain more. Effectively you still restrict by actual search, the goal however is to minimize the rows requiring the scan (e.g. less then a full table scan).
Ants Aasma says
One quite common way to to substring lookup indexing is by using trigrams. That is for each entry store a set of all different three letter sequences in it and when searching for a subtring you can narrow the search set by only looking at the entries where all the trigrams of the substring are present. Actually its quite similar to your method, but using 3 characters (a trigram) instead of 1 (a monogram? :). Unfortunately the regular btree index doesn’t work very well with this approach (both the trigrams and the bitsets). Postgresql has an intarray extension that uses the GiST framework to facilitate index lookups for finding superset entries for a given subset. It’s trivial to convert the set of trigrams into a set of integers. For MySQL I’m not so sure. It may be possible to coax the fulltext index to support the trigram matching. Possibly by creating an extra fulltext indexed field with trigrams separated by space, you probably would need to encode special chars in trigrams somehow though, and possibly disable or work around fulltext stopwords.
Whether this approach is usable of course depends your specific conditions. For instance the overhead for storing all the trigrams may be too heavy, or the selectivity may not be good enough.
Jay Pipes says
Well, in your specific example, you’d still be left with a full table scan; the reason being that AFAIK, bitwise operators won’t use an index on the field… So, in effect, it would just be a better post-access filter in the optimizer (you know, when it says “Using WHERE” in the Extra column output of EXPLAIN)…
Perhaps a more rounded solution would be the basics of implementing function indexes in MySQL. But that’s a whole different story
Ants Aasma says
Unfortunately function indexes won’t help, b-tree indexes only support ,= comparisons on fully ordered values. What you need is pluggable indexes. For substring searches the best way AFAIK is string b-trees (don’t confuse with regular b-trees on strings, theire quite different, despite the confusingly similar name).
Ants Aasma says
BTW. you might want to fix your commenting system. It currently eats smaller-than, bigger-than symbols. I was say that b-trees only support less-than,greater-than and equals comparisons.
Vance Rodriguez says
Here’s a stored function if you want to try this. It seems to run faster in some cases, slower in others. Granted, my test was only with a smaller table, about 100K rows and used 6 text fields. As well, I used an external table for hints that would have slowed it down further.
———-
DELIMITER $$
DROP FUNCTION IF EXISTS `test`.`make_hint_bitvalue` $$
CREATE FUNCTION `test`.`make_hint_bitvalue` ( hint_word MEDIUMTEXT ) RETURNS INT
NO SQL
BEGIN
DECLARE char1 CHAR DEFAULT ”;
DECLARE len INT DEFAULT 0;
DECLARE pos INT DEFAULT 1;
DECLARE bit_data INT UNSIGNED DEFAULT 0;
DECLARE subtract_ord_alpha INT DEFAULT 0;
SET len = LENGTH( hint_word );
SET subtract_ord_alpha = ORD( LOWER( ‘a’ ) );
WHILE ( pos = ‘a’ AND char1
Vance Rodriguez says
Well, guess the comments aren’t long enough for the full data. Email me and I’ll send the remainder to you if you’d like.
Ruturaj says
Drupal (http://drupal.org) , uses a very different way, what it does is basically splits the content into words, and creates an entry of words into a ‘search_index’ table, where every word is stored against its blog/node id entry. So its kind of different index.
But it works well !