In a text like the KJV, certain words have special significance, such
as “God” and
“sin.” However, if you perform
FULLTEXT
searches on the kjv
table for those words using a MySQL 3.23 server,
you’ll observe a curious phenomenon—both words
appear to be missing from the text entirely:
mysql>SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('God');
+----------+ | COUNT(*) | +----------+ | 0 | +----------+ mysql>SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('sin');
+----------+ | COUNT(*) | +----------+ | 0 | +----------+
One property of the indexing engine is that it ignores words that are “too common” (that is, words that occur in more than half the records). This eliminates words such as “the” or “and” from the index, but that’s not what is going on here. You can verify that by counting the total number of records, and by using SQL pattern matches to count the number of records containing each word:[27]
mysql>SELECT COUNT(*) AS 'total verses',
->COUNT(IF(vtext LIKE '%God%',1,NULL)) AS 'verses containing "God"',
->COUNT(IF(vtext LIKE '%sin%',1,NULL)) AS 'verses containing "sin"'
->FROM kjv;
+--------------+-------------------------+-------------------------+ | total verses | verses containing "God" | verses containing "sin" | +--------------+-------------------------+-------------------------+ | 31102 | 4118 | 1292 | +--------------+-------------------------+-------------------------+
Neither word is present in more than half the verses, so sheer
frequency of occurrence doesn’t account for the
failure of a FULLTEXT
search to
find them. What’s really happening is that by
default, the indexing engine doesn’t include words
less than four characters long. On a MySQL 3.23 server,
there’s nothing you can do about that (at least,
nothing short of messing around with the MySQL source code and
recompiling). As of MySQL 4.0, the minimum word length is a
configurable parameter, which you can change by setting the
ft_min_word_len
server variable. For example, to
tell the indexing engine to include words containing three or more
characters, add a set-variable
line to the
[mysqld]
group of the
/etc/my.cnf
file (or whatever option file you
put server settings in):
[mysqld] set-variable = ft_min_word_len=3
After making this change and restarting the server, rebuild the
FULLTEXT
index to take advantage of the new
setting:
mysql>ALTER TABLE kjv DROP INDEX vtext;
mysql>ALTER TABLE kjv ADD FULLTEXT (vtext);
Then try out the new index to verify that it includes shorter words:
mysql>SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('God');
+----------+ | COUNT(*) | +----------+ | 3878 | +----------+ mysql>SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('sin');
+----------+ | COUNT(*) | +----------+ | 389 | +----------+
That’s better!
But why do the MATCH( )
queries find 3878 and 389
records, whereas the earlier LIKE
queries find
4118 and 1292 records? That’s because the
LIKE
patterns match substrings and the
FULLTEXT
search performed by MATCH( )
matches whole words.
[27] The use of COUNT( )
to
produce multiple counts from the same set of values is described in
Recipe 7.2.
Get MySQL Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.