WEB Advent 2008 / Full-Text Searching with SQLite

I’ve evangelized SQLite a lot this year, speaking at various conferences and user groups to try encourage use and support within many web applications. So, it should be no surprise that this is the topic of my article for PHP Advent. I’m going to be covering a lesser known feature: full-text searching.

Most users are spoiled by Google and its ability to index just about everything it can get its hands on. The end result is that any popular application that wants to remain competitive has to have some way to search. There are many solutions available to application developers, such as using the native implementations within MySQL and PostgreSQL, or standalone software such as Sphinx and Lucene. These, however, can either be tricky to set up, or overkill for the purpose.

Luckily, Google have contributed some resources to SQLite in order to help. The first full-text searching implementation is available in version 3.3.8. This version provides the functionality to create a virtual table that relies on an external extension: in this case, a full-text searching algorithm for any of the text columns within the virtual table. Support for this was only enabled by default—with PDO and SQLite3—in PHP 5.3.0. Earlier versions of PHP can use the SQLite3 extension in PECL.

Creating a search index usually happens as follows:

  • Split the text into tokens.
  • Convert to lowercase.
  • Determine the root word.
  • Build the index.

Setting It All Up

SQLite provides two basic tokenizers by default, SIMPLE and PORTER. These control the way words are separated. The simple tokenizer will split text into tokens based upon spaces and punctuation. The porter tokenizer is designed for use with English and will collapse the extensions of many words to their base form. Words such as consolidate, consolidated, and consolidating will be changed to consolid.

Unfortunately, there is currently no removal of stop words within SQLite, so common words such as the, of, and to will be populated within the index. This can greatly increase the size of the index and slow down searching. The simplest solution to this is to strip out the stop words manually prior to inserting the entry into the table.

Now for some code to show you how to create your first full text index. SQLite does this by creating a virtual table using the FTS3 extension. Only the text columns within this virtual table will be searchable, and the last column is used to identify which tokenizer to use.

CREATE VIRTUAL TABLE example
USING FTS3(title TEXT, TOKENIZE SIMPLE)

After creating the table, you can query it using SELECT, INSERT, UPDATE, and DELETE statements. There is one caveat: no further indices can be created on the table, so simple queries will result in a full table scan.

Once you insert some data, you can try it out. For the rest of the examples in this article, I’m going to utilize all the titles from the English version of Wikipedia. There are 5,453,838 rows in my database, and with the fulltext index, it is 233 MB in size compared to 146 MB without the index.

Searching

Searching the index is done with the MATCH operator. The query can contain multiple terms, in which case it will return only rows containing all of the terms. There is also support for OR queries, which exclude terms, exact phrase matching, and prefix searching.

SELECT rowid, title FROM example WHERE title MATCH 'tea bag'
SELECT rowid, title FROM example WHERE title MATCH 'tea OR bag'
SELECT rowid, title FROM example WHERE title MATCH 'tea -bag'
SELECT rowid, title FROM example WHERE title MATCH '"tea bag"'
SELECT rowid, title FROM example WHERE title MATCH 'tea*'

Note that the OR is case sensitive, and only a single MATCH operator is allowed in a query.

Creating Snippets

In order to provide context to the matches within a search result, a snippet() function is available. This function will highlight any of the search words that were within the column in which a result was found.

SELECT title, snippet(example)
FROM   example
WHERE  title
MATCH  'Advent'
ADVENT|ADVENT
Advent|Advent
Advent,_Cornwall|Advent,_Cornwall
Advent,_West_Virginia|Advent,_West_Virginia
Advent:_One-Winged_Angel|Advent:_One-Winged_Angel
Advent_(band)|Advent_(band)

Several additional arguments are available to indicate the markup used to surround the matched term, and any ellipsis for cutting the result short. These are the second, third, and fourth arguments respectively, the first being used to indicate the table name.

Non-English Tokenizing

For those wishing to perform full-text searching for non-English text, there is some basic functionality provided via the ICU library. This library is a core part of PHP 6 and is also available in PHP 5.3 with the intl extension. In either of these cases, SQLite will have an additional tokenizer.

The ICU tokenizer will break words on boundaries as identified by the rules for the language and locale specified. This will work for some locales, but not all. The syntax is as follows:

CREATE VIRTUAL TABLE example
USING FTS3(title TEXT, TOKENIZE icu en_GB)

The second argument here is composed of language, country, and variant information.

With that I’d like to wish everyone a happy holiday and happy coding.

Other posts