An open source tokenizer which supports fast substring search with sqlite FTS (full text search).
If you think LIKE '%text%'
is too slow, this is the right solution for you.
-
register the "character_tokenizer" module
-
create a FTS table with character_tokenizer. For example:
CREATE VIRTUAL TABLE Book USING fts3(name TEXT NOT NULL, author TEXT, tokenize=character);
-
to search for a substring, use phrase queries. For example, to match strings such as "Adrenalines", "Linux", or "Penicillin", use:
SELECT * FROM docs WHERE docs MATCH '"lin"';
See SqliteSubstringSearchDemo for a complete example.
If you want to open a database encoded by character
tokenizer, do the following:
#import <FMDB/FMDatabase.h>
#import "character_tokenizer.h"
FMDatabase* database = [[FMDatabase alloc] initWithPath:@"my_database.db"];
if ([database open]) {
// add FTS support
const sqlite3_tokenizer_module *ptr;
get_character_tokenizer_module(&ptr);
registerTokenizer(database.sqliteHandle, "character", ptr);
}
English uses a space to separate words, but Chinese and Japanese do not. Since built-in FTS tokenizers relies on spaces to separate words, it will treat the whole sentence in Chinese or Japanese as a single word, which makes FTS not useful at all in these languages.
The third-party Chinese and Japanese tokenizers (mmseg for Chinese, MeCab, ChaSen for Japanese) use sophisticated and memory intensive approaches to find the ambiguous boundary between words. For simple applications such as querying for people names, a simple substring search is a more reasonable choice than these sophisticated tokenizers.
The character tokenizer partitions each character as an individual token. Searching for a substring is equivalent to finding consecutive tokens in the document, which are provided by FTS as phrase queries.