This project provides custom FTS5 tokenizers for SQLite that use the International Components for Unicode (ICU) library to provide robust word segmentation for various languages.
The project supports both FTS5 API v1 (legacy) and API v2 (current) implementations, with the ability to build either version based on your needs. The target locale is configurable at build time, with support for both universal and locale-specific tokenizers.
- API v2: Current implementation with full FTS5 capabilities and enhanced features (default)
- API v1: Legacy implementation for older SQLite versions without FTS5 API v2 support
- Both: Written in C for maximum stability and performance in high-availability systems
- CMake (version 3.10 or higher)
- C Compiler (GCC, Clang, or MSVC)
- SQLite3 development libraries (
libsqlite3-devon Debian/Ubuntu,sqlite-develon RHEL/Fedora) - ICU development libraries (
libicu-devon Debian/Ubuntu,libicu-develon RHEL/Fedora)
For RHEL-based distributions (RHEL, CentOS, Rocky Linux, AlmaLinux, etc.) and other systems with older SQLite versions, use the legacy API v1 as detailed below.
# Build all tokenizers (API v2 by default)
./scripts/build_all.sh
# Test all tokenizers
./scripts/test_all.shFor older SQLite versions that don't support FTS5 API v2:
# Build all legacy API v1 tokenizers
./scripts/build_all_legacy.sh
# Test all legacy API v1 tokenizers
./scripts/test_all_legacy.sh
# Build all tokenizers for both API versions
./scripts/build_all_with_legacy.shmkdir build && cd build
cmake .. -DLOCALE=ja # e.g., Japanese
makemkdir build && cd build
cmake .. -DAPI_VERSION=v1 -DLOCALE=ja # e.g., Japanese
makeThe resulting library will have a _legacy suffix (e.g., libfts5_icu_ja_legacy.so).
.load ./build/libfts5_icu_th_legacy.so -- Note the "_legacy" suffix
CREATE VIRTUAL TABLE documents_th USING fts5(
content,
tokenize = 'icu_th'
);.load ./build/libfts5_icu_th.so
CREATE VIRTUAL TABLE documents_th USING fts5(
content,
tokenize = 'icu_th'
);-- Load the appropriate library
.load ./build/libfts5_icu_th.so
-- Create table and search
CREATE VIRTUAL TABLE documents_th USING fts5(content, tokenize = 'icu_th');
INSERT INTO documents_th(content) VALUES ('การทดสอบภาษาไทยในระบบค้นหา');
SELECT * FROM documents_th WHERE documents_th MATCH 'ภาษา';.load ./build/libfts5_icu.so
CREATE VIRTUAL TABLE documents USING fts5(content, tokenize = 'icu');
INSERT INTO documents(content) VALUES ('甜蜜蜜,你笑得甜蜜蜜-หวานปานน้ำผึ้ง,ยิ้มของคุณช่างหวานปานน้ำผึ้ง');
SELECT * FROM documents WHERE documents MATCH 'หวาน';| Locale | Language | Test File |
|---|---|---|
ar |
Arabic | tests/test_ar_tokenizer.sql |
el |
Greek | tests/test_el_tokenizer.sql |
he |
Hebrew | tests/test_he_tokenizer.sql |
ja |
Japanese | tests/test_ja_tokenizer.sql |
ko |
Korean | tests/test_ko_tokenizer.sql |
ru |
Russian | tests/test_ru_tokenizer.sql |
th |
Thai | tests/test_th_tokenizer.sql |
zh |
Chinese | tests/test_zh_tokenizer.sql |
| - | Universal | tests/test_universal_tokenizer.sql |
cn→zh(Chinese, with warning)jp→ja(Japanese, with warning)kr↔ko(Korean, both supported)iw↔he(Hebrew, both supported)gr↔el(Greek, both supported)
For Windows using Visual Studio:
mkdir build
cd build
cmake -G "Visual Studio 17 2022" -T host=x64 -A x64 .. -DICU_ROOT="C:\icu" -DSQLite3_INCLUDE_DIR="C:\sqlite\include" -DSQLite3_LIBRARY="C:\sqlite\sqlite3.lib" -DAPI_VERSION=v1 -DLOCALE=th
cmake --build . --config ReleaseLocale-specific tokenizers use optimized ICU rules for each language:
- Japanese (
ja):NFKD; Katakana-Hiragana; Lower; NFKC - Chinese (
zh):NFKD; Traditional-Simplified; Lower; NFKC - Thai (
th):NFKD; Lower; NFKC - Korean (
ko):NFKD; Lower; NFKC - Arabic (
ar):NFKD; Arabic-Latin; Lower; NFKC - Russian (
ru):NFKD; Cyrillic-Latin; Lower; NFKC - Hebrew (
he):NFKD; Hebrew-Latin; Lower; NFKC - Greek (
el):NFKD; Greek-Latin; Lower; NFKC
Universal tokenizer rule: NFKD; Arabic-Latin; Cyrillic-Latin; Hebrew-Latin; Greek-Latin; Latin-ASCII; Lower; NFKC; Traditional-Simplified; Katakana-Hiragana
- Locale-specific: When you know the primary language and performance is important
- Universal: For mixed-language content or unknown language at build time
# Format all source files
./scripts/code-format.sh
# Run static analysis
./scripts/lint-check.sh- Script Reference - Complete list of available scripts
- Build & Test Guide - Detailed building and testing information
- API Implementation Details - Technical implementation documentation
- High-Performance Text Search: Optimized for various languages using ICU
- Cross-Platform Compatibility: Works on Linux, Windows, and macOS
- RHEL Support: Backwards compatibility for older SQLite versions
- Memory Safe: Includes buffer overflow prevention and secure coding practices
- Modular Design: Clean, well-documented code structure