Skip to content

cwt/fts5-icu-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FTS5 ICU Tokenizer for SQLite

This project provides custom FTS5 tokenizers for SQLite that use the International Components for Unicode (ICU) library to provide robust word segmentation for various languages.

The project supports both FTS5 API v1 (legacy) and API v2 (current) implementations, with the ability to build either version based on your needs. The target locale is configurable at build time, with support for both universal and locale-specific tokenizers.

  • API v2: Current implementation with full FTS5 capabilities and enhanced features (default)
  • API v1: Legacy implementation for older SQLite versions without FTS5 API v2 support
  • Both: Written in C for maximum stability and performance in high-availability systems

Quick Start

Prerequisites

  • CMake (version 3.10 or higher)
  • C Compiler (GCC, Clang, or MSVC)
  • SQLite3 development libraries (libsqlite3-dev on Debian/Ubuntu, sqlite-devel on RHEL/Fedora)
  • ICU development libraries (libicu-dev on Debian/Ubuntu, libicu-devel on RHEL/Fedora)

RHEL Compatibility Note

For RHEL-based distributions (RHEL, CentOS, Rocky Linux, AlmaLinux, etc.) and other systems with older SQLite versions, use the legacy API v1 as detailed below.


Building & Testing (API v2 - Default)

# Build all tokenizers (API v2 by default)
./scripts/build_all.sh

# Test all tokenizers
./scripts/test_all.sh

Building & Testing (API v1 - Legacy)

For older SQLite versions that don't support FTS5 API v2:

# Build all legacy API v1 tokenizers
./scripts/build_all_legacy.sh

# Test all legacy API v1 tokenizers
./scripts/test_all_legacy.sh

# Build all tokenizers for both API versions
./scripts/build_all_with_legacy.sh

Building Individual Locales

API v2 (Default)

mkdir build && cd build
cmake .. -DLOCALE=ja  # e.g., Japanese
make

API v1 (Legacy - for RHEL & older SQLite)

mkdir build && cd build
cmake .. -DAPI_VERSION=v1 -DLOCALE=ja  # e.g., Japanese
make

The resulting library will have a _legacy suffix (e.g., libfts5_icu_ja_legacy.so).


Usage Examples

Loading API v1 (Legacy) Tokenizers

.load ./build/libfts5_icu_th_legacy.so  -- Note the "_legacy" suffix

CREATE VIRTUAL TABLE documents_th USING fts5(
    content,
    tokenize = 'icu_th'
);

Loading API v2 (Current) Tokenizers

.load ./build/libfts5_icu_th.so

CREATE VIRTUAL TABLE documents_th USING fts5(
    content,
    tokenize = 'icu_th'
);

Example: Thai Text Search

-- Load the appropriate library
.load ./build/libfts5_icu_th.so

-- Create table and search
CREATE VIRTUAL TABLE documents_th USING fts5(content, tokenize = 'icu_th');
INSERT INTO documents_th(content) VALUES ('การทดสอบภาษาไทยในระบบค้นหา');
SELECT * FROM documents_th WHERE documents_th MATCH 'ภาษา';

Example: Universal Multi-Language Support

.load ./build/libfts5_icu.so

CREATE VIRTUAL TABLE documents USING fts5(content, tokenize = 'icu');
INSERT INTO documents(content) VALUES ('甜蜜蜜,你笑得甜蜜蜜-หวานปานน้ำผึ้ง,ยิ้มของคุณช่างหวานปานน้ำผึ้ง');
SELECT * FROM documents WHERE documents MATCH 'หวาน';

Supported Locales

Locale Language Test File
ar Arabic tests/test_ar_tokenizer.sql
el Greek tests/test_el_tokenizer.sql
he Hebrew tests/test_he_tokenizer.sql
ja Japanese tests/test_ja_tokenizer.sql
ko Korean tests/test_ko_tokenizer.sql
ru Russian tests/test_ru_tokenizer.sql
th Thai tests/test_th_tokenizer.sql
zh Chinese tests/test_zh_tokenizer.sql
- Universal tests/test_universal_tokenizer.sql

Locale Mappings

  • cnzh (Chinese, with warning)
  • jpja (Japanese, with warning)
  • krko (Korean, both supported)
  • iwhe (Hebrew, both supported)
  • grel (Greek, both supported)

Advanced Configuration

Windows Build

For Windows using Visual Studio:

mkdir build
cd build
cmake -G "Visual Studio 17 2022" -T host=x64 -A x64 .. -DICU_ROOT="C:\icu" -DSQLite3_INCLUDE_DIR="C:\sqlite\include" -DSQLite3_LIBRARY="C:\sqlite\sqlite3.lib" -DAPI_VERSION=v1 -DLOCALE=th
cmake --build . --config Release

Locale-Specific Performance Optimizations

Locale-specific tokenizers use optimized ICU rules for each language:

  • Japanese (ja): NFKD; Katakana-Hiragana; Lower; NFKC
  • Chinese (zh): NFKD; Traditional-Simplified; Lower; NFKC
  • Thai (th): NFKD; Lower; NFKC
  • Korean (ko): NFKD; Lower; NFKC
  • Arabic (ar): NFKD; Arabic-Latin; Lower; NFKC
  • Russian (ru): NFKD; Cyrillic-Latin; Lower; NFKC
  • Hebrew (he): NFKD; Hebrew-Latin; Lower; NFKC
  • Greek (el): NFKD; Greek-Latin; Lower; NFKC

Universal tokenizer rule: NFKD; Arabic-Latin; Cyrillic-Latin; Hebrew-Latin; Greek-Latin; Latin-ASCII; Lower; NFKC; Traditional-Simplified; Katakana-Hiragana

When to Use Each Approach

  • Locale-specific: When you know the primary language and performance is important
  • Universal: For mixed-language content or unknown language at build time

Code Quality & Maintenance

Formatting & Linting

# Format all source files
./scripts/code-format.sh

# Run static analysis
./scripts/lint-check.sh

Documentation


Key Benefits

  • High-Performance Text Search: Optimized for various languages using ICU
  • Cross-Platform Compatibility: Works on Linux, Windows, and macOS
  • RHEL Support: Backwards compatibility for older SQLite versions
  • Memory Safe: Includes buffer overflow prevention and secure coding practices
  • Modular Design: Clean, well-documented code structure

About

FTS5 ICU Tokenizer for SQLite (mirror)

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •