|
| 1 | +# Corpus Integrity Tests |
| 2 | + |
| 3 | +This directory contains tests that verify the integrity, format, and parseability of corpus files in PyThaiNLP. |
| 4 | + |
| 5 | +## Purpose |
| 6 | + |
| 7 | +These tests are separate from regular unit tests because: |
| 8 | +1. They test actual file loading and parsing (not mocked) |
| 9 | +2. Downloadable corpus tests require network access and can be slow |
| 10 | +3. They verify corpus data format and structure |
| 11 | +4. They should only run when corpus files or corpus code changes |
| 12 | + |
| 13 | +## Test Categories |
| 14 | + |
| 15 | +### Built-in Corpus Tests (`test_builtin_corpus.py`) |
| 16 | + |
| 17 | +Tests corpus files that are included in the package: |
| 18 | +- Text word lists (negations, stopwords, syllables, words, etc.) |
| 19 | +- CSV files (provinces) |
| 20 | +- Frequency data (TNC, TTC) |
| 21 | +- Name lists (family names, person names) |
| 22 | + |
| 23 | +**Run time:** < 1 second |
| 24 | + |
| 25 | +### Downloadable Corpus Tests (`test_downloadable_corpus.py`) |
| 26 | + |
| 27 | +Tests corpus files that need to be downloaded: |
| 28 | +- OSCAR word frequencies (96MB) |
| 29 | +- TNC bigram/trigram frequencies (41MB + 145MB) |
| 30 | + |
| 31 | +**Run time:** ~17 seconds (includes download time) |
| 32 | + |
| 33 | +## Running Tests |
| 34 | + |
| 35 | +### Run all corpus integrity tests: |
| 36 | +```bash |
| 37 | +python -m unittest discover -s tests/corpus_integrity -v |
| 38 | +``` |
| 39 | + |
| 40 | +### Run only built-in corpus tests: |
| 41 | +```bash |
| 42 | +python -m unittest tests.corpus_integrity.test_builtin_corpus -v |
| 43 | +``` |
| 44 | + |
| 45 | +### Run only downloadable corpus tests: |
| 46 | +```bash |
| 47 | +python -m unittest tests.corpus_integrity.test_downloadable_corpus -v |
| 48 | +``` |
| 49 | + |
| 50 | +## CI Integration |
| 51 | + |
| 52 | +The corpus integrity tests run automatically via GitHub Actions workflow (`.github/workflows/corpus-integrity.yml`) when: |
| 53 | +- Changes are made to `pythainlp/corpus/**` |
| 54 | +- Changes are made to `tests/corpus_integrity/**` |
| 55 | +- The workflow file itself is modified |
| 56 | + |
| 57 | +## What is Tested |
| 58 | + |
| 59 | +Each test verifies: |
| 60 | +1. **Loadability**: File can be loaded without errors |
| 61 | +2. **Type correctness**: Returns expected data type (frozenset, list, dict) |
| 62 | +3. **Non-empty**: Contains actual data |
| 63 | +4. **Format validity**: Data structure matches expected format |
| 64 | +5. **Content validity**: Contains expected content (e.g., Thai characters) |
| 65 | + |
| 66 | +## Adding New Tests |
| 67 | + |
| 68 | +When adding a new corpus file or function to `pythainlp.corpus`: |
| 69 | +1. Add a test to `test_builtin_corpus.py` if it's included in the package |
| 70 | +2. Add a test to `test_downloadable_corpus.py` if it requires download |
| 71 | +3. Verify the test catches format errors by temporarily breaking the corpus |
| 72 | + |
| 73 | +## Relationship to Unit Tests |
| 74 | + |
| 75 | +- **Unit tests** (`tests/core/test_corpus.py`): Use mocks for speed, test code logic |
| 76 | +- **Corpus integrity tests** (this directory): Use real data, test file integrity |
| 77 | + |
| 78 | +Both test suites are important and complementary. |
0 commit comments