Skip to content

Commit fc1aaec

Browse files
Copilotbact
andcommitted
Add README for corpus integrity tests
Co-authored-by: bact <128572+bact@users.noreply.github.com>
1 parent fa9f3a7 commit fc1aaec

File tree

1 file changed

+78
-0
lines changed

1 file changed

+78
-0
lines changed

tests/corpus_integrity/README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Corpus Integrity Tests
2+
3+
This directory contains tests that verify the integrity, format, and parseability of corpus files in PyThaiNLP.
4+
5+
## Purpose
6+
7+
These tests are separate from regular unit tests because:
8+
1. They test actual file loading and parsing (not mocked)
9+
2. Downloadable corpus tests require network access and can be slow
10+
3. They verify corpus data format and structure
11+
4. They should only run when corpus files or corpus code changes
12+
13+
## Test Categories
14+
15+
### Built-in Corpus Tests (`test_builtin_corpus.py`)
16+
17+
Tests corpus files that are included in the package:
18+
- Text word lists (negations, stopwords, syllables, words, etc.)
19+
- CSV files (provinces)
20+
- Frequency data (TNC, TTC)
21+
- Name lists (family names, person names)
22+
23+
**Run time:** < 1 second
24+
25+
### Downloadable Corpus Tests (`test_downloadable_corpus.py`)
26+
27+
Tests corpus files that need to be downloaded:
28+
- OSCAR word frequencies (96MB)
29+
- TNC bigram/trigram frequencies (41MB + 145MB)
30+
31+
**Run time:** ~17 seconds (includes download time)
32+
33+
## Running Tests
34+
35+
### Run all corpus integrity tests:
36+
```bash
37+
python -m unittest discover -s tests/corpus_integrity -v
38+
```
39+
40+
### Run only built-in corpus tests:
41+
```bash
42+
python -m unittest tests.corpus_integrity.test_builtin_corpus -v
43+
```
44+
45+
### Run only downloadable corpus tests:
46+
```bash
47+
python -m unittest tests.corpus_integrity.test_downloadable_corpus -v
48+
```
49+
50+
## CI Integration
51+
52+
The corpus integrity tests run automatically via GitHub Actions workflow (`.github/workflows/corpus-integrity.yml`) when:
53+
- Changes are made to `pythainlp/corpus/**`
54+
- Changes are made to `tests/corpus_integrity/**`
55+
- The workflow file itself is modified
56+
57+
## What is Tested
58+
59+
Each test verifies:
60+
1. **Loadability**: File can be loaded without errors
61+
2. **Type correctness**: Returns expected data type (frozenset, list, dict)
62+
3. **Non-empty**: Contains actual data
63+
4. **Format validity**: Data structure matches expected format
64+
5. **Content validity**: Contains expected content (e.g., Thai characters)
65+
66+
## Adding New Tests
67+
68+
When adding a new corpus file or function to `pythainlp.corpus`:
69+
1. Add a test to `test_builtin_corpus.py` if it's included in the package
70+
2. Add a test to `test_downloadable_corpus.py` if it requires download
71+
3. Verify the test catches format errors by temporarily breaking the corpus
72+
73+
## Relationship to Unit Tests
74+
75+
- **Unit tests** (`tests/core/test_corpus.py`): Use mocks for speed, test code logic
76+
- **Corpus integrity tests** (this directory): Use real data, test file integrity
77+
78+
Both test suites are important and complementary.

0 commit comments

Comments
 (0)