This directory contains comprehensive documentation of all Temporal QA datasets, organized by collection type and characteristics.
- Diachronic Corpora - Time-stamped historical documents
- Synchronic Corpora - Wikipedia snapshots at specific points in time
- Annotated Temporal Corpora - Explicitly annotated with TimeML/temporal relations
- Web & Real-Time - Live search and periodically updated datasets
- Synthetic & Specialized - Generated datasets for specific reasoning tasks
- Knowledge Graph-Based - Structured temporal KG datasets
- Domain-Specific - Medical, Legal, Financial
| Dataset | Year | #Q | Source Type | Time Coverage | Creation | Multi-Hop | Metadata | Paper | Data |
|---|---|---|---|---|---|---|---|---|---|
| DIACHRONIC (Primary Historical Sources) | |||||||||
| ArchivalQA | 2022 | 532K | NYT | 1987-2007 | AG | ✗ | ✓ | 📄 | 💾 |
| ChroniclingAmericaQA | 2024 | 485K | Historical News | 1800-1920 | AG | ✗ | ✓ | 📄 | 💾 |
| StreamingQA | 2022 | 147K | News | 2007-2020 | CS | ✓ | ✓ | 📄 | 💾 |
| NewsQA | 2017 | 119K | CNN/DM | 2007-2015 | CS | ✗ | ✗ | 📄 | 💾 |
| TempLAMA | 2022 | 50K | News | 2010-2020 | CS | ✗ | ✓ | 📄 | 💾 |
| TORQUE | 2020 | 21K | News | - | CS | ✗ | ✗ | 📄 | 💾 |
| ForecastQA | 2021 | 10.3K | News | 2015-2019 | CS | ✓ | ✓ | 📄 | 💾 |
| TDDiscourse | 2019 | 6.1K | News | Unspecified | CS | ✗ | ✗ | 📄 | 💾 |
| TemporalQuestions | 2021 | 1K | NYT | 1987-2007 | CS | ✗ | ✓ | 📄 | Contact authors |
| SYNCHRONIC (Wikipedia Snapshots) | |||||||||
| ComplexTempQA | 2024 | 100.2K | Wikipedia | 1987-2023 | AG | ✓ | ✓ | 📄 | 💾 |
| TEMPREASON | 2023 | 52.8K | Wiki/Wikidata | 634-2023 | SC | ✗ | ✗ | 📄 | 💾 |
| TimeQA | 2021 | 41.2K | Wikipedia | 1367-2018 | AG | ✗ | ✗ | 📄 | 💾 |
| TemporalAlignmentQA | 2024 | 20K | Wikipedia | 2000-2023 | AG | ✗ | ✗ | 📄 | Contact authors |
| SituatedQA | 2021 | 12.2K | Wikipedia | ≤ 2021 | CS | ✗ | ✗ | 📄 | 💾 |
| TempTabQA | 2023 | 11.4K | Wiki Infoboxes | - | CS | ✗ | ✗ | 📄 | 💾 |
| TiQ | 2024 | 10K | Wikipedia | Unspecified | AG | ✗ | ✗ | 📄 | Contact authors |
| PAT-Questions | 2024 | 6.1K | Wikipedia | Present | CS | ✓ | ✗ | 📄 | 💾 |
| TRACIE | 2021 | 5.4K | Wikipedia | ≤ 2020 | CS | ✗ | ✗ | 📄 | 💾 |
| MenatQA | 2023 | 2.8K | Wikipedia | 1367-2018 | AG | ✗ | ✗ | 📄 | 💾 |
| WEB & REAL-TIME | |||||||||
| ReaLTimeQA | 2023 | 5.1K | Web Search | 2020-2024 | CS | ✗ | ✗ | 📄 | 💾 |
| FreshQA | 2024 | 600 | Google Search | Dynamic | CS | ✓ | ✗ | 📄 | 💾 |
| SYNTHETIC & SPECIALIZED | |||||||||
| COTEMPQA | 2024 | 4.7K | Wikidata | ≤ 2023 | CS | ✓ | ✗ | 📄 | 💾 |
| UnSeenTimeQA | 2024 | 3.6K | Synthetic | - | AG | ✓ | ✗ | 📄 | 💾 |
| Test of Time | 2024 | 1.8K | Synthetic | - | AG | ✓ | ✗ | 📄 | 💾 |
| TIMEDIAL | 2021 | 1.1K | DailyDialog | - | CS | ✗ | ✗ | 📄 | 💾 |
Legend:
- #Q: Number of questions
- Creation: AG = Auto-Generated, CS = Crowdsourced, SC = Synthetic
- Multi-Hop: Requires reasoning across multiple temporal hops
- Metadata: Explicit temporal metadata available
| Complexity | Datasets | Key Features |
|---|---|---|
| Simple | NewsQA, TimeQA, TempLAMA, ArchivalQA | Direct temporal lookups, explicit dates |
| Complex | ComplexTempQA, TEMPREASON, MenatQA, StreamingQA | Multi-hop reasoning, temporal filtering |
| Reasoning-Focused | Test of Time, UnSeenTimeQA, COTEMPQA | Synthetic temporal logic, beyond memorization |
| Orientation | Datasets | Description |
|---|---|---|
| Historical | ChroniclingAmericaQA (1800-1920), TimeQA (1367-2018) | Past events |
| Recent Past | NewsQA, ArchivalQA, StreamingQA | Modern history (1987-2020) |
| Present/Future | FreshQA, ReaLTimeQA, ForecastQA | Current/predictive |
| Type | Examples |
|---|---|
| Extractive | ArchivalQA, TimeQA, NewsQA |
| Abstractive | TEMPREASON, TemporalAlignmentQA, TRACIE |
| Multiple Choice | ForecastQA, ReaLTimeQA, TIMEDIAL |
| Entities/Freeform | TiQ, NewsQA |
Problem: Questions like "Who is the president?" lack temporal context
Datasets addressing this: SituatedQA, PAT-Questions
Solution: Explicit temporal anchoring or temporal disambiguation
Problem: Correct answers change over time
Datasets addressing this: FreshQA, ReaLTimeQA
Solution: Periodic dataset updates, versioning
Problem: Crowdsourced datasets may have inconsistent temporal understanding
Mitigation: Multiple annotators, expert validation (e.g., ArchivalQA with journalistic expertise)
Problem: Many datasets focus on simple lookup
Datasets addressing this: ComplexTempQA, TEMPREASON, Complex-TR
Solution: Synthetic generation with explicit reasoning templates
When comparing datasets, consider these dimensions:
- Temporal Scope: Historical range covered
- Temporal Granularity: Day, month, year, era
- Question Complexity: Simple lookup vs. multi-hop reasoning
- Temporal Explicitness: Explicit dates vs. implicit temporal references
- Answer Volatility: How quickly answers become outdated
- Evaluation Protocol: Static test set vs. periodic updates
- Annotation Quality: Crowdsourced vs. automatic vs. expert
- Metadata Richness: Availability of document timestamps, temporal expressions
| Dataset | Last Updated | Update Frequency | Notes |
|---|---|---|---|
| ReaLTimeQA | 2024-06 | Weekly | Continuous updates |
| FreshQA | 2024-03 | Periodic | Manual updates |
| PAT-Questions | 2024 | Self-updating mechanism | Automated |
| Others | Static | One-time release | - |
We welcome contributions to keep this survey comprehensive and up-to-date!
If we've missed your work or you know of a relevant paper/dataset that should be included, please send us an email at:
Please include:
- Paper title and authors
- Link to paper and code/data (if available)
- Brief description of the contribution
You can also open an issue on GitHub.