[Summary] GraphGen Roadmap

# Backgraound
To establish GraphGen as an essential tool for training and evaluation data synthesis, its development roadmap focuses on two core pillars: implementing a robust, multi-dimensional **data quality assessment** and filtering system to ensure the reliability of generated knowledge graphs, and expanding its architecture to support **multi-modal** and **multi-omics** data inputs.

> If you'd like to work on one of these tasks, please comment below to claim it and create an issue for the feature you'll be implementing.

# Features
## 1 GraphGen Framework
- [x] ♻️ Refactor the pipeline around base classes: `Reader` → `KG_Builder` → `Partitioner` → `Generator`. Data flow: Raw corpus → Reader → Splitter → KG_Builder → Partitioner → Generator → training / evaluation data: https://github.com/open-sciencelab/GraphGen/pull/59, https://github.com/open-sciencelab/GraphGen/pull/58, https://github.com/open-sciencelab/GraphGen/pull/52
- [ ] 🔍 Data provenance: ensure every record in the final training/evaluation set can be traced back to its original raw corpus through the full pipeline.
## 2 Multi-Modal & Multi-Omics
- [x] 🧬 Define ImageNode, AudioNode, ProteinNode, etc.
- [x] 👁️‍🗨️ Vision–language fusion extraction: use open VLMs to generate "image–caption–entity" triples and write them into the graph: https://github.com/open-sciencelab/GraphGen/pull/69
- [ ] 🧪 Multi-omics extraction: process genomics/transcriptomics/proteomics with automatic node-property alignment
## 3 Data Quality & Curation
- [ ] 📊 Multi-dimensional quality metrics with a unified scoring API
- [ ] 💓 Graph-quality assessment similar to KGHeartBeat
- [ ] 🎯 One-click export of high-quality sub-graphs and high-quality data
- [ ] ⚙️ Configurable pipeline: entity disambiguation, fact verification, redundancy removal, schema validation
## 4 Graph Construction
- [ ] 🚀 Incremental & resumable construction
## 5 Engineering
- [x] 📂 Support csv, json, jsonl, txt, pickle, parquet, pdf, and various triple input formats: https://github.com/open-sciencelab/GraphGen/pull/50, https://github.com/open-sciencelab/GraphGen/pull/65, https://github.com/open-sciencelab/GraphGen/pull/82, https://github.com/open-sciencelab/GraphGen/pull/81
- [ ] 🗄️ Unified hybrid storage layer: graph DB + object store + Redis cache, switchable with one click
- [ ] 💨 Optional RedisGraph or hash storage for real-time read/write on large graphs
- [ ] ✅ Data validation powered by Pydantic
- [ ] 👓 More inference servers and clients such as Azure and Ollama: https://github.com/open-sciencelab/GraphGen/pull/74
- [ ] 🔍 test cases
## 6 Community Detection & Data Synthesis
- [ ] 🔎 Apply multiple community-detection algorithms; generate data from communities and provide typical samples plus visualizations
- [ ] 🧠 Community summary → CoT data: use community summaries as few-shot examples to synthesize high-quality chain-of-thought data
- [ ] 💬 Multi-turn dialogue synthesis: random-walk sampling → multi-turn Q&A while maintaining context consistency
- [ ] 📈 Complexity grading for curriculum learning
- [ ] 🕵️‍♂️ Support comparison with baselines
## 7 UX, Docs & Community
- [ ] 📦 Streamlined pip install and usage
- [ ] 📓 Jupyter tutorial suite
- [ ] 📚 Comprehensive documentation
- [ ] 🗃️ Data & user case library
- [x] 🤝 Contributor guide & roadmap: clear labels, branching strategy, PR template, code of conduct
- [ ] 🌐 More user-friendly web interface
## 8 Others
- [ ] 📝 More standardized prompt & post-processing management; post-processing should be bound to prompts
- [ ] 🌐 Improve online connectivity
- [ ] 🔗 Enhanced coreference resolution during chunking

Further feature ideas are welcome—feel free to suggest and join the plan!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Summary] GraphGen Roadmap #49

Backgraound

Features

1 GraphGen Framework

2 Multi-Modal & Multi-Omics

3 Data Quality & Curation

4 Graph Construction

5 Engineering

6 Community Detection & Data Synthesis

7 UX, Docs & Community

8 Others

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Summary] GraphGen Roadmap #49

Description

Backgraound

Features

1 GraphGen Framework

2 Multi-Modal & Multi-Omics

3 Data Quality & Curation

4 Graph Construction

5 Engineering

6 Community Detection & Data Synthesis

7 UX, Docs & Community

8 Others

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions