-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
Backgraound
To establish GraphGen as an essential tool for training and evaluation data synthesis, its development roadmap focuses on two core pillars: implementing a robust, multi-dimensional data quality assessment and filtering system to ensure the reliability of generated knowledge graphs, and expanding its architecture to support multi-modal and multi-omics data inputs.
If you'd like to work on one of these tasks, please comment below to claim it and create an issue for the feature you'll be implementing.
Features
1 GraphGen Framework
- ♻️ Refactor the pipeline around base classes:
Reader→KG_Builder→Partitioner→Generator. Data flow: Raw corpus → Reader → Splitter → KG_Builder → Partitioner → Generator → training / evaluation data: refactor: Partitioner & Generator #59, Kg builder #58, Refactor KG builder #52 - 🔍 Data provenance: ensure every record in the final training/evaluation set can be traced back to its original raw corpus through the full pipeline.
2 Multi-Modal & Multi-Omics
- 🧬 Define ImageNode, AudioNode, ProteinNode, etc.
- 👁️🗨️ Vision–language fusion extraction: use open VLMs to generate "image–caption–entity" triples and write them into the graph: feat: add vqa pipeline #69
- 🧪 Multi-omics extraction: process genomics/transcriptomics/proteomics with automatic node-property alignment
3 Data Quality & Curation
- 📊 Multi-dimensional quality metrics with a unified scoring API
- 💓 Graph-quality assessment similar to KGHeartBeat
- 🎯 One-click export of high-quality sub-graphs and high-quality data
- ⚙️ Configurable pipeline: entity disambiguation, fact verification, redundancy removal, schema validation
4 Graph Construction
- 🚀 Incremental & resumable construction
5 Engineering
- 📂 Support csv, json, jsonl, txt, pickle, parquet, pdf, and various triple input formats: feat: support Reader classes #50, feat: add pdf_reader & tests for MinerUParser #65, feat: add ParquetReader #82, feat: add PickleReader #81
- 🗄️ Unified hybrid storage layer: graph DB + object store + Redis cache, switchable with one click
- 💨 Optional RedisGraph or hash storage for real-time read/write on large graphs
- ✅ Data validation powered by Pydantic
- 👓 More inference servers and clients such as Azure and Ollama: feat: add inference backends #74
- 🔍 test cases
6 Community Detection & Data Synthesis
- 🔎 Apply multiple community-detection algorithms; generate data from communities and provide typical samples plus visualizations
- 🧠 Community summary → CoT data: use community summaries as few-shot examples to synthesize high-quality chain-of-thought data
- 💬 Multi-turn dialogue synthesis: random-walk sampling → multi-turn Q&A while maintaining context consistency
- 📈 Complexity grading for curriculum learning
- 🕵️♂️ Support comparison with baselines
7 UX, Docs & Community
- 📦 Streamlined pip install and usage
- 📓 Jupyter tutorial suite
- 📚 Comprehensive documentation
- 🗃️ Data & user case library
- 🤝 Contributor guide & roadmap: clear labels, branching strategy, PR template, code of conduct
- 🌐 More user-friendly web interface
8 Others
- 📝 More standardized prompt & post-processing management; post-processing should be bound to prompts
- 🌐 Improve online connectivity
- 🔗 Enhanced coreference resolution during chunking
Further feature ideas are welcome—feel free to suggest and join the plan!
tpoisonooo
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation