-
Notifications
You must be signed in to change notification settings - Fork 175
Add Support for Ollama and OpenAI LLM Providers #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add table of contents for better navigation - Fix typos and broken links - Improve overall structure and organization - Consolidate repeated information - Add badges and improve formatting - Enhance examples and troubleshooting sections
Hi @AhmadHakami! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
- Add _ollama_batch_completion() method for proper Ollama API support - Fix batch completion routing for LLM providers (OpenAI, Ollama, vLLM, api-endpoint)
- Add _ollama_batch_completion() method for proper Ollama API support - Fix batch completion routing for LLM providers (OpenAI, Ollama, vLLM, api-endpoint)
New _openai_include_sampling_params(): skips temperature/top_p for gpt-5-*. Applied in both sync and async chat.completions code paths.
Use max_completion_tokens instead of max_tokens for o1/o2/o3/o4 models. Omit temperature/top_p for O-series models that only support defaults. Updated cli.py system-check to avoid sending temperature to restricted models.
Pass difficulty through directory create flow and core.create.process_file Apply difficulty in QA, CoT, and multimodal-qa generators’ prompts Preserve backward compatibility and tests green
…ogs with --verbose CoT: print chunk/batch progress and totals; stop requesting once target reached Perf: cap CoT tokens (cot_max_tokens default 768) and summary to 256 for faster replies Ollama: use batch_size=1 and pass max_tokens to reduce latency/stalls CLI: with --verbose, print plain “Generating …” instead of spinner for live logs Honors --chunk-size/--chunk-overlap overrides; non-verbose still shows batch progress Smoke-tested: ollama qwen3:4b on sample text, output saved and progress visible
… create; forwarded to processing. Implemented page_range support in PDF parser and threaded through ingest/create and directory flows previously. Added tests: tests/unit/test_pdf_page_range.py: checks pdfminer page_numbers mapping and invalid ranges. tests/functional/test_page_range_cli.py: validates CLI forwarding for ingest and create (QA and COT). tests/functional/test_real_pdf_path.py: uses
…ified pages, with randomized coverage to avoid summary-only questions (addressing the issue you observed in the attached CoT file)
Restored correct indentation and structure to remove syntax errors. Added structured prompt with explicit fences and rules: Use only SOURCE_TEXT. Forbid page numbers, headers/footers, formatting, or difficulty echo. Prefer content-focused questions; advanced requires multi-step reasoning. Slightly over-generates per chunk and applies a post-filter to drop trivial/meta items: Filters references to page/صفحة/heading/difficulty, numeric-only answers, “years/page numbers/section title” questions, and difficulty word echoes. Randomizes chunk order for coverage within the selected range. Injects language instruction into the system message so functional language tests can detect it. CLI ingest fix: Guarded --page-range parsing to avoid attempting .strip() on Typer’s OptionInfo when called directly. Now safely casts and strips: rng = str(page_range).strip(). CoT generator reliability: Kept single-call generation simple and removed the extra pre-call to match tests that assert one chat_completion. Added deterministic fallback examples when parsing fails, matching unit test expectations and request count.
Extract arrays from code fences or free text safely. Normalize keys: support question/answer in Arabic and aliases (q/a, response/reply, answers[]). Clean trailing commas and whitespace reliably. Adjusted QA filters in qa_generator.py: Numeric-heavy heuristic relaxed for Arabic (allow brief date-like answers if some letters appear). Still blocks page numbers, headers/footers, difficulty echoes, and trivial patterns.
Apply similar enforcement to related generators (multimodal).
…nerator returns exactly the requested number of pairs. Extended test_qa_generator.py with test_generate_qa_pairs_exact_count_and_dedup to verify deduplication and exact count
This pull request introduces comprehensive support for Ollama and OpenAI LLM providers, significantly enhancing the flexibility of the
synthetic-data-kit
for both local and cloud-based inference workflows.Key Changes
Provider Integration
llama3.2:3b
)gpt-4o
)--provider
flag for seamless model selection.Configuration Updates
Testing Enhancements
PYTHONPATH
issues for reliable module imports.Dependencies
pyproject.toml
with:ollama
pytest
Documentation
README.md
with:Benefits
These changes allow users to leverage both local and cloud-based LLMs throughout the synthetic data pipeline (
ingest → create → curate → save-as
). This improves accessibility, performance, and deployment flexibility—without disrupting existing workflows.Related Issue
Fixes: N/A (New feature implementation)
Type of Change