-
Notifications
You must be signed in to change notification settings - Fork 0
Add Codegen embedding provider with OpenAI and DeepSeek support #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
# Motivation The **Codegen on OSS** package provides a pipeline that: - **Collects repository URLs** from different sources (e.g., CSV files or GitHub searches). - **Parses repositories** using the codegen tool. - **Profiles performance** and logs metrics for each parsing run. - **Logs errors** to help pinpoint parsing failures or performance bottlenecks. <!-- Why is this change necessary? --> # Content <!-- Please include a summary of the change --> see [codegen-on-oss/README.md](https://github.com/codegen-sh/codegen-sdk/blob/acfe3dc07b65670af33b977fa1e7bc8627fd714e/codegen-on-oss/README.md) # Testing <!-- How was the change tested? --> `uv run modal run modal_run.py` No unit tests yet 😿 # Please check the following before marking your PR as ready for review - [ ] I have added tests for my changes - [x] I have updated the documentation or added new documentation as needed
Original commit by Tawsif Kamal: Revert "Revert "Adding Schema for Tool Outputs"" (codegen-sh#894) Reverts codegen-sh#892 --------- Co-authored-by: Rushil Patel <rpatel@codegen.com> Co-authored-by: rushilpatel0 <171610820+rushilpatel0@users.noreply.github.com>
Original commit by Ellen Agarwal: fix: Workaround for relace not adding newlines (codegen-sh#907)
Reviewer's GuideThis PR adds a pluggable embedding provider system to Codegen’s FileIndex by defining a provider abstraction (OpenAI and DeepSeek), a manager to select providers, and a patching mechanism to override the built-in embedding method, along with documentation and an example script. Class Diagram: New Embedding Provider SystemclassDiagram
class EmbeddingProvider {
+api_key: str
+__init__(api_key: str)
+get_embeddings(texts: List[str], model: str) List~List~float~~*
}
note for EmbeddingProvider "Base class for embedding providers"
class OpenAIEmbeddingProvider {
+base_url: str
+__init__(api_key: str, base_url: str)
+get_embeddings(texts: List[str], model: str) List~List~float~~
}
class DeepSeekEmbeddingProvider {
+base_url: str
-openai_fallback: OpenAIEmbeddingProvider
+__init__(api_key: str, base_url: str)
+get_embeddings(texts: List[str], model: str) List~List~float~~
}
note for DeepSeekEmbeddingProvider "Uses OpenAIEmbeddingProvider as fallback"
class CodegenEmbeddingManager {
-provider: EmbeddingProvider
+PROVIDERS: dict$
+__init__(provider: str, api_key: str, base_url: Optional[str])
+get_embeddings(texts: List[str], model: Optional[str]) List~List~float~~
}
note for CodegenEmbeddingManager "Manages and uses an EmbeddingProvider instance"
EmbeddingProvider <|-- OpenAIEmbeddingProvider : Inheritance
EmbeddingProvider <|-- DeepSeekEmbeddingProvider : Inheritance
DeepSeekEmbeddingProvider *-- "1" OpenAIEmbeddingProvider : openai_fallback
CodegenEmbeddingManager *-- "1" EmbeddingProvider : provider
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Join our Discord community for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
I see a check failed - I'm on it! 🫡 |
✅ Fixed the failing checks in PR #120 by addressing linting and type checking issues in the embedding provider files. The main changes include:
The changes have been pushed to the original PR branch. You can view the fix in this commit. 💻 View my work • React 👍 or 👎 |
I see a check failed - I'm on it! 🫡 |
✅ Fixed the failing checks in PR #120 by addressing formatting issues in the README_CODEGEN.md file. The main changes include:
The changes have been pushed to the original PR branch. You can view the fix in this commit. 💻 View my work • React 👍 or 👎 |
This PR adds a flexible embedding provider system that can be used with Codegen's FileIndex for semantic code search.
Features
FileIndex
classFiles Added
codegen_embedding_provider.py
: The main implementation with provider classes and patching mechanismexample_codegen_usage.py
: Example script showing how to use the tool with CodegenREADME_CODEGEN.md
: Documentation on usage and extensionHow It Works
The tool uses a simple patching mechanism to replace Codegen's
_get_embeddings
method in theFileIndex
class with our own implementation that routes requests through the specified provider. This approach allows you to switch providers without modifying Codegen's source code.Usage Example
Notes
FileIndex
instance you patchFileIndex
instance💻 View my work • About Codegen
Summary by Sourcery
Add a flexible embedding provider system to plug alternative backends into Codegen’s FileIndex for semantic code search, including OpenAI and DeepSeek support, a non-invasive patching mechanism, example usage script, and accompanying documentation.
New Features:
Enhancements:
Documentation: