Skip to content

feat: Add GoogleAITextEmbedder and GoogleAIDocumentEmbedder components #1783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jun 10, 2025

Conversation

garybadwal
Copy link
Contributor

@garybadwal garybadwal commented May 24, 2025

Related Issues

Proposed Changes:

  • feat: Added support for Google AI embeddings through two new components:
    • GoogleAITextEmbedder: For embedding individual strings.
    • GoogleAIDocumentEmbedder: For embedding Haystack Document objects in batches, with optional metadata handling.
  • Both components use the Google Generative AI SDK (google.genai) and are compatible with Haystack’s embedding interface.
  • Components support configuration via environment variables and customizable embedding settings (model, task type, etc.).

How did you test it?

  • Manual tests with sample queries and documents.
  • Verified API response handling and embedding correctness.
  • Checked serialization/deserialization via to_dict() and from_dict().
  • Confirmed compatibility in a basic Haystack pipeline.

Notes for the reviewer

  • You may want to double-check how batch embedding is handled in _embed_batch() for performance and error resilience.
  • I’ve followed patterns used in other embedding components for consistency.

Checklist

@garybadwal garybadwal requested a review from a team as a code owner May 24, 2025 10:58
@garybadwal garybadwal requested review from Amnah199 and removed request for a team May 24, 2025 10:58
@CLAassistant
Copy link

CLAassistant commented May 24, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added integration:google-ai type:documentation Improvements or additions to documentation labels May 24, 2025
@anakin87
Copy link
Member

This might be related' #1694

@garybadwal
Copy link
Contributor Author

Hi @anakin87, sorry I haven't checked that, but if you can review this and let me know if my PR would be considered or not.

@sjrl
Copy link
Contributor

sjrl commented Jun 2, 2025

Hey @garybadwal thanks for your work on this! We actually just created a new integration for Google in this PR which uses the new google-genai library which it looks like your code also depends on.

Could I ask you to do the following with this PR:

  • Add your changes to the folder google_genai
  • Create a separate file for the TextEmbedder and the DocumentEmbedder and use folder structure like you can find in the Cohere integration. See here So basically make an embedders folder and within that folder you should have a text_embedder.py and a document_embedder.py file.
  • Make sure to create corresponding test files and tests for each of the embedders. You can find examples of that here for our Cohere integration.

Once you've done these, we'll be happy to give a more in depth review!

@sjrl sjrl requested review from sjrl and removed request for Amnah199 June 2, 2025 13:56
@sjrl sjrl self-assigned this Jun 2, 2025
@sjrl sjrl added the information-needed Information needed from the user label Jun 2, 2025
@garybadwal
Copy link
Contributor Author

Hey @garybadwal thanks for your work on this! We actually just created a new integration for Google in this PR which uses the new google-genai library which it looks like your code also depends on.

Could I ask you to do the following with this PR:

  • Add your changes to the folder google_genai

  • Create a separate file for the TextEmbedder and the DocumentEmbedder and use folder structure like you can find in the Cohere integration. See here So basically make an embedders folder and within that folder you should have a text_embedder.py and a document_embedder.py file.

  • Make sure to create corresponding test files and tests for each of the embedders. You can find examples of that here for our Cohere integration.

Once you've done these, we'll be happy to give a more in depth review!

Thank you so much @sjrl for your review on this.

Sure, I'll make all the changes ASAP and will let you know. Also if you can help a little with tests it would be helpful. If you can provide me some sort of format or standard that specifically being followed for integration, so I can write in the same standards.

And thanks once again for your guidance in this.

@sjrl
Copy link
Contributor

sjrl commented Jun 3, 2025

Thank you so much @sjrl for your review on this.

Sure, I'll make all the changes ASAP and will let you know. Also if you can help a little with tests it would be helpful. If you can provide me some sort of format or standard that specifically being followed for integration, so I can write in the same standards.

And thanks once again for your guidance in this.

Great! For the tests I suggest you follow the ones we have for our OpenAI Text and Document Embedders. They can be found here:

@garybadwal
Copy link
Contributor Author

Thank you so much man @sjrl, I'll complete this over this weekend and you can than give it a review.

Thank you so much for your help.

@sjrl
Copy link
Contributor

sjrl commented Jun 5, 2025

Thank you so much man @sjrl, I'll complete this over this weekend and you can than give it a review.

Thank you so much for your help.

Sounds good! For the tests make sure to add them to the tests folder in google_genai and you can name them test_text_embedder.py and test_document_embedder.py. Also you can always copy from GitHub the code I linked in this comment so no need to fork or git clone the main haystack repo, but up to you.

@garybadwal
Copy link
Contributor Author

Hi @sjrl, I have added the test cases and also made them run once, those are working fine.

@sjrl
Copy link
Contributor

sjrl commented Jun 5, 2025

Hey @garybadwal could you take a look at the failing tests in the CI? It looks like a few of the unit tests are failing. E.g. here and also the linting is failing here.

You can run the linting locally by running hatch run lint:all in your terminal when in the google_genai folder.

@garybadwal
Copy link
Contributor Author

Ok sure @sjrl I'll check.

@garybadwal
Copy link
Contributor Author

garybadwal commented Jun 5, 2025

Hi @sjrl, I fixed the test case and linting problem, but only one linting is failing. Should I fix that too, or will this work?

@sjrl
Copy link
Contributor

sjrl commented Jun 6, 2025

Hi @sjrl, I have done all the changes, and 1 test is failing, it says that src/haystack_integrations/components/embedders/google_genai/document_embedder.py:11: error: Cannot find implementation or library stub for module named "more_itertools" [import-not-found]. Can you guide me in this? I checked online, and they say to make this change in pyproject.toml file:

dependencies = ["pip", "black>=23.1.0", "mypy>=1.0.0", "ruff>=0.0.243", "more-itertools"]

under [tool.hatch.envs.lint], but I don't want to make any change in this file without your approval. Let me know what i have to do.

Thanks for checking! Go ahead and make the change to see if that helps fix the tests

@sjrl
Copy link
Contributor

sjrl commented Jun 6, 2025

@garybadwal looking very good and almost there!

@garybadwal
Copy link
Contributor Author

Hi @sjrl,
I’ve implemented all the changes you suggested, and I’m happy to report that all checks passed this time — phew, big relief! 😄
Let me know if there's anything else you'd like me to tweak or revisit.

@garybadwal
Copy link
Contributor Author

@sjrl, this is my first open-source contribution. I've often come across tweets and LinkedIn posts from other developers sharing their open-source experiences. One thing I noticed is that they usually include a comment at the top of the file like this:

# Author Name: <Name of the contributor>
# Author Email: <Email of the author>
# Author GitHub Username: <GitHub username of the author>

Is this considered a valid practice? If so, do we allow it in this project? And would it be okay if I do the same?

@sjrl
Copy link
Contributor

sjrl commented Jun 6, 2025

@garybadwal great question! I checked internally and this is how we attribute authorship.

  • First you should add your info in the pyproject.toml inside of google_genai here

  • Second once your PR is merged you can add your info to this open PR in this section (The PR linked here is for publishing the Google GenAI integration on our website here and you would be able to link to this page to see your name. Example for the Ollama integration here)

@garybadwal
Copy link
Contributor Author

Thank you @sjrl!
I’ve added the author information in the pyproject.toml file. Once this is merged, I’ll proceed with the next steps as well.

@garybadwal
Copy link
Contributor Author

Hey @sjrl,
Thanks a lot for guiding me through everything so smoothly—I truly appreciate your support.

If you have any feedback on the work I’ve done in this PR, I’d love to hear it. It’ll really help me improve and refine my approach for future contributions and ensure my work reflects the standards of a good developer.

Thanks again! 🙌

@garybadwal
Copy link
Contributor Author

Hi @sjrl, sorry to bother you — just wanted to check in and ask if you’ll be reviewing this again or planning to merge it today?

@garybadwal garybadwal requested a review from sjrl June 7, 2025 14:44
@garybadwal
Copy link
Contributor Author

Hi @sjrl, sorry to bother you — just wanted to check in and ask if you’ll be reviewing this again or planning to merge it today?

@garybadwal
Copy link
Contributor Author

Hey @sjrl, just checking in—anything pending for me on this?

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garybadwal thanks for the contribution!

@sjrl sjrl merged commit 9bd9134 into deepset-ai:main Jun 10, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:google-genai type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gemini embedder models
4 participants