feat(microsoft_sharepoint): Add Microsoft SharePoint retriever integration by bogdankostic · Pull Request #3429 · deepset-ai/haystack-core-integrations

bogdankostic · 2026-06-11T14:45:16Z

Related Issues

part of https://github.com/deepset-ai/haystack-private/issues/346

Proposed Changes:

Adds a new integration, microsoft-sharepoint-haystack, providing the MSSharePointRetriever component.

Given a query, the retriever calls the Microsoft Graph Search API (POST /search/query) and returns matching SharePoint and OneDrive content as Haystack Documents. Each hit becomes a Document whose content is the search snippet (hit-highlight markup stripped, entities unescaped) and whose meta carries file_name, web_url, entity_type,
created_date_time, last_modified_date_time, created_by, last_modified_by, mime_type, and file_extension (keys with no value are omitted).

How did you test it?

Unit tests covering: init validation, serialization round-trip, hit-to-Document mapping, request-body and auth-header construction, KQL/fields omission, pagination with offset/size, top_k handling, error handling (401/403/other 4xx, 429 retry-then-succeed, give-up after max retries), running inside a Pipeline, and the async path.
Manual live verification against a real Microsoft 365 tenant, including a two-user per-user-scoping check: a user with access to a restricted site receives its content, while a user without access does not.

Notes for the reviewer

The access_token is provided at run time. It can be obtained from the OAuthResolver, which ships in a separate PR: feat(oauth): Add OAuth integration #3419
The component does not fetch the files themselves. It returns the summary provided by the Search API in the Document's content field, with the link to the file in the meta (web_url). Fetching the actual file contents requires additional downstream components.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

github-actions · 2026-06-11T14:46:34Z

Coverage report (microsoft_sharepoint)

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
integrations/microsoft_sharepoint/src/haystack_integrations/components/retrievers/microsoft_sharepoint
errors.py
retriever.py					191, 227, 283-286, 301
Project Total

_{This report was generated by python-coverage-comment-action}

sjrl · 2026-06-16T12:36:21Z

+        self.max_retries = max_retries
+
+    @component.output_types(documents=list[Document])
+    def run(self, query: str, access_token: str, top_k: int | None = None) -> dict[str, list[Document]]:


Minor suggestion: Would it also be worth supporting passing a Secret for access_token in addition here?

Added support for Secret in e9ad7bc

sjrl · 2026-06-17T13:00:19Z

+        if status == _HTTP_UNAUTHORIZED:
+            msg = (
+                "Microsoft Graph rejected the access token (401 Unauthorized). The token may be expired, invalid, "
+                "or missing the required delegated scopes (for example Files.Read.All / Sites.Read.All)."
+            )
+        elif status == _HTTP_FORBIDDEN:


Minor comment: I feel like these status code values are pretty common. Do we need a global variable for them?

sjrl · 2026-06-17T13:38:45Z

+    ```python
+    from haystack import Pipeline
+    from haystack.utils import Secret
+    from haystack_integrations.components.connectors.oauth import OAuthResolver, RefreshTokenSource


Maybe we could update this example to not use another integration? Then in the docs page we make for this I think we can add a full example where we can explain that users also need to install the other haystack integration.

sjrl · 2026-06-17T13:41:51Z

+        :param entity_types: The Microsoft Search entity types to query. Defaults to `["driveItem", "listItem"]`,
+            which covers files, folders, SharePoint pages and news, and list items. Other valid values are
+            `"list"` and `"site"`.
+        :param top_k: The maximum number of documents to return. Maps to the Search API `size` and is paginated
+            when it exceeds a single page.
+        :param fields: Optional list of resource properties to request via the Search API `fields` selection
+            (only honored for `listItem` and `driveItem` entity types).
+        :param query_template: Optional KQL query template used to scope the search, for example
+            `'{searchTerms} path:"https://contoso.sharepoint.com/sites/Team"'`. The literal `{searchTerms}`
+            placeholder is replaced by the run-time query.


For the entitu_types, fields and query_template are there microsoft docs links we could provide?

sjrl · 2026-06-17T13:45:35Z

+    def _build_request_body(self, query: str, offset: int, size: int) -> dict[str, Any]:
+        """Build the Microsoft Search `POST /search/query` request body for one page."""


I realize it's internal but if possible providing a microsoft docs link for how to see what are valid params to this endpoint would be great for future devs

sjrl · 2026-06-17T13:54:35Z

+            if response.status_code in _RETRYABLE_STATUS and attempt < self.max_retries:
+                attempt += 1
+                delay = self._retry_delay(response, attempt)


Maybe worth using tenacity instead of rolling our own retry mechanism?

We may even be able to reuse request_with_retry from haystack which uses tenacity and httpx under the hood.

We also have an async version async_request_with_retry available in haystack in haystack/utils/request_utils.py

sjrl · 2026-06-17T14:07:28Z

+    @component.output_types(documents=list[Document])
+    def run(self, query: str, access_token: str | Secret, top_k: int | None = None) -> dict[str, list[Document]]:


Out of curiosity is there a concept of filters we could also add here? Just to make it work more like our doc store retrievers

sjrl · 2026-06-17T14:25:01Z

@bogdankostic one thing I noticed is that ideally we can use this new sharepoint retriever inside of the MultiRetriever since that is our recommended retriever to use in platform and allows for not needing to manually wire many retrievers all up to a document joiner.

To do this though we will need two sets of changes. The main one I want to propose for this integration (I think) is number 2 where we create a wrapper component to align with the TextRetriever protocol. WDYT?

1. MultiRetriever — add a per-retriever run-inputs map (haystack)

A new optional run param (run + run_async) that maps retriever name → extra kwargs, merged only into that retriever's call. Generic, explicit routing, no signature introspection, protocol-safe (the protocol allows extra optional params), and not serialized (it's a run input, not init).

def run(self, query, filters=None, top_k=None, *, active_retrievers=None,
        retriever_inputs: dict[str, dict[str, Any]] | None = None):
    retriever_inputs = retriever_inputs or {}
    unknown = set(retriever_inputs) - self.retrievers.keys()
    if unknown:
        raise ValueError(f"Unknown retriever name(s) in retriever_inputs: {sorted(unknown)}")
    ...
    executor.submit(
        retriever.run,
        query=query, filters=resolved_filters, top_k=resolved_top_k,
        **retriever_inputs.get(name, {}),   # only this retriever gets the extras
    )

2. Create a new SharePointTextRetriever adapter

Wraps an OAuthResolver + MSSharePointRetriever, conforms to our TextRetriever protocol, and resolves the token per run. If the resolver's token source needs a per-request subject_token, the adapter declares it as a mandatory input too — mirroring the resolver's contract.

@component
class SharePointTextRetriever:
    def __init__(self, *, retriever: MSSharePointRetriever, resolver: OAuthResolver):
        self.retriever = retriever
        self.resolver = resolver
        # public token-source attribute, not the resolver's private flag
        self._requires_subject_token = bool(getattr(resolver.token_source, "requires_subject_token", False))
        if self._requires_subject_token:
            component.set_input_type(self, "subject_token", str)  # no default => mandatory socket

    @component.output_types(documents=list[Document])
    def run(self, query: str, filters=None, top_k=None, **kwargs):
        if filters is not None:
            logger.warning("SharePoint retrieval ignores `filters`; scope via `query_template` instead.")
        access_token = self.resolver.run(**self._resolver_kwargs(kwargs))["access_token"]
        return self.retriever.run(query=query, access_token=access_token, top_k=top_k)

    @component.output_types(documents=list[Document])
    async def run_async(self, query: str, filters=None, top_k=None, **kwargs):
        # signature must match run() — enforced by @component
        if filters is not None:
            logger.warning("SharePoint retrieval ignores `filters`; scope via `query_template` instead.")
        result = await self.resolver.run_async(**self._resolver_kwargs(kwargs))
        return await self.retriever.run_async(query=query, access_token=result["access_token"], top_k=top_k)

    def _resolver_kwargs(self, kwargs):
        if not self._requires_subject_token:
            return {}
        return {"subject_token": kwargs.get("subject_token", "")}

    # also add to_dict / from_dict

feat: Add Microsoft SharePoint retriever integration

b5a7015

github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Jun 11, 2026

Correct filename for Microsoft SharePoint documentation configuration

7791250

bogdankostic changed the title ~~feat: Add Microsoft SharePoint retriever integration~~ feat(microsoft_sharepoint): Add Microsoft SharePoint retriever integration Jun 11, 2026

bogdankostic added the integration:microsoft-sharepoint label Jun 15, 2026

bogdankostic marked this pull request as ready for review June 16, 2026 10:55

bogdankostic requested a review from a team as a code owner June 16, 2026 10:55

bogdankostic requested review from sjrl and removed request for a team June 16, 2026 10:55

sjrl reviewed Jun 16, 2026

View reviewed changes

Comment thread .github/workflows/microsoft_sharepoint.yml Outdated

sjrl reviewed Jun 16, 2026

View reviewed changes

sjrl reviewed Jun 17, 2026

View reviewed changes

Comment thread ...sharepoint/src/haystack_integrations/components/retrievers/microsoft_sharepoint/retriever.py Outdated

bogdankostic added 3 commits June 17, 2026 13:24

Adapt concurrency group naming in Microsoft SharePoint workflow

2357d22

Support Secret type

e9ad7bc

Extend metadata fields in docstring

4f76cb8

sjrl reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(microsoft_sharepoint): Add Microsoft SharePoint retriever integration#3429

feat(microsoft_sharepoint): Add Microsoft SharePoint retriever integration#3429
bogdankostic wants to merge 5 commits into
mainfrom
ms_sharepoint

bogdankostic commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

sjrl Jun 16, 2026

Uh oh!

bogdankostic Jun 17, 2026

Uh oh!

Uh oh!

sjrl Jun 17, 2026

Uh oh!

sjrl Jun 17, 2026

Uh oh!

sjrl Jun 17, 2026

Uh oh!

sjrl Jun 17, 2026

Uh oh!

sjrl Jun 17, 2026

Uh oh!

sjrl Jun 17, 2026

Uh oh!

sjrl Jun 17, 2026

Uh oh!

sjrl commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def _build_request_body(self, query: str, offset: int, size: int) -> dict[str, Any]:
		"""Build the Microsoft Search `POST /search/query` request body for one page."""

		@component.output_types(documents=list[Document])
		def run(self, query: str, access_token: str \| Secret, top_k: int \| None = None) -> dict[str, list[Document]]:

Conversation

bogdankostic commented Jun 11, 2026

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report (microsoft_sharepoint)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sjrl commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 11, 2026 •

edited

Loading