-
Notifications
You must be signed in to change notification settings - Fork 2k
Add YouTube connector #4857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add YouTube connector #4857
Conversation
Add Youtube connector core implementation Add README for Youtube connector Add unit tests for Youtube connector
@Cristian-Scobioala is attempting to deploy a commit to the Danswer Team on Vercel. A member of the Team first needs to authorize it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
The YouTube connector implementation needs significant improvements to meet core requirements and align with Onyx's best practices. While the basic structure is in place, several critical features and robustness improvements are needed.
- Core transcription functionality is missing from
backend/onyx/connectors/youtube/connector.py
, which is essential for timestamp-based answers as specified in issue #799 - Duplicate logic exists between
load_from_checkpoint
andpoll_source
methods inconnector.py
, violating the DRY principle - The OAuth implementation in
backend/onyx/connectors/youtube/__init__.py
incorrectly attempts to access non-existent credentials during the flow - The checkpoint implementation in
connector.py
has potential memory issues with unboundedprocessed_videos
list growth - Double registration attempt in
backend/onyx/connectors/factory.py
could cause runtime errors
6 files reviewed, 5 comments
Edit PR Review Bot Settings | Greptile
# Register YouTube connector in the factory | ||
register_connector(YouTubeConnector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: register_connector() function is not defined. Remove this line since YouTubeConnector is already registered in the connector_map.
# Register YouTube connector in the factory | |
register_connector(YouTubeConnector) | |
# Register YouTube connector in the factory |
@pytest.fixture | ||
def youtube_credentials_provider(): | ||
"""Create a mock credentials provider.""" | ||
provider = MagicMock(spec=CredentialsProviderInterface) | ||
provider.get_credentials.return_value = '{"token": "test_token", "refresh_token": "test_refresh", "token_uri": "https://oauth2.googleapis.com/token", "client_id": "test_client_id", "client_secret": "test_client_secret", "scopes": ["https://www.googleapis.com/auth/youtube.readonly"]}' | ||
return provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Credentials fixture is defined but never used in tests. Either remove or add credential loading tests
def load_credentials(self) -> Optional[Credentials]: | ||
"""Load credentials from the provider. | ||
Returns: | ||
Optional[Credentials]: The loaded credentials if available, None otherwise. | ||
""" | ||
if not self._credentials_provider: | ||
return None | ||
creds_str = self._credentials_provider.get_credentials() | ||
if not creds_str: | ||
return None | ||
creds_dict = json.loads(creds_str) | ||
return Credentials( | ||
token=creds_dict["token"], | ||
refresh_token=creds_dict["refresh_token"], | ||
token_uri=creds_dict["token_uri"], | ||
client_id=creds_dict["client_id"], | ||
client_secret=creds_dict["client_secret"], | ||
scopes=creds_dict["scopes"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Duplicate code with set_credentials_provider. Extract credential loading logic into a private method to avoid repetition.
def load_credentials(self) -> Optional[Credentials]: | |
"""Load credentials from the provider. | |
Returns: | |
Optional[Credentials]: The loaded credentials if available, None otherwise. | |
""" | |
if not self._credentials_provider: | |
return None | |
creds_str = self._credentials_provider.get_credentials() | |
if not creds_str: | |
return None | |
creds_dict = json.loads(creds_str) | |
return Credentials( | |
token=creds_dict["token"], | |
refresh_token=creds_dict["refresh_token"], | |
token_uri=creds_dict["token_uri"], | |
client_id=creds_dict["client_id"], | |
client_secret=creds_dict["client_secret"], | |
scopes=creds_dict["scopes"] | |
) | |
def _create_credentials_from_dict(self, creds_dict: Dict[str, Any]) -> Credentials: | |
"""Create Credentials object from dictionary. | |
Args: | |
creds_dict: Dictionary containing credential data. | |
Returns: | |
Credentials: The created credentials object. | |
""" | |
return Credentials( | |
token=creds_dict["token"], | |
refresh_token=creds_dict["refresh_token"], | |
token_uri=creds_dict["token_uri"], | |
client_id=creds_dict["client_id"], | |
client_secret=creds_dict["client_secret"], | |
scopes=creds_dict["scopes"] | |
) | |
def load_credentials(self) -> Optional[Credentials]: | |
"""Load credentials from the provider. | |
Returns: | |
Optional[Credentials]: The loaded credentials if available, None otherwise. | |
""" | |
if not self._credentials_provider: | |
return None | |
creds_str = self._credentials_provider.get_credentials() | |
if not creds_str: | |
return None | |
creds_dict = json.loads(creds_str) | |
return self._create_credentials_from_dict(creds_dict) |
def oauth_code_to_token(self, code: str) -> Dict[str, Any]: | ||
"""Exchange OAuth code for tokens. | ||
Args: | ||
code: The OAuth code to exchange. | ||
Returns: | ||
Dict[str, Any]: The token response. | ||
Raises: | ||
ValueError: If credentials provider is not set or no credentials are available. | ||
""" | ||
if not self._credentials_provider: | ||
raise ValueError("Credentials provider not set") | ||
|
||
credentials = self._credentials_provider.get_credentials() | ||
if not credentials: | ||
raise ValueError("No credentials available") | ||
|
||
return { | ||
"access_token": credentials.token, | ||
"refresh_token": credentials.refresh_token, | ||
"token_uri": credentials.token_uri, | ||
"client_id": credentials.client_id, | ||
"client_secret": credentials.client_secret, | ||
"scopes": credentials.scopes | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: oauth_code_to_token implementation is incorrect. It's returning existing credentials instead of exchanging the code for new tokens using the OAuth flow.
) | ||
|
||
checkpoint.processed_videos.append(video_id) | ||
checkpoint.has_more = len(search_response.get("items", [])) > len(checkpoint.processed_videos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Incorrect has_more logic. Should be based on presence of nextPageToken in search_response, not number of processed videos.
Add Youtube connector core implementation
Add README for Youtube connector
Add unit tests for Youtube connector
Description
This PR adds a YouTube connector to Onyx, allowing users to index and search YouTube videos. The connector fetches video metadata, including titles, descriptions, and statistics, and integrates with Onyx's search and UI.
Changes:
Related Issue: #799
How Has This Been Tested?
Unit Tests: Added tests for credential loading, video polling, and error handling.
Manual Testing: Verified the connector works with the YouTube API and correctly indexes videos.
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.