Skip to content

Conversation

scobioala
Copy link

@scobioala scobioala commented Jun 9, 2025

Add Youtube connector core implementation

Add README for Youtube connector

Add unit tests for Youtube connector

Description

This PR adds a YouTube connector to Onyx, allowing users to index and search YouTube videos. The connector fetches video metadata, including titles, descriptions, and statistics, and integrates with Onyx's search and UI.

Changes:

  • Register YouTube connector in the factory
  • Add YouTube connector core implementation
  • Add README and documentation
  • Add unit tests

Related Issue: #799

How Has This Been Tested?

Unit Tests: Added tests for credential loading, video polling, and error handling.
Manual Testing: Verified the connector works with the YouTube API and correctly indexes videos.

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Add Youtube connector core implementation

Add README for Youtube connector

Add unit tests for Youtube connector
@scobioala scobioala requested a review from a team as a code owner June 9, 2025 22:40
Copy link

vercel bot commented Jun 9, 2025

@Cristian-Scobioala is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

The YouTube connector implementation needs significant improvements to meet core requirements and align with Onyx's best practices. While the basic structure is in place, several critical features and robustness improvements are needed.

  • Core transcription functionality is missing from backend/onyx/connectors/youtube/connector.py, which is essential for timestamp-based answers as specified in issue #799
  • Duplicate logic exists between load_from_checkpoint and poll_source methods in connector.py, violating the DRY principle
  • The OAuth implementation in backend/onyx/connectors/youtube/__init__.py incorrectly attempts to access non-existent credentials during the flow
  • The checkpoint implementation in connector.py has potential memory issues with unbounded processed_videos list growth
  • Double registration attempt in backend/onyx/connectors/factory.py could cause runtime errors

6 files reviewed, 5 comments
Edit PR Review Bot Settings | Greptile

Comment on lines +243 to +244
# Register YouTube connector in the factory
register_connector(YouTubeConnector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: register_connector() function is not defined. Remove this line since YouTubeConnector is already registered in the connector_map.

Suggested change
# Register YouTube connector in the factory
register_connector(YouTubeConnector)
# Register YouTube connector in the factory

Comment on lines +72 to +77
@pytest.fixture
def youtube_credentials_provider():
"""Create a mock credentials provider."""
provider = MagicMock(spec=CredentialsProviderInterface)
provider.get_credentials.return_value = '{"token": "test_token", "refresh_token": "test_refresh", "token_uri": "https://oauth2.googleapis.com/token", "client_id": "test_client_id", "client_secret": "test_client_secret", "scopes": ["https://www.googleapis.com/auth/youtube.readonly"]}'
return provider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Credentials fixture is defined but never used in tests. Either remove or add credential loading tests

Comment on lines 77 to 96
def load_credentials(self) -> Optional[Credentials]:
"""Load credentials from the provider.
Returns:
Optional[Credentials]: The loaded credentials if available, None otherwise.
"""
if not self._credentials_provider:
return None
creds_str = self._credentials_provider.get_credentials()
if not creds_str:
return None
creds_dict = json.loads(creds_str)
return Credentials(
token=creds_dict["token"],
refresh_token=creds_dict["refresh_token"],
token_uri=creds_dict["token_uri"],
client_id=creds_dict["client_id"],
client_secret=creds_dict["client_secret"],
scopes=creds_dict["scopes"]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Duplicate code with set_credentials_provider. Extract credential loading logic into a private method to avoid repetition.

Suggested change
def load_credentials(self) -> Optional[Credentials]:
"""Load credentials from the provider.
Returns:
Optional[Credentials]: The loaded credentials if available, None otherwise.
"""
if not self._credentials_provider:
return None
creds_str = self._credentials_provider.get_credentials()
if not creds_str:
return None
creds_dict = json.loads(creds_str)
return Credentials(
token=creds_dict["token"],
refresh_token=creds_dict["refresh_token"],
token_uri=creds_dict["token_uri"],
client_id=creds_dict["client_id"],
client_secret=creds_dict["client_secret"],
scopes=creds_dict["scopes"]
)
def _create_credentials_from_dict(self, creds_dict: Dict[str, Any]) -> Credentials:
"""Create Credentials object from dictionary.
Args:
creds_dict: Dictionary containing credential data.
Returns:
Credentials: The created credentials object.
"""
return Credentials(
token=creds_dict["token"],
refresh_token=creds_dict["refresh_token"],
token_uri=creds_dict["token_uri"],
client_id=creds_dict["client_id"],
client_secret=creds_dict["client_secret"],
scopes=creds_dict["scopes"]
)
def load_credentials(self) -> Optional[Credentials]:
"""Load credentials from the provider.
Returns:
Optional[Credentials]: The loaded credentials if available, None otherwise.
"""
if not self._credentials_provider:
return None
creds_str = self._credentials_provider.get_credentials()
if not creds_str:
return None
creds_dict = json.loads(creds_str)
return self._create_credentials_from_dict(creds_dict)

Comment on lines 137 to 163
def oauth_code_to_token(self, code: str) -> Dict[str, Any]:
"""Exchange OAuth code for tokens.
Args:
code: The OAuth code to exchange.
Returns:
Dict[str, Any]: The token response.
Raises:
ValueError: If credentials provider is not set or no credentials are available.
"""
if not self._credentials_provider:
raise ValueError("Credentials provider not set")

credentials = self._credentials_provider.get_credentials()
if not credentials:
raise ValueError("No credentials available")

return {
"access_token": credentials.token,
"refresh_token": credentials.refresh_token,
"token_uri": credentials.token_uri,
"client_id": credentials.client_id,
"client_secret": credentials.client_secret,
"scopes": credentials.scopes
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: oauth_code_to_token implementation is incorrect. It's returning existing credentials instead of exchanging the code for new tokens using the OAuth flow.

)

checkpoint.processed_videos.append(video_id)
checkpoint.has_more = len(search_response.get("items", [])) > len(checkpoint.processed_videos)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Incorrect has_more logic. Should be based on presence of nextPageToken in search_response, not number of processed videos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant