Skip to content

Conversation

EugeneJinXin
Copy link
Contributor

@EugeneJinXin EugeneJinXin commented Apr 23, 2025

Bug and Reproduce

When using client.clone_public_dataset(), only the first 100 examples of a large dataset would be cloned instead of the full dataset. This causes data miss and user confusion.

Reproduction steps (attach a script at the end of PR as well):

  1. Find a public dataset with >100 examples
  2. Run client.clone_public_dataset(...)
  3. Count examples in the cloned dataset
  4. Result: Only 100 examples are copied

Solution

Refactored the method to use the existing helper function _get_paginated_list, which correctly handles paginated API responses.

This change resolves the input data extraction issue and is confirmed with a cross-langsmith site, see test section.

Test

With the PR's change, I conducted data clone (from langsmith personal cloud account, to a langsmith through local docker-compose) and successfully verified clone data. See reproduce with below script.

eujin@eujin-mn1 langsmith-project % python3 /Users/eujin/langsmith-project/main.py -v
Passed! All 214 rows cloned

def reproduce_issue():
    """ Clone a dataset with 200+ example, verify output is exact match"

    ls_client = Client(api_url='http://localhost:1980/api/v1')
    dataset_name = "eujin_test_200_rows"

    dataset_public_url = (
        "https://smith.langchain.com/public/0dfe83c3-079e-4ee3-b6a5-01a6508066ea/d"
    )
    ls_client.clone_public_dataset(dataset_public_url)
    cloned_dataset = ls_client.read_dataset(dataset_name=dataset_name)
    
    assert cloned_dataset.example_count == 214, f"Expected 214 examples, got {cloned_dataset.example_count}"
    
    print("Passed! All 214 rows cloned")
Screenshot 2025-04-23 at 4 20 04 PM

In addition, added a unit test in test_client.py to make sure list_shared_examples handles pagination correctly.

More Considerations
Feel free to let me know if you’d like to include more coverage in integration testing. I took a look of all methods using request_with_retries() in client.py, seems all safe (they either are POST, or single ID lookup with no pagination risk), however, i'm not sure about this one since i've not used splits feature, will verify offline with owner

@EugeneJinXin EugeneJinXin marked this pull request as ready for review April 24, 2025 00:04
share_token (Union[UUID, str]): The share token or URL of the shared dataset.
example_ids (Optional[List[UUID, str]], optional): The IDs of the examples to filter by. Defaults to None.
limit (Optional[int]): Maximum number of examples to return, by default None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limit (Optional[int]): Maximum number of examples to return, by default None. -> limit (Optional[int]): Maximum number of examples to return. Defaults to no limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants