Skip to content

Added functions to support IO for Parquet files. #562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

ShigrafS
Copy link
Contributor

@ShigrafS ShigrafS commented Apr 27, 2025

Closes #307

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Summary

This PR adds Parquet file support to the movement package, enabling it to read and write pose tracking data in the tidy DataFrame format used by the [animovement](https://github.com/roaldarbol/animovement) R package. It enhances interoperability, supports efficient data storage, and simplifies integration with modern data analysis tools.


🧩 Related Issue

#307

Support tidy dataframe and Parquet I/O to facilitate data exchange with animovement


✨ What's New

✅ Load Functions (movement/io/load_poses.py)

  • Added from_tidy_df: Converts a tidy pandas DataFrame into an xarray.Dataset.
  • Added from_animovement_file: Reads a .parquet file and converts it using from_tidy_df.
  • Updated from_file to support source_software="animovement".

✅ Save Functions (movement/io/save_poses.py)

  • Added to_tidy_df: Converts an xarray.Dataset to a tidy DataFrame with optional confidence values.
  • Added to_animovement_file: Saves a dataset to a .parquet file via to_tidy_df.

✅ Dependency Update

  • Added pyarrow to pyproject.toml to support Parquet I/O via pandas.

✅ Tests (tests/test_parquet_io.py)

  • Added a new test suite covering:
    • Conversion between tidy DataFrames and datasets
    • Round-trip accuracy (DataFrame → dataset → DataFrame, and Parquet file round-trips)
    • Edge cases like missing data, no confidence, and invalid inputs

💡 Why This Matters

  • Interoperability: Enables seamless exchange with the animovement package.
  • Performance: Parquet provides efficient columnar storage and compression.
  • Usability: Tidy format is ideal for plotting, statistics, and tabular exploration.
  • Reliability: Comprehensive test coverage ensures stable, correct behavior.
  • Modernization: Brings movement closer to data science best practices.

How has this PR been tested?

Local pytest and CI tests.

Is this a breaking change?

If this PR breaks any existing functionality, please explain how and why.

Does this PR require an update to the documentation?

If any features have changed, or have been added. Please explain how the
documentation has been updated.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@ShigrafS ShigrafS marked this pull request as ready for review April 27, 2025 17:40
@ShigrafS
Copy link
Contributor Author

@niksirbi @sfmig This PR is ready to be merged.
Kindly review it.

Copy link

codecov bot commented Apr 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (fe28a5f) to head (4c8f835).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #562   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           32        32           
  Lines         1786      1856   +70     
=========================================
+ Hits          1786      1856   +70     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ShigrafS
Copy link
Contributor Author

@niksirbi @sfmig I've added a few tests to increase the coverage to 100%.
This should solve the Codecov issue.

@niksirbi niksirbi self-requested a review May 2, 2025 07:34
@ShigrafS
Copy link
Contributor Author

@niksirbi Can you please review this?

@niksirbi
Copy link
Member

@niksirbi Can you please review this?

Thanks for your work on this @ShigrafS. Sorry we didn't have time to get to this earlier, as we are busy with other development priorities (which are more urgent) and, in my case, being away attending conferences. I will give you feedback on this when I manage to review it in detail.

Copy link

sonarqubecloud bot commented Jun 8, 2025

Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @ShigrafS.

This is a solid start, but I propose narrowing the scope of this PR to make it more manageable.

Specifically, let’s first focus solely on the from_tidy_df and to_tidy_df functions. I envisage these functions as providing a one-to-one mapping between our xarray dataset format and a “tidy” pandas DataFrame.

The columns I’d expect in such a tidy DataFrame—closely following animovement’s format—are:

  • time: derived directly from the time dimension of the xarray dataset, in whatever units are used there (you’re currently using frame). This means that in the from_tidy_df function, the fps parameter will no longer be used to populate the time column. Instead, it will only be stored in the dataset’s attrs if provided.
  • individual: this is missing from the animovement docs, but we definitely need it to handle multi-animal data. You currently call this track_id, but it would be good to match the xarray dimension name.
  • keypoint
  • x
  • y
  • z (only if the data is 3D)
  • confidence

For now, please omit the from_animovement_file and to_animovement_file functions. I’d like to discuss some details with animovement’s developer, who is currently on extended leave. We can easily add those functions later once we have the tidy DataFrame functions sorted.

Additionally, please remove any unrelated changes that have crept into other functions (such as the loaders for SLEAP, Anipose, etc.). Perhaps these were inadvertently introduced when merging from the main branch?

The easiest approach might be to close this PR and open a fresh one—starting from the latest main branch—implementing only the necessary tidy DataFrame functions and their corresponding tests.

Thanks again for your effort on this!

@ShigrafS
Copy link
Contributor Author

ShigrafS commented Jul 6, 2025

Thank you for your review @niksirbi
I had been a little preoccupied with other commitments.
I greatly appreciate your review.

I'll follow up on this with a new PR incorporating your suggestions.
Thanks again.

@niksirbi
Copy link
Member

niksirbi commented Jul 7, 2025

Thanks for the update @ShigrafS. In that case, I'll convert this PR to draft for now. Make sure to reference this when you open your new PR.

@niksirbi niksirbi marked this pull request as draft July 7, 2025 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement I/O for parquet files
2 participants