-
Notifications
You must be signed in to change notification settings - Fork 65
Added functions to support IO for Parquet files. #562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Added test_parquet_io.py.
Fixed minor issue.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #562 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 32 32
Lines 1786 1856 +70
=========================================
+ Hits 1786 1856 +70 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@niksirbi Can you please review this? |
Thanks for your work on this @ShigrafS. Sorry we didn't have time to get to this earlier, as we are busy with other development priorities (which are more urgent) and, in my case, being away attending conferences. I will give you feedback on this when I manage to review it in detail. |
for more information, see https://pre-commit.ci
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, @ShigrafS.
This is a solid start, but I propose narrowing the scope of this PR to make it more manageable.
Specifically, let’s first focus solely on the from_tidy_df
and to_tidy_df
functions. I envisage these functions as providing a one-to-one mapping between our xarray
dataset format and a “tidy” pandas DataFrame.
The columns I’d expect in such a tidy DataFrame—closely following animovement’s format—are:
time
: derived directly from thetime
dimension of thexarray
dataset, in whatever units are used there (you’re currently usingframe
). This means that in thefrom_tidy_df
function, thefps
parameter will no longer be used to populate thetime
column. Instead, it will only be stored in the dataset’sattrs
if provided.individual
: this is missing from the animovement docs, but we definitely need it to handle multi-animal data. You currently call thistrack_id
, but it would be good to match thexarray
dimension name.keypoint
x
y
z
(only if the data is 3D)confidence
For now, please omit the from_animovement_file
and to_animovement_file
functions. I’d like to discuss some details with animovement’s developer, who is currently on extended leave. We can easily add those functions later once we have the tidy DataFrame functions sorted.
Additionally, please remove any unrelated changes that have crept into other functions (such as the loaders for SLEAP, Anipose, etc.). Perhaps these were inadvertently introduced when merging from the main
branch?
The easiest approach might be to close this PR and open a fresh one—starting from the latest main
branch—implementing only the necessary tidy DataFrame functions and their corresponding tests.
Thanks again for your effort on this!
Thank you for your review @niksirbi I'll follow up on this with a new PR incorporating your suggestions. |
Thanks for the update @ShigrafS. In that case, I'll convert this PR to draft for now. Make sure to reference this when you open your new PR. |
Closes #307
Description
What is this PR
Summary
This PR adds Parquet file support to the
movement
package, enabling it to read and write pose tracking data in the tidy DataFrame format used by the[animovement](https://github.com/roaldarbol/animovement)
R package. It enhances interoperability, supports efficient data storage, and simplifies integration with modern data analysis tools.🧩 Related Issue
#307
✨ What's New
✅ Load Functions (
movement/io/load_poses.py
)from_tidy_df
: Converts a tidy pandas DataFrame into anxarray.Dataset
.from_animovement_file
: Reads a.parquet
file and converts it usingfrom_tidy_df
.from_file
to supportsource_software="animovement"
.✅ Save Functions (
movement/io/save_poses.py
)to_tidy_df
: Converts anxarray.Dataset
to a tidy DataFrame with optional confidence values.to_animovement_file
: Saves a dataset to a.parquet
file viato_tidy_df
.✅ Dependency Update
pyarrow
topyproject.toml
to support Parquet I/O via pandas.✅ Tests (
tests/test_parquet_io.py
)💡 Why This Matters
animovement
package.movement
closer to data science best practices.How has this PR been tested?
Local pytest and CI tests.
Is this a breaking change?
If this PR breaks any existing functionality, please explain how and why.
Does this PR require an update to the documentation?
If any features have changed, or have been added. Please explain how the
documentation has been updated.
Checklist: