-
Notifications
You must be signed in to change notification settings - Fork 1
fix datetime issue in installers data #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
fix datetime issue in installers data #108
Conversation
installer_data["effective_to"] = pd.to_datetime(installer_data["effective_to"]) | ||
installer_data["commissioning_date"] = pd.to_datetime( | ||
installer_data["commissioning_date"] | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a welome addition. I wonder if it would've been better to convert in earlier processing steps of the respective datasets (i.e. on reading in the datasets)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did previously encounter issues writing the parquet file as a column of mixed types cannot be written to parquet. So I did add a conversion bit after calling this function in asf_daps process_mcs_flow.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sqr00t does it make mixed types to convert to datetime if there are NA values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should now convert non datetime to pd.NA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I meant, does pd.Na and datetime values in the same column mean it would have mixed types and thus would not be able to write to parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no issue. It's fine to have pd.NA and datetime in the same column.
On write, I pass in a PyArrow schema object that checks and can optionally force conversion. pd.NA and datetime in the same column is convertible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome thanks for confirming!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me in terms of code :)
The only thing I guess to bear in mind is that there might be some other scripts and functions in other places that expect the date columns to be in string format so they might break with this change and need updating. That being said, unless these columns eventually get renamed to something else, I don't think we use them too much?
When running the MCS-EPC pipeline, I was getting a an error due to variables not being of type datetime. This PR fixed it.
Fixes #107
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
soutput/reports/