-
Notifications
You must be signed in to change notification settings - Fork 0
Adding Database side analysis functions #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@JBorrow, this is an ongoing work to address the analysis function issue on lightserve. It took some time to understand and resolve the primary key and foreign key requirements of the tables when creating hypertables. Once I’ve added the analysis functions, I’ll request a review. |
Cool! So this actually works with timescsaledb like this? |
Yes, seems like working. Had to add 'time' as a primary key in FluxMeasurement Table as it was a requirement for creating hypertables chunked in time. Then had to carry the changes forward. |
Very interesting that it requires time as a primary key... As long as it's ok with a composite primary key with the id field we're ok, otherwise we might need to rethink things. |
Yes I checked that works. |
Very interesting. Keep exploring this direction... Do we need a separate table for each time range? Or does timescaledb have some functionality for arbrtitrary time ranges too? |
As far as I have read, we don't need to worry about separate tables for different time ranges. timescaledb uses one table and automatically partitions data into time-based chunks. When we query any time range, it automatically finds the relevant chunks. |
Ah I understand, so the chunk_time_interval => INTERVAL '6 months', is more of an optimization thing? |
Yes. Best practice is to set chunk_time_interval so that one chunk of data takes up 25% of RAM. Most examples show 7-14 days as starting points, with TimescaleDB's default being 7 days. |
Right now I have set the aggregate bucketing at 1 month. This way every row in the table corresponds to the month start, so any query range includes entire months whose buckets start inside that range. For example, asking for 1 Mar–2 Sep returns the September bucket that starts on 1 Sep, and because that bucket covers the whole month, the results run through 30 Sep. If you don’t anticipate any issues, we can combine information from the Aggregate Table and Raw Table for accurate results. Alternatively, we can set shorter buckets, but this requires optimisation unless we know what users are comfortable with. |
… database while testing
@JBorrow I've implemented performance tests for the aggregate statistics endpoints to compare continuous aggregates vs raw queries:
Is it the right way to approach the tests? |
Seems like a reasonable way to do this. I guess in the future we'd probably not want to support two types of endpoint though. |
Yes, definitely. I had the 'without continuous aggregate' endpoint for testing purposes and will drop it once this is cleanly implemented. |
Hi @JBorrow! I have implemented both time series and aggregate endpoints. I have verified this against the raw table statistics. Would you like to test the implementation, or should I proceed with dropping the endpoint serving data from the raw flux measurement tables? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments before the refactor to two templated setups. You're close to that already.
return f""" | ||
SELECT add_continuous_aggregate_policy('{self.config.view_name}', | ||
start_offset => INTERVAL '{self.config.refresh_start_offset}', | ||
end_offset => INTERVAL '{self.config.refresh_end_offset}', | ||
schedule_interval => INTERVAL '{self.config.refresh_schedule_interval}' | ||
); | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use sql parameters here? I don't like building queries with random strings...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I do select(func.function_name(...)) I get
LINE 1: SELECT add_retention_policy('band_statistics_daily', INTERVA...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be expecting a REGCLASS type variable.
lightcurvedb/analysis/statistics.py
Outdated
if delta_start <= 30: | ||
view_name = "band_statistics_daily" | ||
time_resolution = "daily" | ||
elif delta_start <= 180: | ||
view_name = "band_statistics_weekly" | ||
time_resolution = "weekly" | ||
else: | ||
view_name = "band_statistics_monthly" | ||
time_resolution = "monthly" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't very configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored and set this up to use the AggregateConfigurations dictionary
session.execute(text(""" | ||
SELECT create_hypertable( | ||
'flux_measurements', | ||
'time', | ||
chunk_time_interval => INTERVAL '7 days', | ||
if_not_exists => TRUE | ||
); | ||
""")) | ||
session.commit() | ||
create_continuous_aggregates(session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is hypertable
the table that the CAs get loaded from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes the flux_measurements a hypertable
…ructure as aggregate requests
Addressing simonsobs/lightserve#11