Description
This issue is closely tied this discussion, so please read the linked content before continuing.
Examining the data from this query:
SELECT * FROM public.gamit_subnets where "DOY"='180' and "Year"='2022'
Shows interesting behavior:
df = pd.read_csv('/Users/espg/Downloads/gamit_subnets_180_2022.csv')
df.iloc[1].stations
Which outputs the following (note the color highlight)
All of the red entries above are duplicates of stations already listed in the blue highlighting.
For public.gamit_subnets
on DOY of 2022, there are 17 listed clusters in the data table, with the first cluster (labeled subnet 0) being the backbone network. That leaves 16 clusters, which correspond to the 16 clusters that make_clusters
produces. Since index zero in the postgres data table corresponds to the backbone, the indexing is off by 1; i.e., df.iloc[1].stations
compares to a[0]
and b[0]
from a, b = make_clusters(points.T, stations)
, with "a" and "b" being the clusters
dictionary and cluster_ties
list respectively.
This is the zero-th entry for cluster stations from the clusters dictionary-- note that it's identical to the blue highlighted text from public.gamit_subnets
table for DOY 180 in 2022:
>>> a['stations'][0]
[array(['igs', 'badg'], dtype='<U4'),
array(['igs', 'cas1'], dtype='<U4'),
array(['igs', 'coco'], dtype='<U4'),
array(['igs', 'daej'], dtype='<U4'),
array(['igs', 'darw'], dtype='<U4'),
array(['igs', 'dumg'], dtype='<U4'),
array(['igs', 'guam'], dtype='<U4'),
array(['igs', 'hob2'], dtype='<U4'),
array(['igs', 'hrao'], dtype='<U4'),
array(['igs', 'iisc'], dtype='<U4'),
array(['igs', 'kiru'], dtype='<U4'),
array(['igs', 'mal2'], dtype='<U4'),
array(['igs', 'mcil'], dtype='<U4'),
array(['igs', 'mobs'], dtype='<U4'),
array(['igs', 'nklg'], dtype='<U4'),
array(['igs', 'pohn'], dtype='<U4'),
array(['igs', 'pol2'], dtype='<U4'),
array(['igs', 'reun'], dtype='<U4')]
Now, this is the output from the cluster_ties
list, which is identical to the red highlighted text from public.gamit_subnets
table for DOY 180 in 2022:
>>> b[0]
[array(['igs', 'cas1'], dtype='<U4'),
array(['igs', 'darw'], dtype='<U4'),
array(['igs', 'dumg'], dtype='<U4'),
array(['igs', 'hob2'], dtype='<U4'),
array(['igs', 'hrao'], dtype='<U4'),
array(['igs', 'kiru'], dtype='<U4'),
array(['igs', 'mal2'], dtype='<U4'),
array(['igs', 'mobs'], dtype='<U4'),
array(['igs', 'nklg'], dtype='<U4'),
array(['igs', 'pol2'], dtype='<U4'),
array(['igs', 'reun'], dtype='<U4')]
Looking at two additional entries from public.gamit_subnets
and the clusters
dictionary & cluster_ties
list confirms the pattern.
Questions
- Was this the case with earlier runs that @eckendrick was doing, such as
public.gamit_soln
2022 days 001-008?- If not, this might be a bug with these lines that check for tie points repeats on load from the database
- If the tie and stations are getting added together inside of
GamitSession
, we can fix the issue with the code from the previous bullet or similar
- What is default and preferred behavior for handling stations, and should subnetwork
stations
include the tie stations?- Reading this comment, it looks like currently
GamitSession
wants these two data objects (tie points and station clusters) not to overlap. - Regardless of what the current default behavior is, we should intentionally determine what makes sense for the behavior to be, and if we want to change it.
- Having the clusters include the tie stations (or not) will impact other downstream code, such as how subnetwork plots are currently handled.
- Having the clusters include the tie stations (or not) will also impact the 'check' that's run when determining how large the subnetworks are (should it be the 'base' size of the clusters, or the 'expanded' size that includes the tie points)
- @demiangomez my intuition is that it will make more sense to change the behavior in
GamitSession
than what is setup inpyNetwork
- Reading this comment, it looks like currently
- Regardless of what the default behavior is or where the tie stations and subnetworks are being double merged, we should be testing for repeats:
- With unit tests that tell us (and fail submitted PRs) if the control logic needlessly duplicates entries
- With runtime checks that can detect, fix and remove duplicate stations before time consuming numerics