-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Hi,
I am using the deduplifhir to find duplicate records in a generated CSV :
[cage@.today cli]$ cat /tmp/dedup.csv
"unique_id","family_name","given_name","gender","birth_date"
"23","morrison","elizabeth","F","12/05/1953"
"24","morrison","elizabeth","F","12/05/1953"
I run
[cage@.today cli]$ python3.9 ecqm_dedupe.py dedupe-data --fmt CSV /tmp/dedup.csv /tmp/ddd.csv
(i;ve cut some names here)
Stats for nerds:
blocking_rule row_count cumulative_rows cartesian match_key start
0 l."birth_date" = r."birth_date" 187 187 125751 0 0
1 (l."ssn" = r."ssn") AND (l."birth_date" = r."birth_date") 0 187 125751 1 187
2 l."phone" = r."phone" 48079 48266 125751 2 187
----- Estimating u probabilities using random sampling -----
u probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- street_address0 (no m values are trained).
- postal_code0 (some u values are not trained, no m values are trained).
- street_address1 (no m values are trained).
- postal_code1 (some u values are not trained, no m values are trained).
- phone (no m values are trained).
- given_name (no m values are trained).
- family_name (no m values are trained).
- birth_date (no m values are trained).
/home/cage/public_html/mdinteractive-00/scripts/dedupliFHIR/cli
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."ssn" = r."ssn"
Parameter estimates will be made for the following comparison(s):
- street_address0
- postal_code0
- street_address1
- postal_code1
- phone
- given_name
- family_name
- birth_date
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
WARNING:
Level Exact match on street_address0 on comparison street_address0 not observed in dataset, unable to train m value
WARNING:
Level Exact match on sector on comparison postal_code0 not observed in dataset, unable to train m value
WARNING:
Level Exact match on district on comparison postal_code0 not observed in dataset, unable to train m value
WARNING:
Level Exact match on area on comparison postal_code0 not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison postal_code0 not observed in dataset, unable to train m value
WARNING:
Level Exact match on street_address1 on comparison street_address1 not observed in dataset, unable to train m value
WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level Jaro-Winkler distance of given_name >= 0.7 on comparison given_name not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison given_name not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison family_name not observed in dataset, unable to train m value
WARNING:
Level DamerauLevenshtein distance <= 1 on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level Abs date difference <= 1 month on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level Abs date difference <= 1 year on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level Abs date difference <= 10 year on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison birth_date not observed in dataset, unable to train m value
Iteration 1: Largest change in params was 0.996 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.00392 in probability_two_random_records_match
EM converged after 2 iterations
m probability not trained for street_address0 - Exact match on street_address0 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for street_address1 - Exact match on street_address1 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Jaro-Winkler distance of given_name >= 0.7 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - DamerauLevenshtein distance <= 1 (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 year (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 10 year (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- street_address0 (some m values are not trained).
- postal_code0 (some u values are not trained, some m values are not trained).
- street_address1 (some m values are not trained).
- postal_code1 (some u values are not trained, some m values are not trained).
- given_name (some m values are not trained).
- family_name (some m values are not trained).
- birth_date (some m values are not trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."birth_date" = r."birth_date"
Parameter estimates will be made for the following comparison(s):
- street_address0
- postal_code0
- street_address1
- postal_code1
- phone
- given_name
- family_name
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- birth_date
WARNING:
Level Exact match on sector on comparison postal_code0 not observed in dataset, unable to train m value
WARNING:
Level Exact match on district on comparison postal_code0 not observed in dataset, unable to train m value
WARNING:
Level Exact match on area on comparison postal_code0 not observed in dataset, unable to train m value
WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value
Iteration 1: Largest change in params was -0.922 in the m_probability of street_address1, level Exact match on street_address1
Iteration 2: Largest change in params was 0.00042 in probability_two_random_records_match
EM converged after 2 iterations
m probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- postal_code0 (some u values are not trained, some m values are not trained).
- postal_code1 (some u values are not trained, some m values are not trained).
- birth_date (some m values are not trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
(l."street_address0" = r."street_address0") AND (l."postal_code0" = r."postal_code0")
Parameter estimates will be made for the following comparison(s):
- street_address1
- postal_code1
- phone
- given_name
- family_name
- birth_date
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- street_address0
- postal_code0
WARNING:
Level Exact match on street_address1 on comparison street_address1 not observed in dataset, unable to train m value
WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison postal_code1 not observed in dataset, unable to train m value
WARNING:
Level Exact match on given_name on comparison given_name not observed in dataset, unable to train m value
WARNING:
Level Jaro-Winkler distance of given_name >= 0.88 on comparison given_name not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison given_name not observed in dataset, unable to train m value
WARNING:
Level Exact match on family_name on comparison family_name not observed in dataset, unable to train m value
WARNING:
Level Jaro-Winkler distance of family_name >= 0.88 on comparison family_name not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison family_name not observed in dataset, unable to train m value
WARNING:
Level DamerauLevenshtein distance <= 1 on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level Abs date difference <= 1 month on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level Abs date difference <= 1 year on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level Abs date difference <= 10 year on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison birth_date not observed in dataset, unable to train m value
Iteration 1: Largest change in params was 0.187 in the m_probability of phone, level Exact match on phone
Iteration 2: Largest change in params was 7.74e-10 in probability_two_random_records_match
EM converged after 2 iterations
m probability not trained for street_address1 - Exact match on street_address1 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Exact match on given_name (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Jaro-Winkler distance of given_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - Exact match on family_name (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - Jaro-Winkler distance of family_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - DamerauLevenshtein distance <= 1 (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 year (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 10 year (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- postal_code0 (some u values are not trained, some m values are not trained).
- postal_code1 (some u values are not trained, some m values are not trained).
- birth_date (some m values are not trained).
Blocking time: 0.03 seconds
Predict time: 0.49 seconds
-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'postal_code0':
m values not fully trained
Comparison: 'postal_code0':
u values not fully trained
Comparison: 'postal_code1':
m values not fully trained
Comparison: 'postal_code1':
u values not fully trained
Comparison: 'birth_date':
m values not fully trained
The 'probability_two_random_records_match' setting has been set to the default value (0.0001).
If this is not the desired behaviour, either:
- assign a value for
probability_two_random_records_match
in your settings dictionary, or - estimate with the
linker.estimate_probability_two_random_records_match
function.
Completed iteration 1, num representatives needing updating: 0
the result is:
cage@.today cli] $ cat /tmp/ddd.csv
,cluster_id,unique_id,path,family_name,given_name,gender,birth_date,id,truth_value,phone,street_address0,city0,state0,postal_code0,street_address1,city1,state1,postal_code1,ssn
394,23,23,,morrison,elizabeth,F,1953-12-05,,,,,,,,,,,,
458,24,24,,morrison,elizabeth,F,1953-12-05,,,,,,,,,,,,
Seems it doesn't put these 2 records in the same cluster as duplicates. It seems for other records it works, but this is a strange case that is not found as duplicate - not sure why .