Not matching 2 obvious records..

Hi,
I am using the deduplifhir to find duplicate records in a generated CSV :

[cage@.today cli]$ cat /tmp/dedup.csv 
"unique_id","family_name","given_name","gender","birth_date"
"23","morrison","elizabeth","F","12/05/1953"
"24","morrison","elizabeth","F","12/05/1953"

I run
[cage@.today cli]$ python3.9 ecqm_dedupe.py dedupe-data --fmt CSV /tmp/dedup.csv /tmp/ddd.csv
(i;ve cut some names here)
Stats for nerds:
                                               blocking_rule  row_count  cumulative_rows  cartesian match_key  start
0                            l."birth_date" = r."birth_date"        187              187     125751         0      0
1  (l."ssn" = r."ssn") AND (l."birth_date" = r."birth_date")          0              187     125751         1    187
2                                      l."phone" = r."phone"      48079            48266     125751         2    187
----- Estimating u probabilities using random sampling -----
u probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - street_address0 (no m values are trained).
    - postal_code0 (some u values are not trained, no m values are trained).
    - street_address1 (no m values are trained).
    - postal_code1 (some u values are not trained, no m values are trained).
    - phone (no m values are trained).
    - given_name (no m values are trained).
    - family_name (no m values are trained).
    - birth_date (no m values are trained).
/home/cage/public_html/mdinteractive-00/scripts/dedupliFHIR/cli

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."ssn" = r."ssn"

Parameter estimates will be made for the following comparison(s):
    - street_address0
    - postal_code0
    - street_address1
    - postal_code1
    - phone
    - given_name
    - family_name
    - birth_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 

WARNING:
Level Exact match on street_address0 on comparison street_address0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on street_address1 on comparison street_address1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Jaro-Winkler distance of given_name >= 0.7 on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level DamerauLevenshtein distance <= 1 on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 month on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 10 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison birth_date not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.996 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.00392 in probability_two_random_records_match

EM converged after 2 iterations
m probability not trained for street_address0 - Exact match on street_address0 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for street_address1 - Exact match on street_address1 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Jaro-Winkler distance of given_name >= 0.7 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - DamerauLevenshtein distance <= 1 (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 year (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 10 year (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - street_address0 (some m values are not trained).
    - postal_code0 (some u values are not trained, some m values are not trained).
    - street_address1 (some m values are not trained).
    - postal_code1 (some u values are not trained, some m values are not trained).
    - given_name (some m values are not trained).
    - family_name (some m values are not trained).
    - birth_date (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."birth_date" = r."birth_date"

Parameter estimates will be made for the following comparison(s):
    - street_address0
    - postal_code0
    - street_address1
    - postal_code1
    - phone
    - given_name
    - family_name

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - birth_date

WARNING:
Level Exact match on sector on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value

Iteration 1: Largest change in params was -0.922 in the m_probability of street_address1, level `Exact match on street_address1`
Iteration 2: Largest change in params was 0.00042 in probability_two_random_records_match

EM converged after 2 iterations
m probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - postal_code0 (some u values are not trained, some m values are not trained).
    - postal_code1 (some u values are not trained, some m values are not trained).
    - birth_date (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
(l."street_address0" = r."street_address0") AND (l."postal_code0" = r."postal_code0")

Parameter estimates will be made for the following comparison(s):
    - street_address1
    - postal_code1
    - phone
    - given_name
    - family_name
    - birth_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - street_address0
    - postal_code0

WARNING:
Level Exact match on street_address1 on comparison street_address1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on given_name on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level Jaro-Winkler distance of given_name >= 0.88 on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level Exact match on family_name on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level Jaro-Winkler distance of family_name >= 0.88 on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level DamerauLevenshtein distance <= 1 on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 month on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 10 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison birth_date not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.187 in the m_probability of phone, level `Exact match on phone`
Iteration 2: Largest change in params was 7.74e-10 in probability_two_random_records_match

EM converged after 2 iterations
m probability not trained for street_address1 - Exact match on street_address1 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Exact match on given_name (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Jaro-Winkler distance of given_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - Exact match on family_name (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - Jaro-Winkler distance of family_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - DamerauLevenshtein distance <= 1 (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 year (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 10 year (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - postal_code0 (some u values are not trained, some m values are not trained).
    - postal_code1 (some u values are not trained, some m values are not trained).
    - birth_date (some m values are not trained).
Blocking time: 0.03 seconds
Predict time: 0.49 seconds

 -- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'postal_code0':
    m values not fully trained
Comparison: 'postal_code0':
    u values not fully trained
Comparison: 'postal_code1':
    m values not fully trained
Comparison: 'postal_code1':
    u values not fully trained
Comparison: 'birth_date':
    m values not fully trained
The 'probability_two_random_records_match' setting has been set to the default value (0.0001). 
If this is not the desired behaviour, either: 
 - assign a value for `probability_two_random_records_match` in your settings dictionary, or 
 - estimate with the `linker.estimate_probability_two_random_records_match` function.
Completed iteration 1, num representatives needing updating: 0

the result is:
cage@.today cli] $ cat /tmp/ddd.csv 
,cluster_id,unique_id,path,family_name,given_name,gender,birth_date,id,truth_value,phone,street_address0,city0,state0,postal_code0,street_address1,city1,state1,postal_code1,ssn
394,23,23,,morrison,elizabeth,F,1953-12-05,,,,,,,,,,,,
458,24,24,,morrison,elizabeth,F,1953-12-05,,,,,,,,,,,,


Seems it doesn't put these 2 records in the same cluster as duplicates. It seems for other records it works, but this is a strange case that is not found as duplicate - not sure why .



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Not matching 2 obvious records.. #212

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Not matching 2 obvious records.. #212

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions