Skip to content

Commit d9c21c4

Browse files
authored
First Pull Request (#1)
Added alternative software section in README.
1 parent 9ef59d8 commit d9c21c4

File tree

2 files changed

+78
-19
lines changed

2 files changed

+78
-19
lines changed

README.md

Lines changed: 63 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,25 @@ was generated with the R package
163163

164164
# Alternative Software
165165

166-
TODO
166+
Several state-of-the-art entity matching (EM) systems have been developed in recent years, utilizing different methodologies to address the challenges of EM tasks. Below, we highlight some of the most recent, best-performing and/or most recognized EM systems:
167+
168+
- [**HierGAT**](https://github.com/CGCL-codes/HierGAT) ([Yao et al., 2022](https://dl.acm.org/doi/10.1145/3514221.3517872)): HierGAT introduces a Hierarchical Graph Attention Transformer Network to model and leverage interdependence between EM decisions and attributes. It uses a graph attention mechanism to identify discriminative words and attributes, combined with contextual embeddings to enrich word representations, enabling a more nuanced and interconnected approach to EM.
169+
170+
- [**Ditto**](https://github.com/megagonlabs/ditto) ([Li et al., 2020](https://dl.acm.org/doi/10.14778/3421424.3421431)): Ditto leverages pre-trained Transformer-based language models to cast EM as a sequence-pair classification task, enhancing matching quality through fine-tuning. It incorporates optimizations such as domain-specific highlighting, string summarization to retain essential information, and advanced data augmentation to improve training, making it both efficient and effective for large-scale EM tasks.
171+
172+
- **CorDEL** ([Wang et al., 2020](https://ieeexplore.ieee.org/document/9338287)): CorDEL employs a contrastive deep learning framework that moves beyond twin-network architectures by focusing on both syntactic and semantic matching signals while emphasizing critical subtle differences. The approach includes a simple yet effective variant, CorDEL-Sum, to enhance the model's ability to discern nuanced relationships in data.
173+
174+
- [**DAEM**](https://github.com/nju-websoft/DAEM) ([Huang et al., 2023](https://dl.acm.org/doi/abs/10.1007/s00778-022-00745-1)): This approach combines a deep neural network for EM with adversarial active learning, enabling the automatic completion of missing textual values and the modeling of both similarities and differences between records. It integrates active learning to curate high-quality labeled examples, adversarial learning for augmented stability, and a dynamic blocking method for scalable database handling, ensuring efficient and robust EM performance.
175+
176+
- [**AdaMEL**](https://github.com/DerekDiJin/AdaMEL-supplementary) ([Jin et al., 2021](https://dl.acm.org/doi/10.14778/3494124.3494131)): AdaMEL introduces a deep transfer learning framework for multi-source entity linkage, addressing challenges of incremental data and source variability by learning high-level generic knowledge. It employs an attribute-level self-attention mechanism to model attribute importance and leverages domain adaptation to utilize unlabeled data from new sources, enabling source-agnostic EM while accommodating additional labeled data for enhanced accuracy.
177+
178+
- [**DeepMatcher**](https://github.com/anhaidgroup/deepmatcher) ([Mudgal et al., 2018](https://dl.acm.org/doi/10.1145/3183713.3196926)): This framework is one of the first to introduce deep learning (DL) to entity matching, categorizing learning approaches into SIF, RNN, Attention, and Hybrid models based on their representational power. It highlights DL's strengths in handling textual and dirty EM tasks while identifying its limitations in structured EM, offering valuable insights for both researchers and practitioners.
179+
180+
- **SETEM** ([Ding et al., 2024](https://dl.acm.org/doi/10.1016/j.knosys.2024.111708)): SETEM introduces a self-ensemble training method for EM to overcome challenges in real-world scenarios, such as small datasets, hard negatives, and unseen entities, where traditional Pre-trained Language Model (PLM)-based methods often struggle due to their reliance on large labeled datasets and overlapping benchmarks. By leveraging the stability and generalization of ensemble models, SETEM effectively addresses these limitations while maintaining low memory consumption and high label efficiency. Additionally, it incorporates a faster training method designed for low-resource applications, ensuring adaptability and scalability for practical EM tasks.
181+
182+
- **AttendEM** ([Low et al., 2024](https://www.sciencedirect.com/science/article/pii/S0950705124003137?dgcid=rss_sd_all)): AttendEM introduces a novel framework for entity matching (EM) that enhances transformer architectures through intra-transformer ensembling, distinct text rearrangements, additional aggregator tokens, and extra self-attention layers. Departing from the focus on text cleaning and data augmentation in existing solutions, AttendEM innovates within the base model design, offering a distinct approach to pairwise duplicate identification across databases.
183+
184+
These systems represent significant advancements in the EM field, offering various approaches such as graph neural networks, attention mechanisms, transformers, and data augmentation. Depending on your project's requirements and data characteristics, they can serve as effective alternative solutions.
167185

168186
# Contributors
169187

@@ -184,35 +202,64 @@ The package is distributed under the [MIT license](LICENSE.txt).
184202
<div id="refs" class="references csl-bib-body hanging-indent"
185203
entry-spacing="0">
186204

187-
<div id="ref-tensorflow2015" class="csl-entry">
205+
<div id="ref-abadi2015" class="csl-entry">
206+
Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” <https://www.tensorflow.org/>.
207+
</div>
208+
<br>
188209

189-
Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
190-
Craig Citro, Greg S. Corrado, et al. 2015. “TensorFlow: Large-Scale
191-
Machine Learning on Heterogeneous Systems.”
192-
<https://www.tensorflow.org/>.
210+
<div id="ref-badreddine2022" class="csl-entry">
211+
Badreddine, Samy, Artur d’Avila Garcez, Luciano Serafini, and Michael Spranger. 2022. “Logic Tensor Networks.” *Artificial Intelligence* 303: 103649. <https://doi.org/10.1016/j.artint.2021.103649>.
212+
</div>
213+
<br>
193214

215+
<div id="ref-chollet2015" class="csl-entry">
216+
Chollet, François et al. 2015. “Keras.” <https://keras.io>.
194217
</div>
218+
<br>
195219

196-
<div id="ref-badreddine2022" class="csl-entry">
220+
<div id="ref-ding2024" class="csl-entry">
221+
Ding, Huahua, Chaofan Dai, Yahui Wu, Wubin Ma, and Haohao Zhou. 2024. “SETEM: Self-ensemble Training with Pre-trained Language Models for Entity Matching.” <https://dl.acm.org/doi/10.1016/j.knosys.2024.111708>.
222+
</div>
223+
<br>
197224

198-
Badreddine, Samy, Artur d’Avila Garcez, Luciano Serafini, and Michael
199-
Spranger. 2022. “Logic Tensor Networks.” *Artificial Intelligence* 303:
200-
103649. <https://doi.org/10.1016/j.artint.2021.103649>.
225+
<div id="ref-huang2022" class="csl-entry">
226+
Huang, Jiacheng, Wei Hu, Zhifeng Bao, Qijin Chen, and Yuzhong Qu. 2022. “Deep Entity Matching with Adversarial Active Learning.” <https://dl.acm.org/doi/abs/10.1007/s00778-022-00745-1>.
227+
</div>
228+
<br>
201229

230+
<div id="ref-jin2021" class="csl-entry">
231+
Jin, Di, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Danai Koutra. 2021. “Deep Transfer Learning for Multi-Source Entity Linkage via Domain Adaptation.” <https://dl.acm.org/doi/10.14778/3494124.3494131>.
202232
</div>
233+
<br>
203234

204-
<div id="ref-keras2015" class="csl-entry">
235+
<div id="ref-karapanagiotis2023" class="csl-entry">
236+
Karapanagiotis, Pantelis, and Marius Liebald. 2023. “Entity Matching with Similarity Encoding: A Supervised Learning Recommendation Framework for Linking (Big) Data.” <http://dx.doi.org/10.2139/ssrn.4541376>.
237+
</div>
238+
<br>
205239

206-
Chollet, François et al. 2015. “Keras.” <https://keras.io>.
240+
<div id="ref-li2020" class="csl-entry">
241+
Li, Yuliang, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Ta. 2020. “Deep Entity Matching with Pre-Trained Language Models.” <https://dl.acm.org/doi/10.14778/3421424.3421431>.
242+
</div>
243+
<br>
207244

245+
<div id="ref-low2024" class="csl-entry">
246+
Low, Jwen Fai, Benjamin C.M. Fung, and Pulei Xiong. 2024. “Better Entity Matching with Transformers Through Ensembles.” <https://www.sciencedirect.com/science/article/pii/S0950705124003137?dgcid=rss_sd_all>.
208247
</div>
248+
<br>
209249

210-
<div id="ref-karapanagiotis2023" class="csl-entry">
250+
<div id="ref-mudgal2018" class="csl-entry">
251+
Mudgal, Sidharth, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. “Deep Learning for Entity Matching: A Design Space Exploration.” <https://dl.acm.org/doi/10.1145/3183713.3196926>.
252+
</div>
253+
<br>
211254

212-
Karapanagiotis, Pantelis, and Marius Liebald. 2023. “Entity Matching
213-
with Similarity Encoding: A Supervised Learning Recommendation Framework
214-
for Linking (Big) Data.” <http://dx.doi.org/10.2139/ssrn.4541376>.
255+
<div id="ref-wang2020" class="csl-entry">
256+
Wang, Zhengyang, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Shuiwang Ji. 2020. “CorDEL: A Contrastive Deep Learning Approach for Entity Linkage.” <https://ieeexplore.ieee.org/document/9338287>.
257+
</div>
258+
<br>
215259

260+
<div id="ref-yao2022" class="csl-entry">
261+
Yao, Dezhong, Yuhong Gu, Gao Cong, Hai Jin, and Xinqiao Lv. 2022. “Entity Resolution with Hierarchical Graph Attention Networks.” <https://dl.acm.org/doi/10.1145/3514221.3517872>.
216262
</div>
263+
<br>
217264

218265
</div>

src/neer_match/matching_model.py

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -432,9 +432,21 @@ def _training_loop_log_header(self) -> str:
432432
return "| " + " | ".join([f"{x:<10}" for x in headers]) + " |"
433433

434434
def _training_loop_log_row(self, epoch: int, logs: dict) -> str:
435-
recall = logs["TP"] / (logs["TP"] + logs["FN"])
436-
precision = logs["TP"] / (logs["TP"] + logs["FP"])
437-
f1 = 2.0 * precision * recall / (precision + recall)
435+
if logs["TP"] + logs["FN"] != 0:
436+
recall = logs["TP"] / (logs["TP"] + logs["FN"])
437+
else:
438+
recall = 0 # Define recall as 0 when the denominator is 0
439+
440+
if (logs["TP"] + logs["FP"]) != 0:
441+
precision = logs["TP"] / (logs["TP"] + logs["FP"])
442+
else:
443+
precision = 0
444+
445+
if (precision + recall) != 0:
446+
f1 = 2.0 * precision * recall / (precision + recall)
447+
else:
448+
f1 = 0
449+
438450
values = [logs["BCE"].numpy(), recall, precision, f1, logs["Sat"].numpy()]
439451
row = f"| {epoch:<10} | " + " | ".join([f"{x:<10.4f}" for x in values]) + " |"
440452
return row

0 commit comments

Comments
 (0)