Skip to content

Commit 0ae217f

Browse files
Moved references to bibliography.bib.
1 parent 4c80e8d commit 0ae217f

File tree

3 files changed

+355
-72
lines changed

3 files changed

+355
-72
lines changed

README.md

Lines changed: 161 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,8 @@ the package’s functionality (without neural-symbolic components) are
4242
given by (Karapanagiotis and Liebald 2023).
4343

4444
The training loops for both deep and symbolic learning models are
45-
implemented in [tensorflow](https://www.tensorflow.org) (Abadi et al.
46-
2015). The pure deep learning model inherits from the
45+
implemented in [tensorflow](https://www.tensorflow.org) (Martín Abadi et
46+
al. 2015). The pure deep learning model inherits from the
4747
[keras](https://keras.io) model class (Chollet et al. 2015). The
4848
neural-symbolic model is implemented using the logic tensor network
4949
([LTN](https://pypi.org/project/ltn/)) framework (Badreddine et al.
@@ -163,7 +163,80 @@ was generated with the R package
163163

164164
# Alternative Software
165165

166-
TODO
166+
Several state-of-the-art entity matching (EM) systems have been
167+
developed in recent years, utilizing different methodologies to address
168+
the challenges of EM tasks. Below, we highlight some of the most recent,
169+
best-performing and/or most recognized EM systems:
170+
171+
- [**HierGAT**](https://github.com/CGCL-codes/HierGAT): HierGAT
172+
introduces a Hierarchical Graph Attention Transformer Network to model
173+
and leverage interdependence between EM decisions and attributes. It
174+
uses a graph attention mechanism to identify discriminative words and
175+
attributes, combined with contextual embeddings to enrich word
176+
representations, enabling a more nuanced and interconnected approach
177+
to EM (Yao et al. 2022).
178+
179+
- [**Ditto**](https://github.com/megagonlabs/ditto): Ditto leverages
180+
pre-trained Transformer-based language models to cast EM as a
181+
sequence-pair classification task, enhancing matching quality through
182+
fine-tuning. It incorporates optimizations such as domain-specific
183+
highlighting, string summarization to retain essential information,
184+
and advanced data augmentation to improve training, making it both
185+
efficient and effective for large-scale EM tasks (Li et al. 2020).
186+
187+
- **CorDEL**: CorDEL employs a contrastive deep learning framework that
188+
moves beyond twin-network architectures by focusing on both syntactic
189+
and semantic matching signals while emphasizing critical subtle
190+
differences. The approach includes a simple yet effective variant,
191+
CorDEL-Sum, to enhance the model’s ability to discern nuanced
192+
relationships in data (Wang et al. 2020).
193+
194+
- [**DAEM**](https://github.com/nju-websoft/DAEM): This approach
195+
combines a deep neural network for EM with adversarial active
196+
learning, enabling the automatic completion of missing textual values
197+
and the modeling of both similarities and differences between records.
198+
It integrates active learning to curate high-quality labeled examples,
199+
adversarial learning for augmented stability, and a dynamic blocking
200+
method for scalable database handling, ensuring efficient and robust
201+
EM performance (Huang et al. 2023).
202+
203+
- [**AdaMEL**](https://github.com/DerekDiJin/AdaMEL-supplementary):
204+
AdaMEL introduces a deep transfer learning framework for multi-source
205+
entity linkage, addressing challenges of incremental data and source
206+
variability by learning high-level generic knowledge. It employs an
207+
attribute-level self-attention mechanism to model attribute importance
208+
and leverages domain adaptation to utilize unlabeled data from new
209+
sources, enabling source-agnostic EM while accommodating additional
210+
labeled data for enhanced accuracy (Jin et al. 2021).
211+
212+
- [**DeepMatcher**](https://github.com/anhaidgroup/deepmatcher): This
213+
framework is one of the first to introduce deep learning (DL) to
214+
entity matching, categorizing learning approaches into SIF, RNN,
215+
Attention, and Hybrid models based on their representational power. It
216+
highlights DL’s strengths in handling textual and dirty EM tasks while
217+
identifying its limitations in structured EM, offering valuable
218+
insights for both researchers and practitioners (Mudgal et al. 2018).
219+
220+
- **SETEM**: SETEM introduces a self-ensemble training method for EM to
221+
overcome challenges in real-world scenarios, such as small datasets,
222+
hard negatives, and unseen entities, where traditional Pre-trained
223+
Language Model (PLM)-based methods often struggle due to their
224+
reliance on large labeled datasets and overlapping benchmarks. By
225+
leveraging the stability and generalization of ensemble models, SETEM
226+
effectively addresses these limitations while maintaining low memory
227+
consumption and high label efficiency. Additionally, it incorporates a
228+
faster training method designed for low-resource applications,
229+
ensuring adaptability and scalability for practical EM tasks (Ding et
230+
al. 2024).
231+
232+
- **AttendEM**: AttendEM introduces a novel framework for entity
233+
matching (EM) that enhances transformer architectures through
234+
intra-transformer ensembling, distinct text rearrangements, additional
235+
aggregator tokens, and extra self-attention layers. Departing from the
236+
focus on text cleaning and data augmentation in existing solutions,
237+
AttendEM innovates within the base model design, offering a distinct
238+
approach to pairwise duplicate identification across databases (Low,
239+
Fung, and Xiong 2024).
167240

168241
# Contributors
169242

@@ -184,15 +257,6 @@ The package is distributed under the [MIT license](LICENSE.txt).
184257
<div id="refs" class="references csl-bib-body hanging-indent"
185258
entry-spacing="0">
186259

187-
<div id="ref-tensorflow2015" class="csl-entry">
188-
189-
Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
190-
Craig Citro, Greg S. Corrado, et al. 2015. “TensorFlow: Large-Scale
191-
Machine Learning on Heterogeneous Systems.”
192-
<https://www.tensorflow.org/>.
193-
194-
</div>
195-
196260
<div id="ref-badreddine2022" class="csl-entry">
197261

198262
Badreddine, Samy, Artur d’Avila Garcez, Luciano Serafini, and Michael
@@ -203,15 +267,98 @@ Spranger. 2022. “Logic Tensor Networks.” *Artificial Intelligence* 303:
203267

204268
<div id="ref-keras2015" class="csl-entry">
205269

206-
Chollet, François et al. 2015. “Keras.” <https://keras.io>.
270+
Chollet, François et al. 2015. “Keras.”
271+
272+
</div>
273+
274+
<div id="ref-ding2024" class="csl-entry">
275+
276+
Ding, Huahua, Chaofan Dai, Yahui Wu, Wubin Ma, and Haohao Zhou. 2024.
277+
“SETEM: <span class="nocase">Self-ensemble</span> Training with
278+
<span class="nocase">Pre-trained Language Models</span> for Entity
279+
Matching.” *Knowledge-Based Systems* 293 (June): 111708.
280+
<https://doi.org/10.1016/j.knosys.2024.111708>.
281+
282+
</div>
283+
284+
<div id="ref-huang2023" class="csl-entry">
285+
286+
Huang, Jiacheng, Wei Hu, Zhifeng Bao, Qijin Chen, and Yuzhong Qu. 2023.
287+
“Deep Entity Matching with Adversarial Active Learning.” *The VLDB
288+
Journal* 32 (1): 229–55. <https://doi.org/10.1007/s00778-022-00745-1>.
289+
290+
</div>
291+
292+
<div id="ref-jin2021" class="csl-entry">
293+
294+
Jin, Di, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Danai Koutra.
295+
2021. “Deep Transfer Learning for Multi-Source Entity Linkage via Domain
296+
Adaptation.” In *Proceedings of the VLDB Endowment*, 15:465–77.
297+
<https://doi.org/10.14778/3494124.3494131>.
207298

208299
</div>
209300

210301
<div id="ref-karapanagiotis2023" class="csl-entry">
211302

212303
Karapanagiotis, Pantelis, and Marius Liebald. 2023. “Entity Matching
213304
with Similarity Encoding: A Supervised Learning Recommendation Framework
214-
for Linking (Big) Data.” <http://dx.doi.org/10.2139/ssrn.4541376>.
305+
for Linking (Big) Data.”
306+
307+
</div>
308+
309+
<div id="ref-li2020" class="csl-entry">
310+
311+
Li, Yuliang, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew
312+
Tan. 2020. “Deep Entity Matching with Pre-Trained Language Models.”
313+
*Proceedings of the VLDB Endowment* 14 (1): 50–60.
314+
<https://doi.org/10.14778/3421424.3421431>.
315+
316+
</div>
317+
318+
<div id="ref-low2024" class="csl-entry">
319+
320+
Low, Jwen Fai, Benjamin C. M. Fung, and Pulei Xiong. 2024. “Better
321+
Entity Matching with Transformers Through Ensembles.” *Knowledge-Based
322+
Systems* 293 (June): 111678.
323+
<https://doi.org/10.1016/j.knosys.2024.111678>.
324+
325+
</div>
326+
327+
<div id="ref-tensorflow2015" class="csl-entry">
328+
329+
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
330+
Craig Citro, Greg S. Corrado, et al. 2015. “TensorFlow:
331+
<span class="nocase">Large-scale</span> Machine Learning on
332+
Heterogeneous Systems.”
333+
334+
</div>
335+
336+
<div id="ref-mudgal2018" class="csl-entry">
337+
338+
Mudgal, Sidharth, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon
339+
Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay
340+
Raghavendra. 2018. “Deep Learning for Entity Matching: A Design Space
341+
Exploration.” In *Proceedings of the 2018 International Conference on
342+
Management of Data*, 19–34. <https://doi.org/10.1145/3183713.3196926>.
343+
344+
</div>
345+
346+
<div id="ref-wang2020" class="csl-entry">
347+
348+
Wang, Zhengyang, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Shuiwang
349+
Ji. 2020. “CorDEL: A Contrastive Deep Learning Approach for Entity
350+
Linkage.” In *2020 IEEE International Conference on Data Mining (ICDM)*,
351+
1322–27. IEEE. <https://doi.org/10.1109/ICDM50108.2020.00171>.
352+
353+
</div>
354+
355+
<div id="ref-yao2022" class="csl-entry">
356+
357+
Yao, Dezhong, Yuhong Gu, Gao Cong, Hai Jin, and Xinqiao Lv. 2022.
358+
“Entity Resolution with Hierarchical Graph Attention Networks.” In
359+
*Proceedings of the 2022 International Conference on Management of
360+
Data*, 429–42. Philadelphia PA USA: ACM.
361+
<https://doi.org/10.1145/3514221.3517872>.
215362

216363
</div>
217364

docs/source/_static/README.qmd

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,23 @@ The logo was designed using [Microsoft Designer](https://designer.microsoft.com/
123123

124124
# Alternative Software
125125

126-
TODO
126+
Several state-of-the-art entity matching (EM) systems have been developed in recent years, utilizing different methodologies to address the challenges of EM tasks. Below, we highlight some of the most recent, best-performing and/or most recognized EM systems:
127+
128+
- [**HierGAT**](https://github.com/CGCL-codes/HierGAT): HierGAT introduces a Hierarchical Graph Attention Transformer Network to model and leverage interdependence between EM decisions and attributes. It uses a graph attention mechanism to identify discriminative words and attributes, combined with contextual embeddings to enrich word representations, enabling a more nuanced and interconnected approach to EM [@yao2022].
129+
130+
- [**Ditto**](https://github.com/megagonlabs/ditto): Ditto leverages pre-trained Transformer-based language models to cast EM as a sequence-pair classification task, enhancing matching quality through fine-tuning. It incorporates optimizations such as domain-specific highlighting, string summarization to retain essential information, and advanced data augmentation to improve training, making it both efficient and effective for large-scale EM tasks [@li2020].
131+
132+
- **CorDEL**: CorDEL employs a contrastive deep learning framework that moves beyond twin-network architectures by focusing on both syntactic and semantic matching signals while emphasizing critical subtle differences. The approach includes a simple yet effective variant, CorDEL-Sum, to enhance the model's ability to discern nuanced relationships in data [@wang2020].
133+
134+
- [**DAEM**](https://github.com/nju-websoft/DAEM): This approach combines a deep neural network for EM with adversarial active learning, enabling the automatic completion of missing textual values and the modeling of both similarities and differences between records. It integrates active learning to curate high-quality labeled examples, adversarial learning for augmented stability, and a dynamic blocking method for scalable database handling, ensuring efficient and robust EM performance [@huang2023].
135+
136+
- [**AdaMEL**](https://github.com/DerekDiJin/AdaMEL-supplementary): AdaMEL introduces a deep transfer learning framework for multi-source entity linkage, addressing challenges of incremental data and source variability by learning high-level generic knowledge. It employs an attribute-level self-attention mechanism to model attribute importance and leverages domain adaptation to utilize unlabeled data from new sources, enabling source-agnostic EM while accommodating additional labeled data for enhanced accuracy [@jin2021].
137+
138+
- [**DeepMatcher**](https://github.com/anhaidgroup/deepmatcher): This framework is one of the first to introduce deep learning (DL) to entity matching, categorizing learning approaches into SIF, RNN, Attention, and Hybrid models based on their representational power. It highlights DL's strengths in handling textual and dirty EM tasks while identifying its limitations in structured EM, offering valuable insights for both researchers and practitioners [@mudgal2018].
139+
140+
- **SETEM**: SETEM introduces a self-ensemble training method for EM to overcome challenges in real-world scenarios, such as small datasets, hard negatives, and unseen entities, where traditional Pre-trained Language Model (PLM)-based methods often struggle due to their reliance on large labeled datasets and overlapping benchmarks. By leveraging the stability and generalization of ensemble models, SETEM effectively addresses these limitations while maintaining low memory consumption and high label efficiency. Additionally, it incorporates a faster training method designed for low-resource applications, ensuring adaptability and scalability for practical EM tasks [@ding2024].
141+
142+
- **AttendEM**: AttendEM introduces a novel framework for entity matching (EM) that enhances transformer architectures through intra-transformer ensembling, distinct text rearrangements, additional aggregator tokens, and extra self-attention layers. Departing from the focus on text cleaning and data augmentation in existing solutions, AttendEM innovates within the base model design, offering a distinct approach to pairwise duplicate identification across databases [@low2024].
127143

128144
# Contributors
129145

0 commit comments

Comments
 (0)