Conference Articles

Word embedding based semantic cross-lingual document alignment in comparable corpora

Debasis Ganguly, IBM Research Europe, Ireland
Haithem Afli, The ADAPT Centre
Dwaipayan Roy, Indian Statistical Institute, Kolkata

Document Type

Conference Article

Publication Title

ACM International Conference Proceeding Series

Abstract

Crosslingual information retrieval (CLIR) finds its application in aligning documents across comparable corpora. However, traditional CLIR, due to the term independence assumption, cannot consider the semantic similarity between the constituent words of the candidate pairs of documents in two different languages. Moreover, traditional CLIR models score a document by aggregating only the weights of the constituent terms that match with those of the query, while the other non-matching terms of the document do not significantly contribute to the similarity function. Word vector embedding allows the provision to model the semantic distances between terms by the application of standard distance metrics between their corresponding real valued vectors. This paper develops a word vector embedding based CLIR model that uses the average distances between the embedded word vectors of the source and target language documents to rank candidate document pairs. Our experiments with the WMT bilingual document alignment dataset reveal that the word vector based similarity significantly improves the recall of crosslingual document alignment in comparison to the classical language modeling based CLIR.

First Page

Last Page

DOI

10.1145/3293339.3293346

Publication Date

12-6-2018

Recommended Citation

Ganguly, Debasis; Afli, Haithem; and Roy, Dwaipayan, "Word embedding based semantic cross-lingual document alignment in comparable corpora" (2018). Conference Articles. 23.
https://digitalcommons.isical.ac.in/conf-articles/23

This document is currently not available here.

COinS

Conference Articles

Word embedding based semantic cross-lingual document alignment in comparable corpora

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Recommended Citation

Browse

Search

Author Corner

Links

Conference Articles

Word embedding based semantic cross-lingual document alignment in comparable corpora

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Recommended Citation

Share

Browse

Search

Author Corner

Links