Download PDFOpen PDF in browser

Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection

EasyChair Preprint no. 5833

14 pagesDate: June 16, 2021


A paraphrase typically is a restatement/ rephrasing of a text or a passage based on its elucidation. Paraphrasing has its applications in various fields such as text summarization, plagiarism detection, question answering, machine translation, text grouping, sentiment analysis, etc. Most of the current state of art plagiarism detection tools focus on verbatim reproduction of document and do not account for its semantic properties, hence paraphrase plagiarism goes undetected in many cases. This paper gives an overview and comparison of the performances of five word embedding models in the field of semantic similarity such as TF-IDF, Word2Vec, Doc2Vec, FastText and BERT on two publicly available corpora: Quora Question Pairs (QQP) and Plagiarized Short Answers (PSA). After extensive literature review and experiments, the most appropriate text preprocessing approaches, distance measures, and the thresholds have been settled on for detecting semantic similarity/paraphrasing. The paper concludes on FastText being the most efficient model out of the five, both in terms of evaluation metrics i.e. accuracy, precision, recall, F1-score and resource consumption. It also compares all the models with each other based on the above mentioned metrics.

Keyphrases: deep learning model, paraphrase detection, Paraphrase Identification, paraphrasing, Plagiarism, plagiarism detection, Semantic Similarity Detection, word embedding

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
  author = {Shrutika Chawla and Preeti Aggarwal and Ravreet Kaur},
  title = {Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection},
  howpublished = {EasyChair Preprint no. 5833},

  year = {EasyChair, 2021}}
Download PDFOpen PDF in browser