HSE-School of linguistics at Russian Paraphrase Detection shared taskЛекция
The task of paraphrase detection is to tell whether a pair of sentences is semantically equivalent or to give a score of that equivalency. To perform it effectively, methods of meaning representation are used. For representing meaning of words, vector space representation models like word2vec, GloVe were proved to be effective. A common approach for sequences is to take the mean of all the word vectors. This is however a rough approach. In order to obtain higher accuracy on capturing sequence meaning we explore other suggested techniques including modification of BM25 algorithms with idfweighting, binning of per dimension similarities and binning of max similarities. We try several pre-trained word2vec models with different parameters and their combination. We also experiment with recently published syntactic parser SyntaxNet, which determines the syntactic relationships between words in the sentence and presents them in the dependency parse tree. We compute the tree edit distance between the two given dependency trees and use it as another feature. A combination of these semantic features with simple surface features like precision, recall or BLEU score let us achieve the best results in Task 1 and 2 (non-standard runs).