Automated Comparison of Texts and Plagiarism Search Based on

Automated Comparison of Texts and Plagiarism Search
Based on Frequency Analysis
Marek Kowalski1, Dorota Narojczyk2, Marek Szczepański1
Cardinal Stefan Wyszyński University in Warsaw
University of Finance and Management in Warsaw
1
2
This presentation aims to show methods of overcoming most important difficulties in automated texts’
comparison based on frequency analysis. The most important one is that by using standard cosine
similarity measure we “consider the global similarity of documents which may not lead to detecting
plagiarism”, see [1 , p. 9]. We’ll deal with a refined mathematical model for information gathering,
processing and similarity evaluation involving the term frequency (TF) and inverse document frequency
(IDF) vectors. We’ll focus on the following issues:
1.
2.
3.
4.
5.
6.
7.
8.
Copyright and other legal limitations;
Safe indexing of the reference texts database (RTD);
Cascade clustering of the RTD;
Automated fragmenting of input texts;
Fast detection of most relevant clusters (MRC) based on their centroids;
Fast preselection (i.e., similarity search in the MRC) involving modified cosine measures;
Eliminating floating point operations from the preselection;
Eliminating false-positive similarities.
In practical implementations initial clustering of the RTD has to be crated according to appropriate
standards, e.g., International Standard Classification of Education, see [5]. Once initial clustering is
made the process of clustering can be automated by using the Rocchio algorithm, see [3, 4].
Given an input text T we define MRC as the union of tokenized and lemmatized clusters 𝐵𝑖1 , … , 𝐵𝑖𝑘 of
𝑅𝑇𝐷 = 𝐵1 ᴗ … ᴗ 𝐵𝑛 such that values of the maxima
∑𝑠 𝜖 𝑡 𝑤𝑡 (𝑠)𝑐𝑖 (𝑠)
max
1≤𝑖≤𝑛
2
√∑𝑠 𝜖 𝑡(𝑐𝑖 (𝑠))2 √∑𝑠 𝜖 𝑡(𝑤𝑡 (𝑠))
are bigger than given numbers and attained for 𝑖 = 𝑖1 , …, 𝑖𝑘 when t ranges over the fragments of T. Here
𝑠 is a lemmatized word in 𝑡, the numbers {𝑐𝑖 (𝑠)}𝑠∈𝑡 form the centroid of 𝐵𝑖 , see [2, 4], and
𝑤𝑡 (𝑠) = 𝑡𝑓(𝑠, 𝑡) ∗ log(#RTD/#RTD(𝑠)),
where 𝑡𝑓(𝑠, 𝑡) is the number of appearances of 𝑠 in 𝑡 and RTD(𝑠) is a subset of RTD consisting of the
texts containing 𝑠 and #𝑋 denotes the number of elements in 𝑋.
After qualifying a tokenized and lemmatized text 𝑡 to a cluster B we search for similar (tokenized and
lemmatized) elements 𝑦 in 𝐵 employing quantities 𝐼(𝑡, 𝑦), 𝐶(𝑡, 𝑦), 𝑅(𝑡, 𝑦) given below.
∑𝑠∈𝑡∩𝑦 𝑤𝑡 (𝑠)𝑤𝑦 (𝑠)
𝐼(𝑡, 𝑦) =
2
2
,
√∑𝑠∈𝑡∩𝑦(𝑤𝑡 (𝑠)) ∑𝑠∈𝑡∩𝑦 (𝑤𝑦 (𝑠))
𝐶(𝑡, 𝑦) =
∑𝑠∈𝑡∩𝑦 min (𝑤𝑡 (𝑠), 𝑤𝑦 (𝑠))
min(∑𝑠∈𝑡∩𝑦 𝑤𝑡 (𝑠), ∑𝑠∈𝑡∩𝑦 𝑤𝑦 (𝑠))
.
Here 𝑡 ∩ 𝑦 stands for the set of those words which simultaneously appear in 𝑡 and 𝑦 and 𝑤𝑡 (𝑠) is given
by the formula 𝑤𝑥 (𝑠) = 𝑡𝑓(𝑠, 𝑥) ∗ 𝑟(𝑠), where 𝑟(𝑠) is the rank of 𝑠 in 𝐵. In standard formulation
𝑟(𝑠) = log(𝐼𝐷𝐹(𝑠)) ,
𝐼𝐷𝐹(𝑠) =
#𝐵
,
#𝐵(𝑠)
where 𝐵(𝑠) is a subset of 𝐵 consisting of the texts containing 𝑠. Alternatively one may use discrete ranks
assuming the values 0, 20,2-1, …, 2-k with a fixed 𝑘 ∈ ℕ which leads to serious reduction of
computational costs.
To define 𝑅(𝑡, 𝑦) we assume that 𝑡 ∩ 𝑦 = {𝑠1 , 𝑠2 , … , 𝑠𝑚 } and we consider the text
𝑖𝑛𝑑(𝑡, 𝑦) = {𝑖(𝑠1 ), 𝑖(𝑠2 ), … , 𝑖(𝑠𝑚 )}
consisting of the words 0,1,2 formed according to the rule
𝑖𝑓 𝑤𝑡 (𝑠) = 𝑤𝑦 (𝑠),
0,
𝑖𝑓 𝑤𝑡 (𝑠) ≠ 𝑤𝑦 (𝑠) and 𝑤𝑡 (𝑠) = min{𝑤𝑡 (𝑠), 𝑤𝑦 (𝑠)},
𝑖(𝑠) = {1,
2,
𝑖𝑓 𝑤𝑡 (𝑠) ≠ 𝑤𝑦 (𝑠) and 𝑤𝑦 (𝑠) = min{𝑤𝑡 (𝑠), 𝑤𝑦 (𝑠)} .
For 𝑒 𝜖 {1,2} we consider the text 𝑖𝑛𝑑(𝑡, 𝑦, 𝑒) created from 𝑖𝑛𝑑(𝑡, 𝑦) by eliminating all appearances
of 𝑒. We now set
𝑅(𝑡, 𝑦) =
2max{length(𝑖𝑛𝑑(𝑡, 𝑦, 1)), length(𝑖𝑛𝑑(𝑡, 𝑦, 1))}
− 1.
length(𝑡 ∩ 𝑦)
To measure similarity between t and y we can use any mapping 𝜑: [0,1]3 → [0,1] which is an
increasing function of each argument when two other arguments are fixed. The texts t and y are
considered to be similar if
𝜑(𝐼(𝑡, 𝑦), 𝐶(𝑡, 𝑦), 𝑅(𝑡, 𝑦)) > 0.5,
In extensive tests and simulations we obtained very good results for
𝜑(𝑥, 𝑦, 𝑧) = 𝑔(max{𝑥, 𝑦𝑧}),
1
where 𝑔(𝑢) = 1 − (1 − (1 −
arccos(𝑢) 𝑞 𝑞
2∗
) ) , with
𝜋
𝑞 ≈ 2.
Bibliography
[1] S. Alzahrani, N. Salim, A. Abraham, Understanding plagiarism linguistic patterns textual features
and detection methods, IEEE Transactions on Systems, Man, and Cybernetics - Part C:
Applications and Reviews, Vol.. XX, No. XX , pp. 1 – 17, 2011.
[2] C. Buckley, G. Salton, J. Allan, The effect of adding relevance information in a relevance feedback
environment, International ACM SIGIR Conference, pp. 292-300, 1994.
[3] J. Rocchio, Relevance feedback in information retrieval in The SMART Retrieval System:
Experiments in Automatic Document Processing, G. Salton, ed., Prentice-Hall, pp. 313-323, 1971.
[4] M. Szczepański, Algorytmy klasyfikacji tekstów i ich wykorzystanie w systemie wykrywania
plagiatów, Oficyna Wydawnicza Politechniki Warszawskiej, ISBN 978-83-7814-189-1, 2014.
[5] http://www.uis.unesco.org/Education/Pages/international-standard-classification-of-education.aspx
(opened March 20, 2015).