NAME

plagiat - A perl script to detect plagiarism between files.

 plagiat.pl [-v -v ... ] --path='*/*/*/file' [other options]

Download the package from http://www-spi.lip6.fr/~queinnec/Miscellaneous/plagiat.pl

DESCRIPTION

Plagiat is a perl script that tries to detect plagiarims between files. When given a shell expression yielding many files, it compares them two by two and signal those that are very close.

The comparison algorithm is very simple. Every file is compressed, then every pair of files is concatenated then compressed. If the compressed pair has a size that is far less than the sum of the size of the two individually compressed files then there are common fragments between the two files and potential plagiarism.

Attention, this is a quadratic algorithm, it may take a looooong time! By the way, the files you want to compare should not be too small. Fifty lines at least seem to be the least size for valuable comparisons. It might be useful to pretty-print and obfuscate the files to compare before (but I have yet to test that method).

This script will generate compressed files (near the original file that is, in the same directory with the usual extension meaning compressed (.gz here)). It will also generate a .NEAR file that contains the name of the files that are near this one. Attention, if f1 and f2 are close then either f1.NEAR will mention f2, either f2.NEAR will mention f1.

THE .NEAR FILES

Here is an example of a real .NEAR file. A .NEAR file is made of lines all having the structure of the next line:

 1796 1796 1861 /tmp/testplagiat/1/912423947/words.c     [correlation=0.91%]

 ^ size of the compressed file
      ^ size of the other compressed file
           ^ size of the two files compressed together
                ^ name of the other file
                                                          ^ correlation

The first number is the size of the compressed file words.c. The second number is the size of the compressed file to which it is compared (here this is /tmp/testplagiat/1/912423947/words.c. The size of the compressed concatenated files is the third number. The name of the other file is then mentioned. The final number is the correlation between the two files. These two files are probably very close.

OPTIONS

Here are the options of the plagiat script.

--path

 plagiat.pl --path='*/f.c'

It will compare all the files named f.c in the subdirectories of the current directory.

-v

--threshold

--help

--factor

--clean

--clean-gz

--clean-near

-k

BUGS

Should use the opt package for options! Allow an incremental usage (that requires to record the compressed pairs!)

AUTHOR

Christian Queinnec <Christian.Queinnec@lip6.fr>