CS610 Master's Project: Plagiarism Detection


Andrea Hunt

Spring 2005


Project Idea

    Under the direction of Dr. Finkel, I will modify the algorithms of existing plagiarism detection software to increase accuracy and sophistication. 

       Update:  

          Currently I am modifying the software to evaluate natural language texts, using sentence chunking with removal of "noise words".  Another expansion may be to implement some form of  Knuth's soundex algorithm to aid in the evaluation of older texts.



Proposed Schedule (revised):

February -- Implement effective chunking algorithm for natural language texts
March -- Modify or explore alternatives to existing hashing methods
April -- Examine, extend, or implement alternate method for storage
Mid April to May -- Complete write up of project


As allowed by time: further extend my own chunking, hashing, or storage methods; modify user interface; extend my contributions to cover plagiarism detection in programs (source code) for a given language.

Updates

January 13 -- Meeting with Dr. Finkel to discuss possible directions this project could take: 1) design a front end for the existing plagiarism software; 2) modify the existing algorithms to improve precision and accuracy; 3) explore a different route entirely (e.g.,  develop software to characterize programming styles or determine authorship of a given text).   After consideration, I have ruled out option three; although it did sound most exciting, it also sounded most involved and unfocused.  Since I am trying to contain this project within one semester, I need to be realistic about what I can accomplish in that time frame.  Going in a different direction entirely seems overwhelming, especially starting from scratch.  I have decided to go with option one, with a slight venture into option two.  This decision may change, of course, should I get to tinkering with the existing code and become engrossed so that my attention is focused more on the goals presented in option two.

My next step is to create a proposed schedule and outline of what I hope to accomplish this semester.  The results of this can be seen in the section above.

January 20 -- Meeting with Dr. Finkel to review proposed schedule and discuss the chosen direction of the project.  Several toolkits and libraries were discussed as possible tools, including Glade (to design GTK widgets), TK for simpler items, and the hashing options found in perl.  (Related note: Extensible hashing)  Although I would normally jump in and start hacking at the code, we have decided to use this project as an experiment in software engineering and attempt to follow software engineering protocols and practices (as evidenced by the tasks set forth in the proposed schedule),

February 1 -- Already I am deviating from the schedule.  A week of illness has put me behind, though I have played catch-up as best I can the past day and a half.  I should have had the initial draft of the software requirements done today, but as of yet I am still wrestling with that.  To better understand the software as is stands, I have been reading various papers.  The work presented for MatchDetectReveal at Monash University, Australia provides a wealth of information, as Dr. Finkel was involved with that project.  Reading and reviewing the papers and presentations on that site has proven a good starting point, though I am still having difficulty knowing where to begin.  I now believe my time would be best served concentrating on understanding the existing code and algorithms as they stand.  The code is written in perl and while I am completely unfamiliar with perl, this should not be a great hurdle to overcome -- it is merely a matter of learning the syntax and nuances of a new programming language.  The greater challenge lies in thoroughly understanding exactly and precisely how each of the dozen or so files of code work together.  Analyzing and understanding the algorithms contained in the code will no doubt present a challenge as well, though the files themselves seem to be well documented.  "Signature extraction for overlap detection in documents" by Finkel, et al. (which can be found on the MDR link given above) also helps greatly.  As soon as I can dive into the code and truly begin dissecting it, I will finally feel I have my bearings and a clear direction.

February 3 -- I have begun to plan out a new chunking method to add to the existing ones in the code.  I have yet to decide whether it will be for text or programs.  Either way, the method needs to be self aligning and needs to correspond to natural divisions in the text.  Once I have a basic chunking method up and running, I will plug it into the existing code and see what happens.

February 11 -- I have integrated my chunking method and it is up and running.  I have decided on a text based method for natural languages, based on sentences.  The method divides the text according to specified end punctuation.  I have been experimenting with the method to see how it behaves when insignificant words are removed -- words such as articles, prepositions, and conjunctions.  Care must be taken, however, not to remove too much of the original text, since the content much be preserved to accurately detect plagiarism (this is not such a pressing issue with program code, since the structure, instead of the content, can reveal plagiarism).
    Other ideas for extensions Dr. Finkel and I have discussed:
I next plan to improve the chunking method, develop a test harness, and work on other aspects of the software (such as hashing, storage, etc.)

February 14 -- Dr. Finkel and I met and further discussed the use of something such as WordNet to modify the chunking algorithm.  Although this may prove to be useful, for the time being we have decided to table that route.  We also discussed a method of removing noise words.  I will continue to work on the chunking method and begin to collect viable sets of test cases.

February 18 --  I have collected the complete set of Shakespeare's sonnets as well as two complete versions of the Bible, no longer under copyright protection, with which I will test my chunking method.  I have chosen to use the King James version of the Bible as well as the Douay-Rheims version.  The sources of my test cases can be found in the links section below.  After several "false starts" of my Linux box freezing during program execution, it finally seems to be chugging away productively.  However, I have a feeling it may take a while.
On a related note, I have been exploring including a possible extension of the chunking method based on Knuth's proposed soundex algorithm.  Although it is used extensively in genealogy to reconcile the various spellings of surnames, I believe it may prove useful when comparing new texts to old texts, or even American English texts to British English texts, to allow for changes or differences in spelling.


Links

Knuth's Soundex Algorithm  -- Web site giving brief explanation of the algorithm and how to use existing perl module to evaluate the soundex value of a name.
King James Version of the Bible -- from Project Gutenburg
Douay-Rheims Version of the Bible -- from Project Gutenburg
Shakespeare's Complete Sonnets -- Again, from Project Gutenburg
WEKA -- Data Mining Software in Java