CS610 Master's Project: Plagiarism
Detection
Andrea Hunt
Spring 2005
Project Idea
Under the direction of Dr. Finkel, I will modify the
algorithms of existing plagiarism detection software to increase
accuracy and sophistication.
Update:
Currently I am modifying
the software to evaluate natural language texts, using sentence
chunking with removal of "noise words". Another expansion may be
to implement some form of Knuth's soundex algorithm to aid in the
evaluation of older texts.
Proposed Schedule (revised):
February -- Implement effective chunking algorithm for natural language
texts
March -- Modify or explore alternatives to existing hashing methods
April -- Examine, extend, or implement alternate method for storage
Mid April to May -- Complete write up of project
As allowed by time: further extend my own chunking, hashing, or storage
methods; modify user interface; extend my contributions to cover
plagiarism detection in programs (source code) for a given language.
Updates
January 13
-- Meeting with Dr. Finkel to discuss possible directions this project
could take: 1) design a front end for the existing plagiarism software;
2) modify the existing algorithms to improve precision and accuracy; 3)
explore a different route entirely (e.g., develop software to
characterize programming styles or determine authorship of a given
text). After consideration, I have ruled out option three;
although it did sound most exciting, it also sounded most involved and
unfocused. Since I am trying to contain this project within one
semester, I need to be realistic about what I can accomplish in that
time frame. Going in a different direction entirely seems
overwhelming, especially starting from scratch. I have decided to
go with option one, with a slight venture into option two. This
decision may change, of course, should I get to tinkering with the
existing code and become engrossed so that my attention is focused more
on the goals presented in option two.
My next step is to create a proposed schedule and outline of what I
hope to accomplish this semester. The results of this can be seen
in the section above.
January 20 -- Meeting with Dr.
Finkel to review proposed schedule and discuss the chosen direction of
the project. Several toolkits and libraries were discussed as
possible tools, including Glade (to design GTK widgets), TK for simpler
items, and the hashing options found in perl. (Related note: Extensible
hashing) Although I would normally jump in and start hacking
at the code, we have decided to use this project as an experiment in
software engineering and attempt to follow software engineering
protocols and practices (as evidenced by the tasks set forth in the
proposed schedule),
February 1 -- Already I am
deviating from the schedule. A week of illness has put me behind,
though I have played catch-up as best I can the past day and a
half. I should have had the initial draft of the software
requirements done today, but as of yet I am still wrestling with
that. To better understand the software as is stands, I have been
reading various papers. The work presented for MatchDetectReveal at Monash University,
Australia provides a wealth of information, as Dr. Finkel was involved
with that project. Reading and reviewing the papers and
presentations on that site has proven a good starting point, though I
am still having difficulty knowing where to begin. I now believe
my time would be best served concentrating on understanding the
existing code and algorithms as they stand. The code is written
in perl and while I am completely unfamiliar with perl, this should not
be a great hurdle to overcome -- it is merely a matter of learning the
syntax and nuances of a new programming language. The greater
challenge lies in thoroughly understanding exactly and precisely how
each of the dozen or so files of code work together. Analyzing
and understanding the algorithms contained in the code will no doubt
present a challenge as well, though the files themselves seem to be
well documented. "Signature extraction for overlap detection in
documents" by Finkel, et al. (which can be found on the MDR link given
above) also helps greatly. As soon as I can dive into the code
and truly begin dissecting it, I will finally feel I have my bearings
and a clear direction.
February 3 -- I have begun to
plan out a new chunking method to add to the existing ones in the
code. I have yet to decide whether it will be for text or
programs. Either way, the method needs to be self aligning and
needs to correspond to natural divisions in the text. Once I have
a basic chunking method up and running, I will plug it into the
existing code and see what happens.
February 11 -- I have
integrated my chunking method and it is up and running. I have
decided on a text based method for natural languages, based on
sentences. The method divides the text according to specified end
punctuation. I have been experimenting with the method to see how
it behaves when insignificant words are removed -- words such as
articles, prepositions, and conjunctions. Care must be taken,
however, not to remove too much of the original text, since the content
much be preserved to accurately detect plagiarism (this is not such a
pressing issue with program code, since the structure, instead of the
content, can reveal plagiarism).
Other ideas for extensions Dr. Finkel and I have
discussed:
- Use classifiers as opposed to straight text (similar to
WordNet). This would involve extra parsing, but there are
existing tools that could aid with that.
- Use categorizes, as in Java -- along the lines of WEKA, machine
learning algorithms for data mining tasks, but modify into perl
- Remove all trivial chunks (those that are common or unimportant
to the overall content). These could be discovered through corpus
based analysis.
I next plan to improve the chunking method, develop a test harness, and
work on other aspects of the software (such as hashing, storage, etc.)
February 14 -- Dr. Finkel and
I met and further discussed the use of something such as WordNet to
modify the chunking algorithm. Although this may prove to be
useful, for the time being we have decided to table that route.
We also discussed a method of removing noise words. I will
continue to work on the chunking method and begin to collect viable
sets of test cases.
February 18 -- I have
collected the complete set of Shakespeare's sonnets as well as two
complete versions of the Bible, no longer under copyright protection,
with which I will test my chunking method. I have chosen to use
the King James version of the Bible as well as the Douay-Rheims
version. The sources of my test cases can be found in the links
section below. After several "false starts" of my Linux box
freezing during program execution, it finally seems to be chugging away
productively. However, I have a feeling it may take a while.
On a related note, I have been exploring including a possible extension
of the chunking method based on Knuth's proposed soundex
algorithm. Although it is used extensively in genealogy to
reconcile the various spellings of surnames, I believe it may prove
useful when comparing new texts to old texts, or even American English
texts to British English texts, to allow for changes or differences in
spelling.
Links
Knuth's
Soundex Algorithm -- Web site giving brief explanation of the
algorithm and how to use existing perl module to evaluate the soundex
value of a name.
King James Version of the
Bible -- from
Project Gutenburg
Douay-Rheims Version of
the Bible -- from Project Gutenburg
Shakespeare's Complete
Sonnets -- Again, from Project Gutenburg
WEKA -- Data Mining
Software in Java