Programming Assignment 05: Similarity Detector
Estimated reading time: 10 minutes
Estimated time to complete: 90-150 minutes (plus debugging time)
Prerequisites: Assignment 03
Starter code: similarity-detector-student.zip
Collaboration: not permitted
Many classes at UMass and other colleges and universities use one or more automated tools to alert instructors to possible plagiarism. These tools, such as turnitin and MossPlus, are used to evaluate the similarity of texts to one another and/or to large corpora of source material. Turnitin, for example, doesn’t just check student assignments against one another – it can compare against assignments submitted in previous years (thus helping deter paper banks), or against all of Wikipedia, or against other digitized texts.
At its core, these services have to compare the similarity between texts (or program code). How do they work? In this assignment, you’ll build a simplified similarity detection system and find out. Your system will convert input texts into sets representing those texts, and will compare those sets using the Jaccard index, a measure of the similarity between sets.
We’ve provided a large set of unit tests to help with automated testing, though you might also want to write a class with a method for interactive testing. The Gradescope autograder includes a few more tests, but they exist primarily to verify you’re not gaming the autograder. If your code can pass the tests we’ve provided, it is likely correct.
Note that if you run into trouble with the Eclipse debugger mysteriously quitting during unit tests, it’s due to the timeout rule that we use to catch infinite loops:
Comment out the above two lines in all test files, and the debugger will no longer exit (and test cases will now get stuck in infinite loops).
- Translate written descriptions of behavior into code.
- Practice writing static methods.
- Practice interacting with the and abstractions.
- Test code using unit tests.
Downloading and importing the starter code
As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a project in the “Project Explorer”.
What to do
Complete the code in both and .
has four methods for you to complete. Reading the type signatures:
you should understand that these are generic methods, that is, they are parameterized on type . will work on any two of same-typed s.
These four methods should be straightforward to implement. Get them done early, because you’ll need them for the next part of the assignment.
There are six methods to complete here.
As in previous assignments, you’ll find helpful. But I don’t expect you to know “regular expressions” at this point, so here are two you’ll need:
- Passing to will split the original string on newlines (s). It will return an array of strings representing the split lines. You may still need to handle empty lines yourself, though.
- Passing (note: uppercase ) to will split the original string on non-word characters. ( means “a non-word character” and means “one or more times in a row”). You may still need to handle empty words yourself, as above.
You may find other instance methods of , such as and , helpful.
In the first method, your code should convert each text to a set using , then compare them using . There is no trick here!
In the second method, there is a argument. The is a string that you don’t want to consider when comparing the similarity of two texts. For example, if I were checking your programming assignments for similarity, I would include the starter code as the template. To “ignore” the template, convert a text to a set and a template to a set, then use set difference to remove the template set from the text set. Do so for both input texts, then find their Jaccard index.
Then comes the method. Shingling is a venerable technique for finding partial duplicates (see, for example, http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html) that creates a set of multiple overlapping pieces of a text for further comparison. The Java doc has an example, and the JUnit test cases have further examples; it should be straightforward to generalize from those examples. You’re likely to need a pair of nested loops, and it should (I hope) be less confusing for some students than the method was in last week’s assignment.
Finally, implement the method, which parallels the second method but requires shingling the words of the text, rather than the trimmed lines.
Submitting the assignment
When you have completed the changes to your code, you should export an archive file containing the entire Java project. To do this, follow the same steps as from Assignment 01 to produce a file, and upload it to Gradescope.
Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.
Contains both daily (front files) and a (backfile) created annually of patent assignments text derived from patent assignment recordations from August 1980 to present made at the USPTO for granted patents. The file format is eXtensible Markup Language (XML) in accordance with the Patent Assignment Daily XML (PADX) Version 0.3 Document Type Definition (DTD).
Refer to the following USPTO web site for additional patent data information (2001 through present) to include the Document Type Definitions (DTDs): http://www.uspto.gov/products/xml-resources.jsp
*** 02/13/2015 NOTE: In anticipation of the Hague Agreement Concerning the International Registration of Industrial Designs (Hague Agreement) having effect in the United States in the future, the description for certain Data Type Definition (DTD) element names has been changed to accommodate international design applications. For more information, please see the documentation, "PADX-File-Description-v2_Hague.doc". Specifically, the description for the following DTD element names has been changed: country; doc-number; kind; and date. DTD element names are not changing. In the future, content relating to international design applications may be present in these DTD fields. The date the Hague Agreement will take effect with respect to these files is *** May 13, 2015. ***