Wednesday, April 16, 2008

Detecting near duplicates in big data

I finally got to a WWW 2007 paper out of Google I have been meaning to read, "Detecting Near-Duplicates for Web Crawling" (PDF) by Gurmeet Manku, Arvind Jain, and Anish Sarma.

The paper takes more theoretical work from Moses Charikar back in 2002, "Similarity Estimation Techniques from Rounding Algorithms" (ACM site), which describes a form of locality sensitive hashing, and applies it at very large scale (8B documents), dealing with all the practical issues along the way.

In that sense, this Google WWW 2007 paper has a lot of similarities with Monika Henzinger's excellent SIGIR 2006 paper, "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms" (ACM site). That paper evaluated the shingles method of creating fingerprints of substrings from a document against Charikar's technique on a 1.6B document data set and found the latter to be superior.

Some excerpts from the Google WWW 2007 paper:
Documents that are exact duplicates of each other ... are easy to identify ... A more difficult problem is the identification of near-duplicate documents ... [that] differ in a small portion of the document such as advertisements, counters and timestamps.

Charikar's simhash ... is a fingerprinting technique that enjoys the property that fingerprints of near-duplicates differ in a small number of bit positions.

We maintain an f-dimensional vector V, each of whose dimensions is initialized to zero. A feature is hashed into an f-bit hash value. These f bits (unique to the feature) increment/decrement the f components of the vector by the weight of that feature as follows: if the i-th bit of the hash value is 1, the i-th component of V is incremented by the weight of that feature; if the i-th bit of the hash value is 0, the i-th component of V is decremented by the weight of that feature. When all features have been processed, some components of V are positive while others are negative. The signs of components determine the corresponding bits of the final fingerprint.

Simhash possesses two conflicting proporties: (A) The fingerprint of a document is a "hash" of its features, and (B) Similar documents have similar has values.
The paper goes on to describe how they efficiently find nearby but different hash values that represent similar documents, how they rapidly build the simhash using MapReduce, and metrics on the effectiveness of the technique.

3 comments:

jeff.dalton said...

Very elegant technique for dimension reduction with document similarity.

Google also patented it: Methods and apparatus for estimating similarity (US Patent 7,158,961)

Dennis Gorelik said...

Greg,

Am I right that Google's paper basically saying that the most efficient way to find out near-duplicate documents is:
count number of matching triplets in two documents?
(Triplet -- 3-words phrase).

JD said...

Interesting.. I would like to know what the weight of the feature means and how to calculate it. If I understand correctly a feature can be just a token from the document