I am curious if Lucene and/or Mahout can identify duplicate documents? I am
having trouble with many redundant docs in my corpus, which is causing
inflated values and an expense on users to process and reprocess much of the
material. Can the redundancy be removed or managed in some sense my either
Lucene at ingestion or Mahout at post-processing? The Vector Space Model
seems to be notional similar to PCA or Factor Analysis, which both have
similar ambitions. Thoughts???
Thank you in advance....