FAQ
All,

I am curious if Lucene and/or Mahout can identify duplicate documents? I am
having trouble with many redundant docs in my corpus, which is causing
inflated values and an expense on users to process and reprocess much of the
material. Can the redundancy be removed or managed in some sense my either
Lucene at ingestion or Mahout at post-processing? The Vector Space Model
seems to be notional similar to PCA or Factor Analysis, which both have
similar ambitions. Thoughts???

Thank you in advance....

Regards,
Rich Heimann

Richard Heimann

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 28, '11 at 3:50p
activeJul 28, '11 at 3:50p
posts1
users1
websitelucene.apache.org

1 user in discussion

Rich Heimann: 1 post

People

Translate

site design / logo © 2023 Grokbase