FAQ
Hi,

I am trying to use Hadoop for Lucene index creation. I have to create multiple indexes based on contents of the files (i.e. if author is "hrishikesh", it should be added to a index for "hrishikesh". There has to be a separate index for every author). For this, I am keeping multiple IndexWriter open for every author and maintaining them in a hashmap in map() function. I parse incoming file and if I see author is one for which I already have opened a IndexWriter, I just add this file in that index, else I create a new IndesWriter for new author. As authors might run into thousands, I am closing IndexWriter and clearing hashmap once it reaches a certain threshold and starting all over again. There is no reduced function.

Does this logic sound correct? Is there any other way of implementing this requirement?

--Hrishi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Search Discussions

  • Otis Gospodnetic at Nov 10, 2009 at 11:55 pm
    I think that sounds right.
    I believe that's what I did when I implemented this type of functionality for http://simpy.com/

    I'm not sure why this is a Hadoop thing, though.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Hrishikesh Agashe <hrishikesh_agashe@persistent.co.in>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Tue, November 10, 2009 4:56:33 AM
    Subject: Lucene + Hadoop

    Hi,

    I am trying to use Hadoop for Lucene index creation. I have to create multiple
    indexes based on contents of the files (i.e. if author is "hrishikesh", it
    should be added to a index for "hrishikesh". There has to be a separate index
    for every author). For this, I am keeping multiple IndexWriter open for every
    author and maintaining them in a hashmap in map() function. I parse incoming
    file and if I see author is one for which I already have opened a IndexWriter, I
    just add this file in that index, else I create a new IndesWriter for new
    author. As authors might run into thousands, I am closing IndexWriter and
    clearing hashmap once it reaches a certain threshold and starting all over
    again. There is no reduced function.

    Does this logic sound correct? Is there any other way of implementing this
    requirement?

    --Hrishi

    DISCLAIMER
    ==========
    This e-mail may contain privileged and confidential information which is the
    property of Persistent Systems Ltd. It is intended only for the use of the
    individual or entity to which it is addressed. If you are not the intended
    recipient, you are not authorized to read, retain, copy, print, distribute or
    use this message. If you have received this communication in error, please
    notify the sender and delete all copies of this message. Persistent Systems Ltd.
    does not accept any liability for virus infected mails.
  • Eason.Lee at Nov 11, 2009 at 1:03 am
    I think you'd better using map to group all the file belong to the same
    author together
    and using reduce to index the files?

    2009/11/11 Otis Gospodnetic <otis_gospodnetic@yahoo.com>
    I think that sounds right.
    I believe that's what I did when I implemented this type of functionality
    for http://simpy.com/

    I'm not sure why this is a Hadoop thing, though.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Hrishikesh Agashe <hrishikesh_agashe@persistent.co.in>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Tue, November 10, 2009 4:56:33 AM
    Subject: Lucene + Hadoop

    Hi,

    I am trying to use Hadoop for Lucene index creation. I have to create multiple
    indexes based on contents of the files (i.e. if author is "hrishikesh", it
    should be added to a index for "hrishikesh". There has to be a separate index
    for every author). For this, I am keeping multiple IndexWriter open for every
    author and maintaining them in a hashmap in map() function. I parse incoming
    file and if I see author is one for which I already have opened a
    IndexWriter, I
    just add this file in that index, else I create a new IndesWriter for new
    author. As authors might run into thousands, I am closing IndexWriter and
    clearing hashmap once it reaches a certain threshold and starting all over
    again. There is no reduced function.

    Does this logic sound correct? Is there any other way of implementing this
    requirement?

    --Hrishi

    DISCLAIMER
    ==========
    This e-mail may contain privileged and confidential information which is the
    property of Persistent Systems Ltd. It is intended only for the use of the
    individual or entity to which it is addressed. If you are not the intended
    recipient, you are not authorized to read, retain, copy, print,
    distribute or
    use this message. If you have received this communication in error, please
    notify the sender and delete all copies of this message. Persistent
    Systems Ltd.
    does not accept any liability for virus infected mails.
  • Sagar at Nov 11, 2009 at 6:28 pm
    Checkout MultipleOutputFormat
    (it is same as per u r implementation )

    Having separate index for author may not be a good idea.
    U can have one index for all authors and query it per author
    But, I m not sure of requirements

    -Sagar
    Hrishikesh Agashe wrote:
    Hi,

    I am trying to use Hadoop for Lucene index creation. I have to create multiple indexes based on contents of the files (i.e. if author is "hrishikesh", it should be added to a index for "hrishikesh". There has to be a separate index for every author). For this, I am keeping multiple IndexWriter open for every author and maintaining them in a hashmap in map() function. I parse incoming file and if I see author is one for which I already have opened a IndexWriter, I just add this file in that index, else I create a new IndesWriter for new author. As authors might run into thousands, I am closing IndexWriter and clearing hashmap once it reaches a certain threshold and starting all over again. There is no reduced function.

    Does this logic sound correct? Is there any other way of implementing this requirement?

    --Hrishi

    DISCLAIMER
    ==========
    This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 10, '09 at 9:57a
activeNov 11, '09 at 6:28p
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase