FAQ
Hi,

How do I allow multiple nodes to write to the same index file in HDFS?

Thank you,
Mark

Search Discussions

  • 王红宝 at Mar 13, 2009 at 5:35 am
    you can see the nutch code.

    2009/3/13 Mark Kerzner <markkerzner@gmail.com>
    Hi,

    How do I allow multiple nodes to write to the same index file in HDFS?

    Thank you,
    Mark
  • Ning Li at Mar 13, 2009 at 7:10 pm
    Or you can check out the index contrib. The difference of the two is that:
    - In Nutch's indexing map/reduce job, indexes are built in the
    reduce phase. Afterwards, they are merged into smaller number of
    shards if necessary. The last time I checked, the merge process does
    not use map/reduce.
    - In contrib/index, small indexes are built in the map phase. They
    are merged into the desired number of shards in the reduce phase. In
    addition, they can be merged into existing shards.

    Cheers,
    Ning

    On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote:
    you can see the nutch code.

    2009/3/13 Mark Kerzner <markkerzner@gmail.com>
    Hi,

    How do I allow multiple nodes to write to the same index file in HDFS?

    Thank you,
    Mark
  • Ian Soboroff at Mar 16, 2009 at 6:58 pm
    I understand why you would index in the reduce phase, because the anchor
    text gets shuffled to be next to the document. However, when you index
    in the map phase, don't you just have to reindex later?

    The main point to the OP is that HDFS is a bad FS for writing Lucene
    indexes because of how Lucene works. The simple approach is to write
    your index outside of HDFS in the reduce phase, and then merge the
    indexes from each reducer manually.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    Or you can check out the index contrib. The difference of the two is that:
    - In Nutch's indexing map/reduce job, indexes are built in the
    reduce phase. Afterwards, they are merged into smaller number of
    shards if necessary. The last time I checked, the merge process does
    not use map/reduce.
    - In contrib/index, small indexes are built in the map phase. They
    are merged into the desired number of shards in the reduce phase. In
    addition, they can be merged into existing shards.

    Cheers,
    Ning

    On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote:
    you can see the nutch code.

    2009/3/13 Mark Kerzner <markkerzner@gmail.com>
    Hi,

    How do I allow multiple nodes to write to the same index file in HDFS?

    Thank you,
    Mark
  • Ning Li at Mar 16, 2009 at 8:48 pm
    I should have pointed out that Nutch index build and contrib/index
    targets different applications. The latter is for applications who
    simply want to build Lucene index from a set of documents - e.g. no
    link analysis.

    As to writing Lucene indexes, both work the same way - write the final
    results to local file system and then copy to HDFS. In contrib/index,
    the intermediate results are in memory and not written to HDFS.

    Hope it clarifies things.

    Cheers,
    Ning

    On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff wrote:

    I understand why you would index in the reduce phase, because the anchor
    text gets shuffled to be next to the document. However, when you index
    in the map phase, don't you just have to reindex later?

    The main point to the OP is that HDFS is a bad FS for writing Lucene
    indexes because of how Lucene works. The simple approach is to write
    your index outside of HDFS in the reduce phase, and then merge the
    indexes from each reducer manually.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    Or you can check out the index contrib. The difference of the two is that:
    - In Nutch's indexing map/reduce job, indexes are built in the
    reduce phase. Afterwards, they are merged into smaller number of
    shards if necessary. The last time I checked, the merge process does
    not use map/reduce.
    - In contrib/index, small indexes are built in the map phase. They
    are merged into the desired number of shards in the reduce phase. In
    addition, they can be merged into existing shards.

    Cheers,
    Ning

    On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote:
    you can see the nutch code.

    2009/3/13 Mark Kerzner <markkerzner@gmail.com>
    Hi,

    How do I allow multiple nodes to write to the same index file in HDFS?

    Thank you,
    Mark
  • Ian Soboroff at Mar 16, 2009 at 8:53 pm
    Does anyone have stats on how multiple readers on an optimized Lucene
    index in HDFS compares with a ParallelMultiReader (or whatever its
    called) over RPC on a local filesystem?

    I'm missing why you would ever want the Lucene index in HDFS for
    reading.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    I should have pointed out that Nutch index build and contrib/index
    targets different applications. The latter is for applications who
    simply want to build Lucene index from a set of documents - e.g. no
    link analysis.

    As to writing Lucene indexes, both work the same way - write the final
    results to local file system and then copy to HDFS. In contrib/index,
    the intermediate results are in memory and not written to HDFS.

    Hope it clarifies things.

    Cheers,
    Ning

    On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff wrote:

    I understand why you would index in the reduce phase, because the anchor
    text gets shuffled to be next to the document. However, when you index
    in the map phase, don't you just have to reindex later?

    The main point to the OP is that HDFS is a bad FS for writing Lucene
    indexes because of how Lucene works. The simple approach is to write
    your index outside of HDFS in the reduce phase, and then merge the
    indexes from each reducer manually.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    Or you can check out the index contrib. The difference of the two is that:
    - In Nutch's indexing map/reduce job, indexes are built in the
    reduce phase. Afterwards, they are merged into smaller number of
    shards if necessary. The last time I checked, the merge process does
    not use map/reduce.
    - In contrib/index, small indexes are built in the map phase. They
    are merged into the desired number of shards in the reduce phase. In
    addition, they can be merged into existing shards.

    Cheers,
    Ning

    On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote:
    you can see the nutch code.

    2009/3/13 Mark Kerzner <markkerzner@gmail.com>
    Hi,

    How do I allow multiple nodes to write to the same index file in HDFS?

    Thank you,
    Mark
  • Ning Li at Mar 16, 2009 at 9:19 pm

    I'm missing why you would ever want the Lucene index in HDFS for
    reading.
    The Lucene indexes are written to HDFS, but that does not mean you
    conduct search on the indexes stored in HDFS directly. HDFS is not
    designed for random access. Usually the indexes are copied to the
    nodes where search will be served. With
    http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
    become feasible to search on HDFS directly.

    Cheers,
    Ning

    On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff wrote:

    Does anyone have stats on how multiple readers on an optimized Lucene
    index in HDFS compares with a ParallelMultiReader (or whatever its
    called) over RPC on a local filesystem?

    I'm missing why you would ever want the Lucene index in HDFS for
    reading.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    I should have pointed out that Nutch index build and contrib/index
    targets different applications. The latter is for applications who
    simply want to build Lucene index from a set of documents - e.g. no
    link analysis.

    As to writing Lucene indexes, both work the same way - write the final
    results to local file system and then copy to HDFS. In contrib/index,
    the intermediate results are in memory and not written to HDFS.

    Hope it clarifies things.

    Cheers,
    Ning

    On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff wrote:

    I understand why you would index in the reduce phase, because the anchor
    text gets shuffled to be next to the document. However, when you index
    in the map phase, don't you just have to reindex later?

    The main point to the OP is that HDFS is a bad FS for writing Lucene
    indexes because of how Lucene works. The simple approach is to write
    your index outside of HDFS in the reduce phase, and then merge the
    indexes from each reducer manually.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    Or you can check out the index contrib. The difference of the two is that:
    - In Nutch's indexing map/reduce job, indexes are built in the
    reduce phase. Afterwards, they are merged into smaller number of
    shards if necessary. The last time I checked, the merge process does
    not use map/reduce.
    - In contrib/index, small indexes are built in the map phase. They
    are merged into the desired number of shards in the reduce phase. In
    addition, they can be merged into existing shards.

    Cheers,
    Ning

    On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote:
    you can see the nutch code.

    2009/3/13 Mark Kerzner <markkerzner@gmail.com>
    Hi,

    How do I allow multiple nodes to write to the same index file in HDFS?

    Thank you,
    Mark
  • Doug Cutting at Mar 16, 2009 at 9:37 pm

    Ning Li wrote:
    With
    http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
    become feasible to search on HDFS directly.
    I don't think HADOOP-4801 is required. It would help, certainly, but
    it's so fraught with security and other issues that I doubt it will be
    committed anytime soon.

    What would probably help HDFS random access performance for Lucene
    significantly would be:
    1. A cache of connections to datanodes, so that each seek() does not
    require an open(). If we move HDFS data transfer to be RPC-based (see,
    e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will
    come for free, since RPC already caches connections. We hope to do this
    for Hadoop 1.0, so that we use a single transport for all Hadoop's core
    operations, to simplify security.
    2. A local cache of read-only HDFS data, equivalent to kernel's buffer
    cache. This might be implemented as a Lucene Directory that keeps an
    LRU cache of buffers from a wrapped filesystem, perhaps a subclass of
    RAMDirectory.

    With these, performance would still be slower than a local drive, but
    perhaps not so dramatically.

    Doug
  • Ning Li at Mar 16, 2009 at 11:22 pm
    1 is good. But for 2:
    - Won't it have a security concern as well? Or is this not a general
    local cache?
    - You are referring to caching in RAM, not caching in local FS,
    right? In general, a Lucene index size could be quite large. We may
    have to cache a lot of data to reach a reasonable hit ratio...

    Cheers,
    Ning

    On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting wrote:
    Ning Li wrote:
    With
    http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
    become feasible to search on HDFS directly.
    I don't think HADOOP-4801 is required.  It would help, certainly, but it's
    so fraught with security and other issues that I doubt it will be committed
    anytime soon.

    What would probably help HDFS random access performance for Lucene
    significantly would be:
    1. A cache of connections to datanodes, so that each seek() does not
    require an open().  If we move HDFS data transfer to be RPC-based (see,
    e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come
    for free, since RPC already caches connections.  We hope to do this for
    Hadoop 1.0, so that we use a single transport for all Hadoop's core
    operations, to simplify security.
    2. A local cache of read-only HDFS data, equivalent to kernel's buffer
    cache.  This might be implemented as a Lucene Directory that keeps an LRU
    cache of buffers from a wrapped filesystem, perhaps a subclass of
    RAMDirectory.

    With these, performance would still be slower than a local drive, but
    perhaps not so dramatically.

    Doug
  • Doug Cutting at Mar 17, 2009 at 6:31 pm

    Ning Li wrote:
    1 is good. But for 2:
    - Won't it have a security concern as well? Or is this not a general
    local cache?
    A client-side RAM cache would be filled through the same security
    mechanisms as all other filesystem accesses.
    - You are referring to caching in RAM, not caching in local FS,
    right? In general, a Lucene index size could be quite large. We may
    have to cache a lot of data to reach a reasonable hit ratio...
    Lucene on a local disk benefits significantly from the local
    filesystem's RAM cache (aka the kernel's buffer cache). HDFS has no
    such local RAM cache outside of the stream's buffer. The cache would
    need to be no larger than the kernel's buffer cache to get an equivalent
    hit ratio. And if you're accessing a remote index then you shouldn't
    also need a large buffer cache.

    Doug
  • Ning Li at Mar 17, 2009 at 11:03 pm

    Lucene on a local disk benefits significantly from the local filesystem's
    RAM cache (aka the kernel's buffer cache).  HDFS has no such local RAM cache
    outside of the stream's buffer.  The cache would need to be no larger than
    the kernel's buffer cache to get an equivalent hit ratio.  And if you're
    If the two cache sizes are the same, then yes. Just that local FS
    cache size is adjusted (more?) dynamically.


    Cheers,
    Ning
  • Ctam at Oct 7, 2009 at 4:31 am
    hi Ning , I am also looking at different approaches on indexing with hadoop ,
    I could index using contrib package for hadoop into HDFS but since its not
    designed for random access what would be the other recommended ways to move
    them to Local file system

    Also what would be the best approach to begin with ? should we look into
    katta or solr integrations ?

    thanks in advance.


    Ning Li-5 wrote:
    I'm missing why you would ever want the Lucene index in HDFS for
    reading.
    The Lucene indexes are written to HDFS, but that does not mean you
    conduct search on the indexes stored in HDFS directly. HDFS is not
    designed for random access. Usually the indexes are copied to the
    nodes where search will be served. With
    http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
    become feasible to search on HDFS directly.

    Cheers,
    Ning

    On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff wrote:

    Does anyone have stats on how multiple readers on an optimized Lucene
    index in HDFS compares with a ParallelMultiReader (or whatever its
    called) over RPC on a local filesystem?

    I'm missing why you would ever want the Lucene index in HDFS for
    reading.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    I should have pointed out that Nutch index build and contrib/index
    targets different applications. The latter is for applications who
    simply want to build Lucene index from a set of documents - e.g. no
    link analysis.

    As to writing Lucene indexes, both work the same way - write the final
    results to local file system and then copy to HDFS. In contrib/index,
    the intermediate results are in memory and not written to HDFS.

    Hope it clarifies things.

    Cheers,
    Ning


    On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ian.soboroff@nist.gov>
    wrote:
    I understand why you would index in the reduce phase, because the
    anchor
    text gets shuffled to be next to the document. However, when you index
    in the map phase, don't you just have to reindex later?

    The main point to the OP is that HDFS is a bad FS for writing Lucene
    indexes because of how Lucene works. The simple approach is to write
    your index outside of HDFS in the reduce phase, and then merge the
    indexes from each reducer manually.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    Or you can check out the index contrib. The difference of the two is
    that:
    - In Nutch's indexing map/reduce job, indexes are built in the
    reduce phase. Afterwards, they are merged into smaller number of
    shards if necessary. The last time I checked, the merge process does
    not use map/reduce.
    - In contrib/index, small indexes are built in the map phase. They
    are merged into the desired number of shards in the reduce phase. In
    addition, they can be merged into existing shards.

    Cheers,
    Ning

    On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote:
    you can see the nutch code.

    2009/3/13 Mark Kerzner <markkerzner@gmail.com>
    Hi,

    How do I allow multiple nodes to write to the same index file in
    HDFS?

    Thank you,
    Mark
    --
    View this message in context: http://www.nabble.com/Creating-Lucene-index-in-Hadoop-tp22490120p25780366.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Jason Venner at Oct 7, 2009 at 4:10 pm
    Check out katta, as it can pull indexes from hdfs and deploy them into your
    search cluster.
    Katta also handles index directories that have been packed into a zip file.
    Katta can pull indexes from any file system that hadoop supports, hdfs, s3,
    hftp, file etc.

    We have been doing this with our solr (solr-1301) indexes and getting an 80%
    reduction in size, which is a big gain for us.

    I need to feed a 2 line change back into solr-1301 as the close method can
    fail to heart beat while the optimize is happening, in some situations right
    now.

    On Tue, Oct 6, 2009 at 9:30 PM, ctam wrote:


    hi Ning , I am also looking at different approaches on indexing with hadoop
    ,
    I could index using contrib package for hadoop into HDFS but since its not
    designed for random access what would be the other recommended ways to
    move
    them to Local file system

    Also what would be the best approach to begin with ? should we look into
    katta or solr integrations ?

    thanks in advance.


    Ning Li-5 wrote:
    I'm missing why you would ever want the Lucene index in HDFS for
    reading.
    The Lucene indexes are written to HDFS, but that does not mean you
    conduct search on the indexes stored in HDFS directly. HDFS is not
    designed for random access. Usually the indexes are copied to the
    nodes where search will be served. With
    http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
    become feasible to search on HDFS directly.

    Cheers,
    Ning


    On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff <ian.soboroff@nist.gov>
    wrote:
    Does anyone have stats on how multiple readers on an optimized Lucene
    index in HDFS compares with a ParallelMultiReader (or whatever its
    called) over RPC on a local filesystem?

    I'm missing why you would ever want the Lucene index in HDFS for
    reading.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    I should have pointed out that Nutch index build and contrib/index
    targets different applications. The latter is for applications who
    simply want to build Lucene index from a set of documents - e.g. no
    link analysis.

    As to writing Lucene indexes, both work the same way - write the final
    results to local file system and then copy to HDFS. In contrib/index,
    the intermediate results are in memory and not written to HDFS.

    Hope it clarifies things.

    Cheers,
    Ning


    On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ian.soboroff@nist.gov>
    wrote:
    I understand why you would index in the reduce phase, because the
    anchor
    text gets shuffled to be next to the document. However, when you
    index
    in the map phase, don't you just have to reindex later?

    The main point to the OP is that HDFS is a bad FS for writing Lucene
    indexes because of how Lucene works. The simple approach is to write
    your index outside of HDFS in the reduce phase, and then merge the
    indexes from each reducer manually.

    Ian

    Ning Li <ning.li.00@gmail.com> writes:
    Or you can check out the index contrib. The difference of the two is
    that:
    - In Nutch's indexing map/reduce job, indexes are built in the
    reduce phase. Afterwards, they are merged into smaller number of
    shards if necessary. The last time I checked, the merge process does
    not use map/reduce.
    - In contrib/index, small indexes are built in the map phase. They
    are merged into the desired number of shards in the reduce phase. In
    addition, they can be merged into existing shards.

    Cheers,
    Ning

    On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote:
    you can see the nutch code.

    2009/3/13 Mark Kerzner <markkerzner@gmail.com>
    Hi,

    How do I allow multiple nodes to write to the same index file in
    HDFS?

    Thank you,
    Mark
    --
    View this message in context:
    http://www.nabble.com/Creating-Lucene-index-in-Hadoop-tp22490120p25780366.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.

    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 13, '09 at 4:38a
activeOct 7, '09 at 4:10p
posts13
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase