FAQ
BloomMapFile - fail-fast version of MapFile for sparsely populated key space
----------------------------------------------------------------------------

Key: HADOOP-3063
URL: https://issues.apache.org/jira/browse/HADOOP-3063
Project: Hadoop Core
Issue Type: Improvement
Components: io
Affects Versions: 0.17.0
Reporter: Andrzej Bialecki
Fix For: 0.17.0


The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.

This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.

Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Andrzej Bialecki (JIRA) at Mar 21, 2008 at 11:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated HADOOP-3063:
    --------------------------------------

    Attachment: bloommap.patch

    BloomMapFile implementation and JUnit test.

    NOTE 1: I wasn't sure how to approach the issue of the org.onelab.* classes that I borrowed from HBase (which originally were a part of Hadoop core ;) ). For now they are included verbatim here in this patch.

    NOTE 2: the BloomFilter and DynamicBloomFilter classes contained a few bugs related to their Writable (de)serialization, some of them related to specific assumptions about the environment, some others fatal under all conditions. This patch contains these fixes too.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Fix For: 0.17.0

    Attachments: bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jim Kellerman (JIRA) at Mar 21, 2008 at 4:21 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581112#action_12581112 ]

    Jim Kellerman commented on HADOOP-3063:
    ---------------------------------------

    If you have fixes for BloomFilter and DynamicBloomFilter, please open a Jira for HBase https://issues.apache.org/jira/browse/HBASE and submit a patch there as well. Thanks.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Fix For: 0.17.0

    Attachments: bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Mar 21, 2008 at 4:35 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581114#action_12581114 ]

    Doug Cutting commented on HADOOP-3063:
    --------------------------------------

    We should avoid replicating source files. So files are copied from HBase to Core, then they should ideally then be removed from HBase, since HBase relies on Core.

    As for the org.onelab classes: shouldn't we import these as a jar rather than as source?
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Fix For: 0.17.0

    Attachments: bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Mar 21, 2008 at 4:45 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581115#action_12581115 ]

    Andrzej Bialecki commented on HADOOP-3063:
    -------------------------------------------

    Re: replicating the source file - I agree, I just wasn't sure how to solve this.

    My experience with these classes is that they are not well-debugged, so often bug fixing is necessary. The original creator was an EU IST project, so it's unlikely we could expect any maintenance from that project. Therefore I don't think that packaging them as a third-party jar is a good option.

    I think it would be best to import these classes into org.apache.hadoop.util.bloom package, and keep the comment about the original authors in the javadoc, as we do it now, and subsequently remove these classes form HBase.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Fix For: 0.17.0

    Attachments: bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jim Kellerman (JIRA) at Mar 21, 2008 at 5:05 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581123#action_12581123 ]

    Jim Kellerman commented on HADOOP-3063:
    ---------------------------------------

    We can't import them as a Jar because they implement Writable
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Fix For: 0.17.0

    Attachments: bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Mar 21, 2008 at 5:17 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581128#action_12581128 ]

    Doug Cutting commented on HADOOP-3063:
    --------------------------------------
    I think it would be best to import these classes into org.apache.hadoop.util.bloom package [ ...]
    +1

    The license should also be added to our top-level LICENSE.txt, in the style of

    http://svn.apache.org/repos/asf/httpd/httpd/trunk/LICENSE

    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Fix For: 0.17.0

    Attachments: bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Chansler (JIRA) at Mar 25, 2008 at 3:08 am
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Chansler updated HADOOP-3063:
    ------------------------------------

    Fix Version/s: (was: 0.17.0)
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Mar 29, 2008 at 9:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated HADOOP-3063:
    --------------------------------------

    Attachment: bloommap-v2.patch

    Updated patch. This patch imports the Bloom filter classes into org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap-v2.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Mar 29, 2008 at 9:55 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated HADOOP-3063:
    --------------------------------------

    Status: Patch Available (was: Open)
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap-v2.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Mar 29, 2008 at 11:07 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583379#action_12583379 ]

    Hadoop QA commented on HADOOP-3063:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12378873/bloommap-v2.patch
    against trunk revision 619744.

    @author +1. The patch does not contain any @author tags.

    tests included +1. The patch appears to include 4 new or modified tests.

    javadoc -1. The javadoc tool appears to have generated 1 warning messages.

    javac -1. The applied patch generated 579 javac compiler warnings (more than the trunk's current 568 warnings).

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2098/console

    This message is automatically generated.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap-v2.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Apr 10, 2008 at 4:55 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587703#action_12587703 ]

    Andrzej Bialecki commented on HADOOP-3063:
    -------------------------------------------

    I'm not sure what to do about the Findbugs warnings. There are two of them, and both don't make sense to me.

    * CN_IDIOM_NO_SUPER_CALL: clone method does not call super.clone() in BloomFilter.clone() and in DynamicBloomFilter.clone() - how am I supposed to use Object.clone() here?

    * IS2_INCONSISTENT_SYNC: Inconsistent synchronization in BloomMapFile.initBloomFilter() - this access is not synchronized because it's called from the constructor. Other accesses are synchronized because the MapFile methods require synchronization. Should I add "synchronized" to initBloomFilter(), even though it's not needed there?
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap-v2.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Apr 22, 2008 at 9:36 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Owen O'Malley updated HADOOP-3063:
    ----------------------------------

    Status: Open (was: Patch Available)

    It is complaining that your clone methods don't call the super.clone, which is generally considered good style. (So in your case, Filter needs a clone method that copies its fields that is called by the subtypes.)

    Yes, please add synchronization. It is far easier to add synchronization to a method that deal with missing synchronization.

    You also need to address the javac and javadoc warnings.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap-v2.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Edward Bruce Williams (JIRA) at Dec 2, 2008 at 10:23 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652545#action_12652545 ]

    Edward Bruce Williams commented on HADOOP-3063:
    -----------------------------------------------

    What is the status on this? I was assigned to fix some bugs in the Hbase Bloom Filter code, but moving the code back into Hadoop and removing it from HBase needs to be done first. Can I consider HBase 533 unblocked? There is also a new hashing mechanism now that was added to hbase by Andrzej, MurmurHash, that should also be moved back up into hadoop.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap-v2.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Dec 2, 2008 at 10:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652556#action_12652556 ]

    Andrzej Bialecki commented on HADOOP-3063:
    -------------------------------------------

    Sorry, I got involved in other stuff - I'd like to prepare a new patch in the next day or two, perhaps we could squeeze it into 0.20 ... For the purpose of this patch I'll take the latest BloomFilter-s as they are now in HBase (plus the MurmurHash).
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap-v2.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Edward Bruce Williams (JIRA) at Dec 2, 2008 at 10:57 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652575#action_12652575 ]

    Edward Bruce Williams commented on HADOOP-3063:
    -----------------------------------------------

    Thanks, refactoring of the onelab stuff into a hadoop util.bloomfilters would cleanup some awkward code tangles in hbase.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap-v2.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Dec 8, 2008 at 12:23 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated HADOOP-3063:
    --------------------------------------

    Attachment: bloommap-v3.patch

    This patch uses the latest versions of Bloom filters and Hash implementations from HBase. It successfully passes the 'ant test-patch' command:

    [exec] +1 overall.
    [exec]
    [exec] +1 @author. The patch does not contain any @author tags.
    [exec]
    [exec] +1 tests included. The patch appears to include 4 new or modified tests.
    [exec]
    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
    [exec]
    [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
    [exec]
    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
    [exec]
    [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.17.0
    Reporter: Andrzej Bialecki
    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Dec 8, 2008 at 12:25 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated HADOOP-3063:
    --------------------------------------

    Affects Version/s: (was: 0.17.0)
    0.20.0
    Fix Version/s: 0.20.0
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Dec 9, 2008 at 9:16 am
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated HADOOP-3063:
    --------------------------------------

    Status: Patch Available (was: Open)

    Updated patch.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Dec 10, 2008 at 2:25 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655223#action_12655223 ]

    Andrzej Bialecki commented on HADOOP-3063:
    -------------------------------------------

    Any chance that this issue makes it into 0.20 ?
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Dec 11, 2008 at 8:37 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655768#action_12655768 ]

    Hadoop QA commented on HADOOP-3063:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12395548/bloommap-v3.patch
    against trunk revision 725729.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 4 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests. The patch failed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3709/console

    This message is automatically generated.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Dec 11, 2008 at 9:25 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655778#action_12655778 ]

    stack commented on HADOOP-3063:
    -------------------------------

    Andrzej: It failed its own unit test up on hudson org.apache.hadoop.io.TestBloomMapFile.testMembershipTest.

    {code}
    Error Message

    hashType must be known

    Stacktrace

    java.lang.IllegalArgumentException: hashType must be known
    at org.apache.hadoop.util.bloom.HashFunction.(Filter.java:102)
    at org.apache.hadoop.util.bloom.DynamicBloomFilter.(BloomMapFile.java:150)
    at org.apache.hadoop.io.BloomMapFile$Writer.(TestBloomMapFile.java:37)
    {code}

    I ran it locally and got same failure.


    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Dec 11, 2008 at 10:35 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655793#action_12655793 ]

    stack commented on HADOOP-3063:
    -------------------------------

    Reviewing the patch, it looks great. The above failure would seem to be because the last two arguments need to be flipped:

    {code}
    236 + bloomFilter = new DynamicBloomFilter(vectorSize, HASH_COUNT, numKeys,
    237 + Hash.getHashType(conf));
    {code}

    ....the DBF constructor looks like this:

    {code}
    1528 + public DynamicBloomFilter(int vectorSize, int nbHash, int hashType, int nr) {
    {code}

    Test still fails with an assertion error so maybe the above is not right.

    There is an hbasism still in the code that you probably want to remove:

    {code}
    408 + String name = conf.get("hbase.hash.type", "murmur");
    {code}
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Dec 13, 2008 at 3:51 am
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated HADOOP-3063:
    --------------------------------------

    Status: Open (was: Patch Available)
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Dec 13, 2008 at 3:55 am
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated HADOOP-3063:
    --------------------------------------

    Attachment: bloommap-v4.patch

    Apart from fumbling the args passed to the constructor, the unit test failed for a valid reason - the calculation of BloomFilter size was wrong. I fixed the calculation and added a config knob to adjust the error rate.

    This time it passes the tests and gets +1's from ant test-patch.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap-v4.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Dec 13, 2008 at 3:55 am
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated HADOOP-3063:
    --------------------------------------

    Status: Patch Available (was: Open)
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap-v4.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Dec 13, 2008 at 9:11 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656329#action_12656329 ]

    stack commented on HADOOP-3063:
    -------------------------------

    +1 Test now passes locally with v4. Took quick look at patch. Looks good to me.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap-v4.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Dec 14, 2008 at 12:27 am
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656352#action_12656352 ]

    Hadoop QA commented on HADOOP-3063:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12395994/bloommap-v4.patch
    against trunk revision 726129.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 4 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests. The patch failed core unit tests.

    -1 contrib tests. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3739/console

    This message is automatically generated.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap-v4.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Dec 14, 2008 at 3:43 am
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656360#action_12656360 ]

    stack commented on HADOOP-3063:
    -------------------------------

    The failing tests seem unrelated to this patch:

    {code}
    org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestFileTailingAdaptors.testLogRotate
    org.apache.hadoop.hdfs.TestFileAppend2.testComplexAppend
    {code}

    IMO, these failures shouldn't hold up committing this patch.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap-v4.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Dec 15, 2008 at 9:02 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HADOOP-3063:
    --------------------------

    Resolution: Fixed
    Fix Version/s: 0.20.0
    Release Note: Implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    Committed on behalf of Andrzej. He's swamped today and he (and I) wanted to get this in before 0.20.0 feature-freeze.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap-v4.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Dec 15, 2008 at 9:02 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack reassigned HADOOP-3063:
    -----------------------------

    Assignee: Andrzej Bialecki
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Assignee: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap-v4.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Chansler (JIRA) at Mar 3, 2009 at 7:09 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Chansler updated HADOOP-3063:
    ------------------------------------

    Release Note: Introduced BloomMapFile subclass of MapFile that creates a Bloom filter from all keys. (was: Implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.)

    Edit release note for publication.
    BloomMapFile - fail-fast version of MapFile for sparsely populated key space
    ----------------------------------------------------------------------------

    Key: HADOOP-3063
    URL: https://issues.apache.org/jira/browse/HADOOP-3063
    Project: Hadoop Core
    Issue Type: Improvement
    Components: io
    Affects Versions: 0.20.0
    Reporter: Andrzej Bialecki
    Assignee: Andrzej Bialecki
    Fix For: 0.20.0

    Attachments: bloommap-v2.patch, bloommap-v3.patch, bloommap-v4.patch, bloommap.patch


    The need for this improvement arose when working with large ancillary MapFile-s (essentially used as external dictionaries). For each invokation of map() / reduce() it was necessary to perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the cost of finding that a key is absent was too high.
    This patch implements a subclass of MapFile that creates a Bloom filter from all keys, so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
    Writer.append() operations update a DynamicBloomFilter, which is then serialized when the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get() operation first checks the filter for the key membership, and if the key is absent it immediately returns null without doing any further IO.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 21, '08 at 11:06a
activeMar 3, '09 at 7:09p
posts32
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Robert Chansler (JIRA): 32 posts

People

Translate

site design / logo © 2022 Grokbase