FAQ
Am I right in thinking that the CompositeInputFormat is limited to joining
64 files?

I believe this comes about because TupleWritable uses a single long-type
instance field in order to maintain a bitset of tuple slots that have been
written to - I'm guessing this is for performance reasons, but it also
implies that the TupleWritable only has 64-bits to play with when joining.

If my assumptions above are true, could replacing this long with a
java.util.BitSet be appropiate in terms of making the map-side join package
more scalable?

Search Discussions

  • Jason hadoop at Mar 25, 2009 at 3:16 pm
    That code is highly optimized and quite difficult to follow. We have always
    limited our joins to 31 members and ignored the problem.
    But I think your jira and fixing it are the correct choices.

    There is, in my opinion, a decent write up on how to use map side joins in
    chapter 8 of my book, so I suspect more people will use this soon, as map
    side join is an incredibly powerful tool.

    In one of our production applications it took the run time from 5+ hours to
    about 12 minutes.
    On Wed, Mar 25, 2009 at 7:23 AM, Jingkei Ly wrote:

    Am I right in thinking that the CompositeInputFormat is limited to joining
    64 files?

    I believe this comes about because TupleWritable uses a single long-type
    instance field in order to maintain a bitset of tuple slots that have been
    written to - I'm guessing this is for performance reasons, but it also
    implies that the TupleWritable only has 64-bits to play with when joining.

    If my assumptions above are true, could replacing this long with a
    java.util.BitSet be appropiate in terms of making the map-side join package
    more scalable?


    --
    Alpha Chapters of my book on Hadoop are available
    http://www.apress.com/book/view/9781430219422
  • Jingkei Ly at Mar 25, 2009 at 5:30 pm
    Yes, we are leaning on the map-side join package quite heavily too - it is
    an excellent addition to the MapReduce model that's proving really useful.
    However, while HADOOP-5571 is an immediate problem for us, I can imagine
    that we will probably be wanting to join over 64 files soon as well,
    especially if we move onto larger clusters.

    2009/3/25 jason hadoop <jason.hadoop@gmail.com>
    That code is highly optimized and quite difficult to follow. We have always
    limited our joins to 31 members and ignored the problem.
    But I think your jira and fixing it are the correct choices.

    There is, in my opinion, a decent write up on how to use map side joins in
    chapter 8 of my book, so I suspect more people will use this soon, as map
    side join is an incredibly powerful tool.

    In one of our production applications it took the run time from 5+ hours to
    about 12 minutes.
    On Wed, Mar 25, 2009 at 7:23 AM, Jingkei Ly wrote:

    Am I right in thinking that the CompositeInputFormat is limited to joining
    64 files?

    I believe this comes about because TupleWritable uses a single long-type
    instance field in order to maintain a bitset of tuple slots that have been
    written to - I'm guessing this is for performance reasons, but it also
    implies that the TupleWritable only has 64-bits to play with when joining.
    If my assumptions above are true, could replacing this long with a
    java.util.BitSet be appropiate in terms of making the map-side join package
    more scalable?


    --
    Alpha Chapters of my book on Hadoop are available
    http://www.apress.com/book/view/9781430219422
  • Jason hadoop at Mar 26, 2009 at 4:10 am
    Be aware that there is a job run time cost for each additional data set in
    your join.

    On the clusters we were working with, 2ghz xeon dell 2950's, each additional
    data set in the join operator added roughly 30 seconds to the job run time.

    As a result of that, we would merge data sets in groups of 10, as a
    background operation. This had the side effect of keeping us out of the > 32
    tables == badness problem.

    That is part of the reason we never bothered to attempt to fix the table
    count in join problem, as our operational path took us away from the issue.

    On Wed, Mar 25, 2009 at 10:30 AM, Jingkei Ly wrote:

    Yes, we are leaning on the map-side join package quite heavily too - it is
    an excellent addition to the MapReduce model that's proving really useful.
    However, while HADOOP-5571 is an immediate problem for us, I can imagine
    that we will probably be wanting to join over 64 files soon as well,
    especially if we move onto larger clusters.

    2009/3/25 jason hadoop <jason.hadoop@gmail.com>
    That code is highly optimized and quite difficult to follow. We have always
    limited our joins to 31 members and ignored the problem.
    But I think your jira and fixing it are the correct choices.

    There is, in my opinion, a decent write up on how to use map side joins in
    chapter 8 of my book, so I suspect more people will use this soon, as map
    side join is an incredibly powerful tool.

    In one of our production applications it took the run time from 5+ hours to
    about 12 minutes.
    On Wed, Mar 25, 2009 at 7:23 AM, Jingkei Ly wrote:

    Am I right in thinking that the CompositeInputFormat is limited to joining
    64 files?

    I believe this comes about because TupleWritable uses a single
    long-type
    instance field in order to maintain a bitset of tuple slots that have been
    written to - I'm guessing this is for performance reasons, but it also
    implies that the TupleWritable only has 64-bits to play with when joining.
    If my assumptions above are true, could replacing this long with a
    java.util.BitSet be appropiate in terms of making the map-side join package
    more scalable?


    --
    Alpha Chapters of my book on Hadoop are available
    http://www.apress.com/book/view/9781430219422


    --
    Alpha Chapters of my book on Hadoop are available
    http://www.apress.com/book/view/9781430219422

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 25, '09 at 2:24p
activeMar 26, '09 at 4:10a
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Jingkei Ly: 2 posts Jason hadoop: 2 posts

People

Translate

site design / logo © 2022 Grokbase