FAQ
Hi,

I want to broadcast some data to all nodes under Hadoop 0.20.2. I tested
DistributedCache module. Unfortunately, it was time-consuming
and runtime is important for my work.
I want to write a MR job so that a copy of input data are generated in
output of all reducers.
Is that possible? How?
I mean I want to have copies of some data to the number of reducers.

Thanks,

Hamid Oliaei

oliaei@gmail.com

Search Discussions

  • Tim Robertson at Aug 23, 2012 at 8:45 am
    So you are trying to run a single reducer on each machine, and all input
    data regardless of its location gets streamed to each reducer?
    On Thu, Aug 23, 2012 at 10:41 AM, Hamid Oliaei wrote:

    Hi,

    I want to broadcast some data to all nodes under Hadoop 0.20.2. I tested
    DistributedCache module. Unfortunately, it was time-consuming
    and runtime is important for my work.
    I want to write a MR job so that a copy of input data are generated in
    output of all reducers.
    Is that possible? How?
    I mean I want to have copies of some data to the number of reducers.

    Thanks,

    Hamid Oliaei

    oliaei@gmail.com
  • Hamid Oliaei at Aug 23, 2012 at 8:47 am
    exactly!!
  • Tim Robertson at Aug 23, 2012 at 10:05 am
    Sorry to ask too many questions, but it will help the user list best offer
    you advice, as this is not a typical MR use case.

    - Do you foresee the reducer store the data on a local files system to the
    machine?
    - Do you need to use specific input formats for the job, or is it really
    just text files?
    - Are the input files on the HDFS, or are you (e.g.) reading from HBase, or
    some other source?

    If your data is on HDFS, and if it is just text files, have you considered
    a simple HDFS getMerge on each machine? You could use several tools (e.g.
    Fabric) which could trigger a getMerge on each machine.

    The problems with MR for this, is that you would be circumventing (if it is
    at all possible) the job scheduling which is trying to balance the load
    across the cluster.

    Cheers,
    Tim
    On Thu, Aug 23, 2012 at 10:47 AM, Hamid Oliaei wrote:

    exactly!!
  • Hamid Oliaei at Aug 23, 2012 at 11:09 am
    Hi,

    First of all, thank you Tim for giving your time.

    The answer of first question is yes.
    My inputs are in format of triples (sub,pre,obj) and they are stored on
    the HDFS.
    The problem is: After running some MR jobs,some data generated in all
    machines and I want to each machine send part of that to others in minimum
    time, using for next phase.
    I know that this is unfamiliar with MR nature but that was the first
    solution coming to my mind and I am glad to know other suggestions.

    Regards,
    Hamid
  • Tim Robertson at Aug 23, 2012 at 1:14 pm
    Then I think you might be best exploring running a getmerge on each
    client. How you trigger that is up to you, but something like Fabric [1]
    might help. Others might propose different solutions, but it doesn't sound
    like MR is a natural choice to me.

    I would expect this is the very fastest way of getting the data locally.

    There is one alternative you might consider - set the replication factor to
    be the same as the number of machines for whatever is producing the input
    files. This way they will all be local, although will likely be split into
    multiple files (part000001 etc)

    I hope this helps,
    Tim

    [1] http://docs.fabfile.org/en/1.4.3/index.html


    On Thu, Aug 23, 2012 at 1:08 PM, Hamid Oliaei wrote:

    Hi,

    First of all, thank you Tim for giving your time.

    The answer of first question is yes.
    My inputs are in format of triples (sub,pre,obj) and they are stored on
    the HDFS.
    The problem is: After running some MR jobs,some data generated in all
    machines and I want to each machine send part of that to others in minimum
    time, using for next phase.
    I know that this is unfamiliar with MR nature but that was the first
    solution coming to my mind and I am glad to know other suggestions.

    Regards,
    Hamid

  • Hamid Oliaei at Aug 23, 2012 at 1:21 pm
    Hi,

    I take a look to that, hope it can be useful for my purpose.

    Thank you so much.

    Hamid
  • Sonal Goyal at Aug 23, 2012 at 1:37 pm
    Hamid,

    I would recommend taking a relook at your current algorithm and making sure
    you are utilizing the MR framework to its strengths. You can evaluate
    having multiple passes for your map reduce program, or doing a map side
    join. You mention runtime is important for your system, so make sure you
    preserve data locality in the generated tasks.

    HTH.

    Best Regards,
    Sonal
    Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>




    On Thu, Aug 23, 2012 at 6:50 PM, Hamid Oliaei wrote:

    Hi,

    I take a look to that, hope it can be useful for my purpose.

    Thank you so much.

    Hamid

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedAug 23, '12 at 8:42a
activeAug 23, '12 at 1:37p
posts8
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase