FAQ
I know in Hadoop we can implement multi-threaded, asynchronous

Search Discussions

  • Nguyen Manh Tien at Oct 4, 2007 at 1:50 am
    I know in Hadoop we can implement multi-threaded, asynchronous mapping with
    class MapRunnable. But this don't exist the similar class to do
    multi-threaded in reduce phrase. Could we do milti-thread in reduce phrase?.
    Does the following code work?

    public void reduce(WritableComparable key, Iterator values,
    OutputCollector output, Reporter reporter) {
    new SomeThread(output).start(); // transfer OutputCollector to thread
    }

    public class SomeThread extend Thread {
    OutputCollector ouput;
    public SomeThread(OutputCollector output) {
    this.output = output;
    }
    public void run() {
    output.collect(key, value);
    }
    }
  • Ted Dunning at Oct 4, 2007 at 5:59 am
    You don't need to do all that work.

    Just set:

    <property>
    <name>mapred.reduce.tasks</name>
    <value>4</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>



    Either in hadoop-site or in your program using
    conf.set("mapred.reduce.tasks", 4)

    That will give you 4 reduce threads. You can have lots more than that if
    you like.

    On 10/3/07 6:50 PM, "Nguyen Manh Tien" wrote:

    I know in Hadoop we can implement multi-threaded, asynchronous mapping with
    class MapRunnable. But this don't exist the similar class to do
    multi-threaded in reduce phrase. Could we do milti-thread in reduce phrase?.
    Does the following code work?

    public void reduce(WritableComparable key, Iterator values,
    OutputCollector output, Reporter reporter) {
    new SomeThread(output).start(); // transfer OutputCollector to thread
    }

    public class SomeThread extend Thread {
    OutputCollector ouput;
    public SomeThread(OutputCollector output) {
    this.output = output;
    }
    public void run() {
    output.collect(key, value);
    }
    }
  • Arun C Murthy at Oct 4, 2007 at 8:36 am

    Ted Dunning wrote:
    You don't need to do all that work.

    Just set:

    <property>
    <name>mapred.reduce.tasks</name>
    <value>4</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    Either in hadoop-site or in your program using
    conf.set("mapred.reduce.tasks", 4)

    That will give you 4 reduce threads. You can have lots more than that if
    you like.
    Err... no.

    *mapred.reduce.tasks* is the default no. of reduces for a job.

    I think the config knob you want is *mapred.tasktrackers.tasks.maximum*.
    (http://lucene.apache.org/hadoop/hadoop-default.html#mapred.tasktracker.tasks.maximum)

    That, btw, is the maximum no. of tasks of a given kind (map or reduce)
    which can be simultaneously running on a given tasktracker (separate
    jvms). This is a cluster-wide limit, and there are jira issues open to
    make that a per-tracker knob (HADOOP-1245 & HADOOP-1274).

    Arun
    On 10/3/07 6:50 PM, "Nguyen Manh Tien" wrote:

    I know in Hadoop we can implement multi-threaded, asynchronous mapping with
    class MapRunnable. But this don't exist the similar class to do
    multi-threaded in reduce phrase. Could we do milti-thread in reduce phrase?.
    Does the following code work?

    public void reduce(WritableComparable key, Iterator values,
    OutputCollector output, Reporter reporter) {
    new SomeThread(output).start(); // transfer OutputCollector to thread
    }

    public class SomeThread extend Thread {
    OutputCollector ouput;
    public SomeThread(OutputCollector output) {
    this.output = output;
    }
    public void run() {
    output.collect(key, value);
    }
    }
  • Ted Dunning at Oct 4, 2007 at 5:28 pm
    Arun,

    I think that you are strictly correct, but that the original questioner
    simply needed some parallelism for reduces, not necessarily parallelism on a
    single node.

    I could be very wrong. It is always difficult to determine what a question
    means, of course, since the person asking the question generally doesn't
    understand something about the system (hence the question).

    On 10/4/07 1:13 AM, "Arun C Murthy" wrote:

    Ted Dunning wrote:
    You don't need to do all that work.

    Just set:

    <property>
    <name>mapred.reduce.tasks</name>
    <value>4</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    Either in hadoop-site or in your program using
    conf.set("mapred.reduce.tasks", 4)

    That will give you 4 reduce threads. You can have lots more than that if
    you like.
    Err... no.

    *mapred.reduce.tasks* is the default no. of reduces for a job.

    I think the config knob you want is *mapred.tasktrackers.tasks.maximum*.
    (http://lucene.apache.org/hadoop/hadoop-default.html#mapred.tasktracker.tasks.
    maximum)

    That, btw, is the maximum no. of tasks of a given kind (map or reduce)
    which can be simultaneously running on a given tasktracker (separate
    jvms). This is a cluster-wide limit, and there are jira issues open to
    make that a per-tracker knob (HADOOP-1245 & HADOOP-1274).

    Arun
    On 10/3/07 6:50 PM, "Nguyen Manh Tien" wrote:

    I know in Hadoop we can implement multi-threaded, asynchronous mapping with
    class MapRunnable. But this don't exist the similar class to do
    multi-threaded in reduce phrase. Could we do milti-thread in reduce phrase?.
    Does the following code work?

    public void reduce(WritableComparable key, Iterator values,
    OutputCollector output, Reporter reporter) {
    new SomeThread(output).start(); // transfer OutputCollector to thread
    }

    public class SomeThread extend Thread {
    OutputCollector ouput;
    public SomeThread(OutputCollector output) {
    this.output = output;
    }
    public void run() {
    output.collect(key, value);
    }
    }
  • Joydeep Sen Sarma at Oct 4, 2007 at 6:43 pm
    I got the impression that the original question was about
    multithreadedmaprunner.java and how something like that could be
    implemented in the reduce phase as well.

    Nguyen - ur code looks alright - but there's no limit on the number of
    threads u would end up spawning (something that the maprunner avoids)

    -----Original Message-----
    From: Ted Dunning
    Sent: Thursday, October 04, 2007 10:27 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Multi-threaded Reduce


    Arun,

    I think that you are strictly correct, but that the original questioner
    simply needed some parallelism for reduces, not necessarily parallelism
    on a
    single node.

    I could be very wrong. It is always difficult to determine what a
    question
    means, of course, since the person asking the question generally doesn't
    understand something about the system (hence the question).

    On 10/4/07 1:13 AM, "Arun C Murthy" wrote:

    Ted Dunning wrote:
    You don't need to do all that work.

    Just set:

    <property>
    <name>mapred.reduce.tasks</name>
    <value>4</value>
    <description>The default number of map tasks per job. Typically
    set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    Either in hadoop-site or in your program using
    conf.set("mapred.reduce.tasks", 4)

    That will give you 4 reduce threads. You can have lots more than
    that if
    you like.
    Err... no.

    *mapred.reduce.tasks* is the default no. of reduces for a job.

    I think the config knob you want is
    *mapred.tasktrackers.tasks.maximum*.
    >
    (http://lucene.apache.org/hadoop/hadoop-default.html#mapred.tasktracker.
    tasks.
    maximum)

    That, btw, is the maximum no. of tasks of a given kind (map or reduce)
    which can be simultaneously running on a given tasktracker (separate
    jvms). This is a cluster-wide limit, and there are jira issues open to
    make that a per-tracker knob (HADOOP-1245 & HADOOP-1274).

    Arun
    On 10/3/07 6:50 PM, "Nguyen Manh Tien" wrote:

    I know in Hadoop we can implement multi-threaded, asynchronous
    mapping with
    class MapRunnable. But this don't exist the similar class to do
    multi-threaded in reduce phrase. Could we do milti-thread in reduce
    phrase?.
    Does the following code work?

    public void reduce(WritableComparable key, Iterator values,
    OutputCollector output, Reporter reporter) {
    new SomeThread(output).start(); // transfer OutputCollector to
    thread
    }

    public class SomeThread extend Thread {
    OutputCollector ouput;
    public SomeThread(OutputCollector output) {
    this.output = output;
    }
    public void run() {
    output.collect(key, value);
    }
    }

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 4, '07 at 1:44a
activeOct 4, '07 at 6:43p
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase