FAQ
hi,

I've the 3 datanodes/tasktrackers.

I'm runing nutch crawling for bulk urls[18k] in our hadoop cluster.

It is running more mapred jobs in each task trackers.

I want to make the replication off in bulk processing time to make it fast
and then after all jobs are finished I'll enable the replication.

is it possible?

After I enabled the replication whether all past data in cluster will be
replicated according to the dfs.replication settings[dfs.replication=3]?

Or else could you suggest any idea to make the bulk jobs to run fast?

Note : If I run 15 url crawling in each TaskTracker concurrently, the
average finishing time is 10mins. One URL crawl has 7 mapred jobs!

Thanks.

Search Discussions

  • Harsh J at Sep 8, 2012 at 12:26 pm
    Hi Bala,

    How're you determining your issue to be related to replication of HDFS
    writes?

    You can run your jobs with dfs.replication set to 1 in the Job
    configuration (or in your mapred-site.xml for a global effect) and at
    the end of the job, call FileSystem#setReplication(…)
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#setReplication(org.apache.hadoop.fs.Path,%20short)
    for all the files in the output directory.

    This would bump up the
    replication factor on the touched files, but the replication will
    happen in the background (asynchronous) so your job driver does not
    need to wait for it to complete.

    On Fri, Sep 7, 2012 at 1:39 PM, Bala krishnan
    wrote:
    hi,

    I've the 3 datanodes/tasktrackers.

    I'm runing nutch crawling for bulk urls[18k] in our hadoop cluster.

    It is running more mapred jobs in each task trackers.

    I want to make the replication off in bulk processing time to make it fast
    and then after all jobs are finished I'll enable the replication.

    is it possible?

    After I enabled the replication whether all past data in cluster will be
    replicated according to the dfs.replication settings[dfs.replication=3]?

    Or else could you suggest any idea to make the bulk jobs to run fast?

    Note : If I run 15 url crawling in each TaskTracker concurrently, the
    average finishing time is 10mins. One URL crawl has 7 mapred jobs!

    Thanks.


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupscm-users @
categorieshadoop
postedSep 7, '12 at 8:40a
activeSep 8, '12 at 12:26p
posts2
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Bala krishnan: 1 post Harsh J: 1 post

People

Translate

site design / logo © 2022 Grokbase