FAQ
Hi,

I have a question about map parallelism in Pig.

I am using Pig to stream a file through a Python script that performs some
computationally expensive transforms. This process is assigned to a single
map task that can take a very long time if it happens to execute on one of
the weaker nodes in the cluster. I am wondering how I can force the map task
to be spread across a number of nodes.
From reading
http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause, I
see that the parallelism of maps is "determined by the input file, one map
for each HDFS block."

The file I am operating on is 40 MB; the block size is 64 MB, so presumably
the file is stored in a single HDFS block. The replication factor for the
file is 3, and the DFS web UI verifies this.

My question is: Is there anything I can do to increase the parallelism of
the map task? Is it the case that the replication factor being 3 does not
influence how many map tasks can be performed simultaneously? Should I use a
smaller HDFS block size?

I am using Hadoop 0.20.2, Pig 0.7.0.

Thanks,
- Charles

Search Discussions

  • Dmitriy Ryaboy at Dec 15, 2010 at 4:59 am
    Try

    set mapred.max.split.size $desired_split_size

    -D
    On Tue, Dec 14, 2010 at 8:10 PM, Charles W wrote:
    Hi,

    I have a question about map parallelism in Pig.

    I am using Pig to stream a file through a Python script that performs some
    computationally expensive transforms. This process is assigned to a single
    map task that can take a very long time if it happens to execute on one of
    the weaker nodes in the cluster. I am wondering how I can force the map task
    to be spread across a number of nodes.

    From reading
    http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause, I
    see that the parallelism of maps is "determined by the input file, one map
    for each HDFS block."

    The file I am operating on is 40 MB; the block size is 64 MB, so presumably
    the file is stored in a single HDFS block. The replication factor for the
    file is 3, and the DFS web UI verifies this.

    My question is: Is there anything I can do to increase the parallelism of
    the map task? Is it the case that the replication factor being 3 does not
    influence how many map tasks can be performed simultaneously? Should I use a
    smaller HDFS block size?

    I am using Hadoop 0.20.2, Pig 0.7.0.

    Thanks,
    - Charles
  • Charles W at Dec 15, 2010 at 6:39 pm
    Excellent, that did the trick.

    For reference, I did:

    export PIG_OPTS="$PIG_OPTS -Dmapred.max.split.size=1000000"

    Thanks for your help.

    - Charles
    On Tue, Dec 14, 2010 at 11:59 PM, Dmitriy Ryaboy wrote:

    Try

    set mapred.max.split.size $desired_split_size

    -D
    On Tue, Dec 14, 2010 at 8:10 PM, Charles W wrote:
    Hi,

    I have a question about map parallelism in Pig.

    I am using Pig to stream a file through a Python script that performs some
    computationally expensive transforms. This process is assigned to a single
    map task that can take a very long time if it happens to execute on one of
    the weaker nodes in the cluster. I am wondering how I can force the map task
    to be spread across a number of nodes.

    From reading
    http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause, I
    see that the parallelism of maps is "determined by the input file, one map
    for each HDFS block."

    The file I am operating on is 40 MB; the block size is 64 MB, so
    presumably
    the file is stored in a single HDFS block. The replication factor for the
    file is 3, and the DFS web UI verifies this.

    My question is: Is there anything I can do to increase the parallelism of
    the map task? Is it the case that the replication factor being 3 does not
    influence how many map tasks can be performed simultaneously? Should I use a
    smaller HDFS block size?

    I am using Hadoop 0.20.2, Pig 0.7.0.

    Thanks,
    - Charles

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedDec 15, '10 at 4:11a
activeDec 15, '10 at 6:39p
posts3
users2
websitepig.apache.org

2 users in discussion

Charles W: 2 posts Dmitriy Ryaboy: 1 post

People

Translate

site design / logo © 2021 Grokbase