I have a question about map parallelism in Pig.
I am using Pig to stream a file through a Python script that performs some
computationally expensive transforms. This process is assigned to a single
map task that can take a very long time if it happens to execute on one of
the weaker nodes in the cluster. I am wondering how I can force the map task
to be spread across a number of nodes.
From readinghttp://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause, I
see that the parallelism of maps is "determined by the input file, one map
for each HDFS block."
The file I am operating on is 40 MB; the block size is 64 MB, so presumably
the file is stored in a single HDFS block. The replication factor for the
file is 3, and the DFS web UI verifies this.
My question is: Is there anything I can do to increase the parallelism of
the map task? Is it the case that the replication factor being 3 does not
influence how many map tasks can be performed simultaneously? Should I use a
smaller HDFS block size?
I am using Hadoop 0.20.2, Pig 0.7.0.