Grokbase Groups Pig user June 2011
FAQ
Hi,

This is probably not directly a Pig question.

Anyone running Pig on amazon EC2 instances? Something's not making sense to
me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
cluster using m1.small. It took *13 minutes*. The job reads input from S3
and writes output to S3. But from the logs the reading and writing part
to/from S3 is pretty fast. And all the intermediate steps should happen on
HDFS.

Running the same job on my mbp laptop, it only took *3 minutes*.

Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6
on my laptop. Some hadoop config is probably also not ideal. I tried
m1.large instead of m1.small, doesn't seem to make a huge difference.
Anything you would suggest to look for the slowness on EC2?

Dexin

Search Discussions

  • Daniel Dai at Jun 14, 2011 at 5:43 pm
    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?
    2. If mapreduce mode, did you look into the hadoop log to see how much
    slow down each mapreduce job does?
    3. What kind of query is it?

    Daniel
    On 06/13/2011 11:54 AM, Dexin Wang wrote:
    Hi,

    This is probably not directly a Pig question.

    Anyone running Pig on amazon EC2 instances? Something's not making sense to
    me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
    cluster using m1.small. It took *13 minutes*. The job reads input from S3
    and writes output to S3. But from the logs the reading and writing part
    to/from S3 is pretty fast. And all the intermediate steps should happen on
    HDFS.

    Running the same job on my mbp laptop, it only took *3 minutes*.

    Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6
    on my laptop. Some hadoop config is probably also not ideal. I tried
    m1.large instead of m1.small, doesn't seem to make a huge difference.
    Anything you would suggest to look for the slowness on EC2?

    Dexin
  • Dexin Wang at Jun 14, 2011 at 5:55 pm
    Thanks for your feedback. My comments below.
    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai wrote:

    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?
    Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I
    ran it on ec2 cluster.

    2. If mapreduce mode, did you look into the hadoop log to see how much slow
    down each mapreduce job does?
    I'm looking into that.

    3. What kind of query is it?

    The input is gzipped json files which has one event per line. Then I do
    some hourly aggregation on the raw events, then do bunch of groupping,
    joining and some metrics computing (like median, variance) on some fields.

    Daniel
    Someone mentioned it's EC2's I/O performance. But I'm sure there are
    plenty of people using EC2/EMR running big MR jobs so more likely I have
    some configuration issues? My jobs can be optimized a bit but the fact that
    running on my laptop is faster tells me this is a separate issue.

    Thanks!


    On 06/13/2011 11:54 AM, Dexin Wang wrote:

    Hi,

    This is probably not directly a Pig question.

    Anyone running Pig on amazon EC2 instances? Something's not making sense
    to
    me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
    cluster using m1.small. It took *13 minutes*. The job reads input from S3
    and writes output to S3. But from the logs the reading and writing part
    to/from S3 is pretty fast. And all the intermediate steps should happen on
    HDFS.

    Running the same job on my mbp laptop, it only took *3 minutes*.

    Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6
    on my laptop. Some hadoop config is probably also not ideal. I tried
    m1.large instead of m1.small, doesn't seem to make a huge difference.
    Anything you would suggest to look for the slowness on EC2?

    Dexin
  • Daniel Dai at Jun 14, 2011 at 6:02 pm
    Local mode and mapreduce mode makes a huge difference. For a small
    query, the mapreduce overhead will dominate. For a fair comparison, can
    you setup a single node hadoop cluster on your laptop and run Pig on it?

    Daniel
    On 06/14/2011 10:54 AM, Dexin Wang wrote:
    Thanks for your feedback. My comments below.

    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai wrote:

    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?

    Local mode (-x local) when I ran it on my laptop, and mapreduce mode
    when I ran it on ec2 cluster.

    2. If mapreduce mode, did you look into the hadoop log to see how
    much slow down each mapreduce job does?

    I'm looking into that.

    3. What kind of query is it?

    The input is gzipped json files which has one event per line. Then I
    do some hourly aggregation on the raw events, then do bunch of
    groupping, joining and some metrics computing (like median, variance)
    on some fields.

    Daniel

    Someone mentioned it's EC2's I/O performance. But I'm sure there are
    plenty of people using EC2/EMR running big MR jobs so more likely I
    have some configuration issues? My jobs can be optimized a bit but the
    fact that running on my laptop is faster tells me this is a separate
    issue.

    Thanks!



    On 06/13/2011 11:54 AM, Dexin Wang wrote:

    Hi,

    This is probably not directly a Pig question.

    Anyone running Pig on amazon EC2 instances? Something's not
    making sense to
    me. I ran a Pig script that has about 10 mapred jobs in it on
    a 16 node
    cluster using m1.small. It took *13 minutes*. The job reads
    input from S3
    and writes output to S3. But from the logs the reading and
    writing part
    to/from S3 is pretty fast. And all the intermediate steps
    should happen on
    HDFS.

    Running the same job on my mbp laptop, it only took *3 minutes*.

    Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll
    try Pig 0.6
    on my laptop. Some hadoop config is probably also not ideal. I
    tried
    m1.large instead of m1.small, doesn't seem to make a huge
    difference.
    Anything you would suggest to look for the slowness on EC2?

    Dexin

  • Dexin Wang at Jun 14, 2011 at 6:07 pm
    Good to know. Trying single node hadoop cluster now. The main input is about
    1+ million lines of events. After some aggregation, it joins with another
    input source which has also about 1+ million rows. Is this considered small
    query? Thanks.
    On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai wrote:

    Local mode and mapreduce mode makes a huge difference. For a small query,
    the mapreduce overhead will dominate. For a fair comparison, can you setup a
    single node hadoop cluster on your laptop and run Pig on it?

    Daniel


    On 06/14/2011 10:54 AM, Dexin Wang wrote:

    Thanks for your feedback. My comments below.
    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai wrote:

    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?
    Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I
    ran it on ec2 cluster.

    2. If mapreduce mode, did you look into the hadoop log to see how much
    slow down each mapreduce job does?
    I'm looking into that.

    3. What kind of query is it?

    The input is gzipped json files which has one event per line. Then I do
    some hourly aggregation on the raw events, then do bunch of groupping,
    joining and some metrics computing (like median, variance) on some fields.

    Daniel
    Someone mentioned it's EC2's I/O performance. But I'm sure there are
    plenty of people using EC2/EMR running big MR jobs so more likely I have
    some configuration issues? My jobs can be optimized a bit but the fact that
    running on my laptop is faster tells me this is a separate issue.

    Thanks!


    On 06/13/2011 11:54 AM, Dexin Wang wrote:

    Hi,

    This is probably not directly a Pig question.

    Anyone running Pig on amazon EC2 instances? Something's not making sense
    to
    me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
    cluster using m1.small. It took *13 minutes*. The job reads input from S3
    and writes output to S3. But from the logs the reading and writing part
    to/from S3 is pretty fast. And all the intermediate steps should happen
    on
    HDFS.

    Running the same job on my mbp laptop, it only took *3 minutes*.

    Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig
    0.6
    on my laptop. Some hadoop config is probably also not ideal. I tried
    m1.large instead of m1.small, doesn't seem to make a huge difference.
    Anything you would suggest to look for the slowness on EC2?

    Dexin
  • Daniel Dai at Jun 14, 2011 at 7:16 pm
    If the job finishes in 3 minutes in local mode, I would think it is small.
    On 06/14/2011 11:07 AM, Dexin Wang wrote:
    Good to know. Trying single node hadoop cluster now. The main input is
    about 1+ million lines of events. After some aggregation, it joins
    with another input source which has also about 1+ million rows. Is
    this considered small query? Thanks.

    On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai wrote:

    Local mode and mapreduce mode makes a huge difference. For a small
    query, the mapreduce overhead will dominate. For a fair
    comparison, can you setup a single node hadoop cluster on your
    laptop and run Pig on it?

    Daniel

    On 06/14/2011 10:54 AM, Dexin Wang wrote:
    Thanks for your feedback. My comments below.

    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
    wrote:

    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?

    Local mode (-x local) when I ran it on my laptop, and mapreduce
    mode when I ran it on ec2 cluster.

    2. If mapreduce mode, did you look into the hadoop log to see
    how much slow down each mapreduce job does?

    I'm looking into that.

    3. What kind of query is it?

    The input is gzipped json files which has one event per line.
    Then I do some hourly aggregation on the raw events, then do
    bunch of groupping, joining and some metrics computing (like
    median, variance) on some fields.

    Daniel

    Someone mentioned it's EC2's I/O performance. But I'm sure there
    are plenty of people using EC2/EMR running big MR jobs so more
    likely I have some configuration issues? My jobs can be optimized
    a bit but the fact that running on my laptop is faster tells me
    this is a separate issue.

    Thanks!



    On 06/13/2011 11:54 AM, Dexin Wang wrote:

    Hi,

    This is probably not directly a Pig question.

    Anyone running Pig on amazon EC2 instances? Something's
    not making sense to
    me. I ran a Pig script that has about 10 mapred jobs in
    it on a 16 node
    cluster using m1.small. It took *13 minutes*. The job
    reads input from S3
    and writes output to S3. But from the logs the reading
    and writing part
    to/from S3 is pretty fast. And all the intermediate steps
    should happen on
    HDFS.

    Running the same job on my mbp laptop, it only took *3
    minutes*.

    Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
    I'll try Pig 0.6
    on my laptop. Some hadoop config is probably also not
    ideal. I tried
    m1.large instead of m1.small, doesn't seem to make a huge
    difference.
    Anything you would suggest to look for the slowness on EC2?

    Dexin

  • Tomas Svarovsky at Jun 14, 2011 at 10:36 pm
    Hi Dexin,

    Since I am being a Pig and map reduce newbie your post is very
    intriguing for me. I am coming from Talend background and trying to
    asses if map/reduce would bring any possible speed up and faster
    turnaround to my projects. My worries are that my data are to small so
    that map reduce overhead will be prohibitive in certain cases.

    When using Talend if the transformation was reasonable it could
    process 10s of thousand rows per second. Processing 1 million rows
    could be finished well under 1 minute so I think that your dataset is
    fairly small. Nevertheless my data are growing so soon it wil be time
    for pig.

    Could you provide some info what worked well for you to run your job on EC2?

    Thanks in advance,

    Tomas
    On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai wrote:
    If the job finishes in 3 minutes in local mode, I would think it is small.
    On 06/14/2011 11:07 AM, Dexin Wang wrote:

    Good to know. Trying single node hadoop cluster now. The main input is
    about 1+ million lines of events. After some aggregation, it joins with
    another input source which has also about 1+ million rows. Is this
    considered small query? Thanks.

    On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <jianyong@yahoo-inc.com
    wrote:

    Local mode and mapreduce mode makes a huge difference. For a small
    query, the mapreduce overhead will dominate. For a fair
    comparison, can you setup a single node hadoop cluster on your
    laptop and run Pig on it?

    Daniel

    On 06/14/2011 10:54 AM, Dexin Wang wrote:

    Thanks for your feedback. My comments below.

    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
    wrote:

    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?

    Local mode (-x local) when I ran it on my laptop, and mapreduce
    mode when I ran it on ec2 cluster.

    2. If mapreduce mode, did you look into the hadoop log to see
    how much slow down each mapreduce job does?

    I'm looking into that.

    3. What kind of query is it?

    The input is gzipped json files which has one event per line.
    Then I do some hourly aggregation on the raw events, then do
    bunch of groupping, joining and some metrics computing (like
    median, variance) on some fields.

    Daniel

    Someone mentioned it's EC2's I/O performance. But I'm sure there
    are plenty of people using EC2/EMR running big MR jobs so more
    likely I have some configuration issues? My jobs can be optimized
    a bit but the fact that running on my laptop is faster tells me
    this is a separate issue.

    Thanks!



    On 06/13/2011 11:54 AM, Dexin Wang wrote:

    Hi,

    This is probably not directly a Pig question.

    Anyone running Pig on amazon EC2 instances? Something's
    not making sense to
    me. I ran a Pig script that has about 10 mapred jobs in
    it on a 16 node
    cluster using m1.small. It took *13 minutes*. The job
    reads input from S3
    and writes output to S3. But from the logs the reading
    and writing part
    to/from S3 is pretty fast. And all the intermediate steps
    should happen on
    HDFS.

    Running the same job on my mbp laptop, it only took *3
    minutes*.

    Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
    I'll try Pig 0.6
    on my laptop. Some hadoop config is probably also not
    ideal. I tried
    m1.large instead of m1.small, doesn't seem to make a huge
    difference.
    Anything you would suggest to look for the slowness on EC2?

    Dexin

  • Dexin Wang at Jun 15, 2011 at 6:15 pm
    Tomas,

    What worked well for me is still to be figured out. Right now, it works but
    it's too slow. I think one of the main problem is that my job has many
    JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk which
    is slow.

    On that node, anyone knows how to know if the lzo is turned on for
    intermediate jobs. Reference to this

    http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs

    and this

    http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

    I see I have this in my mapred-site.xml file:

    <property><name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value></property>

    Is that all I need to have map compression turned on? Thanks.

    Dexin

    On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky
    wrote:
    Hi Dexin,

    Since I am being a Pig and map reduce newbie your post is very
    intriguing for me. I am coming from Talend background and trying to
    asses if map/reduce would bring any possible speed up and faster
    turnaround to my projects. My worries are that my data are to small so
    that map reduce overhead will be prohibitive in certain cases.

    When using Talend if the transformation was reasonable it could
    process 10s of thousand rows per second. Processing 1 million rows
    could be finished well under 1 minute so I think that your dataset is
    fairly small. Nevertheless my data are growing so soon it wil be time
    for pig.

    Could you provide some info what worked well for you to run your job on
    EC2?

    Thanks in advance,

    Tomas
    On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai wrote:
    If the job finishes in 3 minutes in local mode, I would think it is
    small.
    On 06/14/2011 11:07 AM, Dexin Wang wrote:

    Good to know. Trying single node hadoop cluster now. The main input is
    about 1+ million lines of events. After some aggregation, it joins with
    another input source which has also about 1+ million rows. Is this
    considered small query? Thanks.

    On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <jianyong@yahoo-inc.com
    wrote:

    Local mode and mapreduce mode makes a huge difference. For a small
    query, the mapreduce overhead will dominate. For a fair
    comparison, can you setup a single node hadoop cluster on your
    laptop and run Pig on it?

    Daniel

    On 06/14/2011 10:54 AM, Dexin Wang wrote:

    Thanks for your feedback. My comments below.

    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
    wrote:

    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?

    Local mode (-x local) when I ran it on my laptop, and mapreduce
    mode when I ran it on ec2 cluster.

    2. If mapreduce mode, did you look into the hadoop log to see
    how much slow down each mapreduce job does?

    I'm looking into that.

    3. What kind of query is it?

    The input is gzipped json files which has one event per line.
    Then I do some hourly aggregation on the raw events, then do
    bunch of groupping, joining and some metrics computing (like
    median, variance) on some fields.

    Daniel

    Someone mentioned it's EC2's I/O performance. But I'm sure there
    are plenty of people using EC2/EMR running big MR jobs so more
    likely I have some configuration issues? My jobs can be optimized
    a bit but the fact that running on my laptop is faster tells me
    this is a separate issue.

    Thanks!



    On 06/13/2011 11:54 AM, Dexin Wang wrote:

    Hi,

    This is probably not directly a Pig question.

    Anyone running Pig on amazon EC2 instances? Something's
    not making sense to
    me. I ran a Pig script that has about 10 mapred jobs in
    it on a 16 node
    cluster using m1.small. It took *13 minutes*. The job
    reads input from S3
    and writes output to S3. But from the logs the reading
    and writing part
    to/from S3 is pretty fast. And all the intermediate steps
    should happen on
    HDFS.

    Running the same job on my mbp laptop, it only took *3
    minutes*.

    Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
    I'll try Pig 0.6
    on my laptop. Some hadoop config is probably also not
    ideal. I tried
    m1.large instead of m1.small, doesn't seem to make a huge
    difference.
    Anything you would suggest to look for the slowness on EC2?

    Dexin

  • Dmitriy Ryaboy at Jun 16, 2011 at 1:46 am
    you need to add this to your pig.properties:

    pig.tmpfilecompression=true
    pig.tmpfilecompression.codec=lzo

    Make sure that you are running hadoop 20.2 or higher, pig 8.1 or
    higher, and that all the lzo stuff is set up -- it's a bit involved.

    Use replicated joins where possible.

    If you are doing a large number of small jobs, scheduling and
    provisioning is likely to dominate -- tune your job scheduler to
    schedule more tasks per heartbeat and make sure your jar is as small
    as you can get it (there's a lot of unjarring going on in Hadoop)
    D
    On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang wrote:
    Tomas,

    What worked well for me is still to be figured out. Right now, it works but
    it's too slow. I think one of the main problem is that my job has many
    JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk which
    is slow.

    On that node, anyone knows how to know if the lzo is turned on for
    intermediate jobs. Reference to this

    http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs

    and this

    http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

    I see I have this in my mapred-site.xml file:

    <property><name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value></property>

    Is that all I need to have map compression turned on? Thanks.

    Dexin

    On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky
    wrote:
    Hi Dexin,

    Since I am being a Pig and map reduce newbie your post is very
    intriguing for me. I am coming from Talend background and trying to
    asses if map/reduce would bring any possible speed up and faster
    turnaround to my projects. My worries are that my data are to small so
    that map reduce overhead will be prohibitive in certain cases.

    When using Talend if the transformation was reasonable it could
    process 10s of thousand rows per second. Processing 1 million rows
    could be finished well under 1 minute so I think that your dataset is
    fairly small. Nevertheless my data are growing so soon it wil be time
    for pig.

    Could you provide some info what worked well for you to run your job on
    EC2?

    Thanks in advance,

    Tomas

    On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai <jianyong@yahoo-inc.com>
    wrote:
    If the job finishes in 3 minutes in local mode, I would think it is
    small.
    On 06/14/2011 11:07 AM, Dexin Wang wrote:

    Good to know. Trying single node hadoop cluster now. The main input is
    about 1+ million lines of events. After some aggregation, it joins with
    another input source which has also about 1+ million rows. Is this
    considered small query? Thanks.

    On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <jianyong@yahoo-inc.com
    wrote:

    Local mode and mapreduce mode makes a huge difference. For a small
    query, the mapreduce overhead will dominate. For a fair
    comparison, can you setup a single node hadoop cluster on your
    laptop and run Pig on it?

    Daniel

    On 06/14/2011 10:54 AM, Dexin Wang wrote:

    Thanks for your feedback. My comments below.

    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
    wrote:

    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?

    Local mode (-x local) when I ran it on my laptop, and mapreduce
    mode when I ran it on ec2 cluster.

    2. If mapreduce mode, did you look into the hadoop log to see
    how much slow down each mapreduce job does?

    I'm looking into that.

    3. What kind of query is it?

    The input is gzipped json files which has one event per line.
    Then I do some hourly aggregation on the raw events, then do
    bunch of groupping, joining and some metrics computing (like
    median, variance) on some fields.

    Daniel

    Someone mentioned it's EC2's I/O performance. But I'm sure there
    are plenty of people using EC2/EMR running big MR jobs so more
    likely I have some configuration issues? My jobs can be optimized
    a bit but the fact that running on my laptop is faster tells me
    this is a separate issue.

    Thanks!



    On 06/13/2011 11:54 AM, Dexin Wang wrote:

    Hi,

    This is probably not directly a Pig question.

    Anyone running Pig on amazon EC2 instances? Something's
    not making sense to
    me. I ran a Pig script that has about 10 mapred jobs in
    it on a 16 node
    cluster using m1.small. It took *13 minutes*. The job
    reads input from S3
    and writes output to S3. But from the logs the reading
    and writing part
    to/from S3 is pretty fast. And all the intermediate steps
    should happen on
    HDFS.

    Running the same job on my mbp laptop, it only took *3
    minutes*.

    Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
    I'll try Pig 0.6
    on my laptop. Some hadoop config is probably also not
    ideal. I tried
    m1.large instead of m1.small, doesn't seem to make a huge
    difference.
    Anything you would suggest to look for the slowness on EC2?

    Dexin

  • Dexin Wang at Jun 16, 2011 at 4:17 am
    Thanks a lot for the good advice.

    I'll see if I can get lzo setup. Currently I'm using emr which uses pig 0.6.
    I'll looking into whirr to start the hadoop cluster on ec2.

    There is one place in my job where I can use replicated join, I'm sure that
    will cut down some time.

    What I find interesting is without doing any optimization on configuration
    or code side, I get 2x to 4x speed up by just using the "*Cluster Compute
    Quadruple Extra Large Instance*" (cc1.4xlarge) as oppose to the regular
    "Large instance" (m1.large) on the $$. They do claim cc1.4xlarge's IO is
    "very high". Since I suspect most of my job was spending time
    reading/writing disk, this speedup makes sense.
    On Wed, Jun 15, 2011 at 6:46 PM, Dmitriy Ryaboy wrote:

    you need to add this to your pig.properties:

    pig.tmpfilecompression=true
    pig.tmpfilecompression.codec=lzo

    Make sure that you are running hadoop 20.2 or higher, pig 8.1 or
    higher, and that all the lzo stuff is set up -- it's a bit involved.

    Use replicated joins where possible.

    If you are doing a large number of small jobs, scheduling and
    provisioning is likely to dominate -- tune your job scheduler to
    schedule more tasks per heartbeat and make sure your jar is as small
    as you can get it (there's a lot of unjarring going on in Hadoop)
    D
    On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang wrote:
    Tomas,

    What worked well for me is still to be figured out. Right now, it works but
    it's too slow. I think one of the main problem is that my job has many
    JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk which
    is slow.

    On that node, anyone knows how to know if the lzo is turned on for
    intermediate jobs. Reference to this

    http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs
    and this

    http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
    I see I have this in my mapred-site.xml file:

    <property><name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value></property>

    Is that all I need to have map compression turned on? Thanks.

    Dexin

    On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky
    wrote:
    Hi Dexin,

    Since I am being a Pig and map reduce newbie your post is very
    intriguing for me. I am coming from Talend background and trying to
    asses if map/reduce would bring any possible speed up and faster
    turnaround to my projects. My worries are that my data are to small so
    that map reduce overhead will be prohibitive in certain cases.

    When using Talend if the transformation was reasonable it could
    process 10s of thousand rows per second. Processing 1 million rows
    could be finished well under 1 minute so I think that your dataset is
    fairly small. Nevertheless my data are growing so soon it wil be time
    for pig.

    Could you provide some info what worked well for you to run your job on
    EC2?

    Thanks in advance,

    Tomas

    On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai <jianyong@yahoo-inc.com>
    wrote:
    If the job finishes in 3 minutes in local mode, I would think it is
    small.
    On 06/14/2011 11:07 AM, Dexin Wang wrote:

    Good to know. Trying single node hadoop cluster now. The main input
    is
    about 1+ million lines of events. After some aggregation, it joins
    with
    another input source which has also about 1+ million rows. Is this
    considered small query? Thanks.

    On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <jianyong@yahoo-inc.com
    wrote:

    Local mode and mapreduce mode makes a huge difference. For a small
    query, the mapreduce overhead will dominate. For a fair
    comparison, can you setup a single node hadoop cluster on your
    laptop and run Pig on it?

    Daniel

    On 06/14/2011 10:54 AM, Dexin Wang wrote:

    Thanks for your feedback. My comments below.

    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
    wrote:

    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?

    Local mode (-x local) when I ran it on my laptop, and mapreduce
    mode when I ran it on ec2 cluster.

    2. If mapreduce mode, did you look into the hadoop log to see
    how much slow down each mapreduce job does?

    I'm looking into that.

    3. What kind of query is it?

    The input is gzipped json files which has one event per line.
    Then I do some hourly aggregation on the raw events, then do
    bunch of groupping, joining and some metrics computing (like
    median, variance) on some fields.

    Daniel

    Someone mentioned it's EC2's I/O performance. But I'm sure there
    are plenty of people using EC2/EMR running big MR jobs so more
    likely I have some configuration issues? My jobs can be optimized
    a bit but the fact that running on my laptop is faster tells me
    this is a separate issue.

    Thanks!



    On 06/13/2011 11:54 AM, Dexin Wang wrote:

    Hi,

    This is probably not directly a Pig question.

    Anyone running Pig on amazon EC2 instances? Something's
    not making sense to
    me. I ran a Pig script that has about 10 mapred jobs in
    it on a 16 node
    cluster using m1.small. It took *13 minutes*. The job
    reads input from S3
    and writes output to S3. But from the logs the reading
    and writing part
    to/from S3 is pretty fast. And all the intermediate steps
    should happen on
    HDFS.

    Running the same job on my mbp laptop, it only took *3
    minutes*.

    Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
    I'll try Pig 0.6
    on my laptop. Some hadoop config is probably also not
    ideal. I tried
    m1.large instead of m1.small, doesn't seem to make a huge
    difference.
    Anything you would suggest to look for the slowness on
    EC2?
    Dexin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 13, '11 at 6:55p
activeJun 16, '11 at 4:17a
posts10
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase