Grokbase Groups Pig user July 2011
FAQ
I have been trying to Store data in HBase suing HbaseStorage class. While I
can store the original read data, it fails when I try to store the processed
data.
Which means I might be messing up the datatypes somewhere.

My script below is :-

--REGISTER myudfs.jar
--A = load 'hbase://transaction' using
org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:ref2', '-loadKey') AS
(row:chararray, code:chararray) ;
--grp = group A by myudfs.Parser(code);
--ct = foreach grp generate group,COUNT(A.code) as count;

--sorted = order ct by count desc;
--result = foreach sorted generate $0 as row,(chararray)$1;
--store result into 'pig_test' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:count');

The dump of "result" works but the store to Hbase fails.
WHen I try to store A it works fine.

Datatypes of A and result are :-
A: {row: chararray,code: chararray}
result: {row: chararray,count: chararray}

Search Discussions

  • Bill Graham at Jul 15, 2011 at 8:16 pm
    What version of Pig are you using and what errors are you seeing?

    There was PIG-1870 related to projections that might apply, but I can't say
    so for sure. If that's the case it should work if you disable the new
    logical plan with -Dusenewloginalplan=false.

    Also, you might try specifying pig_test as 'hbase://pig_test'. I recall
    another JIRA about that as well.
    On Fri, Jul 15, 2011 at 12:40 PM, sulabh choudhury wrote:

    I have been trying to Store data in HBase suing HbaseStorage class. While I
    can store the original read data, it fails when I try to store the
    processed
    data.
    Which means I might be messing up the datatypes somewhere.

    My script below is :-

    --REGISTER myudfs.jar
    --A = load 'hbase://transaction' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:ref2', '-loadKey') AS
    (row:chararray, code:chararray) ;
    --grp = group A by myudfs.Parser(code);
    --ct = foreach grp generate group,COUNT(A.code) as count;

    --sorted = order ct by count desc;
    --result = foreach sorted generate $0 as row,(chararray)$1;
    --store result into 'pig_test' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:count');

    The dump of "result" works but the store to Hbase fails.
    WHen I try to store A it works fine.

    Datatypes of A and result are :-
    A: {row: chararray,code: chararray}
    result: {row: chararray,count: chararray}
  • Sulabh choudhury at Jul 15, 2011 at 9:31 pm
    Bill,

    there no useful message in logs (pasted below).
    I tried SET pig.usenewlogicalplan 'false' which did not help.
    I am using pig-0.8.0-cdh3u0. I have tried both with and without 'hbase://'
    prefix

    2011-07-15 14:19:58,700 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
    - 100% complete
    2011-07-15 14:19:58,702 [main] ERROR
    org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
    2011-07-15 14:19:58,703 [main] INFO org.apache.pig.tools.pigstats.PigStats
    - Script Statistics:

    HadoopVersion PigVersion UserId StartedAt FinishedAt Features
    0.20.2-cdh3u0 0.8.0-cdh3u0 cxt 2011-07-15 14:18:11 2011-07-15 14:19:58
    GROUP_BY,ORDER_BY

    Some jobs have failed! Stop running all dependent jobs

    Job Stats (time in seconds):
    JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
    MinReduceTime AvgReduceTime Alias Feature Outputs
    job_201106212025_0139 1 1 8 8 8 12 12 12 A,ct,grp GROUP_BY,COMBINER
    job_201106212025_0140 1 1 3 3 3 12 12 12 sorted SAMPLER

    Failed Jobs:
    JobId Alias Feature Message Outputs
    job_201106212025_0141 result,sorted ORDER_BY Message: Job failed! Error - NA
    pig_test,

    Input(s):
    Successfully read 2583 records (330 bytes) from: "hbase://transaction"

    Output(s):
    Failed to produce result in "pig_test"

    On Fri, Jul 15, 2011 at 1:16 PM, Bill Graham wrote:

    What version of Pig are you using and what errors are you seeing?

    There was PIG-1870 related to projections that might apply, but I can't say
    so for sure. If that's the case it should work if you disable the new
    logical plan with -Dusenewloginalplan=false.

    Also, you might try specifying pig_test as 'hbase://pig_test'. I recall
    another JIRA about that as well.

    On Fri, Jul 15, 2011 at 12:40 PM, sulabh choudhury <sulabhc@gmail.com
    wrote:
    I have been trying to Store data in HBase suing HbaseStorage class. While I
    can store the original read data, it fails when I try to store the
    processed
    data.
    Which means I might be messing up the datatypes somewhere.

    My script below is :-

    --REGISTER myudfs.jar
    --A = load 'hbase://transaction' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:ref2', '-loadKey') AS
    (row:chararray, code:chararray) ;
    --grp = group A by myudfs.Parser(code);
    --ct = foreach grp generate group,COUNT(A.code) as count;

    --sorted = order ct by count desc;
    --result = foreach sorted generate $0 as row,(chararray)$1;
    --store result into 'pig_test' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:count');

    The dump of "result" works but the store to Hbase fails.
    WHen I try to store A it works fine.

    Datatypes of A and result are :-
    A: {row: chararray,code: chararray}
    result: {row: chararray,count: chararray}
  • Bill Graham at Jul 15, 2011 at 10:28 pm
    What do you see on the map and reduce tasks logs on the JT UI for that job?

    This job is failing for some reason, so there should be some hint in the
    task logs.
    On Fri, Jul 15, 2011 at 2:31 PM, sulabh choudhury wrote:

    Bill,

    there no useful message in logs (pasted below).
    I tried SET pig.usenewlogicalplan 'false' which did not help.
    I am using pig-0.8.0-cdh3u0. I have tried both with and without 'hbase://'
    prefix

    2011-07-15 14:19:58,700 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
    - 100% complete
    2011-07-15 14:19:58,702 [main] ERROR
    org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
    2011-07-15 14:19:58,703 [main] INFO org.apache.pig.tools.pigstats.PigStats
    - Script Statistics:

    HadoopVersion PigVersion UserId StartedAt FinishedAt Features
    0.20.2-cdh3u0 0.8.0-cdh3u0 cxt 2011-07-15 14:18:11 2011-07-15 14:19:58
    GROUP_BY,ORDER_BY

    Some jobs have failed! Stop running all dependent jobs

    Job Stats (time in seconds):
    JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
    MinReduceTime AvgReduceTime Alias Feature Outputs
    job_201106212025_0139 1 1 8 8 8 12 12 12 A,ct,grp GROUP_BY,COMBINER
    job_201106212025_0140 1 1 3 3 3 12 12 12 sorted SAMPLER

    Failed Jobs:
    JobId Alias Feature Message Outputs
    job_201106212025_0141 result,sorted ORDER_BY Message: Job failed! Error -
    NA pig_test,

    Input(s):
    Successfully read 2583 records (330 bytes) from: "hbase://transaction"

    Output(s):
    Failed to produce result in "pig_test"

    On Fri, Jul 15, 2011 at 1:16 PM, Bill Graham wrote:

    What version of Pig are you using and what errors are you seeing?

    There was PIG-1870 related to projections that might apply, but I can't
    say
    so for sure. If that's the case it should work if you disable the new
    logical plan with -Dusenewloginalplan=false.

    Also, you might try specifying pig_test as 'hbase://pig_test'. I recall
    another JIRA about that as well.

    On Fri, Jul 15, 2011 at 12:40 PM, sulabh choudhury <sulabhc@gmail.com
    wrote:
    I have been trying to Store data in HBase suing HbaseStorage class. While I
    can store the original read data, it fails when I try to store the
    processed
    data.
    Which means I might be messing up the datatypes somewhere.

    My script below is :-

    --REGISTER myudfs.jar
    --A = load 'hbase://transaction' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:ref2', '-loadKey') AS
    (row:chararray, code:chararray) ;
    --grp = group A by myudfs.Parser(code);
    --ct = foreach grp generate group,COUNT(A.code) as count;

    --sorted = order ct by count desc;
    --result = foreach sorted generate $0 as row,(chararray)$1;
    --store result into 'pig_test' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:count');

    The dump of "result" works but the store to Hbase fails.
    WHen I try to store A it works fine.

    Datatypes of A and result are :-
    A: {row: chararray,code: chararray}
    result: {row: chararray,count: chararray}



  • Sulabh choudhury at Jul 15, 2011 at 10:53 pm
    Yes I see a few errors in JT logs :-
    java.lang.NoClassDefFoundError: com/google/common/collect/Lists
    ClassNotFoundException:
    org.apache.hadoop.hbase.filter.WritableByteArrayComparable

    I think it cannot find some dependent jars? How or where do I add these jars
    so that pig can see them

    On Fri, Jul 15, 2011 at 3:27 PM, Bill Graham wrote:

    What do you see on the map and reduce tasks logs on the JT UI for that job?

    This job is failing for some reason, so there should be some hint in the
    task logs.

    On Fri, Jul 15, 2011 at 2:31 PM, sulabh choudhury wrote:

    Bill,

    there no useful message in logs (pasted below).
    I tried SET pig.usenewlogicalplan 'false' which did not help.
    I am using pig-0.8.0-cdh3u0. I have tried both with and without 'hbase://'
    prefix

    2011-07-15 14:19:58,700 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
    - 100% complete
    2011-07-15 14:19:58,702 [main] ERROR
    org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
    2011-07-15 14:19:58,703 [main] INFO
    org.apache.pig.tools.pigstats.PigStats - Script Statistics:

    HadoopVersion PigVersion UserId StartedAt FinishedAt Features
    0.20.2-cdh3u0 0.8.0-cdh3u0 cxt 2011-07-15 14:18:11 2011-07-15 14:19:58
    GROUP_BY,ORDER_BY

    Some jobs have failed! Stop running all dependent jobs

    Job Stats (time in seconds):
    JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
    MinReduceTime AvgReduceTime Alias Feature Outputs
    job_201106212025_0139 1 1 8 8 8 12 12 12 A,ct,grp GROUP_BY,COMBINER
    job_201106212025_0140 1 1 3 3 3 12 12 12 sorted SAMPLER

    Failed Jobs:
    JobId Alias Feature Message Outputs
    job_201106212025_0141 result,sorted ORDER_BY Message: Job failed! Error -
    NA pig_test,

    Input(s):
    Successfully read 2583 records (330 bytes) from: "hbase://transaction"

    Output(s):
    Failed to produce result in "pig_test"

    On Fri, Jul 15, 2011 at 1:16 PM, Bill Graham wrote:

    What version of Pig are you using and what errors are you seeing?

    There was PIG-1870 related to projections that might apply, but I can't
    say
    so for sure. If that's the case it should work if you disable the new
    logical plan with -Dusenewloginalplan=false.

    Also, you might try specifying pig_test as 'hbase://pig_test'. I recall
    another JIRA about that as well.

    On Fri, Jul 15, 2011 at 12:40 PM, sulabh choudhury <sulabhc@gmail.com
    wrote:
    I have been trying to Store data in HBase suing HbaseStorage class. While I
    can store the original read data, it fails when I try to store the
    processed
    data.
    Which means I might be messing up the datatypes somewhere.

    My script below is :-

    --REGISTER myudfs.jar
    --A = load 'hbase://transaction' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:ref2',
    '-loadKey') AS
    (row:chararray, code:chararray) ;
    --grp = group A by myudfs.Parser(code);
    --ct = foreach grp generate group,COUNT(A.code) as count;

    --sorted = order ct by count desc;
    --result = foreach sorted generate $0 as row,(chararray)$1;
    --store result into 'pig_test' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:count');

    The dump of "result" works but the store to Hbase fails.
    WHen I try to store A it works fine.

    Datatypes of A and result are :-
    A: {row: chararray,code: chararray}
    result: {row: chararray,count: chararray}




    --

    --
    Thanks and Regards,
    Sulabh Choudhury
  • Bill Graham at Jul 15, 2011 at 10:58 pm
    That's because your pig script probably doesn't register the guava jar. Be
    sure to register the guava, hbase and zookeeper jars in your script.
    On Fri, Jul 15, 2011 at 3:52 PM, sulabh choudhury wrote:

    Yes I see a few errors in JT logs :-
    java.lang.NoClassDefFoundError: com/google/common/collect/Lists
    ClassNotFoundException:
    org.apache.hadoop.hbase.filter.WritableByteArrayComparable

    I think it cannot find some dependent jars? How or where do I add these
    jars so that pig can see them

    On Fri, Jul 15, 2011 at 3:27 PM, Bill Graham wrote:

    What do you see on the map and reduce tasks logs on the JT UI for that
    job?

    This job is failing for some reason, so there should be some hint in the
    task logs.

    On Fri, Jul 15, 2011 at 2:31 PM, sulabh choudhury wrote:

    Bill,

    there no useful message in logs (pasted below).
    I tried SET pig.usenewlogicalplan 'false' which did not help.
    I am using pig-0.8.0-cdh3u0. I have tried both with and without
    'hbase://' prefix

    2011-07-15 14:19:58,700 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
    - 100% complete
    2011-07-15 14:19:58,702 [main] ERROR
    org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
    2011-07-15 14:19:58,703 [main] INFO
    org.apache.pig.tools.pigstats.PigStats - Script Statistics:

    HadoopVersion PigVersion UserId StartedAt FinishedAt Features
    0.20.2-cdh3u0 0.8.0-cdh3u0 cxt 2011-07-15 14:18:11 2011-07-15 14:19:58
    GROUP_BY,ORDER_BY

    Some jobs have failed! Stop running all dependent jobs

    Job Stats (time in seconds):
    JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
    MinReduceTime AvgReduceTime Alias Feature Outputs
    job_201106212025_0139 1 1 8 8 8 12 12 12 A,ct,grp GROUP_BY,COMBINER
    job_201106212025_0140 1 1 3 3 3 12 12 12 sorted SAMPLER

    Failed Jobs:
    JobId Alias Feature Message Outputs
    job_201106212025_0141 result,sorted ORDER_BY Message: Job failed! Error
    - NA pig_test,

    Input(s):
    Successfully read 2583 records (330 bytes) from: "hbase://transaction"

    Output(s):
    Failed to produce result in "pig_test"

    On Fri, Jul 15, 2011 at 1:16 PM, Bill Graham wrote:

    What version of Pig are you using and what errors are you seeing?

    There was PIG-1870 related to projections that might apply, but I can't
    say
    so for sure. If that's the case it should work if you disable the new
    logical plan with -Dusenewloginalplan=false.

    Also, you might try specifying pig_test as 'hbase://pig_test'. I recall
    another JIRA about that as well.

    On Fri, Jul 15, 2011 at 12:40 PM, sulabh choudhury <sulabhc@gmail.com
    wrote:
    I have been trying to Store data in HBase suing HbaseStorage class. While I
    can store the original read data, it fails when I try to store the
    processed
    data.
    Which means I might be messing up the datatypes somewhere.

    My script below is :-

    --REGISTER myudfs.jar
    --A = load 'hbase://transaction' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:ref2',
    '-loadKey') AS
    (row:chararray, code:chararray) ;
    --grp = group A by myudfs.Parser(code);
    --ct = foreach grp generate group,COUNT(A.code) as count;

    --sorted = order ct by count desc;
    --result = foreach sorted generate $0 as row,(chararray)$1;
    --store result into 'pig_test' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:count');

    The dump of "result" works but the store to Hbase fails.
    WHen I try to store A it works fine.

    Datatypes of A and result are :-
    A: {row: chararray,code: chararray}
    result: {row: chararray,count: chararray}




    --

    --
    Thanks and Regards,
    Sulabh Choudhury
  • Sulabh choudhury at Jul 15, 2011 at 11:14 pm
    Wow, adding the jars did the trick. Thank you very much.
    On Fri, Jul 15, 2011 at 3:58 PM, Bill Graham wrote:

    That's because your pig script probably doesn't register the guava jar. Be
    sure to register the guava, hbase and zookeeper jars in your script.

    On Fri, Jul 15, 2011 at 3:52 PM, sulabh choudhury wrote:

    Yes I see a few errors in JT logs :-
    java.lang.NoClassDefFoundError: com/google/common/collect/Lists
    ClassNotFoundException:
    org.apache.hadoop.hbase.filter.WritableByteArrayComparable

    I think it cannot find some dependent jars? How or where do I add these
    jars so that pig can see them

    On Fri, Jul 15, 2011 at 3:27 PM, Bill Graham wrote:

    What do you see on the map and reduce tasks logs on the JT UI for that
    job?

    This job is failing for some reason, so there should be some hint in the
    task logs.

    On Fri, Jul 15, 2011 at 2:31 PM, sulabh choudhury wrote:

    Bill,

    there no useful message in logs (pasted below).
    I tried SET pig.usenewlogicalplan 'false' which did not help.
    I am using pig-0.8.0-cdh3u0. I have tried both with and without
    'hbase://' prefix

    2011-07-15 14:19:58,700 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
    - 100% complete
    2011-07-15 14:19:58,702 [main] ERROR
    org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
    2011-07-15 14:19:58,703 [main] INFO
    org.apache.pig.tools.pigstats.PigStats - Script Statistics:

    HadoopVersion PigVersion UserId StartedAt FinishedAt Features
    0.20.2-cdh3u0 0.8.0-cdh3u0 cxt 2011-07-15 14:18:11 2011-07-15 14:19:58
    GROUP_BY,ORDER_BY

    Some jobs have failed! Stop running all dependent jobs

    Job Stats (time in seconds):
    JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
    MinReduceTime AvgReduceTime Alias Feature Outputs
    job_201106212025_0139 1 1 8 8 8 12 12 12 A,ct,grp GROUP_BY,COMBINER
    job_201106212025_0140 1 1 3 3 3 12 12 12 sorted SAMPLER

    Failed Jobs:
    JobId Alias Feature Message Outputs
    job_201106212025_0141 result,sorted ORDER_BY Message: Job failed! Error
    - NA pig_test,

    Input(s):
    Successfully read 2583 records (330 bytes) from: "hbase://transaction"

    Output(s):
    Failed to produce result in "pig_test"

    On Fri, Jul 15, 2011 at 1:16 PM, Bill Graham wrote:

    What version of Pig are you using and what errors are you seeing?

    There was PIG-1870 related to projections that might apply, but I can't
    say
    so for sure. If that's the case it should work if you disable the new
    logical plan with -Dusenewloginalplan=false.

    Also, you might try specifying pig_test as 'hbase://pig_test'. I recall
    another JIRA about that as well.

    On Fri, Jul 15, 2011 at 12:40 PM, sulabh choudhury <sulabhc@gmail.com
    wrote:
    I have been trying to Store data in HBase suing HbaseStorage class. While I
    can store the original read data, it fails when I try to store the
    processed
    data.
    Which means I might be messing up the datatypes somewhere.

    My script below is :-

    --REGISTER myudfs.jar
    --A = load 'hbase://transaction' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:ref2',
    '-loadKey') AS
    (row:chararray, code:chararray) ;
    --grp = group A by myudfs.Parser(code);
    --ct = foreach grp generate group,COUNT(A.code) as count;

    --sorted = order ct by count desc;
    --result = foreach sorted generate $0 as row,(chararray)$1;
    --store result into 'pig_test' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:count');

    The dump of "result" works but the store to Hbase fails.
    WHen I try to store A it works fine.

    Datatypes of A and result are :-
    A: {row: chararray,code: chararray}
    result: {row: chararray,count: chararray}




    --

    --
    Thanks and Regards,
    Sulabh Choudhury

    --

    --
    Thanks and Regards,
    Sulabh Choudhury
  • Jagaran das at Jul 16, 2011 at 6:18 am
    Hi,

    Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
    Like this we have 20 applications that would run in parallel

    So one set would have 11520 files of total size 12 GB
    Like this we would have 15 sets in parallel,

    We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.

    What we do:

    1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
    2. Copy to cluster
    3. Execute PIG job
    4. copy to local
    5 Sql loader

    Can we perform merge and copy to cluster from a different host other than the Namenode?
    We want an out of cluster machine running a java process that would
    1. Run periodically
    2. Merge Files
    3. Copy to Cluster

    Secondly,
    If we can append to an existing file in cluster?

    Please provide your thoughts as maintaing the SLA is becoming tough.

    Regards,
    Jagaran
  • Jeremy Hanna at Jul 16, 2011 at 12:49 pm
    One thing that we use is filecrush to merge small files below a threshold. It works pretty well.
    http://www.jointhegrid.com/hadoop_filecrush/index.jsp
    On Jul 16, 2011, at 1:17 AM, jagaran das wrote:



    Hi,

    Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
    Like this we have 20 applications that would run in parallel

    So one set would have 11520 files of total size 12 GB
    Like this we would have 15 sets in parallel,

    We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.

    What we do:

    1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
    2. Copy to cluster
    3. Execute PIG job
    4. copy to local
    5 Sql loader

    Can we perform merge and copy to cluster from a different host other than the Namenode?
    We want an out of cluster machine running a java process that would
    1. Run periodically
    2. Merge Files
    3. Copy to Cluster

    Secondly,
    If we can append to an existing file in cluster?

    Please provide your thoughts as maintaing the SLA is becoming tough.

    Regards,
    Jagaran
  • Dmitriy Ryaboy at Jul 16, 2011 at 2:58 pm
    Merging: doesn't actually speed things up all that much; reduces load
    on the Namenode, and speeds up job initialization some. You don't have
    to do it on the namenode itself. Neither do you have to do copying on
    the NN. In fact, don't do anything but run the NameNode on the
    namenode.

    Pig jobs can transparently merge small jobs into larger chunks, so you
    won't be stuck with 11K mappers.

    Don't copy to local an then run SQL loader. Use Sqoop export, and load
    directly from Hadoop.

    You cannot append to a file that already exists in the cluster. This
    will be available in one of the coming Hadoop releases. You can
    certainly create a new file in a directory, and load whole
    directories.

    -D
    On Sat, Jul 16, 2011 at 1:17 AM, jagaran das wrote:


    Hi,

    Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
    Like this we have 20 applications that would run in parallel

    So one set would have 11520 files of total size 12 GB
    Like this we would have 15 sets in parallel,

    We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.

    What we do:

    1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
    2. Copy to cluster
    3. Execute PIG job
    4. copy to local
    5 Sql loader

    Can we perform merge and copy to cluster from a different host other than the Namenode?
    We want an out of cluster machine running a java process that would
    1. Run periodically
    2. Merge Files
    3. Copy to Cluster

    Secondly,
    If we can append to an existing file in cluster?

    Please provide your thoughts as maintaing the SLA is becoming tough.

    Regards,
    Jagaran
  • Jagaran das at Jul 16, 2011 at 6:01 pm
    ok then

    1. We have to write a pig job for merging or PIG itself merges so that less number of mappers are invoked.

    2. Can we copy to a cluster from a non cluster machine, using the namespace URI of the NN ? - We can dedicate some good config boxes to do our merging and copying and then copy it to NN over network.

    3. How is the performance of FileCrusher tool ?

    We found that to copy 12 GB of data for all the 15 apps in parallel took 35 mins.

    we ran 15 copy from local each having 12 GB data.

    Thanks
    JD


    ________________________________
    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    To: user@pig.apache.org; jagaran das <jagaran_das@yahoo.co.in>
    Sent: Saturday, 16 July 2011 7:58 AM
    Subject: Re: Hadoop Production Issue

    Merging: doesn't actually speed things up all that much; reduces load
    on the Namenode, and speeds up job initialization some. You don't have
    to do it on the namenode itself. Neither do you have to do copying on
    the NN. In fact, don't do anything but run the NameNode on the
    namenode.

    Pig jobs can transparently merge small jobs into larger chunks, so you
    won't be stuck with 11K mappers.

    Don't copy to local an then run SQL loader. Use Sqoop export, and load
    directly from Hadoop.

    You cannot append to a file that already exists in the cluster. This
    will be available in one of the coming Hadoop releases. You can
    certainly create a new file in a directory, and load whole
    directories.

    -D
    On Sat, Jul 16, 2011 at 1:17 AM, jagaran das wrote:


    Hi,

    Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
    Like this we have 20 applications that would run in parallel

    So one set would have 11520 files of total size 12 GB
    Like this we would have 15 sets in parallel,

    We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.

    What we do:

    1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
    2. Copy to cluster
    3. Execute PIG job
    4. copy to local
    5 Sql loader

    Can we perform merge and copy to cluster from a different host other than the Namenode?
    We want an out of cluster machine running a java process that would
    1. Run periodically
    2. Merge Files
    3. Copy to Cluster

    Secondly,
    If we can append to an existing file in cluster?

    Please provide your thoughts as maintaing the SLA is becoming tough.

    Regards,
    Jagaran
  • Jagaran das at Jul 16, 2011 at 6:07 pm
    Our Config:

    72 G RAM 4 Quad Core processor 1.8 TB local memory

    10 node CDH3 cluster


    ________________________________
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: "user@pig.apache.org" <user@pig.apache.org>
    Sent: Saturday, 16 July 2011 11:00 AM
    Subject: Re: Hadoop Production Issue

    ok then

    1. We have to write a pig job for merging or PIG itself merges so that less number of mappers are invoked.

    2. Can we copy to a cluster from a non cluster machine, using the namespace URI of the NN ? - We can dedicate some good config boxes to do our merging and copying and then copy it to NN over network.

    3. How is the performance of FileCrusher tool ?

    We found that to copy 12 GB of data for all the 15 apps in parallel took 35 mins.

    we ran 15 copy from local each having 12 GB data.

    Thanks
    JD


    ________________________________
    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    To: user@pig.apache.org; jagaran das <jagaran_das@yahoo.co.in>
    Sent: Saturday, 16 July 2011 7:58 AM
    Subject: Re: Hadoop Production Issue

    Merging: doesn't actually speed things up all that much; reduces load
    on the Namenode, and speeds up job initialization some. You don't have
    to do it on the namenode itself. Neither do you have to do copying on
    the NN. In fact, don't do anything but run the NameNode on the
    namenode.

    Pig jobs can transparently merge small jobs into larger chunks, so you
    won't be stuck with 11K mappers.

    Don't copy to local an then run SQL loader. Use Sqoop export, and load
    directly from Hadoop.

    You cannot append to a file that already exists in the cluster. This
    will be available in one of the coming Hadoop releases. You can
    certainly create a new file in a directory, and load whole
    directories.

    -D
    On Sat, Jul 16, 2011 at 1:17 AM, jagaran das wrote:


    Hi,

    Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
    Like this we have 20 applications that would run in parallel

    So one set would have 11520 files of total size 12 GB
    Like this we would have 15 sets in parallel,

    We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.

    What we do:

    1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
    2. Copy to cluster
    3. Execute PIG job
    4. copy to local
    5 Sql loader

    Can we perform merge and copy to cluster from a different host other than the Namenode?
    We want an out of cluster machine running a java process that would
    1. Run periodically
    2. Merge Files
    3. Copy to Cluster

    Secondly,
    If we can append to an existing file in cluster?

    Please provide your thoughts as maintaing the SLA is becoming tough.

    Regards,
    Jagaran
  • Dmitriy Ryaboy at Jul 17, 2011 at 3:38 am
    1) Correct.

    2) Copy to the cluster from any machine, just have the config on the
    classpath or specify the full path in your copy command
    (hdfs://my-nn/path/to/destination).


    On Sat, Jul 16, 2011 at 1:00 PM, jagaran das wrote:
    ok then

    1. We have to write a pig job for merging or PIG itself merges so that less number of mappers are invoked.

    2. Can we copy to a cluster from a non cluster machine, using the namespace URI of the NN ? - We can dedicate some good config boxes to do our merging and copying and then copy it to NN over network.

    3. How is the performance of FileCrusher tool ?

    We found that to copy 12 GB of data for all the 15 apps in parallel took 35 mins.

    we ran 15 copy from local each having 12 GB data.

    Thanks
    JD


    ________________________________
    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    To: user@pig.apache.org; jagaran das <jagaran_das@yahoo.co.in>
    Sent: Saturday, 16 July 2011 7:58 AM
    Subject: Re: Hadoop Production Issue

    Merging: doesn't actually speed things up all that much; reduces load
    on the Namenode, and speeds up job initialization some. You don't have
    to do it on the namenode itself. Neither do you have to do copying on
    the NN. In fact, don't do anything but run the NameNode on the
    namenode.

    Pig jobs can transparently merge small jobs into larger chunks, so you
    won't be stuck with 11K mappers.

    Don't copy to local an then run SQL loader. Use Sqoop export, and load
    directly from Hadoop.

    You cannot append to a file that already exists in the cluster. This
    will be available in one of the coming Hadoop releases. You can
    certainly create a new file in a directory, and load whole
    directories.

    -D
    On Sat, Jul 16, 2011 at 1:17 AM, jagaran das wrote:


    Hi,

    Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
    Like this we have 20 applications that would run in parallel

    So one set would have 11520 files of total size 12 GB
    Like this we would have 15 sets in parallel,

    We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.

    What we do:

    1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
    2. Copy to cluster
    3. Execute PIG job
    4. copy to local
    5 Sql loader

    Can we perform merge and copy to cluster from a different host other than the Namenode?
    We want an out of cluster machine running a java process that would
    1. Run periodically
    2. Merge Files
    3. Copy to Cluster

    Secondly,
    If we can append to an existing file in cluster?

    Please provide your thoughts as maintaing the SLA is becoming tough.

    Regards,
    Jagaran
  • Jagaran das at Jul 17, 2011 at 4:25 am
    Thanks Dimitry.

    1. So I can write a pig job that would do merging files.
    2. But again the above Pig job would work around small files, would that not affect performance.

    3. For Copy again, if want to un the copy command, we need hadoop installed in the machine or you are suggesting  to use java api to invoke copy command??

    Thanks a lot

    Regards,
    JD


    ________________________________
    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    To: user@pig.apache.org; jagaran das <jagaran_das@yahoo.co.in>
    Sent: Saturday, 16 July 2011 8:38 PM
    Subject: Re: Hadoop Production Issue

    1) Correct.

    2) Copy to the cluster from any machine, just have the config on the
    classpath or specify the full path in your copy command
    (hdfs://my-nn/path/to/destination).


    On Sat, Jul 16, 2011 at 1:00 PM, jagaran das wrote:
    ok then

    1. We have to write a pig job for merging or PIG itself merges so that less number of mappers are invoked.

    2. Can we copy to a cluster from a non cluster machine, using the namespace URI of the NN ? - We can dedicate some good config boxes to do our merging and copying and then copy it to NN over network.

    3. How is the performance of FileCrusher tool ?

    We found that to copy 12 GB of data for all the 15 apps in parallel took 35 mins.

    we ran 15 copy from local each having 12 GB data.

    Thanks
    JD


    ________________________________
    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    To: user@pig.apache.org; jagaran das <jagaran_das@yahoo.co.in>
    Sent: Saturday, 16 July 2011 7:58 AM
    Subject: Re: Hadoop Production Issue

    Merging: doesn't actually speed things up all that much; reduces load
    on the Namenode, and speeds up job initialization some. You don't have
    to do it on the namenode itself. Neither do you have to do copying on
    the NN. In fact, don't do anything but run the NameNode on the
    namenode.

    Pig jobs can transparently merge small jobs into larger chunks, so you
    won't be stuck with 11K mappers.

    Don't copy to local an then run SQL loader. Use Sqoop export, and load
    directly from Hadoop.

    You cannot append to a file that already exists in the cluster. This
    will be available in one of the coming Hadoop releases. You can
    certainly create a new file in a directory, and load whole
    directories.

    -D
    On Sat, Jul 16, 2011 at 1:17 AM, jagaran das wrote:


    Hi,

    Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
    Like this we have 20 applications that would run in parallel

    So one set would have 11520 files of total size 12 GB
    Like this we would have 15 sets in parallel,

    We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.

    What we do:

    1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
    2. Copy to cluster
    3. Execute PIG job
    4. copy to local
    5 Sql loader

    Can we perform merge and copy to cluster from a different host other than the Namenode?
    We want an out of cluster machine running a java process that would
    1. Run periodically
    2. Merge Files
    3. Copy to Cluster

    Secondly,
    If we can append to an existing file in cluster?

    Please provide your thoughts as maintaing the SLA is becoming tough.

    Regards,
    Jagaran

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 15, '11 at 7:41p
activeJul 17, '11 at 4:25a
posts14
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase