FAQ
I tried to process a big number of small files on pig and I got a strange
problem.

2011-02-27 00:00:58,746 [Thread-15] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : *43458*
2011-02-27 00:00:58,755 [Thread-15] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : *43458*
2011-02-27 00:01:14,173 [Thread-15] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths (combined) to process : *329*

When the script finish to process, the result is just about a subgroup of
the input files.
These are logs from a whole month, but the results are just from the day
21.


Maybe I'm missing something.
Any Ideas?

--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.: 55 31 34741485
Lab.: 55 31 34095840

Search Discussions

  • Romain Rigaux at Feb 28, 2011 at 6:12 pm
    Normally Pig 0.8 is just combining the small
    files<http://pig.apache.org/docs/r0.8.0/cookbook.html#Combine+Small+Input+Files>into
    bigger ones, you should not lose any records.

    You might be filtering out/limiting some records in your script. You can try
    just a LOAD and STORE and see that the output is the same as the input data.

    Romain
    On Sat, Feb 26, 2011 at 7:25 PM, Charles Gonçalves wrote:

    I tried to process a big number of small files on pig and I got a strange
    problem.

    2011-02-27 00:00:58,746 [Thread-15] INFO
    org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
    to process : *43458*
    2011-02-27 00:00:58,755 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths to process : *43458*
    2011-02-27 00:01:14,173 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths (combined) to process : *329*

    When the script finish to process, the result is just about a subgroup of
    the input files.
    These are logs from a whole month, but the results are just from the day
    21.


    Maybe I'm missing something.
    Any Ideas?

    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
  • Daniel Dai at Feb 28, 2011 at 6:30 pm
    Not sure if I get your question. In 0.8, Pig combine small files into
    one map, so it is possible you get less output files. If that is your
    concern, you can try to disable split combine using
    "-Dpig.splitCombination=false"

    Daniel

    Charles Gonçalves wrote:
    I tried to process a big number of small files on pig and I got a strange
    problem.

    2011-02-27 00:00:58,746 [Thread-15] INFO
    org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
    to process : *43458*
    2011-02-27 00:00:58,755 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
    paths to process : *43458*
    2011-02-27 00:01:14,173 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
    paths (combined) to process : *329*

    When the script finish to process, the result is just about a subgroup of
    the input files.
    These are logs from a whole month, but the results are just from the day
    21.


    Maybe I'm missing something.
    Any Ideas?
  • Charles Gonçalves at Feb 28, 2011 at 6:58 pm
    I'm not using any filtering in the script.
    I'm just want to see the total traffic per day in all logs.

    If I combine 1000 log files into one and run the script on this log files I
    got the correct answer for those logs.
    But when I'm run with all the *43458* log files I got a incorrect output.
    The correct would be an histogram for each day from 2010-10 but the result
    contain only data from 2010-10-21.
    And if I process all the logs with an awk script I got the correct answer.

    On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai wrote:

    Not sure if I get your question. In 0.8, Pig combine small files into one
    map, so it is possible you get less output files.
    This is not the problem.
    But thanks anyway!

    If that is your concern, you can try to disable split combine using
    "-Dpig.splitCombination=false"

    Daniel


    Charles Gonçalves wrote:
    I tried to process a big number of small files on pig and I got a strange
    problem.

    2011-02-27 00:00:58,746 [Thread-15] INFO
    org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
    to process : *43458*
    2011-02-27 00:00:58,755 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths to process : *43458*
    2011-02-27 00:01:14,173 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths (combined) to process : *329*

    When the script finish to process, the result is just about a subgroup of
    the input files.
    These are logs from a whole month, but the results are just from the day
    21.


    Maybe I'm missing something.
    Any Ideas?


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
  • Thejas M Nair at Feb 28, 2011 at 10:40 pm
    Hi Charles,
    Which load function are you using ? Is the default (PigStorage?).
    In the hadoop counters for the job in the jobtracker ui, do you see the expected number of input records being read?
    -Thejas



    On 2/28/11 10:57 AM, "Charles Gonçalves" wrote:

    I'm not using any filtering in the script.
    I'm just want to see the total traffic per day in all logs.

    If I combine 1000 log files into one and run the script on this log files I
    got the correct answer for those logs.
    But when I'm run with all the *43458* log files I got a incorrect output.
    The correct would be an histogram for each day from 2010-10 but the result
    contain only data from 2010-10-21.
    And if I process all the logs with an awk script I got the correct answer.

    On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai wrote:

    Not sure if I get your question. In 0.8, Pig combine small files into one
    map, so it is possible you get less output files.
    This is not the problem.
    But thanks anyway!

    If that is your concern, you can try to disable split combine using
    "-Dpig.splitCombination=false"

    Daniel


    Charles Gonçalves wrote:
    I tried to process a big number of small files on pig and I got a strange
    problem.

    2011-02-27 00:00:58,746 [Thread-15] INFO
    org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
    to process : *43458*
    2011-02-27 00:00:58,755 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths to process : *43458*
    2011-02-27 00:01:14,173 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths (combined) to process : *329*

    When the script finish to process, the result is just about a subgroup of
    the input files.
    These are logs from a whole month, but the results are just from the day
    21.


    Maybe I'm missing something.
    Any Ideas?


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
  • Charles Gonçalves at Feb 28, 2011 at 11:47 pm

    On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair wrote:

    Hi Charles,
    Which load function are you using ?
    I'm using a UD load function ..

    Is the default (PigStorage?).
    >
    Nops ...

    In the hadoop counters for the job in the jobtracker ui, do you see the
    expected number of input records being read?
    Is possible to see the counter in the history interface on JobTracker?
    I will run the jobs again to compare the counter, but my guess is probably
    not!

    -Thejas



    On 2/28/11 10:57 AM, "Charles Gonçalves" wrote:

    I'm not using any filtering in the script.
    I'm just want to see the total traffic per day in all logs.

    If I combine 1000 log files into one and run the script on this log files
    I
    got the correct answer for those logs.
    But when I'm run with all the *43458* log files I got a incorrect output.
    The correct would be an histogram for each day from 2010-10 but the result
    contain only data from 2010-10-21.
    And if I process all the logs with an awk script I got the correct answer.

    On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai wrote:

    Not sure if I get your question. In 0.8, Pig combine small files into one
    map, so it is possible you get less output files.
    This is not the problem.
    But thanks anyway!

    If that is your concern, you can try to disable split combine using
    "-Dpig.splitCombination=false"

    Daniel


    Charles Gonçalves wrote:
    I tried to process a big number of small files on pig and I got a
    strange
    problem.

    2011-02-27 00:00:58,746 [Thread-15] INFO
    org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
    paths
    to process : *43458*
    2011-02-27 00:00:58,755 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths to process : *43458*
    2011-02-27 00:01:14,173 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths (combined) to process : *329*

    When the script finish to process, the result is just about a subgroup
    of
    the input files.
    These are logs from a whole month, but the results are just from the
    day
    21.


    Maybe I'm missing something.
    Any Ideas?


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
  • Charles Gonçalves at Mar 1, 2011 at 1:41 am
    Guys,

    The amount of data in the source dir:
    hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111

    What I did was:
    I run with all logs, 43458 and the counters are:

    FILE_BYTES_READ253,905,706372,708,857626,614,563HDFS_BYTES_READ2,553,123,7340
    2,553,123,734FILE_BYTES_WRITTEN619,877,917372,708,857992,586,774
    HDFS_BYTES_WRITTEN 0535535


    I did a manual join of the files and run again for the 336 files (the merge
    of all those files).
    The job didn't finished yet and the counters are:

    FILE_BYTES_READ21,054,970,818021,054,970,818HDFS_BYTES_READ16,772,063,486 0
    16,772,063,486FILE_BYTES_WRITTEN39,797,038,00810,404,287,55150,201,325,55


    I think that the problem could be in the combination of the input files.
    Is the combination class aware of compression.
    Because *all my files are compressed*.
    Maybe the class perform a concatenation and we fall in the hdfs limitation
    of gzip concatenated files.
    On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves wrote:


    On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair wrote:

    Hi Charles,
    Which load function are you using ?
    I'm using a UD load function ..

    Is the default (PigStorage?).
    Nops ...

    In the hadoop counters for the job in the jobtracker ui, do you see the
    expected number of input records being read?
    Is possible to see the counter in the history interface on JobTracker?
    I will run the jobs again to compare the counter, but my guess is probably
    not!

    -Thejas



    On 2/28/11 10:57 AM, "Charles Gonçalves" wrote:

    I'm not using any filtering in the script.
    I'm just want to see the total traffic per day in all logs.

    If I combine 1000 log files into one and run the script on this log files
    I
    got the correct answer for those logs.
    But when I'm run with all the *43458* log files I got a incorrect
    output.
    The correct would be an histogram for each day from 2010-10 but the result
    contain only data from 2010-10-21.
    And if I process all the logs with an awk script I got the correct answer.


    On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <jianyong@yahoo-inc.com>
    wrote:
    Not sure if I get your question. In 0.8, Pig combine small files into one
    map, so it is possible you get less output files.
    This is not the problem.
    But thanks anyway!

    If that is your concern, you can try to disable split combine using
    "-Dpig.splitCombination=false"

    Daniel


    Charles Gonçalves wrote:
    I tried to process a big number of small files on pig and I got a
    strange
    problem.

    2011-02-27 00:00:58,746 [Thread-15] INFO
    org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
    paths
    to process : *43458*
    2011-02-27 00:00:58,755 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths to process : *43458*
    2011-02-27 00:01:14,173 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths (combined) to process : *329*

    When the script finish to process, the result is just about a subgroup
    of
    the input files.
    These are logs from a whole month, but the results are just from the
    day
    21.


    Maybe I'm missing something.
    Any Ideas?


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
  • Daniel Dai at Mar 1, 2011 at 9:45 pm
    Combine input splits should be able to handle compressed files. It will
    create seperate RecordReader for each file within one input split. So
    gzip concatenation should not be the case. I am not sure what happen to
    your script. If possible, give us more information (script, UDF, data,
    version).

    Daniel

    On 02/28/2011 05:40 PM, Charles Gonçalves wrote:
    Guys,

    The amount of data in the source dir:
    hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111

    What I did was:
    I run with all logs, 43458 and the counters are:

    FILE_BYTES_READ 253,905,706 372,708,857 626,614,563
    HDFS_BYTES_READ 2,553,123,734 0 2,553,123,734
    FILE_BYTES_WRITTEN 619,877,917 372,708,857 992,586,774
    HDFS_BYTES_WRITTEN 0 535 535


    I did a manual join of the files and run again for the 336 files (the
    merge of all those files).
    The job didn't finished yet and the counters are:

    FILE_BYTES_READ 21,054,970,818 0 21,054,970,818
    HDFS_BYTES_READ 16,772,063,486 0 16,772,063,486
    FILE_BYTES_WRITTEN 39,797,038,008 10,404,287,551 50,201,325,55



    I think that the problem could be in the combination of the input files.
    Is the combination class aware of compression.
    Because *all my files are compressed*.
    Maybe the class perform a concatenation and we fall in the hdfs
    limitation of gzip concatenated files.

    On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves
    wrote:



    On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair
    wrote:

    Hi Charles,
    Which load function are you using ?

    I'm using a UD load function ..

    Is the default (PigStorage?).

    Nops ...

    In the hadoop counters for the job in the jobtracker ui, do
    you see the expected number of input records being read?

    Is possible to see the counter in the history interface on
    JobTracker?
    I will run the jobs again to compare the counter, but my guess is
    probably not!

    -Thejas




    On 2/28/11 10:57 AM, "Charles Gonçalves" wrote:

    I'm not using any filtering in the script.
    I'm just want to see the total traffic per day in all logs.

    If I combine 1000 log files into one and run the script
    on this log files I
    got the correct answer for those logs.
    But when I'm run with all the *43458* log files I got a
    incorrect output.
    The correct would be an histogram for each day from
    2010-10 but the result
    contain only data from 2010-10-21.
    And if I process all the logs with an awk script I got the
    correct answer.


    On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai
    wrote:
    Not sure if I get your question. In 0.8, Pig combine
    small files into one
    map, so it is possible you get less output files.
    This is not the problem.
    But thanks anyway!

    If that is your concern, you can try to disable split
    combine using
    "-Dpig.splitCombination=false"

    Daniel


    Charles Gonçalves wrote:
    I tried to process a big number of small files on pig
    and I got a strange
    problem.

    2011-02-27 00:00:58,746 [Thread-15] INFO
    org.apache.hadoop.mapreduce.lib.input.FileInputFormat -
    Total input paths
    to process : *43458*
    2011-02-27 00:00:58,755 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
    - Total
    input
    paths to process : *43458*
    2011-02-27 00:01:14,173 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
    - Total
    input
    paths (combined) to process : *329*

    When the script finish to process, the result is just
    about a subgroup of
    the input files.
    These are logs from a whole month, but the results are
    just from the day
    21.


    Maybe I'm missing something.
    Any Ideas?


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    <http://homepages.dcc.ufmg.br/%7Echarles/>
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840





    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    <http://homepages.dcc.ufmg.br/%7Echarles/>
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840




    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    <http://homepages.dcc.ufmg.br/%7Echarles/>
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
  • Charles Gonçalves at Mar 1, 2011 at 10:03 pm
    Ok ...

    I'm sending both.
    Versions:

    Apache Pig version 0.8.0 (r1043805)
    compiled Dec 08 2010, 17:26:09

    Hadoop 0.20.2


    On Tue, Mar 1, 2011 at 6:44 PM, Daniel Dai wrote:

    Combine input splits should be able to handle compressed files. It will
    create seperate RecordReader for each file within one input split. So gzip
    concatenation should not be the case. I am not sure what happen to your
    script. If possible, give us more information (script, UDF, data, version).

    Daniel



    On 02/28/2011 05:40 PM, Charles Gonçalves wrote:

    Guys,

    The amount of data in the source dir:
    hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111

    What I did was:
    I run with all logs, 43458 and the counters are:

    FILE_BYTES_READ 253,905,706 372,708,857 626,614,563 HDFS_BYTES_READ
    2,553,123,734 0 2,553,123,734 FILE_BYTES_WRITTEN 619,877,917 372,708,857
    992,586,774 HDFS_BYTES_WRITTEN 0 535 535


    I did a manual join of the files and run again for the 336 files (the
    merge of all those files).
    The job didn't finished yet and the counters are:

    FILE_BYTES_READ 21,054,970,818 0 21,054,970,818 HDFS_BYTES_READ
    16,772,063,486 0 16,772,063,486 FILE_BYTES_WRITTEN 39,797,038,008
    10,404,287,551 50,201,325,55


    I think that the problem could be in the combination of the input files.
    Is the combination class aware of compression.
    Because *all my files are compressed*.
    Maybe the class perform a concatenation and we fall in the hdfs limitation
    of gzip concatenated files.
    On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves wrote:


    On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair wrote:

    Hi Charles,
    Which load function are you using ?
    I'm using a UD load function ..

    Is the default (PigStorage?).
    Nops ...

    In the hadoop counters for the job in the jobtracker ui, do you see the
    expected number of input records being read?
    Is possible to see the counter in the history interface on JobTracker?
    I will run the jobs again to compare the counter, but my guess is probably
    not!

    -Thejas



    On 2/28/11 10:57 AM, "Charles Gonçalves" wrote:

    I'm not using any filtering in the script.
    I'm just want to see the total traffic per day in all logs.

    If I combine 1000 log files into one and run the script on this log
    files I
    got the correct answer for those logs.
    But when I'm run with all the *43458* log files I got a incorrect
    output.
    The correct would be an histogram for each day from 2010-10 but the
    result
    contain only data from 2010-10-21.
    And if I process all the logs with an awk script I got the correct
    answer.


    On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <jianyong@yahoo-inc.com>
    wrote:
    Not sure if I get your question. In 0.8, Pig combine small files into one
    map, so it is possible you get less output files.
    This is not the problem.
    But thanks anyway!

    If that is your concern, you can try to disable split combine using
    "-Dpig.splitCombination=false"

    Daniel


    Charles Gonçalves wrote:
    I tried to process a big number of small files on pig and I got a
    strange
    problem.

    2011-02-27 00:00:58,746 [Thread-15] INFO
    org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
    paths
    to process : *43458*
    2011-02-27 00:00:58,755 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths to process : *43458*
    2011-02-27 00:01:14,173 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
    input
    paths (combined) to process : *329*

    When the script finish to process, the result is just about a subgroup
    of
    the input files.
    These are logs from a whole month, but the results are just from the
    day
    21.


    Maybe I'm missing something.
    Any Ideas?


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
  • Dmitriy Ryaboy at Mar 1, 2011 at 10:07 pm
    fwiw, something similar happened with the HBase loader in 0.8 -- only the
    first of the combined splits was read in (I worked around this by turning
    off split combination in the loader's setLocation, see pig-1680)

    D
    On Tue, Mar 1, 2011 at 2:02 PM, Charles Gonçalves wrote:

    Ok ...

    I'm sending both.
    Versions:

    Apache Pig version 0.8.0 (r1043805)
    compiled Dec 08 2010, 17:26:09

    Hadoop 0.20.2


    On Tue, Mar 1, 2011 at 6:44 PM, Daniel Dai wrote:

    Combine input splits should be able to handle compressed files. It will
    create seperate RecordReader for each file within one input split. So gzip
    concatenation should not be the case. I am not sure what happen to your
    script. If possible, give us more information (script, UDF, data, version).

    Daniel



    On 02/28/2011 05:40 PM, Charles Gonçalves wrote:

    Guys,

    The amount of data in the source dir:
    hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111

    What I did was:
    I run with all logs, 43458 and the counters are:

    FILE_BYTES_READ 253,905,706 372,708,857 626,614,563 HDFS_BYTES_READ
    2,553,123,734 0 2,553,123,734 FILE_BYTES_WRITTEN 619,877,917 372,708,857
    992,586,774 HDFS_BYTES_WRITTEN 0 535 535


    I did a manual join of the files and run again for the 336 files (the
    merge of all those files).
    The job didn't finished yet and the counters are:

    FILE_BYTES_READ 21,054,970,818 0 21,054,970,818 HDFS_BYTES_READ
    16,772,063,486 0 16,772,063,486 FILE_BYTES_WRITTEN 39,797,038,008
    10,404,287,551 50,201,325,55


    I think that the problem could be in the combination of the input files.
    Is the combination class aware of compression.
    Because *all my files are compressed*.
    Maybe the class perform a concatenation and we fall in the hdfs limitation
    of gzip concatenated files.
    On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves wrote:


    On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair wrote:

    Hi Charles,
    Which load function are you using ?
    I'm using a UD load function ..

    Is the default (PigStorage?).
    Nops ...

    In the hadoop counters for the job in the jobtracker ui, do you see the
    expected number of input records being read?
    Is possible to see the counter in the history interface on JobTracker?

    I will run the jobs again to compare the counter, but my guess is
    probably not!

    -Thejas



    On 2/28/11 10:57 AM, "Charles Gonçalves" wrote:

    I'm not using any filtering in the script.
    I'm just want to see the total traffic per day in all logs.

    If I combine 1000 log files into one and run the script on this log
    files I
    got the correct answer for those logs.
    But when I'm run with all the *43458* log files I got a incorrect
    output.
    The correct would be an histogram for each day from 2010-10 but the
    result
    contain only data from 2010-10-21.
    And if I process all the logs with an awk script I got the correct
    answer.


    On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <jianyong@yahoo-inc.com>
    wrote:
    Not sure if I get your question. In 0.8, Pig combine small files into one
    map, so it is possible you get less output files.
    This is not the problem.
    But thanks anyway!

    If that is your concern, you can try to disable split combine using
    "-Dpig.splitCombination=false"

    Daniel


    Charles Gonçalves wrote:
    I tried to process a big number of small files on pig and I got a
    strange
    problem.

    2011-02-27 00:00:58,746 [Thread-15] INFO
    org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
    paths
    to process : *43458*
    2011-02-27 00:00:58,755 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
    Total
    input
    paths to process : *43458*
    2011-02-27 00:01:14,173 [Thread-15] INFO
    org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
    Total
    input
    paths (combined) to process : *329*

    When the script finish to process, the result is just about a
    subgroup of
    the input files.
    These are logs from a whole month, but the results are just from the
    day
    21.


    Maybe I'm missing something.
    Any Ideas?


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 27, '11 at 3:26a
activeMar 1, '11 at 10:07p
posts10
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase