On 02/28/2011 05:40 PM, Charles Gonçalves wrote:Guys,
The amount of data in the source dir:
hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111
What I did was:
I run with all logs, 43458 and the counters are:
FILE_BYTES_READ 253,905,706 372,708,857 626,614,563
HDFS_BYTES_READ 2,553,123,734 0 2,553,123,734
FILE_BYTES_WRITTEN 619,877,917 372,708,857 992,586,774
HDFS_BYTES_WRITTEN 0 535 535
I did a manual join of the files and run again for the 336 files (the
merge of all those files).
The job didn't finished yet and the counters are:
FILE_BYTES_READ 21,054,970,818 0 21,054,970,818
HDFS_BYTES_READ 16,772,063,486 0 16,772,063,486
FILE_BYTES_WRITTEN 39,797,038,008 10,404,287,551 50,201,325,55
I think that the problem could be in the combination of the input files.
Is the combination class aware of compression.
Because *all my files are compressed*.
Maybe the class perform a concatenation and we fall in the hdfs
limitation of gzip concatenated files.
On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves
wrote:
On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair
wrote:
Hi Charles,
Which load function are you using ?
I'm using a UD load function ..
Is the default (PigStorage?).
Nops ...
In the hadoop counters for the job in the jobtracker ui, do
you see the expected number of input records being read?
Is possible to see the counter in the history interface on
JobTracker?
I will run the jobs again to compare the counter, but my guess is
probably not!
-Thejas
On 2/28/11 10:57 AM, "Charles Gonçalves" wrote:
I'm not using any filtering in the script.
I'm just want to see the total traffic per day in all logs.
If I combine 1000 log files into one and run the script
on this log files I
got the correct answer for those logs.
But when I'm run with all the *43458* log files I got a
incorrect output.
The correct would be an histogram for each day from
2010-10 but the result
contain only data from 2010-10-21.
And if I process all the logs with an awk script I got the
correct answer.
On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai
wrote:
Not sure if I get your question. In 0.8, Pig combine
small files into one
map, so it is possible you get less output files.
This is not the problem.
But thanks anyway!
If that is your concern, you can try to disable split
combine using
"-Dpig.splitCombination=false"
Daniel
Charles Gonçalves wrote:
I tried to process a big number of small files on pig
and I got a strange
problem.
2011-02-27 00:00:58,746 [Thread-15] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat -
Total input paths
to process : *43458*
2011-02-27 00:00:58,755 [Thread-15] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
- Total
input
paths to process : *43458*
2011-02-27 00:01:14,173 [Thread-15] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
- Total
input
paths (combined) to process : *329*
When the script finish to process, the result is just
about a subgroup of
the input files.
These are logs from a whole month, but the results are
just from the day
21.
Maybe I'm missing something.
Any Ideas?
--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/<
http://homepages.dcc.ufmg.br/%7Echarles/>
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.: 55 31 34741485
Lab.: 55 31 34095840
--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/<
http://homepages.dcc.ufmg.br/%7Echarles/>
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.: 55 31 34741485
Lab.: 55 31 34095840
--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/<
http://homepages.dcc.ufmg.br/%7Echarles/>
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.: 55 31 34741485
Lab.: 55 31 34095840