FAQ
For Zebra unsorted table, the number of mappers are limited by the
number of tfiles per column group because the input splits are generated
based upon tfiles not within tfiles. But the improvement in the form of
block-based split within tfile is coming up any time soon
(https://issues.apache.org/jira/browse/PIG-1077).

Regarding compression, yes, you were right that zebra does not support
"none" as a compression method.

Yan

------ Forwarded Message
From: Bennie Schut <bschut@ebuddy.com>
Reply-To: "pig-user@hadoop.apache.org" <pig-user@hadoop.apache.org>
Date: Wed, 18 Nov 2009 14:32:33 +0100
To: <pig-user@hadoop.apache.org>
Subject: Re: zebra: unknown compressor none

Just for anyone else who might have this problem in the future.
I found a bit of a workaround. When you generate the zebra file/dir just
make sure you use something like a "order x by y parallel 20;" before
you do the store so in the zebra structure it will have 20 files and any
jobs using this file can then at least use 20 mappers.
It's not perfect though so if someone finds another way please let me
know.
Using zebra this way gives me a 27% speed improvement over using plain
text tab delimited files so even with the hack I'm happy :)

Bennie Schut wrote:
Hi all,

I still can't get pig to use multiple mappers when using zebra. I tried
using lzo hoping it would help but sadly no. The file is 14G tab
delimited plain text and when using zebra with gz 7G and with lzo 10G.
When I use the tab delimited file I get 216 mappers but with zebra just
2 mappers of which 1 mapper is done almost instantly and the other runs
for hours. Any idea why it's not using more mappers?

As an example of what I'm trying to do:
dim1258375560540 = load '/user/dwh/screenname2.zebra' using
org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
fact1258375560540 = load
'/user/bennies/newvalues//chatsessions_1238624404177_small.csv' using
PigStorage('\t') as (session_hash: chararray, email: chararray,
screenname: chararray);
tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
dim1258375560540 by code outer PARALLEL 4;
dump tmp1258375560540;


Thanks,
Bennie

Bennie Schut wrote:
Another zebra related question.

I couldn't find a lot of documentation on zebra but I figured you can
change compression codec with a syntax like this:
store outfile into '/user/dwh/screenname2.zebra' using
org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');

And in theory disable compression like this:
store outfile into '/user/dwh/screenname3.zebra' using
org.apache.hadoop.zebra.pig.TableStorer('compress by none');

But it doesn't seem to understand the "none" as a compressor.
java.io.IOException: ColumnGroup.Writer constructor failed :
Partition
constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
column 13.
Was
expecting:

<COMPRESSOR>
...



at
org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116
)
at
org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStor
er.java:
154)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)

at
org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)

at
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl
.java:24
7)
at
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at
java.lang.Thread.run(Thread.java:619)



I actually tried this because when I use the zebra result on further
processing it only uses 2 mappers instead of the 230 mappers on the
original file. I remember hadoop can not split gz files so I figured
using compression might cause it to use so little mappers. Anyone
perhaps know a different approach on this?

Thanks,
Bennie.


------ End of Forwarded Message

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 21 of 23 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedNov 2, '09 at 5:58a
activeNov 20, '09 at 10:44a
posts23
users7
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase