Zebra, a contrib project under pig, is such a loader that builds indexes by itself.


-----Original Message-----
From: Renato Marroquín Mogrovejo
Sent: Tuesday, October 05, 2010 7:52 AM
To: pig-user@hadoop.apache.org
Subject: Re: Pig indexing

Hey Dmitriy!
I've been trying to get to this email for the last couple of weeks, and
finally I am here.
You were talking about Pig's merge, but there is one thing I didn't quite
understand from the wiki (http://wiki.apache.org/pig/PigMergeJoin) if it
uses sampling records to create indexes because we know that the files are
ordered which would result in "clustered indexes", right? And in this merge
join operator, the intermediary index gets just destroyed after it?
Do you know where I could find a similar a loader which is aware of indexes?
Maybe there is some source code I could look into. But, I will definitely
look into the MergeJoinIndexer code to try to get a grasp on it (:
One last thing, what do you mean by "splits for blocks"?
Thanks in advanced.

Renato M.

2010/9/22 Dmitriy Ryaboy <dvryaboy@gmail.com>

Using indexes is "just" a matter of writing a loader that is aware of said
indexes. Merge join already builds an index and uses it as part of its
With filters being offered to loaders that claim to implement filter
push-down, there is no reason not to have a loader that can look up block
locations in some index, and only create splits for blocks that contain
unfiltered values, for example. One thing to note is that currently there
is no automatic index creation (since you can load arbitrary data), so you
need to code up a way to look up which of the resources you are trying to
load have been indexed.


On Tue, Sep 21, 2010 at 6:32 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:
Hi everyone!

After reading Ed's email, I got really intrigued about Pig using
indexes, I
thought those were just plans lol
But as commented in here https://issues.apache.org/jira/browse/PIG-209, we
could use indexing through Zebra, right? But that means that we would have
to preload our data into Zebra, "sort it" in a similar way to the sorted
table union example of the wiki, and then if we make a join using them,
join is made in a similar way to the work of Hung-chih Yang et al. ??
Is there any published papers or technical overview on Pig/Zebra or
Thanks in advanced.

Renato M.

Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 4 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedSep 22, '10 at 1:33a
activeOct 5, '10 at 3:25p



site design / logo © 2022 Grokbase