I don't see why not.
If you have (start, end) pairs: (a, b), (b, c), (c, a)
And let's say there's something non-triangular also: (c, d), (d, e), (e, f)
You want to self join on rel.start == rel.end, which will produce
a, (a, b), (c, a)
b, (b, c), (a, b)
d, (d, e), (c, d)
You will now want to pull out start, middle, and end. That's easy -- it's
just ($2.$0, $0, $1.$1)
So now, after a simple join and a foreach-generate, we have
(c, a, b)
(a, b, c)
(c, d, e)
Now we want to match, all segments where the start is equal to the end of
another segment, and the middle is the beginning of that segment
(c, a), (c, a, b), (a, b, c)
Your triple is either $1 or $2 -- they are essentially the same.
You might be able to do a group instead of a join for the last step, if
certain conditions hold (for example, you must be guaranteed that this does
not exist: a->b->c->b->a)
In that case, you could sort the contents of the triples and group on the
result, saving only those results that have > 1 entry in the group. This
would be faster as you would need to shuffle only a single copy of the data.
On Sat, Jun 12, 2010 at 10:39 PM, Renato Marroquín Mogrovejo wrote:
Hi everybody, thanks again for the responses. I am really obtaining many
About the depth of my data, yeah I am only looking for connected components
of depth three which I think will be a great help because my BFS would be
limited in the number of iterations.
And about the example I must have these records x -> y, y -> z and z -> x
order to output the x, y, z result record. But there can probably be
elements who do not close the graph. For example, I might have a record
x -> b, and b doesn't references nobody, so I would have to ignore it. Is
that what you meant Tanton?
What about a self-join three times?Would that be a bad idea?
2010/6/11 Dmitriy Ryaboy <firstname.lastname@example.org>
hc, self-join should just work, but if it doesn't:
split table into table_2 if 1==1, table3 if 1==1;
table_2 = foreach table generate *;
table3 = foreach table generate *;
T = join table by id1, table_2 by id2, table_3 by id3
On Fri, Jun 11, 2010 at 10:59 AM, hc busy wrote:
Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
Like you can't really say
T = join Table by id1, Table by id2, Table by id3;
I think PigLatin will complain that it's confused which Table is and which
id1 goes with which table.
I had been proposing that we allow PigLatin to allow recursive
that way we can do loops. But recursive functions doesn't fit in the
data-flow language paradigm.
But I think people have offered many alternatives in terms of scripting
language that can wrap PigLatin that has loop and flow control
On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <email@example.com
It really depends on the depth of your graph. Are you only dealing
with connected components of depth 3 or could they be deeper.
For instance, can you have x -> y, y -> z, and z -> a? If so, do you
need your record to be x, y, z, a? Or, are you guaranteed that it
will always be x -> z and y -> z?
I really need more information about your data before I can offer too
On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
Hi everybody, thanks a lot for your responses.
I am actually not looking for a transitive closure, I am not trying
"infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy:
in short terms a transitive closure ^^) because I have the data
it. And yeah I am aware of first logic expressive power, but maybe
think about giving an inference engine a try some time in the
I am actually looking at a connected graph-like problem. The sample records
resemble a bidirectional triangle.
770011 ------- 770083
I tried using a smaller version of the problem by using SQL and got
horribly huge query which is a non-scalable possibility for me. I
500Gbs in structured files. I used a triple join but I had to
So that is why I thought on using something like Hc.busy mentioned.
Queries: idx1, idx2, sc, va, p1, p2
possibilities: possibs = foreach queries generate idx1, flatten(a
self join of the data)
and by using the flattening command, get all possible combinations,
not sure that would be the correct approach. Anyhow I thought maybe
performing some breadth first search but that would give me an
O(n2) so )=
Hey Tanton how would you implement and use an union-find structure?
think it is possible with PIG?
2010/6/10 Tanton Gibbs <firstname.lastname@example.org>
To be specific, he's looking for connected components in the
It's not overly easy to do in an ETL tool (or in pig), but if you
wrap the script in a loop it is possible.
There are actually some really interesting parallel algorithms for
finding connected components. If you know you are only going to
two keys, it is a bit simpler to code (but not any more
Essentially, you can implement your union-find algorithm as a
of sorts and merges, which on a large parallel system is actually
On Thu, Jun 10, 2010 at 3:33 PM, hc busy wrote:
What's a transitive closure?
On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <
I think what he wants is a transitive closure of the relation,
is not achievable in SQL-like languages alone (first order
I suppose Pig Latin falls in this category.
On Thu, Jun 10, 2010 at 19:54, hc busy <email@example.com>
Is this like a tricky interview question? I don't see the
those three numbers you listed and the sample of the table.
770011 770083 524 1e-120 89 12
770083 770011 494 1e-120 39 100
ahh, I guess these are related because id1=id2 an id2=id1...
pass at the problem. Project:
P1 = foreach table generate id1 as id1, id2 as id2, *;
P2 = foreach table generate id2 as id1, id1 as id2, *;
J = join P1 by (id1, id2), P2 by (id1,id2);
and now J contains pairs of rows from original table where
is this what you want?
On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
Hi everyone, today I came across with a particular query
how to model in PIG. Part of my data looks like this:
Id1 Id2 Sc Va P1 P2
--------- --------- ----- --------- ----- ----
770011 990201 401 1e-125 100 65
990201 770011 440 1e-125 100 42
770011 770083 524 1e-120 89 12
770083 770011 494 1e-120 39 100
990201 770083 341 1e-125 73 41
770083 990201 421 1e-125 90 85
what I would like to retrieve is something like
because they are records actually related.
Any kind of ideas are highly appreciated. Thanks in