I don't see why not.
If you have (start, end) pairs: (a, b), (b, c), (c, a)
And let's say there's something non-triangular also: (c, d), (d, e), (e, f)
You want to self join on rel.start == rel.end, which will produce
a, (a, b), (c, a)
b, (b, c), (a, b)
d, (d, e), (c, d)
You will now want to pull out start, middle, and end. That's easy -- it's
just ($2.$0, $0, $1.$1)
So now, after a simple join and a foreach-generate, we have
(c, a, b)
(a, b, c)
(c, d, e)
Now we want to match, all segments where the start is equal to the end of
another segment, and the middle is the beginning of that segment
(c, a), (c, a, b), (a, b, c)
Your triple is either $1 or $2 -- they are essentially the same.
You might be able to do a group instead of a join for the last step, if
certain conditions hold (for example, you must be guaranteed that this does
not exist: a->b->c->b->a)
In that case, you could sort the contents of the triples and group on the
result, saving only those results that have > 1 entry in the group. This
would be faster as you would need to shuffle only a single copy of the data.
-D
On Sat, Jun 12, 2010 at 10:39 PM, Renato Marroquín Mogrovejo wrote:Hi everybody, thanks again for the responses. I am really obtaining many
ideas (:
About the depth of my data, yeah I am only looking for connected components
of depth three which I think will be a great help because my BFS would be
limited in the number of iterations.
And about the example I must have these records x -> y, y -> z and z -> x
in
order to output the x, y, z result record. But there can probably be
elements who do not close the graph. For example, I might have a record
like
x -> b, and b doesn't references nobody, so I would have to ignore it. Is
that what you meant Tanton?
What about a self-join three times?Would that be a bad idea?
Thanks again.
Renato M.
2010/6/11 Dmitriy Ryaboy <dvryaboy@gmail.com>
hc, self-join should just work, but if it doesn't:
split table into table_2 if 1==1, table3 if 1==1;
OR
table_2 = foreach table generate *;
table3 = foreach table generate *;
AND THEN
T = join table by id1, table_2 by id2, table_3 by id3
-D
On Fri, Jun 11, 2010 at 10:59 AM, hc busy wrote:
Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
Like you can't really say
T = join Table by id1, Table by id2, Table by id3;
I think PigLatin will complain that it's confused which Table is and which
id1 goes with which table.
I had been proposing that we allow PigLatin to allow recursive
functions,
that way we can do loops. But recursive functions doesn't fit in the
data-flow language paradigm.
But I think people have offered many alternatives in terms of scripting
language that can wrap PigLatin that has loop and flow control
constructs.
On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <tanton.gibbs@gmail.com
wrote:
It really depends on the depth of your graph. Are you only dealing
with connected components of depth 3 or could they be deeper.
For instance, can you have x -> y, y -> z, and z -> a? If so, do you
need your record to be x, y, z, a? Or, are you guaranteed that it
will always be x -> z and y -> z?
I really need more information about your data before I can offer too
much advice.
On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
wrote:
Hi everybody, thanks a lot for your responses.
I am actually not looking for a transitive closure, I am not trying
to
"infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy:
that
is
in short terms a transitive closure ^^) because I have the data
that
proves
it. And yeah I am aware of first logic expressive power, but maybe
I
will
think about giving an inference engine a try some time in the
future.
I am actually looking at a connected graph-like problem. The sample records
resemble a bidirectional triangle.
990201
/ \
/ \
/ \
/ \
770011 ------- 770083
I tried using a smaller version of the problem by using SQL and got
a
horribly huge query which is a non-scalable possibility for me. I
have
over
500Gbs in structured files. I used a triple join but I had to
retrieve
all
existing possibilities.
So that is why I thought on using something like Hc.busy mentioned.
Queries: idx1, idx2, sc, va, p1, p2
possibilities: possibs = foreach queries generate idx1, flatten(a
triple
self join of the data)
and by using the flattening command, get all possible combinations,
but
I
am
not sure that would be the correct approach. Anyhow I thought maybe
performing some breadth first search but that would give me an
algorithm
O(n2) so )=
Hey Tanton how would you implement and use an union-find structure?
Do
you
think it is possible with PIG?
Thanks again.
2010/6/10 Tanton Gibbs <tanton.gibbs@gmail.com>
To be specific, he's looking for connected components in the
graph.
It's not overly easy to do in an ETL tool (or in pig), but if you
can
wrap the script in a loop it is possible.
There are actually some really interesting parallel algorithms for
finding connected components. If you know you are only going to
have
two keys, it is a bit simpler to code (but not any more
efficient).
Essentially, you can implement your union-find algorithm as a
series
of sorts and merges, which on a large parallel system is actually
quite fast.
Tanton
On Thu, Jun 10, 2010 at 3:33 PM, hc busy wrote:
What's a transitive closure?
On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <
gianmarco.dfm@gmail.com
wrote:
I think what he wants is a transitive closure of the relation,
which
is not achievable in SQL-like languages alone (first order
logic
expressive power).
I suppose Pig Latin falls in this category.
-- Gianmarco
On Thu, Jun 10, 2010 at 19:54, hc busy <hc.busy@gmail.com>
wrote:
Is this like a tricky interview question? I don't see the
pattern
between
those three numbers you listed and the sample of the table.
770011 770083 524 1e-120 89 12
770083 770011 494 1e-120 39 100
ahh, I guess these are related because id1=id2 an id2=id1...
Here's
a
first
pass at the problem. Project:
P1 = foreach table generate id1 as id1, id2 as id2, *;
P2 = foreach table generate id2 as id1, id1 as id2, *;
J = join P1 by (id1, id2), P2 by (id1,id2);
and now J contains pairs of rows from original table where
id1
and
id2
are
reversed.
is this what you want?
On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:
Hi everyone, today I came across with a particular query
that
I
don't
know
how to model in PIG. Part of my data looks like this:
Id1 Id2 Sc Va P1 P2
--------- --------- ----- --------- ----- ----
770011 990201 401 1e-125 100 65
990201 770011 440 1e-125 100 42
770011 770083 524 1e-120 89 12
770083 770011 494 1e-120 39 100
990201 770083 341 1e-125 73 41
770083 990201 421 1e-125 90 85
.
.
.
what I would like to retrieve is something like
this: 770011
990201
770083
because they are records actually related.
Any kind of ideas are highly appreciated. Thanks in
advanced.