Grokbase Groups Pig user June 2010
FAQ
Hi everyone, today I came across with a particular query that I don't know
how to model in PIG. Part of my data looks like this:

Id1 Id2 Sc Va P1 P2
--------- --------- ----- --------- ----- ----
770011 990201 401 1e-125 100 65
990201 770011 440 1e-125 100 42
770011 770083 524 1e-120 89 12
770083 770011 494 1e-120 39 100
990201 770083 341 1e-125 73 41
770083 990201 421 1e-125 90 85
.
.
.

what I would like to retrieve is something like
this: 770011 990201 770083
because they are records actually related.
Any kind of ideas are highly appreciated. Thanks in advanced.

Renato M.

Search Discussions

  • Hc busy at Jun 10, 2010 at 5:55 pm
    Is this like a tricky interview question? I don't see the pattern between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1... Here's a first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where id1 and id2 are
    reversed.

    is this what you want?
    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo wrote:

    Hi everyone, today I came across with a particular query that I don't know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this: 770011 990201 770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in advanced.

    Renato M.
  • Gianmarco at Jun 10, 2010 at 10:06 pm
    I think what he wants is a transitive closure of the relation, which
    is not achievable in SQL-like languages alone (first order logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco
    On Thu, Jun 10, 2010 at 19:54, hc busy wrote:
    Is this like a tricky interview question? I don't see the pattern between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1... Here's a first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where id1 and id2 are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query that I don't know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this:                                             770011 990201 770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in advanced.

    Renato M.
  • Hc busy at Jun 10, 2010 at 10:33 pm
    What's a transitive closure?
    On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco wrote:

    I think what he wants is a transitive closure of the relation, which
    is not achievable in SQL-like languages alone (first order logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco
    On Thu, Jun 10, 2010 at 19:54, hc busy wrote:
    Is this like a tricky interview question? I don't see the pattern between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1... Here's a first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where id1 and id2 are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query that I don't
    know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this: 770011 990201 770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in advanced.

    Renato M.
  • Tanton Gibbs at Jun 10, 2010 at 10:37 pm
    To be specific, he's looking for connected components in the graph.

    It's not overly easy to do in an ETL tool (or in pig), but if you can
    wrap the script in a loop it is possible.

    There are actually some really interesting parallel algorithms for
    finding connected components. If you know you are only going to have
    two keys, it is a bit simpler to code (but not any more efficient).
    Essentially, you can implement your union-find algorithm as a series
    of sorts and merges, which on a large parallel system is actually
    quite fast.

    Tanton
    On Thu, Jun 10, 2010 at 3:33 PM, hc busy wrote:
    What's a transitive closure?
    On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco wrote:

    I think what he wants is a transitive closure of the relation, which
    is not achievable in SQL-like languages alone (first order logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco
    On Thu, Jun 10, 2010 at 19:54, hc busy wrote:
    Is this like a tricky interview question? I don't see the pattern between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1... Here's a first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where id1 and id2 are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query that I don't
    know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this:                                             770011 990201 770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in advanced.

    Renato M.
  • Renato Marroquín Mogrovejo at Jun 11, 2010 at 12:39 am
    Hi everybody, thanks a lot for your responses.

    I am actually not looking for a transitive closure, I am not trying to
    "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that is
    in short terms a transitive closure ^^) because I have the data that proves
    it. And yeah I am aware of first logic expressive power, but maybe I will
    think about giving an inference engine a try some time in the future.

    I am actually looking at a connected graph-like problem. The sample records
    resemble a bidirectional triangle.

    990201
    / \
    / \
    / \
    / \
    770011 ------- 770083

    I tried using a smaller version of the problem by using SQL and got a
    horribly huge query which is a non-scalable possibility for me. I have over
    500Gbs in structured files. I used a triple join but I had to retrieve all
    existing possibilities.
    So that is why I thought on using something like Hc.busy mentioned.

    Queries: idx1, idx2, sc, va, p1, p2
    possibilities: possibs = foreach queries generate idx1, flatten(a triple
    self join of the data)

    and by using the flattening command, get all possible combinations, but I am
    not sure that would be the correct approach. Anyhow I thought maybe
    performing some breadth first search but that would give me an algorithm
    O(n2) so )=
    Hey Tanton how would you implement and use an union-find structure? Do you
    think it is possible with PIG?

    Thanks again.


    2010/6/10 Tanton Gibbs <tanton.gibbs@gmail.com>
    To be specific, he's looking for connected components in the graph.

    It's not overly easy to do in an ETL tool (or in pig), but if you can
    wrap the script in a loop it is possible.

    There are actually some really interesting parallel algorithms for
    finding connected components. If you know you are only going to have
    two keys, it is a bit simpler to code (but not any more efficient).
    Essentially, you can implement your union-find algorithm as a series
    of sorts and merges, which on a large parallel system is actually
    quite fast.

    Tanton
    On Thu, Jun 10, 2010 at 3:33 PM, hc busy wrote:
    What's a transitive closure?
    On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco wrote:

    I think what he wants is a transitive closure of the relation, which
    is not achievable in SQL-like languages alone (first order logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco
    On Thu, Jun 10, 2010 at 19:54, hc busy wrote:
    Is this like a tricky interview question? I don't see the pattern
    between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1... Here's a first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where id1 and id2 are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query that I don't
    know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this: 770011 990201
    770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in advanced.

    Renato M.
  • Tanton Gibbs at Jun 11, 2010 at 4:55 am
    It really depends on the depth of your graph. Are you only dealing
    with connected components of depth 3 or could they be deeper.

    For instance, can you have x -> y, y -> z, and z -> a? If so, do you
    need your record to be x, y, z, a? Or, are you guaranteed that it
    will always be x -> z and y -> z?

    I really need more information about your data before I can offer too
    much advice.

    On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
    wrote:
    Hi everybody, thanks a lot for your responses.

    I am actually not looking for a transitive closure, I am not trying to
    "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that is
    in short terms a transitive closure ^^)  because I have the data that proves
    it. And yeah I am aware of first logic expressive power, but maybe I will
    think about giving an inference engine a try some time in the future.

    I am actually looking at a connected graph-like problem. The sample records
    resemble a bidirectional triangle.

    990201
    /              \
    /                \
    /                  \
    /                    \
    770011  ------- 770083

    I tried using a smaller version of the problem by using SQL and got a
    horribly huge query which is a non-scalable possibility for me. I have over
    500Gbs in structured files. I used a triple join but I had to retrieve all
    existing possibilities.
    So that is why I thought on using something like Hc.busy mentioned.

    Queries: idx1, idx2, sc, va, p1, p2
    possibilities: possibs = foreach queries generate idx1, flatten(a triple
    self join of the data)

    and by using the flattening command, get all possible combinations, but I am
    not sure that would be the correct approach. Anyhow I thought maybe
    performing some breadth first search but that would give me an algorithm
    O(n2)  so  )=
    Hey Tanton how would you implement and use an union-find structure? Do you
    think it is possible with PIG?

    Thanks again.


    2010/6/10 Tanton Gibbs <tanton.gibbs@gmail.com>
    To be specific, he's looking for connected components in the graph.

    It's not overly easy to do in an ETL tool (or in pig), but if you can
    wrap the script in a loop it is possible.

    There are actually some really interesting parallel algorithms for
    finding connected components.  If you know you are only going to have
    two keys, it is a bit simpler to code (but not any more efficient).
    Essentially, you can implement your union-find algorithm as a series
    of sorts and merges, which on a large parallel system is actually
    quite fast.

    Tanton
    On Thu, Jun 10, 2010 at 3:33 PM, hc busy wrote:
    What's a transitive closure?

    On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <gianmarco.dfm@gmail.com>
    wrote:
    I think what he wants is a transitive closure of the relation, which
    is not achievable in SQL-like languages alone (first order logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco
    On Thu, Jun 10, 2010 at 19:54, hc busy wrote:
    Is this like a tricky interview question? I don't see the pattern
    between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1... Here's a first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where id1 and id2 are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query that I don't
    know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this:                                             770011 990201
    770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in advanced.

    Renato M.
  • Hc busy at Jun 11, 2010 at 6:00 pm
    Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
    Like you can't really say

    T = join Table by id1, Table by id2, Table by id3;

    I think PigLatin will complain that it's confused which Table is and which
    id1 goes with which table.

    I had been proposing that we allow PigLatin to allow recursive functions,
    that way we can do loops. But recursive functions doesn't fit in the
    data-flow language paradigm.

    But I think people have offered many alternatives in terms of scripting
    language that can wrap PigLatin that has loop and flow control constructs.

    On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs wrote:

    It really depends on the depth of your graph. Are you only dealing
    with connected components of depth 3 or could they be deeper.

    For instance, can you have x -> y, y -> z, and z -> a? If so, do you
    need your record to be x, y, z, a? Or, are you guaranteed that it
    will always be x -> z and y -> z?

    I really need more information about your data before I can offer too
    much advice.

    On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
    wrote:
    Hi everybody, thanks a lot for your responses.

    I am actually not looking for a transitive closure, I am not trying to
    "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that is
    in short terms a transitive closure ^^) because I have the data that proves
    it. And yeah I am aware of first logic expressive power, but maybe I will
    think about giving an inference engine a try some time in the future.

    I am actually looking at a connected graph-like problem. The sample records
    resemble a bidirectional triangle.

    990201
    / \
    / \
    / \
    / \
    770011 ------- 770083

    I tried using a smaller version of the problem by using SQL and got a
    horribly huge query which is a non-scalable possibility for me. I have over
    500Gbs in structured files. I used a triple join but I had to retrieve all
    existing possibilities.
    So that is why I thought on using something like Hc.busy mentioned.

    Queries: idx1, idx2, sc, va, p1, p2
    possibilities: possibs = foreach queries generate idx1, flatten(a triple
    self join of the data)

    and by using the flattening command, get all possible combinations, but I am
    not sure that would be the correct approach. Anyhow I thought maybe
    performing some breadth first search but that would give me an algorithm
    O(n2) so )=
    Hey Tanton how would you implement and use an union-find structure? Do you
    think it is possible with PIG?

    Thanks again.


    2010/6/10 Tanton Gibbs <tanton.gibbs@gmail.com>
    To be specific, he's looking for connected components in the graph.

    It's not overly easy to do in an ETL tool (or in pig), but if you can
    wrap the script in a loop it is possible.

    There are actually some really interesting parallel algorithms for
    finding connected components. If you know you are only going to have
    two keys, it is a bit simpler to code (but not any more efficient).
    Essentially, you can implement your union-find algorithm as a series
    of sorts and merges, which on a large parallel system is actually
    quite fast.

    Tanton
    On Thu, Jun 10, 2010 at 3:33 PM, hc busy wrote:
    What's a transitive closure?

    On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <gianmarco.dfm@gmail.com>
    wrote:
    I think what he wants is a transitive closure of the relation, which
    is not achievable in SQL-like languages alone (first order logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco
    On Thu, Jun 10, 2010 at 19:54, hc busy wrote:
    Is this like a tricky interview question? I don't see the pattern
    between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1... Here's
    a
    first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where id1 and
    id2
    are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query that I
    don't
    know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this: 770011 990201
    770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in advanced.

    Renato M.
  • Dmitriy Ryaboy at Jun 12, 2010 at 2:11 am
    hc, self-join should just work, but if it doesn't:

    split table into table_2 if 1==1, table3 if 1==1;

    OR

    table_2 = foreach table generate *;
    table3 = foreach table generate *;

    AND THEN

    T = join table by id1, table_2 by id2, table_3 by id3

    -D
    On Fri, Jun 11, 2010 at 10:59 AM, hc busy wrote:

    Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
    Like you can't really say

    T = join Table by id1, Table by id2, Table by id3;

    I think PigLatin will complain that it's confused which Table is and which
    id1 goes with which table.

    I had been proposing that we allow PigLatin to allow recursive functions,
    that way we can do loops. But recursive functions doesn't fit in the
    data-flow language paradigm.

    But I think people have offered many alternatives in terms of scripting
    language that can wrap PigLatin that has loop and flow control constructs.


    On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <tanton.gibbs@gmail.com
    wrote:
    It really depends on the depth of your graph. Are you only dealing
    with connected components of depth 3 or could they be deeper.

    For instance, can you have x -> y, y -> z, and z -> a? If so, do you
    need your record to be x, y, z, a? Or, are you guaranteed that it
    will always be x -> z and y -> z?

    I really need more information about your data before I can offer too
    much advice.

    On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
    wrote:
    Hi everybody, thanks a lot for your responses.

    I am actually not looking for a transitive closure, I am not trying to
    "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that is
    in short terms a transitive closure ^^) because I have the data that proves
    it. And yeah I am aware of first logic expressive power, but maybe I
    will
    think about giving an inference engine a try some time in the future.

    I am actually looking at a connected graph-like problem. The sample records
    resemble a bidirectional triangle.

    990201
    / \
    / \
    / \
    / \
    770011 ------- 770083

    I tried using a smaller version of the problem by using SQL and got a
    horribly huge query which is a non-scalable possibility for me. I have over
    500Gbs in structured files. I used a triple join but I had to retrieve all
    existing possibilities.
    So that is why I thought on using something like Hc.busy mentioned.

    Queries: idx1, idx2, sc, va, p1, p2
    possibilities: possibs = foreach queries generate idx1, flatten(a
    triple
    self join of the data)

    and by using the flattening command, get all possible combinations, but
    I
    am
    not sure that would be the correct approach. Anyhow I thought maybe
    performing some breadth first search but that would give me an
    algorithm
    O(n2) so )=
    Hey Tanton how would you implement and use an union-find structure? Do you
    think it is possible with PIG?

    Thanks again.


    2010/6/10 Tanton Gibbs <tanton.gibbs@gmail.com>
    To be specific, he's looking for connected components in the graph.

    It's not overly easy to do in an ETL tool (or in pig), but if you can
    wrap the script in a loop it is possible.

    There are actually some really interesting parallel algorithms for
    finding connected components. If you know you are only going to have
    two keys, it is a bit simpler to code (but not any more efficient).
    Essentially, you can implement your union-find algorithm as a series
    of sorts and merges, which on a large parallel system is actually
    quite fast.

    Tanton
    On Thu, Jun 10, 2010 at 3:33 PM, hc busy wrote:
    What's a transitive closure?

    On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <gianmarco.dfm@gmail.com
    wrote:
    I think what he wants is a transitive closure of the relation,
    which
    is not achievable in SQL-like languages alone (first order logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco
    On Thu, Jun 10, 2010 at 19:54, hc busy wrote:
    Is this like a tricky interview question? I don't see the pattern
    between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1...
    Here's
    a
    first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where id1
    and
    id2
    are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query that I
    don't
    know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this: 770011 990201
    770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in advanced.

    Renato M.
  • Renato Marroquín Mogrovejo at Jun 13, 2010 at 5:39 am
    Hi everybody, thanks again for the responses. I am really obtaining many
    ideas (:

    About the depth of my data, yeah I am only looking for connected components
    of depth three which I think will be a great help because my BFS would be
    limited in the number of iterations.

    And about the example I must have these records x -> y, y -> z and z -> x in
    order to output the x, y, z result record. But there can probably be
    elements who do not close the graph. For example, I might have a record like
    x -> b, and b doesn't references nobody, so I would have to ignore it. Is
    that what you meant Tanton?
    What about a self-join three times?Would that be a bad idea?
    Thanks again.

    Renato M.

    2010/6/11 Dmitriy Ryaboy <dvryaboy@gmail.com>
    hc, self-join should just work, but if it doesn't:

    split table into table_2 if 1==1, table3 if 1==1;

    OR

    table_2 = foreach table generate *;
    table3 = foreach table generate *;

    AND THEN

    T = join table by id1, table_2 by id2, table_3 by id3

    -D
    On Fri, Jun 11, 2010 at 10:59 AM, hc busy wrote:

    Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
    Like you can't really say

    T = join Table by id1, Table by id2, Table by id3;

    I think PigLatin will complain that it's confused which Table is and which
    id1 goes with which table.

    I had been proposing that we allow PigLatin to allow recursive functions,
    that way we can do loops. But recursive functions doesn't fit in the
    data-flow language paradigm.

    But I think people have offered many alternatives in terms of scripting
    language that can wrap PigLatin that has loop and flow control
    constructs.

    On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <tanton.gibbs@gmail.com
    wrote:
    It really depends on the depth of your graph. Are you only dealing
    with connected components of depth 3 or could they be deeper.

    For instance, can you have x -> y, y -> z, and z -> a? If so, do you
    need your record to be x, y, z, a? Or, are you guaranteed that it
    will always be x -> z and y -> z?

    I really need more information about your data before I can offer too
    much advice.

    On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
    wrote:
    Hi everybody, thanks a lot for your responses.

    I am actually not looking for a transitive closure, I am not trying
    to
    "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy:
    that
    is
    in short terms a transitive closure ^^) because I have the data that proves
    it. And yeah I am aware of first logic expressive power, but maybe I
    will
    think about giving an inference engine a try some time in the future.

    I am actually looking at a connected graph-like problem. The sample records
    resemble a bidirectional triangle.

    990201
    / \
    / \
    / \
    / \
    770011 ------- 770083

    I tried using a smaller version of the problem by using SQL and got a
    horribly huge query which is a non-scalable possibility for me. I
    have
    over
    500Gbs in structured files. I used a triple join but I had to
    retrieve
    all
    existing possibilities.
    So that is why I thought on using something like Hc.busy mentioned.

    Queries: idx1, idx2, sc, va, p1, p2
    possibilities: possibs = foreach queries generate idx1, flatten(a
    triple
    self join of the data)

    and by using the flattening command, get all possible combinations,
    but
    I
    am
    not sure that would be the correct approach. Anyhow I thought maybe
    performing some breadth first search but that would give me an
    algorithm
    O(n2) so )=
    Hey Tanton how would you implement and use an union-find structure?
    Do
    you
    think it is possible with PIG?

    Thanks again.


    2010/6/10 Tanton Gibbs <tanton.gibbs@gmail.com>
    To be specific, he's looking for connected components in the graph.

    It's not overly easy to do in an ETL tool (or in pig), but if you
    can
    wrap the script in a loop it is possible.

    There are actually some really interesting parallel algorithms for
    finding connected components. If you know you are only going to
    have
    two keys, it is a bit simpler to code (but not any more efficient).
    Essentially, you can implement your union-find algorithm as a series
    of sorts and merges, which on a large parallel system is actually
    quite fast.

    Tanton
    On Thu, Jun 10, 2010 at 3:33 PM, hc busy wrote:
    What's a transitive closure?

    On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <
    gianmarco.dfm@gmail.com
    wrote:
    I think what he wants is a transitive closure of the relation,
    which
    is not achievable in SQL-like languages alone (first order logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco
    On Thu, Jun 10, 2010 at 19:54, hc busy wrote:
    Is this like a tricky interview question? I don't see the
    pattern
    between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1...
    Here's
    a
    first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where id1
    and
    id2
    are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query that
    I
    don't
    know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this: 770011
    990201
    770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in advanced.

    Renato M.
  • Dmitriy Ryaboy at Jun 13, 2010 at 6:42 am
    I don't see why not.

    If you have (start, end) pairs: (a, b), (b, c), (c, a)
    And let's say there's something non-triangular also: (c, d), (d, e), (e, f)

    You want to self join on rel.start == rel.end, which will produce

    a, (a, b), (c, a)
    b, (b, c), (a, b)
    d, (d, e), (c, d)

    You will now want to pull out start, middle, and end. That's easy -- it's
    just ($2.$0, $0, $1.$1)

    So now, after a simple join and a foreach-generate, we have

    (c, a, b)
    (a, b, c)
    (c, d, e)

    Now we want to match, all segments where the start is equal to the end of
    another segment, and the middle is the beginning of that segment

    (c, a), (c, a, b), (a, b, c)

    Your triple is either $1 or $2 -- they are essentially the same.

    You might be able to do a group instead of a join for the last step, if
    certain conditions hold (for example, you must be guaranteed that this does
    not exist: a->b->c->b->a)
    In that case, you could sort the contents of the triples and group on the
    result, saving only those results that have > 1 entry in the group. This
    would be faster as you would need to shuffle only a single copy of the data.

    -D
    On Sat, Jun 12, 2010 at 10:39 PM, Renato Marroquín Mogrovejo wrote:

    Hi everybody, thanks again for the responses. I am really obtaining many
    ideas (:

    About the depth of my data, yeah I am only looking for connected components
    of depth three which I think will be a great help because my BFS would be
    limited in the number of iterations.

    And about the example I must have these records x -> y, y -> z and z -> x
    in
    order to output the x, y, z result record. But there can probably be
    elements who do not close the graph. For example, I might have a record
    like
    x -> b, and b doesn't references nobody, so I would have to ignore it. Is
    that what you meant Tanton?
    What about a self-join three times?Would that be a bad idea?
    Thanks again.

    Renato M.

    2010/6/11 Dmitriy Ryaboy <dvryaboy@gmail.com>
    hc, self-join should just work, but if it doesn't:

    split table into table_2 if 1==1, table3 if 1==1;

    OR

    table_2 = foreach table generate *;
    table3 = foreach table generate *;

    AND THEN

    T = join table by id1, table_2 by id2, table_3 by id3

    -D
    On Fri, Jun 11, 2010 at 10:59 AM, hc busy wrote:

    Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
    Like you can't really say

    T = join Table by id1, Table by id2, Table by id3;

    I think PigLatin will complain that it's confused which Table is and which
    id1 goes with which table.

    I had been proposing that we allow PigLatin to allow recursive
    functions,
    that way we can do loops. But recursive functions doesn't fit in the
    data-flow language paradigm.

    But I think people have offered many alternatives in terms of scripting
    language that can wrap PigLatin that has loop and flow control
    constructs.

    On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <tanton.gibbs@gmail.com
    wrote:
    It really depends on the depth of your graph. Are you only dealing
    with connected components of depth 3 or could they be deeper.

    For instance, can you have x -> y, y -> z, and z -> a? If so, do you
    need your record to be x, y, z, a? Or, are you guaranteed that it
    will always be x -> z and y -> z?

    I really need more information about your data before I can offer too
    much advice.

    On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
    wrote:
    Hi everybody, thanks a lot for your responses.

    I am actually not looking for a transitive closure, I am not trying
    to
    "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy:
    that
    is
    in short terms a transitive closure ^^) because I have the data
    that
    proves
    it. And yeah I am aware of first logic expressive power, but maybe
    I
    will
    think about giving an inference engine a try some time in the
    future.
    I am actually looking at a connected graph-like problem. The sample records
    resemble a bidirectional triangle.

    990201
    / \
    / \
    / \
    / \
    770011 ------- 770083

    I tried using a smaller version of the problem by using SQL and got
    a
    horribly huge query which is a non-scalable possibility for me. I
    have
    over
    500Gbs in structured files. I used a triple join but I had to
    retrieve
    all
    existing possibilities.
    So that is why I thought on using something like Hc.busy mentioned.

    Queries: idx1, idx2, sc, va, p1, p2
    possibilities: possibs = foreach queries generate idx1, flatten(a
    triple
    self join of the data)

    and by using the flattening command, get all possible combinations,
    but
    I
    am
    not sure that would be the correct approach. Anyhow I thought maybe
    performing some breadth first search but that would give me an
    algorithm
    O(n2) so )=
    Hey Tanton how would you implement and use an union-find structure?
    Do
    you
    think it is possible with PIG?

    Thanks again.


    2010/6/10 Tanton Gibbs <tanton.gibbs@gmail.com>
    To be specific, he's looking for connected components in the
    graph.
    It's not overly easy to do in an ETL tool (or in pig), but if you
    can
    wrap the script in a loop it is possible.

    There are actually some really interesting parallel algorithms for
    finding connected components. If you know you are only going to
    have
    two keys, it is a bit simpler to code (but not any more
    efficient).
    Essentially, you can implement your union-find algorithm as a
    series
    of sorts and merges, which on a large parallel system is actually
    quite fast.

    Tanton
    On Thu, Jun 10, 2010 at 3:33 PM, hc busy wrote:
    What's a transitive closure?

    On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <
    gianmarco.dfm@gmail.com
    wrote:
    I think what he wants is a transitive closure of the relation,
    which
    is not achievable in SQL-like languages alone (first order
    logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco

    On Thu, Jun 10, 2010 at 19:54, hc busy <hc.busy@gmail.com>
    wrote:
    Is this like a tricky interview question? I don't see the
    pattern
    between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an id2=id1...
    Here's
    a
    first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where
    id1
    and
    id2
    are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query
    that
    I
    don't
    know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this: 770011
    990201
    770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in
    advanced.
    Renato M.
  • Renato Marroquín Mogrovejo at Jun 23, 2010 at 4:11 am
    Hi everybody, sorry for not being able to answer back before, but I have
    just been way too busy lately.
    Thanks a lot I think that will work, I haven't been able to test yet, but I
    am getting there these week, I will post about it later. Thanks again.

    Renato M.

    2010/6/13 Dmitriy Ryaboy <dvryaboy@gmail.com>
    I don't see why not.

    If you have (start, end) pairs: (a, b), (b, c), (c, a)
    And let's say there's something non-triangular also: (c, d), (d, e), (e, f)

    You want to self join on rel.start == rel.end, which will produce

    a, (a, b), (c, a)
    b, (b, c), (a, b)
    d, (d, e), (c, d)

    You will now want to pull out start, middle, and end. That's easy -- it's
    just ($2.$0, $0, $1.$1)

    So now, after a simple join and a foreach-generate, we have

    (c, a, b)
    (a, b, c)
    (c, d, e)

    Now we want to match, all segments where the start is equal to the end of
    another segment, and the middle is the beginning of that segment

    (c, a), (c, a, b), (a, b, c)

    Your triple is either $1 or $2 -- they are essentially the same.

    You might be able to do a group instead of a join for the last step, if
    certain conditions hold (for example, you must be guaranteed that this does
    not exist: a->b->c->b->a)
    In that case, you could sort the contents of the triples and group on the
    result, saving only those results that have > 1 entry in the group. This
    would be faster as you would need to shuffle only a single copy of the
    data.

    -D

    On Sat, Jun 12, 2010 at 10:39 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    Hi everybody, thanks again for the responses. I am really obtaining many
    ideas (:

    About the depth of my data, yeah I am only looking for connected
    components
    of depth three which I think will be a great help because my BFS would be
    limited in the number of iterations.

    And about the example I must have these records x -> y, y -> z and z -> x
    in
    order to output the x, y, z result record. But there can probably be
    elements who do not close the graph. For example, I might have a record
    like
    x -> b, and b doesn't references nobody, so I would have to ignore it. Is
    that what you meant Tanton?
    What about a self-join three times?Would that be a bad idea?
    Thanks again.

    Renato M.

    2010/6/11 Dmitriy Ryaboy <dvryaboy@gmail.com>
    hc, self-join should just work, but if it doesn't:

    split table into table_2 if 1==1, table3 if 1==1;

    OR

    table_2 = foreach table generate *;
    table3 = foreach table generate *;

    AND THEN

    T = join table by id1, table_2 by id2, table_3 by id3

    -D
    On Fri, Jun 11, 2010 at 10:59 AM, hc busy wrote:

    Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
    Like you can't really say

    T = join Table by id1, Table by id2, Table by id3;

    I think PigLatin will complain that it's confused which Table is and which
    id1 goes with which table.

    I had been proposing that we allow PigLatin to allow recursive
    functions,
    that way we can do loops. But recursive functions doesn't fit in the
    data-flow language paradigm.

    But I think people have offered many alternatives in terms of
    scripting
    language that can wrap PigLatin that has loop and flow control
    constructs.

    On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <
    tanton.gibbs@gmail.com
    wrote:
    It really depends on the depth of your graph. Are you only dealing
    with connected components of depth 3 or could they be deeper.

    For instance, can you have x -> y, y -> z, and z -> a? If so, do
    you
    need your record to be x, y, z, a? Or, are you guaranteed that it
    will always be x -> z and y -> z?

    I really need more information about your data before I can offer
    too
    much advice.

    On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
    wrote:
    Hi everybody, thanks a lot for your responses.

    I am actually not looking for a transitive closure, I am not
    trying
    to
    "infer" that given *x -> y* and *y -> z*, then *x -> z
    *(@hc.busy:
    that
    is
    in short terms a transitive closure ^^) because I have the data
    that
    proves
    it. And yeah I am aware of first logic expressive power, but
    maybe
    I
    will
    think about giving an inference engine a try some time in the
    future.
    I am actually looking at a connected graph-like problem. The
    sample
    records
    resemble a bidirectional triangle.

    990201
    / \
    / \
    / \
    / \
    770011 ------- 770083

    I tried using a smaller version of the problem by using SQL and
    got
    a
    horribly huge query which is a non-scalable possibility for me. I
    have
    over
    500Gbs in structured files. I used a triple join but I had to
    retrieve
    all
    existing possibilities.
    So that is why I thought on using something like Hc.busy
    mentioned.
    Queries: idx1, idx2, sc, va, p1, p2
    possibilities: possibs = foreach queries generate idx1, flatten(a
    triple
    self join of the data)

    and by using the flattening command, get all possible
    combinations,
    but
    I
    am
    not sure that would be the correct approach. Anyhow I thought
    maybe
    performing some breadth first search but that would give me an
    algorithm
    O(n2) so )=
    Hey Tanton how would you implement and use an union-find
    structure?
    Do
    you
    think it is possible with PIG?

    Thanks again.


    2010/6/10 Tanton Gibbs <tanton.gibbs@gmail.com>
    To be specific, he's looking for connected components in the
    graph.
    It's not overly easy to do in an ETL tool (or in pig), but if
    you
    can
    wrap the script in a loop it is possible.

    There are actually some really interesting parallel algorithms
    for
    finding connected components. If you know you are only going to
    have
    two keys, it is a bit simpler to code (but not any more
    efficient).
    Essentially, you can implement your union-find algorithm as a
    series
    of sorts and merges, which on a large parallel system is
    actually
    quite fast.

    Tanton

    On Thu, Jun 10, 2010 at 3:33 PM, hc busy <hc.busy@gmail.com>
    wrote:
    What's a transitive closure?

    On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <
    gianmarco.dfm@gmail.com
    wrote:
    I think what he wants is a transitive closure of the
    relation,
    which
    is not achievable in SQL-like languages alone (first order
    logic
    expressive power).
    I suppose Pig Latin falls in this category.


    -- Gianmarco

    On Thu, Jun 10, 2010 at 19:54, hc busy <hc.busy@gmail.com>
    wrote:
    Is this like a tricky interview question? I don't see the
    pattern
    between
    those three numbers you listed and the sample of the table.

    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100

    ahh, I guess these are related because id1=id2 an
    id2=id1...
    Here's
    a
    first
    pass at the problem. Project:

    P1 = foreach table generate id1 as id1, id2 as id2, *;
    P2 = foreach table generate id2 as id1, id1 as id2, *;
    J = join P1 by (id1, id2), P2 by (id1,id2);

    and now J contains pairs of rows from original table where
    id1
    and
    id2
    are
    reversed.

    is this what you want?

    On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo
    <
    renatoj.marroquin@gmail.com> wrote:
    Hi everyone, today I came across with a particular query
    that
    I
    don't
    know
    how to model in PIG. Part of my data looks like this:

    Id1 Id2 Sc Va P1 P2
    --------- --------- ----- --------- ----- ----
    770011 990201 401 1e-125 100 65
    990201 770011 440 1e-125 100 42
    770011 770083 524 1e-120 89 12
    770083 770011 494 1e-120 39 100
    990201 770083 341 1e-125 73 41
    770083 990201 421 1e-125 90 85
    .
    .
    .

    what I would like to retrieve is something like
    this: 770011
    990201
    770083
    because they are records actually related.
    Any kind of ideas are highly appreciated. Thanks in
    advanced.
    Renato M.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 10, '10 at 1:55a
activeJun 23, '10 at 4:11a
posts12
users5
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase