Grokbase Groups Pig user October 2010
FAQ
Hi,

If I have bags that have a dynamic number of fields that look something like
this:

("park", "building", "office")
("store", "school")
("building", "school", "restaurant", "hotel)

Is it possible to transform this into one tuple per bag so my data looks
like this and then I can do group bys and counts? Maybe I can do this in an
eval udf?

("park")
("building")
("office")
("store")
...


-Kim

Search Discussions

  • Sal Uryasev at Oct 14, 2010 at 5:47 pm
    Hey Kim,

    You can use the Flatten operator. They don't necessarily stay tuples, but you probably don't need them to.

    http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Example:+Flattening
    On Oct 14, 2010, at 10:30 AM, Kim Vogt wrote:

    Hi,

    If I have bags that have a dynamic number of fields that look something like
    this:

    ("park", "building", "office")
    ("store", "school")
    ("building", "school", "restaurant", "hotel)

    Is it possible to transform this into one tuple per bag so my data looks
    like this and then I can do group bys and counts? Maybe I can do this in an
    eval udf?

    ("park")
    ("building")
    ("office")
    ("store")
    ...


    -Kim
  • Dmitriy Ryaboy at Oct 14, 2010 at 6:34 pm
    Kim,
    You can't just flatten it? Not sure I am following the example right.

    -D
    On Thu, Oct 14, 2010 at 10:30 AM, Kim Vogt wrote:

    Hi,

    If I have bags that have a dynamic number of fields that look something
    like
    this:

    ("park", "building", "office")
    ("store", "school")
    ("building", "school", "restaurant", "hotel)

    Is it possible to transform this into one tuple per bag so my data looks
    like this and then I can do group bys and counts? Maybe I can do this in
    an
    eval udf?

    ("park")
    ("building")
    ("office")
    ("store")
    ...


    -Kim
  • Kim Vogt at Oct 14, 2010 at 7:10 pm
    Hey Dmitriy,

    I tried to keep my example simple but maybe that doesn't work so here it
    goes. I'm trying to do group/counts on the "tag" key/values in json data
    that looks like this:

    {
    "type":"Feature",
    "id":61561312,
    "geometry":{
    "type":"Polygon",
    "coordinates":[
    [
    [
    "53.18119",
    "4.85247"
    ],
    [
    "53.180908",
    "4.8518934"
    ],
    [
    "53.1807441",
    "4.8520919"
    ],
    [
    "53.181027",
    "4.8526444"
    ],
    [
    "53.18119",
    "4.85247"
    ]
    ]
    ]
    },
    "properties":{
    "uid":26959,
    "timestamp":"2010-06-09T12:25:02Z",
    "changeset":4944796,
    "user":"ttwimlex",
    "version":1
    },
    "tags":[
    [
    "amenity",
    "parking"
    ],
    [
    "name",
    "Vuurtoren"
    ]
    ]
    }

    So in the end, I want stats on how many unique key/value tag pairs I see.
    For just this one record, my output would be something like:

    (["amenity", "parking"], 1)
    (["name", "Vuurtoren"], 1)

    I load the data using my jsonLoader and grab just the tags, like this:

    data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
    data = FOREACH data GENERATE json#'tags';
    dump data;

    ([["amenity","parking"],["name","Vuurtoren"]])

    and then I get stuck, because there can be any number of these tags in each
    json record. I thought if I could split them out into multiple bags, I
    could group and count. Or maybe I'm missing something obvious :-)

    -Kim
    On Thu, Oct 14, 2010 at 11:33 AM, Dmitriy Ryaboy wrote:

    Kim,
    You can't just flatten it? Not sure I am following the example right.

    -D
    On Thu, Oct 14, 2010 at 10:30 AM, Kim Vogt wrote:

    Hi,

    If I have bags that have a dynamic number of fields that look something
    like
    this:

    ("park", "building", "office")
    ("store", "school")
    ("building", "school", "restaurant", "hotel)

    Is it possible to transform this into one tuple per bag so my data looks
    like this and then I can do group bys and counts? Maybe I can do this in
    an
    eval udf?

    ("park")
    ("building")
    ("office")
    ("store")
    ...


    -Kim
  • Dmitriy Ryaboy at Oct 14, 2010 at 8:33 pm
    I see.. I think you can flatten each row in data to un-nest, so you will get
    (["amenity","parking"],["name","Vuurtoren"]); then for each resulting row
    call ToBag(*), getting ({["amenity","parking"],["name","Vuurtoren"]}); then
    flatten *that*, getting a row per pair. Now you can group and count.

    Haven't tried it, let me know how it goes.

    -D
    On Thu, Oct 14, 2010 at 12:09 PM, Kim Vogt wrote:

    Hey Dmitriy,

    I tried to keep my example simple but maybe that doesn't work so here it
    goes. I'm trying to do group/counts on the "tag" key/values in json data
    that looks like this:

    {
    "type":"Feature",
    "id":61561312,
    "geometry":{
    "type":"Polygon",
    "coordinates":[
    [
    [
    "53.18119",
    "4.85247"
    ],
    [
    "53.180908",
    "4.8518934"
    ],
    [
    "53.1807441",
    "4.8520919"
    ],
    [
    "53.181027",
    "4.8526444"
    ],
    [
    "53.18119",
    "4.85247"
    ]
    ]
    ]
    },
    "properties":{
    "uid":26959,
    "timestamp":"2010-06-09T12:25:02Z",
    "changeset":4944796,
    "user":"ttwimlex",
    "version":1
    },
    "tags":[
    [
    "amenity",
    "parking"
    ],
    [
    "name",
    "Vuurtoren"
    ]
    ]
    }

    So in the end, I want stats on how many unique key/value tag pairs I see.
    For just this one record, my output would be something like:

    (["amenity", "parking"], 1)
    (["name", "Vuurtoren"], 1)

    I load the data using my jsonLoader and grab just the tags, like this:

    data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
    data = FOREACH data GENERATE json#'tags';
    dump data;

    ([["amenity","parking"],["name","Vuurtoren"]])

    and then I get stuck, because there can be any number of these tags in each
    json record. I thought if I could split them out into multiple bags, I
    could group and count. Or maybe I'm missing something obvious :-)

    -Kim
    On Thu, Oct 14, 2010 at 11:33 AM, Dmitriy Ryaboy wrote:

    Kim,
    You can't just flatten it? Not sure I am following the example right.

    -D
    On Thu, Oct 14, 2010 at 10:30 AM, Kim Vogt wrote:

    Hi,

    If I have bags that have a dynamic number of fields that look something
    like
    this:

    ("park", "building", "office")
    ("store", "school")
    ("building", "school", "restaurant", "hotel)

    Is it possible to transform this into one tuple per bag so my data
    looks
    like this and then I can do group bys and counts? Maybe I can do this
    in
    an
    eval udf?

    ("park")
    ("building")
    ("office")
    ("store")
    ...


    -Kim
  • Kim Vogt at Oct 15, 2010 at 6:05 pm
    grunt> data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
    grunt> data = FOREACH data GENERATE json#'tags';
    grunt> describe data;
    data: {bytearray}
    grunt> data = FOREACH data GENERATE
    FLATTEN($0);
    grunt> describe data;
    data: {bytearray}
    grunt> dump data;
    ([["amenity","parking"],["name","Vuurtoren"]])

    It's not removing the outside brackets, and then ToBag doesn't work
    correctly.

    Instead I wrote my own "SplitIntoBag" and create a tuple out of each
    key/value pair, add to a bag and return the bag. Not sure if this is the
    most efficient way, but it works so I'll roll with it.

    -Kim
    On Thu, Oct 14, 2010 at 1:33 PM, Dmitriy Ryaboy wrote:

    I see.. I think you can flatten each row in data to un-nest, so you will
    get
    (["amenity","parking"],["name","Vuurtoren"]); then for each resulting row
    call ToBag(*), getting ({["amenity","parking"],["name","Vuurtoren"]}); then
    flatten *that*, getting a row per pair. Now you can group and count.

    Haven't tried it, let me know how it goes.

    -D
    On Thu, Oct 14, 2010 at 12:09 PM, Kim Vogt wrote:

    Hey Dmitriy,

    I tried to keep my example simple but maybe that doesn't work so here it
    goes. I'm trying to do group/counts on the "tag" key/values in json data
    that looks like this:

    {
    "type":"Feature",
    "id":61561312,
    "geometry":{
    "type":"Polygon",
    "coordinates":[
    [
    [
    "53.18119",
    "4.85247"
    ],
    [
    "53.180908",
    "4.8518934"
    ],
    [
    "53.1807441",
    "4.8520919"
    ],
    [
    "53.181027",
    "4.8526444"
    ],
    [
    "53.18119",
    "4.85247"
    ]
    ]
    ]
    },
    "properties":{
    "uid":26959,
    "timestamp":"2010-06-09T12:25:02Z",
    "changeset":4944796,
    "user":"ttwimlex",
    "version":1
    },
    "tags":[
    [
    "amenity",
    "parking"
    ],
    [
    "name",
    "Vuurtoren"
    ]
    ]
    }

    So in the end, I want stats on how many unique key/value tag pairs I see.
    For just this one record, my output would be something like:

    (["amenity", "parking"], 1)
    (["name", "Vuurtoren"], 1)

    I load the data using my jsonLoader and grab just the tags, like this:

    data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
    data = FOREACH data GENERATE json#'tags';
    dump data;

    ([["amenity","parking"],["name","Vuurtoren"]])

    and then I get stuck, because there can be any number of these tags in each
    json record. I thought if I could split them out into multiple bags, I
    could group and count. Or maybe I'm missing something obvious :-)

    -Kim

    On Thu, Oct 14, 2010 at 11:33 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    Kim,
    You can't just flatten it? Not sure I am following the example right.

    -D
    On Thu, Oct 14, 2010 at 10:30 AM, Kim Vogt wrote:

    Hi,

    If I have bags that have a dynamic number of fields that look
    something
    like
    this:

    ("park", "building", "office")
    ("store", "school")
    ("building", "school", "restaurant", "hotel)

    Is it possible to transform this into one tuple per bag so my data
    looks
    like this and then I can do group bys and counts? Maybe I can do
    this
    in
    an
    eval udf?

    ("park")
    ("building")
    ("office")
    ("store")
    ...


    -Kim

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedOct 14, '10 at 5:30p
activeOct 15, '10 at 6:05p
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase