Grokbase Groups Pig user April 2011
FAQ
I noticed that there is a Pig JSON Loader (which might or might not be in
piggbank).
Could anyone confirm the existence or absence of a JSONToTuple UDF? (not a
loader)

I am inspired by the UDF mentioned on Slide 23 here:
http://www.slideshare.net/danharvey/hbase-at-mendeley

doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc) as
DOC;

My desire is to store a raw JSON doc in a cell in HBase and run pig queries
against the tuples generated by the UDF.
I used the HBase Loader already to get the cell-data, and now I need a
JSON-deserializer.

I would be willing to roll my own, (and contribute), but I figure I'd see if
there was anything out there first.

thanks,
daniel

Search Discussions

  • Bill Graham at Apr 19, 2011 at 6:26 pm
    We're doing the same thing using a JsonToMap UDF followed by a
    MapToBag UDF. The former was similarly inspired by the elephant bird
    JSONLoader. I'd be glad to collaborate on a contribution if you'd
    like.

    Here's what our scripts look like:

    define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
    define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
    define concat org.apache.pig.builtin.StringConcat();

    raw = LOAD 'hbase://user_info'
    USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'events:*')
    AS (events_map:map[]);

    -- Convert our maps to bags so we can flatten them out
    B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;

    C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
    event_v:chararray);

    -- Convert the JSON events into maps
    D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];

    -- Example showing how to filter on a given field
    E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
    event_map#'levt.asid' IS NOT NULL);

    -- Example showing how to pull data out of a map
    F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
    event_map#'levt.astid' AS astid;


    thanks,
    Bill
    On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund wrote:
    I noticed that there is a Pig JSON Loader (which might or might not be in
    piggbank).
    Could anyone confirm the existence or absence of a JSONToTuple UDF?  (not a
    loader)

    I am inspired by the UDF mentioned on Slide 23 here:
    http://www.slideshare.net/danharvey/hbase-at-mendeley

    doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc) as
    DOC;

    My desire is to store a raw JSON doc in a cell in HBase and run pig queries
    against the tuples generated by the UDF.
    I used the HBase Loader already to get the cell-data, and now I need a
    JSON-deserializer.

    I would be willing to roll my own, (and contribute), but I figure I'd see if
    there was anything out there first.

    thanks,
    daniel
  • Daniel Eklund at Apr 19, 2011 at 6:44 pm
    Bill, thanks...

    so that is a confirmation... people have rolled their own, and it's not in
    piggybank.
    I would absolutely be willing to work with you to get a contribution going,
    but (as
    a warning) I am extremely new to Pig.

    I was looking at this:
    http://wiki.apache.org/pig/UDFManual
    to get my mind wrapped around the framework. And I also discovered this
    https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
    ( I am assuming this was the UDF you mentioned that inspired you)...

    A quick question about the UDF's registered at the top of a pig script:

    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node on the
    cluster.

    thanks more,
    daniel


    On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham wrote:

    We're doing the same thing using a JsonToMap UDF followed by a
    MapToBag UDF. The former was similarly inspired by the elephant bird
    JSONLoader. I'd be glad to collaborate on a contribution if you'd
    like.

    Here's what our scripts look like:

    define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
    define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
    define concat org.apache.pig.builtin.StringConcat();

    raw = LOAD 'hbase://user_info'
    USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'events:*')
    AS (events_map:map[]);

    -- Convert our maps to bags so we can flatten them out
    B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;

    C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
    event_v:chararray);

    -- Convert the JSON events into maps
    D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];

    -- Example showing how to filter on a given field
    E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
    event_map#'levt.asid' IS NOT NULL);

    -- Example showing how to pull data out of a map
    F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
    event_map#'levt.astid' AS
    astid;


    thanks,
    Bill
    On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund wrote:
    I noticed that there is a Pig JSON Loader (which might or might not be in
    piggbank).
    Could anyone confirm the existence or absence of a JSONToTuple UDF? (not a
    loader)

    I am inspired by the UDF mentioned on Slide 23 here:
    http://www.slideshare.net/danharvey/hbase-at-mendeley

    doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc) as
    DOC;

    My desire is to store a raw JSON doc in a cell in HBase and run pig queries
    against the tuples generated by the UDF.
    I used the HBase Loader already to get the cell-data, and now I need a
    JSON-deserializer.

    I would be willing to roll my own, (and contribute), but I figure I'd see if
    there was anything out there first.

    thanks,
    daniel
  • John Hui at Apr 19, 2011 at 6:50 pm
    I have a JSON library and pig script working. Should I just contribute it
    instead of reinventing the wheel?

    John
    On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund wrote:

    Bill, thanks...

    so that is a confirmation... people have rolled their own, and it's not in
    piggybank.
    I would absolutely be willing to work with you to get a contribution going,
    but (as
    a warning) I am extremely new to Pig.

    I was looking at this:
    http://wiki.apache.org/pig/UDFManual
    to get my mind wrapped around the framework. And I also discovered this

    https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
    ( I am assuming this was the UDF you mentioned that inspired you)...

    A quick question about the UDF's registered at the top of a pig script:

    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node on the
    cluster.

    thanks more,
    daniel


    On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham wrote:

    We're doing the same thing using a JsonToMap UDF followed by a
    MapToBag UDF. The former was similarly inspired by the elephant bird
    JSONLoader. I'd be glad to collaborate on a contribution if you'd
    like.

    Here's what our scripts look like:

    define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
    define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
    define concat org.apache.pig.builtin.StringConcat();

    raw = LOAD 'hbase://user_info'
    USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'events:*')
    AS (events_map:map[]);

    -- Convert our maps to bags so we can flatten them out
    B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;

    C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
    event_v:chararray);

    -- Convert the JSON events into maps
    D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];

    -- Example showing how to filter on a given field
    E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
    event_map#'levt.asid' IS NOT NULL);

    -- Example showing how to pull data out of a map
    F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
    event_map#'levt.astid' AS
    astid;


    thanks,
    Bill

    On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <doeklund@gmail.com>
    wrote:
    I noticed that there is a Pig JSON Loader (which might or might not be
    in
    piggbank).
    Could anyone confirm the existence or absence of a JSONToTuple UDF?
    (not
    a
    loader)

    I am inspired by the UDF mentioned on Slide 23 here:
    http://www.slideshare.net/danharvey/hbase-at-mendeley

    doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc)
    as
    DOC;

    My desire is to store a raw JSON doc in a cell in HBase and run pig queries
    against the tuples generated by the UDF.
    I used the HBase Loader already to get the cell-data, and now I need a
    JSON-deserializer.

    I would be willing to roll my own, (and contribute), but I figure I'd
    see
    if
    there was anything out there first.

    thanks,
    daniel
  • Dmitriy Ryaboy at Apr 19, 2011 at 6:56 pm
    YES :)
    On Tue, Apr 19, 2011 at 11:49 AM, John Hui wrote:

    I have a JSON library and pig script working. Should I just contribute it
    instead of reinventing the wheel?

    John
    On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund wrote:

    Bill, thanks...

    so that is a confirmation... people have rolled their own, and it's not in
    piggybank.
    I would absolutely be willing to work with you to get a contribution going,
    but (as
    a warning) I am extremely new to Pig.

    I was looking at this:
    http://wiki.apache.org/pig/UDFManual
    to get my mind wrapped around the framework. And I also discovered this

    https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
    ( I am assuming this was the UDF you mentioned that inspired you)...

    A quick question about the UDF's registered at the top of a pig script:

    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node on the
    cluster.

    thanks more,
    daniel


    On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham wrote:

    We're doing the same thing using a JsonToMap UDF followed by a
    MapToBag UDF. The former was similarly inspired by the elephant bird
    JSONLoader. I'd be glad to collaborate on a contribution if you'd
    like.

    Here's what our scripts look like:

    define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
    define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
    define concat org.apache.pig.builtin.StringConcat();

    raw = LOAD 'hbase://user_info'
    USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
    'events:*')
    AS (events_map:map[]);

    -- Convert our maps to bags so we can flatten them out
    B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;

    C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
    event_v:chararray);

    -- Convert the JSON events into maps
    D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];

    -- Example showing how to filter on a given field
    E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
    event_map#'levt.asid' IS NOT NULL);

    -- Example showing how to pull data out of a map
    F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
    event_map#'levt.astid' AS
    astid;


    thanks,
    Bill

    On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <doeklund@gmail.com>
    wrote:
    I noticed that there is a Pig JSON Loader (which might or might not
    be
    in
    piggbank).
    Could anyone confirm the existence or absence of a JSONToTuple UDF?
    (not
    a
    loader)

    I am inspired by the UDF mentioned on Slide 23 here:
    http://www.slideshare.net/danharvey/hbase-at-mendeley

    doc = FOREACH rawdocs GENERATE
    DocumentProtobufBytesToTuple(protodoc)
    as
    DOC;

    My desire is to store a raw JSON doc in a cell in HBase and run pig queries
    against the tuples generated by the UDF.
    I used the HBase Loader already to get the cell-data, and now I need
    a
    JSON-deserializer.

    I would be willing to roll my own, (and contribute), but I figure I'd
    see
    if
    there was anything out there first.

    thanks,
    daniel
  • Xavier Stevens at Apr 19, 2011 at 6:58 pm
    For what it's worth I have one as well. This one uses Jackson to parse
    everything.

    https://github.com/xstevens/akela/blob/master/src/java/com/mozilla/pig/eval/json/JsonMap.java

    On 4/19/11 11:55 AM, Dmitriy Ryaboy wrote:
    YES :)
    On Tue, Apr 19, 2011 at 11:49 AM, John Hui wrote:

    I have a JSON library and pig script working. Should I just contribute it
    instead of reinventing the wheel?

    John
    On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund wrote:

    Bill, thanks...

    so that is a confirmation... people have rolled their own, and it's not in
    piggybank.
    I would absolutely be willing to work with you to get a contribution going,
    but (as
    a warning) I am extremely new to Pig.

    I was looking at this:
    http://wiki.apache.org/pig/UDFManual
    to get my mind wrapped around the framework. And I also discovered this

    https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
    ( I am assuming this was the UDF you mentioned that inspired you)...

    A quick question about the UDF's registered at the top of a pig script:

    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node on the
    cluster.

    thanks more,
    daniel



    On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <billgraham@gmail.com>
    wrote:
    We're doing the same thing using a JsonToMap UDF followed by a
    MapToBag UDF. The former was similarly inspired by the elephant bird
    JSONLoader. I'd be glad to collaborate on a contribution if you'd
    like.

    Here's what our scripts look like:

    define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
    define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
    define concat org.apache.pig.builtin.StringConcat();

    raw = LOAD 'hbase://user_info'
    USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
    'events:*')
    AS (events_map:map[]);

    -- Convert our maps to bags so we can flatten them out
    B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;

    C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
    event_v:chararray);

    -- Convert the JSON events into maps
    D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];

    -- Example showing how to filter on a given field
    E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
    event_map#'levt.asid' IS NOT NULL);

    -- Example showing how to pull data out of a map
    F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
    event_map#'levt.astid' AS
    astid;


    thanks,
    Bill

    On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <doeklund@gmail.com>
    wrote:
    I noticed that there is a Pig JSON Loader (which might or might not
    be
    in
    piggbank).
    Could anyone confirm the existence or absence of a JSONToTuple UDF?
    (not
    a
    loader)

    I am inspired by the UDF mentioned on Slide 23 here:
    http://www.slideshare.net/danharvey/hbase-at-mendeley

    doc = FOREACH rawdocs GENERATE
    DocumentProtobufBytesToTuple(protodoc)
    as
    DOC;

    My desire is to store a raw JSON doc in a cell in HBase and run pig queries
    against the tuples generated by the UDF.
    I used the HBase Loader already to get the cell-data, and now I need
    a
    JSON-deserializer.

    I would be willing to roll my own, (and contribute), but I figure I'd
    see
    if
    there was anything out there first.

    thanks,
    daniel
  • Daniel Eklund at Apr 19, 2011 at 7:00 pm
    great...

    this was exactly what I was hoping for ... (although I have a bit of sadness
    as I was just about ready to get by hands dirty)
    On Tue, Apr 19, 2011 at 2:57 PM, Xavier Stevens wrote:

    For what it's worth I have one as well. This one uses Jackson to parse
    everything.


    https://github.com/xstevens/akela/blob/master/src/java/com/mozilla/pig/eval/json/JsonMap.java

    On 4/19/11 11:55 AM, Dmitriy Ryaboy wrote:
    YES :)
    On Tue, Apr 19, 2011 at 11:49 AM, John Hui wrote:

    I have a JSON library and pig script working. Should I just contribute
    it
    instead of reinventing the wheel?

    John
    On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund wrote:

    Bill, thanks...

    so that is a confirmation... people have rolled their own, and it's
    not
    in
    piggybank.
    I would absolutely be willing to work with you to get a contribution going,
    but (as
    a warning) I am extremely new to Pig.

    I was looking at this:
    http://wiki.apache.org/pig/UDFManual
    to get my mind wrapped around the framework. And I also discovered
    this
    https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
    ( I am assuming this was the UDF you mentioned that inspired you)...

    A quick question about the UDF's registered at the top of a pig script:

    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node on the
    cluster.

    thanks more,
    daniel



    On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <billgraham@gmail.com>
    wrote:
    We're doing the same thing using a JsonToMap UDF followed by a
    MapToBag UDF. The former was similarly inspired by the elephant bird
    JSONLoader. I'd be glad to collaborate on a contribution if you'd
    like.

    Here's what our scripts look like:

    define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
    define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
    define concat org.apache.pig.builtin.StringConcat();

    raw = LOAD 'hbase://user_info'
    USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
    'events:*')
    AS (events_map:map[]);

    -- Convert our maps to bags so we can flatten them out
    B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;

    C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
    event_v:chararray);

    -- Convert the JSON events into maps
    D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS
    event_map:map[];
    -- Example showing how to filter on a given field
    E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
    event_map#'levt.asid' IS NOT NULL);

    -- Example showing how to pull data out of a map
    F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
    event_map#'levt.astid' AS
    astid;


    thanks,
    Bill

    On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <doeklund@gmail.com>
    wrote:
    I noticed that there is a Pig JSON Loader (which might or might not
    be
    in
    piggbank).
    Could anyone confirm the existence or absence of a JSONToTuple UDF?
    (not
    a
    loader)

    I am inspired by the UDF mentioned on Slide 23 here:
    http://www.slideshare.net/danharvey/hbase-at-mendeley

    doc = FOREACH rawdocs GENERATE
    DocumentProtobufBytesToTuple(protodoc)
    as
    DOC;

    My desire is to store a raw JSON doc in a cell in HBase and run pig queries
    against the tuples generated by the UDF.
    I used the HBase Loader already to get the cell-data, and now I need
    a
    JSON-deserializer.

    I would be willing to roll my own, (and contribute), but I figure I'd
    see
    if
    there was anything out there first.

    thanks,
    daniel
  • Alan Gates at Apr 19, 2011 at 7:01 pm

    On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:

    <snip>
    A quick question about the UDF's registered at the top of a pig
    script:

    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node
    on the
    cluster.
    Pig takes care of getting myJar.jar to the task nodes; you do not have
    to worry about it.

    Alan.
  • John Hui at Apr 19, 2011 at 7:02 pm
    I don't think one parser will work for all solution. It really depends on
    your data, since there might be a list within a list.

    But pick anyone as a starting point and customize it for your own json data
    format.
    On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates wrote:


    On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:

    <snip>
    A quick question about the UDF's registered at the top of a pig script:

    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node on the
    cluster.
    Pig takes care of getting myJar.jar to the task nodes; you do not have to
    worry about it.

    Alan.
  • John Hui at Apr 19, 2011 at 7:03 pm
    I'll post my solution in a few hours =)
    On Tue, Apr 19, 2011 at 3:02 PM, John Hui wrote:

    I don't think one parser will work for all solution. It really depends on
    your data, since there might be a list within a list.

    But pick anyone as a starting point and customize it for your own json data
    format.

    On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates wrote:


    On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:

    <snip>
    A quick question about the UDF's registered at the top of a pig script:

    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node on
    the
    cluster.
    Pig takes care of getting myJar.jar to the task nodes; you do not have to
    worry about it.

    Alan.
  • Xavier Stevens at Apr 19, 2011 at 7:09 pm
    Hey John,

    If you take a look at mine it looks explicitly for Lists and converts
    them to DataBags. I ran into that issue with our data. That said I won't
    make any claims that it'll work for all data.

    Cheers,

    -Xavier
    On 4/19/11 12:02 PM, John Hui wrote:
    I'll post my solution in a few hours =)
    On Tue, Apr 19, 2011 at 3:02 PM, John Hui wrote:

    I don't think one parser will work for all solution. It really depends on
    your data, since there might be a list within a list.

    But pick anyone as a starting point and customize it for your own json data
    format.

    On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates wrote:

    On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:

    <snip>
    A quick question about the UDF's registered at the top of a pig script:

    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node on
    the
    cluster.
    Pig takes care of getting myJar.jar to the task nodes; you do not have to
    worry about it.

    Alan.
  • John Hui at Apr 19, 2011 at 7:11 pm
    Really, cool. Let me take a look when I have some "downtime". If that's
    the case, Xavier's parser is much better than mine.

    Who wants to take the lead in adding this to the piggybank, I am sure this
    makes for a very useful "storage" utility.

    John
    On Tue, Apr 19, 2011 at 3:09 PM, Xavier Stevens wrote:

    Hey John,

    If you take a look at mine it looks explicitly for Lists and converts
    them to DataBags. I ran into that issue with our data. That said I won't
    make any claims that it'll work for all data.

    Cheers,

    -Xavier
    On 4/19/11 12:02 PM, John Hui wrote:
    I'll post my solution in a few hours =)
    On Tue, Apr 19, 2011 at 3:02 PM, John Hui wrote:

    I don't think one parser will work for all solution. It really depends
    on
    your data, since there might be a list within a list.

    But pick anyone as a starting point and customize it for your own json
    data
    format.

    On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates wrote:

    On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:

    <snip>
    A quick question about the UDF's registered at the top of a pig
    script:
    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node on
    the
    cluster.
    Pig takes care of getting myJar.jar to the task nodes; you do not have
    to
    worry about it.

    Alan.
  • Dmitriy Ryaboy at Apr 19, 2011 at 7:20 pm
    FYI there's a ticket open already though it didn't see much action:

    https://issues.apache.org/jira/browse/PIG-1914

    Perhaps the best thing would be to discuss implementation approaches, etc,
    there.

    D
    On Tue, Apr 19, 2011 at 12:11 PM, John Hui wrote:

    Really, cool. Let me take a look when I have some "downtime". If that's
    the case, Xavier's parser is much better than mine.

    Who wants to take the lead in adding this to the piggybank, I am sure this
    makes for a very useful "storage" utility.

    John

    On Tue, Apr 19, 2011 at 3:09 PM, Xavier Stevens <xstevens@mozilla.com
    wrote:
    Hey John,

    If you take a look at mine it looks explicitly for Lists and converts
    them to DataBags. I ran into that issue with our data. That said I won't
    make any claims that it'll work for all data.

    Cheers,

    -Xavier
    On 4/19/11 12:02 PM, John Hui wrote:
    I'll post my solution in a few hours =)
    On Tue, Apr 19, 2011 at 3:02 PM, John Hui wrote:

    I don't think one parser will work for all solution. It really
    depends
    on
    your data, since there might be a list within a list.

    But pick anyone as a starting point and customize it for your own json
    data
    format.


    On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates <gates@yahoo-inc.com>
    wrote:
    On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:

    <snip>
    A quick question about the UDF's registered at the top of a pig
    script:
    does
    REGISTER myJar.jar
    distribute the jar across HDFS (like a Hadoop job jar) so that the
    distribution of the code to the cluster nodes is transparent?
    In other words, do we NOT have to distribute myJar.jar to each node
    on
    the
    cluster.
    Pig takes care of getting myJar.jar to the task nodes; you do not
    have
    to
    worry about it.

    Alan.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 19, '11 at 5:09p
activeApr 19, '11 at 7:20p
posts13
users6
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase