FAQ
Just a quick question out there before I go doing this myself but has anyone
written a StoreFunc (or even better a reversible one that does both
load/store) for JSON... basically I have a relation that I would like to
write to a bunch of files of just JSON objects, and I have the associated
JSON schema, is there an easy way to do this that I'm missing?

--
Zaki Rahaman

Search Discussions

  • Dmitriy Ryaboy at Dec 11, 2009 at 7:07 pm
    Zaki,
    Judging by the resounding silence, no one has a Load/Store func that
    reads JSON (or at least one they are willing to share). It should be
    pretty straightforward to write, though. The one trick is how to
    determine the start of a JSON record if you are dropped in the middle
    of the file by the slicer.
    If you write one, consider using the Jackson JSON parser. It's the one
    Avro uses, so that will save us from having to use multiple jars that
    do the same thing (parse json) when Avro starts being used for
    internal Pig stuff.

    -D
    On Fri, Dec 4, 2009 at 12:04 PM, zaki rahaman wrote:
    Just a quick question out there before I go doing this myself but has anyone
    written a StoreFunc (or even better a reversible one that does both
    load/store) for JSON... basically I have a relation that I would like to
    write to a bunch of files of just JSON objects, and I have the associated
    JSON schema, is there an easy way to do this that I'm missing?

    --
    Zaki Rahaman
  • Zaki rahaman at Dec 11, 2009 at 8:36 pm
    I came across pig-json from an old 2007 mailing list archive ... although
    that appears like it's ancient and completely out of date (it contains
    references to Datum and DataAtom)... can this be reworked to yield what I am
    looking for? At least, I'd like to update it to test and use as a starting
    point.

    And out of curiosity, why would someone want to keep something simple but
    very very useful to themselves? Also, I noticed something about a new impl
    for Load/Store that includes schema... I suspect this would be very useful
    to implement/make use of, as I don't see how to efficiently parse
    heavily-nested/complex JSON into Pig data types without some type of
    additional schema (at least it should make it much easier to parse such
    data).

    And as far as an XML parser is concerned, I have already hacked together
    something using Jackson but it is essentially a glorified EvalFunc that
    returns tuples as JSON-encoded strings using a given schema; this obviously
    won't work for Loading/Storing and is strictly a one-way hack.
    On Fri, Dec 4, 2009 at 1:04 PM, zaki rahaman wrote:

    Just a quick question out there before I go doing this myself but has
    anyone written a StoreFunc (or even better a reversible one that does both
    load/store) for JSON... basically I have a relation that I would like to
    write to a bunch of files of just JSON objects, and I have the associated
    JSON schema, is there an easy way to do this that I'm missing?

    --
    Zaki Rahaman

    --
    Zaki Rahaman
  • Dmitriy Ryaboy at Dec 11, 2009 at 9:25 pm
    Technically, nothing's stopping you from using schemas without waiting
    for the load/store redesign -- you just need to implement
    determineSchema(). Trouble is that you'd need to build pig's Schema
    object by hand, which can be a bit tricky, especially for nested data.
    You can check out what I am doing in PIG-760 (schemas serialized to
    json metadata files -- not to be confused with the actual data being
    in json).

    You can probably rework the Datum/DataAtom stuff, I ported a few
    piggybank contribs a while back that were written for Pig 0.1 and it
    was pretty easy. I only had to deal with strings, though.

    -D
    On Fri, Dec 11, 2009 at 2:36 PM, zaki rahaman wrote:
    I came across pig-json from an old 2007 mailing list archive ... although
    that appears like it's ancient and completely out of date (it contains
    references to Datum and DataAtom)... can this be reworked to yield what I am
    looking for? At least, I'd like to update it to test and use as a starting
    point.

    And out of curiosity, why would someone want to keep something simple but
    very very useful to themselves? Also, I noticed something about a new impl
    for Load/Store that includes schema... I suspect this would be very useful
    to implement/make use of, as I don't see how to efficiently parse
    heavily-nested/complex JSON into Pig data types without some type of
    additional schema (at least it should make it much easier to parse such
    data).

    And as far as an XML parser is concerned, I have already hacked together
    something using Jackson but it is essentially a glorified EvalFunc that
    returns tuples as JSON-encoded strings using a given schema; this obviously
    won't work for Loading/Storing and is strictly a one-way hack.
    On Fri, Dec 4, 2009 at 1:04 PM, zaki rahaman wrote:

    Just a quick question out there before I go doing this myself but has
    anyone written a StoreFunc (or even better a reversible one that does both
    load/store) for JSON... basically I have a relation that I would like to
    write to a bunch of files of just JSON objects, and I have the associated
    JSON schema, is there an easy way to do this that I'm missing?

    --
    Zaki Rahaman

    --
    Zaki Rahaman
  • Zaki rahaman at Dec 11, 2009 at 9:43 pm
    Any hints/places to look for pointers on reworking Datum and DataAtom? Maybe
    I skipped over it, but I didn't see a migration guide.

    Also, what were these used for/why were they removed?

    I will get to work on this when I can, right now I've shifted gears and am
    working on a sort of puppet-master framework for Pig and Elastic MapReduce.
    If it proves worthwhile and there's interest (and assuming I get clearing
    from my higher-ups) I'd love to contrib this as well. The basic idea is to
    abstract away some of the nitty gritty when working with transient
    clusters/one-off jobs while also providing something extensible for repeated
    workflows/jobflows. I'm also thinking about possibly adding some further
    abstraction to allow end-users to quickly do some super common jobs
    (load,filter,group,count/avg/sum). I know there's already the Ruby client
    that serves some of these purposes, but is there anything out there that I'm
    missing? I'd like to avoid duplicating existing efforts if possible.
    On Fri, Dec 11, 2009 at 4:24 PM, Dmitriy Ryaboy wrote:

    Technically, nothing's stopping you from using schemas without waiting
    for the load/store redesign -- you just need to implement
    determineSchema(). Trouble is that you'd need to build pig's Schema
    object by hand, which can be a bit tricky, especially for nested data.
    You can check out what I am doing in PIG-760 (schemas serialized to
    json metadata files -- not to be confused with the actual data being
    in json).

    You can probably rework the Datum/DataAtom stuff, I ported a few
    piggybank contribs a while back that were written for Pig 0.1 and it
    was pretty easy. I only had to deal with strings, though.

    -D
    On Fri, Dec 11, 2009 at 2:36 PM, zaki rahaman wrote:
    I came across pig-json from an old 2007 mailing list archive ... although
    that appears like it's ancient and completely out of date (it contains
    references to Datum and DataAtom)... can this be reworked to yield what I am
    looking for? At least, I'd like to update it to test and use as a starting
    point.

    And out of curiosity, why would someone want to keep something simple but
    very very useful to themselves? Also, I noticed something about a new impl
    for Load/Store that includes schema... I suspect this would be very useful
    to implement/make use of, as I don't see how to efficiently parse
    heavily-nested/complex JSON into Pig data types without some type of
    additional schema (at least it should make it much easier to parse such
    data).

    And as far as an XML parser is concerned, I have already hacked together
    something using Jackson but it is essentially a glorified EvalFunc that
    returns tuples as JSON-encoded strings using a given schema; this obviously
    won't work for Loading/Storing and is strictly a one-way hack.
    On Fri, Dec 4, 2009 at 1:04 PM, zaki rahaman wrote:

    Just a quick question out there before I go doing this myself but has
    anyone written a StoreFunc (or even better a reversible one that does
    both
    load/store) for JSON... basically I have a relation that I would like to
    write to a bunch of files of just JSON objects, and I have the
    associated
    JSON schema, is there an easy way to do this that I'm missing?

    --
    Zaki Rahaman

    --
    Zaki Rahaman


    --
    Zaki Rahaman
  • Alan Gates at Dec 11, 2009 at 10:10 pm

    On Dec 11, 2009, at 1:42 PM, zaki rahaman wrote:

    Any hints/places to look for pointers on reworking Datum and
    DataAtom? Maybe
    I skipped over it, but I didn't see a migration guide.

    Also, what were these used for/why were they removed?
    Originally (0.1) Pig had only 4 data types: bag, tuple, map, and
    scalar. Scalar was represented by DataAtom. All Data types derived
    from Datum (I think, I'm working from memory here). As part of the
    complete rewrite of 0.2, scalar was broken out into the current int,
    long, etc. So DataAtom was removed. Since we decided to use Java
    types to represent some data types (Integer for int, ...) it no longer
    made sense to have a common supertype, so Datum was removed.

    While we don't have a porting guide, there is the functional spec for
    types which includes some information on how they should be used in
    UDFs. http://wiki.apache.org/pig/PigTypesFunctionalSpec

    Alan.
  • Tommy Chheng at Dec 12, 2009 at 10:19 pm
    Hey zaki,
    Have taken a look at opscode's chef? it's similar to puppet's goal of
    automated deployments. They have some existing automated deployment
    scripts for cloudera's hadoop.
    http://www.opscode.com/blog/2009/06/22/cloudera-hadoop-at-velocity/

    ps i renamed the thread to change the topic.

    tommy
    I will get to work on this when I can, right now I've shifted gears and am
    working on a sort of puppet-master framework for Pig and Elastic MapReduce.
    If it proves worthwhile and there's interest (and assuming I get clearing
    from my higher-ups) I'd love to contrib this as well. The basic idea is to
    abstract away some of the nitty gritty when working with transient
    clusters/one-off jobs while also providing something extensible for repeated
    workflows/jobflows. I'm also thinking about possibly adding some further
    abstraction to allow end-users to quickly do some super common jobs
    (load,filter,group,count/avg/sum). I know there's already the Ruby client
    that serves some of these purposes, but is there anything out there that I'm
    missing? I'd like to avoid duplicating existing efforts if possible.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedDec 4, '09 at 6:04p
activeDec 12, '09 at 10:19p
posts7
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase