FAQ
Hi there,

I want to create a new JSON Field/Column type. I know there exists
get_json_object(), but the things is I want to multiple JSON
operations in a single select statement and don't want to incur the
cost of parsing the JSON over and over again, because our json
structures are quite large.

Our schema is two fields - the first is a string of some sort used as
an id, and the second is a huge JSON structure. I want to be able
write sql like the following:

select Foo(json), Bar(json), Baz(json) where Quux(json) like '%hello%';

Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.
Right now what I'm doing is storing the JSON as a string and in each
of those UDF's I am de-serializing the json, and returning another
string.

Ideally I'd want to be able to return JSON objects from these UDF's so
I could compose my functions like "Foo(Bar(Baz(json))"

I'm thinking I need to create a SerDe but SerDe's seem to work at the
table level, not at the field level (in other words, I just need a
custom field, not a entire custom table type...or do I!?)

Also, if there's any docs on writing custom SerDe's please let me
know...I haven't been able to find much.

Thanks!

Bobby

Search Discussions

  • Zheng Shao at Oct 3, 2009 at 5:55 am
    We have 2 example serdes, one for text data (regexserde), one for
    binary data (thriftserde).

    But the simplest solution for this is to add a udf get_json_objects
    that returns a bunch of fields in an array. Then we can use sub query
    to extract the array elements as individual elements, until we have
    support for common expression extraction.


    There are a few example udf in the ctrib package illustrating the way
    to accept variable number of arguments.

    Let us know how it goes.

    Zheng


    On 10/2/09, Bobby Rullo wrote:
    Hi there,

    I want to create a new JSON Field/Column type. I know there exists
    get_json_object(), but the things is I want to multiple JSON
    operations in a single select statement and don't want to incur the
    cost of parsing the JSON over and over again, because our json
    structures are quite large.

    Our schema is two fields - the first is a string of some sort used as
    an id, and the second is a huge JSON structure. I want to be able
    write sql like the following:

    select Foo(json), Bar(json), Baz(json) where Quux(json) like '%hello%';

    Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.
    Right now what I'm doing is storing the JSON as a string and in each
    of those UDF's I am de-serializing the json, and returning another
    string.

    Ideally I'd want to be able to return JSON objects from these UDF's so
    I could compose my functions like "Foo(Bar(Baz(json))"

    I'm thinking I need to create a SerDe but SerDe's seem to work at the
    table level, not at the field level (in other words, I just need a
    custom field, not a entire custom table type...or do I!?)

    Also, if there's any docs on writing custom SerDe's please let me
    know...I haven't been able to find much.

    Thanks!

    Bobby
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
  • Bobby Rullo at Oct 3, 2009 at 6:51 am
    Zheng,

    Thanks for your reply.

    I'm not sure what this "get_json_objects" function would do...do you
    mean it would return a canned set of fields? That's not what I'm
    interested in. The UDFs that I have in mind dig deep into the JSON and
    fetch/filter/transform data. I don't want a few canned fields to play
    with, I want our users to be able to create AdHoc queries on the JSON
    using any and all of the data inside.

    Here's a more realistic example. 'Txn' is a JSON field:

    select TimeSpentQuerying(txn), PageType(txn), TotalRevenue(txn) from
    txn_logs where UserAgent(txn) like '%Safari%'

    I have no problem writing the UDFs, I'm just trying to avoid parsing
    the JSON in each one of them - in other words, I want the input to the
    UDF's to not be a String (or Text) which I need to parse into JSON,
    but rather a JSON object.

    Is that possible? Can I create new field types? And if so, where do I
    register them?

    My fantasy is to be able to do this:

    CREATE TABLE txn_logs (tid String, txn JSON);

    Thanks,

    Bobby


    On Oct 2, 2009, at 10:54 PM, Zheng Shao wrote:

    We have 2 example serdes, one for text data (regexserde), one for
    binary data (thriftserde).

    But the simplest solution for this is to add a udf get_json_objects
    that returns a bunch of fields in an array. Then we can use sub query
    to extract the array elements as individual elements, until we have
    support for common expression extraction.


    There are a few example udf in the ctrib package illustrating the way
    to accept variable number of arguments.

    Let us know how it goes.

    Zheng


    On 10/2/09, Bobby Rullo wrote:
    Hi there,

    I want to create a new JSON Field/Column type. I know there exists
    get_json_object(), but the things is I want to multiple JSON
    operations in a single select statement and don't want to incur the
    cost of parsing the JSON over and over again, because our json
    structures are quite large.

    Our schema is two fields - the first is a string of some sort used as
    an id, and the second is a huge JSON structure. I want to be able
    write sql like the following:

    select Foo(json), Bar(json), Baz(json) where Quux(json) like '%hello
    %';

    Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.
    Right now what I'm doing is storing the JSON as a string and in each
    of those UDF's I am de-serializing the json, and returning another
    string.

    Ideally I'd want to be able to return JSON objects from these UDF's
    so
    I could compose my functions like "Foo(Bar(Baz(json))"

    I'm thinking I need to create a SerDe but SerDe's seem to work at the
    table level, not at the field level (in other words, I just need a
    custom field, not a entire custom table type...or do I!?)

    Also, if there's any docs on writing custom SerDe's please let me
    know...I haven't been able to find much.

    Thanks!

    Bobby
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
  • Zheng Shao at Oct 3, 2009 at 8:29 am
    I got it. You mean TimeSpentQuerying, PageType, TotalRevenue and UserAgent
    are all UDFs that takes a JSON object and outputs a STRING.

    Allowing such a new object type just for UDFs are simpler than supporting a
    new type in all parts of the system, because we don't need to care about
    serialization/deserialization.


    If we go that route, the final query will look like this. Note that
    StringToJson takes a STRING and returns a JSON, and TimeSpentQuerying (etc)
    takes a JSON and returns either STRING or other types that are supported by
    Hive.

    SELECT TimeSpentQuerying(txnjson), PageType(txnjson), TotalRevenue(txnjson)
    from (SELECT StringToJson(txn) as txnjson FROM txn_logs) tnxjson_logs
    where UserAgent(txnjson) like '%Safari%';

    Is this good enough?


    Zheng
    On Fri, Oct 2, 2009 at 11:50 PM, Bobby Rullo wrote:

    Zheng,
    Thanks for your reply.

    I'm not sure what this "get_json_objects" function would do...do you mean
    it would return a canned set of fields? That's not what I'm interested in.
    The UDFs that I have in mind dig deep into the JSON and
    fetch/filter/transform data. I don't want a few canned fields to play with,
    I want our users to be able to create AdHoc queries on the JSON using any
    and all of the data inside.

    Here's a more realistic example. 'Txn' is a JSON field:

    select TimeSpentQuerying(txn), PageType(txn), TotalRevenue(txn) from
    txn_logs where UserAgent(txn) like '%Safari%'

    I have no problem writing the UDFs, I'm just trying to avoid parsing the
    JSON in each one of them - in other words, I want the input to the UDF's to
    not be a String (or Text) which I need to parse into JSON, but rather a JSON
    object.

    Is that possible? Can I create new field types? And if so, where do I
    register them?

    My fantasy is to be able to do this:

    CREATE TABLE txn_logs (tid String, txn JSON);

    Thanks,

    Bobby



    On Oct 2, 2009, at 10:54 PM, Zheng Shao wrote:

    We have 2 example serdes, one for text data (regexserde), one for
    binary data (thriftserde).

    But the simplest solution for this is to add a udf get_json_objects
    that returns a bunch of fields in an array. Then we can use sub query
    to extract the array elements as individual elements, until we have
    support for common expression extraction.


    There are a few example udf in the ctrib package illustrating the way
    to accept variable number of arguments.

    Let us know how it goes.

    Zheng



    On 10/2/09, Bobby Rullo wrote:

    Hi there,


    I want to create a new JSON Field/Column type. I know there exists

    get_json_object(), but the things is I want to multiple JSON

    operations in a single select statement and don't want to incur the

    cost of parsing the JSON over and over again, because our json

    structures are quite large.


    Our schema is two fields - the first is a string of some sort used as

    an id, and the second is a huge JSON structure. I want to be able

    write sql like the following:


    select Foo(json), Bar(json), Baz(json) where Quux(json) like '%hello%';


    Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.

    Right now what I'm doing is storing the JSON as a string and in each

    of those UDF's I am de-serializing the json, and returning another

    string.


    Ideally I'd want to be able to return JSON objects from these UDF's so

    I could compose my functions like "Foo(Bar(Baz(json))"


    I'm thinking I need to create a SerDe but SerDe's seem to work at the

    table level, not at the field level (in other words, I just need a

    custom field, not a entire custom table type...or do I!?)


    Also, if there's any docs on writing custom SerDe's please let me

    know...I haven't been able to find much.


    Thanks!


    Bobby



    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng


    --
    Yours,
    Zheng
  • Bobby Rullo at Oct 4, 2009 at 1:45 am

    On Oct 3, 2009, at 1:28 AM, Zheng Shao wrote:

    I got it. You mean TimeSpentQuerying, PageType, TotalRevenue and
    UserAgent are all UDFs that takes a JSON object and outputs a STRING. Exactly!
    Allowing such a new object type just for UDFs are simpler than
    supporting a new type in all parts of the system, because we don't
    need to care about serialization/deserialization.
    Interesting. So you're saying it's ok to return whatever type from
    UDF.evaluate(), as long as it doesn't get serialized (i.e. as long as
    it's not the final output)?
    If we go that route, the final query will look like this. Note that
    StringToJson takes a STRING and returns a JSON, and
    TimeSpentQuerying (etc) takes a JSON and returns either STRING or
    other types that are supported by Hive.

    SELECT TimeSpentQuerying(txnjson), PageType(txnjson),
    TotalRevenue(txnjson) from (SELECT StringToJson(txn) as txnjson FROM
    txn_logs) tnxjson_logs
    where UserAgent(txnjson) like '%Safari%';

    Is this good enough?
    That's pretty damn good. I like that all I have to do is create one
    more UDF which does the JSON parsing (and maybe JsonToString as well.)

    The only thing that I'm concerned about is whether that sub-query is
    going to create an extra map job. Is the planner smart enough to stick
    stick all that work in one map?


    Thanks!

    bobby
    Zheng

    On Fri, Oct 2, 2009 at 11:50 PM, Bobby Rullo wrote:
    Zheng,

    Thanks for your reply.

    I'm not sure what this "get_json_objects" function would do...do you
    mean it would return a canned set of fields? That's not what I'm
    interested in. The UDFs that I have in mind dig deep into the JSON
    and fetch/filter/transform data. I don't want a few canned fields to
    play with, I want our users to be able to create AdHoc queries on
    the JSON using any and all of the data inside.

    Here's a more realistic example. 'Txn' is a JSON field:

    select TimeSpentQuerying(txn), PageType(txn), TotalRevenue(txn) from
    txn_logs where UserAgent(txn) like '%Safari%'

    I have no problem writing the UDFs, I'm just trying to avoid parsing
    the JSON in each one of them - in other words, I want the input to
    the UDF's to not be a String (or Text) which I need to parse into
    JSON, but rather a JSON object.

    Is that possible? Can I create new field types? And if so, where do
    I register them?

    My fantasy is to be able to do this:

    CREATE TABLE txn_logs (tid String, txn JSON);

    Thanks,

    Bobby


    On Oct 2, 2009, at 10:54 PM, Zheng Shao wrote:

    We have 2 example serdes, one for text data (regexserde), one for
    binary data (thriftserde).

    But the simplest solution for this is to add a udf get_json_objects
    that returns a bunch of fields in an array. Then we can use sub query
    to extract the array elements as individual elements, until we have
    support for common expression extraction.


    There are a few example udf in the ctrib package illustrating the way
    to accept variable number of arguments.

    Let us know how it goes.

    Zheng


    On 10/2/09, Bobby Rullo wrote:
    Hi there,

    I want to create a new JSON Field/Column type. I know there exists
    get_json_object(), but the things is I want to multiple JSON
    operations in a single select statement and don't want to incur the
    cost of parsing the JSON over and over again, because our json
    structures are quite large.

    Our schema is two fields - the first is a string of some sort used
    as
    an id, and the second is a huge JSON structure. I want to be able
    write sql like the following:

    select Foo(json), Bar(json), Baz(json) where Quux(json) like
    '%hello%';

    Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.
    Right now what I'm doing is storing the JSON as a string and in each
    of those UDF's I am de-serializing the json, and returning another
    string.

    Ideally I'd want to be able to return JSON objects from these
    UDF's so
    I could compose my functions like "Foo(Bar(Baz(json))"

    I'm thinking I need to create a SerDe but SerDe's seem to work at
    the
    table level, not at the field level (in other words, I just need a
    custom field, not a entire custom table type...or do I!?)

    Also, if there's any docs on writing custom SerDe's please let me
    know...I haven't been able to find much.

    Thanks!

    Bobby
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng



    --
    Yours,
    Zheng
  • Zheng Shao at Oct 4, 2009 at 11:39 am
    Sub query is NOT going to create additional jobs. Hive is smart enough to
    compile the query into optimized plans, and a lot of obvious optimizations
    are already done.

    Yes that's what I mean. The functionality (for UDF to return arbitrary
    class) is NOT there yet.


    Actually there is an even simpler approach without the need of modifying the
    Hive framework:

    Just create a global String to JSON conversion cache (using LRU algorithm).
    The size of the cache can be set to something like 2.
    This makes sure the conversion from String to JSON will only happen once for
    each row.

    For example code of LRU algorithm, take a look at:
    http://www.source-code.biz/snippets/java/6.htm

    Inside each UDF like TimeSpentQuerying(String), we just call "JSON myjson =
    myLRUCache.get(myString);".

    How do you like this one?


    heng
    On Sat, Oct 3, 2009 at 6:44 PM, Bobby Rullo wrote:


    On Oct 3, 2009, at 1:28 AM, Zheng Shao wrote:

    I got it. You mean TimeSpentQuerying, PageType, TotalRevenue and UserAgent
    are all UDFs that takes a JSON object and outputs a STRING.

    Exactly!

    Allowing such a new object type just for UDFs are simpler than supporting a
    new type in all parts of the system, because we don't need to care about
    serialization/deserialization.


    Interesting. So you're saying it's ok to return whatever type from
    UDF.evaluate(), as long as it doesn't get serialized (i.e. as long as it's
    not the final output)?


    If we go that route, the final query will look like this. Note that
    StringToJson takes a STRING and returns a JSON, and TimeSpentQuerying (etc)
    takes a JSON and returns either STRING or other types that are supported by
    Hive.

    SELECT TimeSpentQuerying(txnjson), PageType(txnjson), TotalRevenue(txnjson)
    from (SELECT StringToJson(txn) as txnjson FROM txn_logs) tnxjson_logs
    where UserAgent(txnjson) like '%Safari%';

    Is this good enough?


    That's pretty damn good. I like that all I have to do is create one more
    UDF which does the JSON parsing (and maybe JsonToString as well.)

    The only thing that I'm concerned about is whether that sub-query is going
    to create an extra map job. Is the planner smart enough to stick stick all
    that work in one map?


    Thanks!

    bobby


    Zheng
    On Fri, Oct 2, 2009 at 11:50 PM, Bobby Rullo wrote:

    Zheng,
    Thanks for your reply.

    I'm not sure what this "get_json_objects" function would do...do you mean
    it would return a canned set of fields? That's not what I'm interested in.
    The UDFs that I have in mind dig deep into the JSON and
    fetch/filter/transform data. I don't want a few canned fields to play with,
    I want our users to be able to create AdHoc queries on the JSON using any
    and all of the data inside.

    Here's a more realistic example. 'Txn' is a JSON field:

    select TimeSpentQuerying(txn), PageType(txn), TotalRevenue(txn) from
    txn_logs where UserAgent(txn) like '%Safari%'

    I have no problem writing the UDFs, I'm just trying to avoid parsing the
    JSON in each one of them - in other words, I want the input to the UDF's to
    not be a String (or Text) which I need to parse into JSON, but rather a JSON
    object.

    Is that possible? Can I create new field types? And if so, where do I
    register them?

    My fantasy is to be able to do this:

    CREATE TABLE txn_logs (tid String, txn JSON);

    Thanks,

    Bobby



    On Oct 2, 2009, at 10:54 PM, Zheng Shao wrote:

    We have 2 example serdes, one for text data (regexserde), one for
    binary data (thriftserde).

    But the simplest solution for this is to add a udf get_json_objects
    that returns a bunch of fields in an array. Then we can use sub query
    to extract the array elements as individual elements, until we have
    support for common expression extraction.


    There are a few example udf in the ctrib package illustrating the way
    to accept variable number of arguments.

    Let us know how it goes.

    Zheng



    On 10/2/09, Bobby Rullo wrote:

    Hi there,


    I want to create a new JSON Field/Column type. I know there exists

    get_json_object(), but the things is I want to multiple JSON

    operations in a single select statement and don't want to incur the

    cost of parsing the JSON over and over again, because our json

    structures are quite large.


    Our schema is two fields - the first is a string of some sort used as

    an id, and the second is a huge JSON structure. I want to be able

    write sql like the following:


    select Foo(json), Bar(json), Baz(json) where Quux(json) like '%hello%';


    Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.

    Right now what I'm doing is storing the JSON as a string and in each

    of those UDF's I am de-serializing the json, and returning another

    string.


    Ideally I'd want to be able to return JSON objects from these UDF's so

    I could compose my functions like "Foo(Bar(Baz(json))"


    I'm thinking I need to create a SerDe but SerDe's seem to work at the

    table level, not at the field level (in other words, I just need a

    custom field, not a entire custom table type...or do I!?)


    Also, if there's any docs on writing custom SerDe's please let me

    know...I haven't been able to find much.


    Thanks!


    Bobby



    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng


    --
    Yours,
    Zheng


    --
    Yours,
    Zheng
  • Bobby Rullo at Oct 5, 2009 at 5:45 pm
    I think I might like this approach the best!

    Thanks for all your advice....going off to experiment.

    Bobby
    On Oct 4, 2009, at 4:38 AM, Zheng Shao wrote:

    Sub query is NOT going to create additional jobs. Hive is smart
    enough to compile the query into optimized plans, and a lot of
    obvious optimizations are already done.

    Yes that's what I mean. The functionality (for UDF to return
    arbitrary class) is NOT there yet.


    Actually there is an even simpler approach without the need of
    modifying the Hive framework:

    Just create a global String to JSON conversion cache (using LRU
    algorithm).
    The size of the cache can be set to something like 2.
    This makes sure the conversion from String to JSON will only happen
    once for each row.

    For example code of LRU algorithm, take a look at: http://www.source-code.biz/snippets/java/6.htm

    Inside each UDF like TimeSpentQuerying(String), we just call "JSON
    myjson = myLRUCache.get(myString);".

    How do you like this one?


    heng

    On Sat, Oct 3, 2009 at 6:44 PM, Bobby Rullo wrote:
    On Oct 3, 2009, at 1:28 AM, Zheng Shao wrote:

    I got it. You mean TimeSpentQuerying, PageType, TotalRevenue and
    UserAgent are all UDFs that takes a JSON object and outputs a STRING.
    Exactly!

    Allowing such a new object type just for UDFs are simpler than
    supporting a new type in all parts of the system, because we don't
    need to care about serialization/deserialization.
    Interesting. So you're saying it's ok to return whatever type from
    UDF.evaluate(), as long as it doesn't get serialized (i.e. as long
    as it's not the final output)?
    If we go that route, the final query will look like this. Note that
    StringToJson takes a STRING and returns a JSON, and
    TimeSpentQuerying (etc) takes a JSON and returns either STRING or
    other types that are supported by Hive.

    SELECT TimeSpentQuerying(txnjson), PageType(txnjson),
    TotalRevenue(txnjson) from (SELECT StringToJson(txn) as txnjson
    FROM txn_logs) tnxjson_logs
    where UserAgent(txnjson) like '%Safari%';

    Is this good enough?
    That's pretty damn good. I like that all I have to do is create one
    more UDF which does the JSON parsing (and maybe JsonToString as well.)

    The only thing that I'm concerned about is whether that sub-query is
    going to create an extra map job. Is the planner smart enough to
    stick stick all that work in one map?


    Thanks!

    bobby
    Zheng

    On Fri, Oct 2, 2009 at 11:50 PM, Bobby Rullo <bobby@metaweb.com>
    wrote:
    Zheng,

    Thanks for your reply.

    I'm not sure what this "get_json_objects" function would do...do
    you mean it would return a canned set of fields? That's not what
    I'm interested in. The UDFs that I have in mind dig deep into the
    JSON and fetch/filter/transform data. I don't want a few canned
    fields to play with, I want our users to be able to create AdHoc
    queries on the JSON using any and all of the data inside.

    Here's a more realistic example. 'Txn' is a JSON field:

    select TimeSpentQuerying(txn), PageType(txn), TotalRevenue(txn)
    from txn_logs where UserAgent(txn) like '%Safari%'

    I have no problem writing the UDFs, I'm just trying to avoid
    parsing the JSON in each one of them - in other words, I want the
    input to the UDF's to not be a String (or Text) which I need to
    parse into JSON, but rather a JSON object.

    Is that possible? Can I create new field types? And if so, where do
    I register them?

    My fantasy is to be able to do this:

    CREATE TABLE txn_logs (tid String, txn JSON);

    Thanks,

    Bobby


    On Oct 2, 2009, at 10:54 PM, Zheng Shao wrote:

    We have 2 example serdes, one for text data (regexserde), one for
    binary data (thriftserde).

    But the simplest solution for this is to add a udf get_json_objects
    that returns a bunch of fields in an array. Then we can use sub
    query
    to extract the array elements as individual elements, until we have
    support for common expression extraction.


    There are a few example udf in the ctrib package illustrating the
    way
    to accept variable number of arguments.

    Let us know how it goes.

    Zheng


    On 10/2/09, Bobby Rullo wrote:
    Hi there,

    I want to create a new JSON Field/Column type. I know there exists
    get_json_object(), but the things is I want to multiple JSON
    operations in a single select statement and don't want to incur the
    cost of parsing the JSON over and over again, because our json
    structures are quite large.

    Our schema is two fields - the first is a string of some sort
    used as
    an id, and the second is a huge JSON structure. I want to be able
    write sql like the following:

    select Foo(json), Bar(json), Baz(json) where Quux(json) like
    '%hello%';

    Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.
    Right now what I'm doing is storing the JSON as a string and in
    each
    of those UDF's I am de-serializing the json, and returning another
    string.

    Ideally I'd want to be able to return JSON objects from these
    UDF's so
    I could compose my functions like "Foo(Bar(Baz(json))"

    I'm thinking I need to create a SerDe but SerDe's seem to work at
    the
    table level, not at the field level (in other words, I just need a
    custom field, not a entire custom table type...or do I!?)

    Also, if there's any docs on writing custom SerDe's please let me
    know...I haven't been able to find much.

    Thanks!

    Bobby
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng



    --
    Yours,
    Zheng



    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedOct 3, '09 at 2:37a
activeOct 5, '09 at 5:45p
posts7
users2
websitehive.apache.org

2 users in discussion

Bobby Rullo: 4 posts Zheng Shao: 3 posts

People

Translate

site design / logo © 2022 Grokbase