FAQ
Hey guys,

I am new to Pig.
I was wondering is it possible to pass schema in pig load statement while
loading it first time.

Suppose if I have a huge dataset.. containing around 100 cols.. Is there a
way through which I can pass the schema defined in some other file (some
kind of meta file) into pig load statement or do I have to define it every
time inside LOAD statement ?

Thanks,
Praveenesh

Search Discussions

  • Stan Rosenberg at Feb 3, 2012 at 10:32 pm
    My hunch is you'll have to write a custom loader, but I'll let the
    experts chime in. E.g., AvroStorage loader can parse the schema
    from a json file passed to it via the constructor. I don't think
    PigStorage has the same option.

    stan
    On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is there a
    way through which I can pass the schema defined in some other file (some
    kind of meta file) into pig load statement or do I have to define it every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Praveenesh kumar at Feb 3, 2012 at 10:36 pm
    Thanks Stan,
    If you were facing this kind of scenario, how would you have proceeded ?
    Can you give me some pointers on how to write custom loader, some good
    tutorials..on it
    What is the current practice in order to solve the above scenario in pig ?

    Praveenesh

    On Sat, Feb 4, 2012 at 4:02 AM, Stan Rosenberg wrote:

    My hunch is you'll have to write a custom loader, but I'll let the
    experts chime in. E.g., AvroStorage loader can parse the schema
    from a json file passed to it via the constructor. I don't think
    PigStorage has the same option.

    stan
    On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is there a
    way through which I can pass the schema defined in some other file (some
    kind of meta file) into pig load statement or do I have to define it every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Stan Rosenberg at Feb 3, 2012 at 10:42 pm
    Hi Praveenesh,

    Assuming you have already read these:

    http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
    http://pig.apache.org/docs/r0.9.2/udf.html#load-store-functions

    my next step would be to peruse the source code of some existing
    loaders, e.g., PigStorage.

    Best,

    stan

    On Fri, Feb 3, 2012 at 5:35 PM, praveenesh kumar wrote:
    Thanks Stan,
    If you were facing this kind of scenario, how would you have proceeded ?
    Can you give me some pointers on how to write custom loader, some good
    tutorials..on it
    What is the current practice in order to solve the above scenario in pig ?

    Praveenesh


    On Sat, Feb 4, 2012 at 4:02 AM, Stan Rosenberg <
    srosenberg@proclivitysystems.com> wrote:
    My hunch is you'll have to write a custom loader, but I'll let the
    experts chime in.  E.g., AvroStorage loader can parse the schema
    from a json file passed to it via the constructor.  I don't think
    PigStorage has the same option.

    stan

    On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar <praveenesh@gmail.com>
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is there a
    way through which I can pass the schema defined in some other file (some
    kind of meta file) into pig load statement or do I have to define it every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Praveenesh kumar at Feb 3, 2012 at 10:45 pm
    Thanks Stan,
    I was going through these only. I was wondering is there a easy way to do
    it or am I reading something wrong.
    Now I will focus on what you have suggested. but I hope there is some easy
    solution to my problem

    Praveenesh
    On Sat, Feb 4, 2012 at 4:12 AM, Stan Rosenberg wrote:

    Hi Praveenesh,

    Assuming you have already read these:

    http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
    http://pig.apache.org/docs/r0.9.2/udf.html#load-store-functions

    my next step would be to peruse the source code of some existing
    loaders, e.g., PigStorage.

    Best,

    stan

    On Fri, Feb 3, 2012 at 5:35 PM, praveenesh kumar wrote:
    Thanks Stan,
    If you were facing this kind of scenario, how would you have proceeded ?
    Can you give me some pointers on how to write custom loader, some good
    tutorials..on it
    What is the current practice in order to solve the above scenario in pig ?
    Praveenesh


    On Sat, Feb 4, 2012 at 4:02 AM, Stan Rosenberg <
    srosenberg@proclivitysystems.com> wrote:
    My hunch is you'll have to write a custom loader, but I'll let the
    experts chime in. E.g., AvroStorage loader can parse the schema
    from a json file passed to it via the constructor. I don't think
    PigStorage has the same option.

    stan

    On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar <praveenesh@gmail.com>
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement
    while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is
    there
    a
    way through which I can pass the schema defined in some other file
    (some
    kind of meta file) into pig load statement or do I have to define it every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Stan Rosenberg at Feb 4, 2012 at 2:41 am
    Hi Praveenesh,

    Maybe this will get you started.

    Suppose we have the desired schema parsed and stored in 'map' of type
    LinkedHashMap<String, String>. The key is your field name, and the
    value denotes the data type, e.g., 'string', 'int',
    etc.

    Now, let's derive pig's schema from this map:

    Schema schema = new Schema(); // pig schema

    for (Entry<String, String> entry : map.entrySet()) {
    schema.add(new Schema.FieldSchema(entry.getKey(),
    getPigType(entry.getValue())));
    }

    where getPigType returns the corresponding pig's data type:

    byte getPigType(String fieldType) {
    if (fieldType.equalsIgnoreCase("string")) {
    return DataType.CHARARRAY;
    } else if (fieldType.equalsIgnoreCase("int")) {
    return DataType.INTEGER;
    } else if (fieldType.equalsIgnoreCase("long")) {
    return DataType.LONG;
    } else if (fieldType.equalsIgnoreCase("float")) {
    return DataType.FLOAT;
    } if (fieldType.equalsIgnoreCase("double")) {
    return DataType.DOUBLE;
    } if (fieldType.equalsIgnoreCase("boolean")) {
    return DataType.BOOLEAN;
    } else {
    return DataType.CHARARRAY;
    }
    }


    Now, you'll want to implement 'getSchema' in your custom loader:

    @Override
    public ResourceSchema getSchema(String location, Job job) throws IOException {
    return new ResourceSchema(schema); // I'd actually cache this
    result if the schema is fixed
    }

    This should take care of the schema except you'd probably also need to
    serialize it to the back-end so that
    you can enforce the schema inside 'getNext'.

    stan

    P.S. The above is essentially pseudo-code; I haven't actually type-checked it.
    On Fri, Feb 3, 2012 at 5:45 PM, praveenesh kumar wrote:
    Thanks Stan,
    I was going through these only. I was wondering is there a easy way to do
    it or am I reading something wrong.
    Now I will focus on what you have suggested. but I hope there is some easy
    solution to my problem

    Praveenesh

    On Sat, Feb 4, 2012 at 4:12 AM, Stan Rosenberg <
    srosenberg@proclivitysystems.com> wrote:
    Hi Praveenesh,

    Assuming you have already read these:

    http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
    http://pig.apache.org/docs/r0.9.2/udf.html#load-store-functions

    my next step would be to peruse the source code of some existing
    loaders, e.g., PigStorage.

    Best,

    stan


    On Fri, Feb 3, 2012 at 5:35 PM, praveenesh kumar <praveenesh@gmail.com>
    wrote:
    Thanks Stan,
    If you were facing this kind of scenario, how would you have proceeded ?
    Can you give me some pointers on how to write custom loader, some good
    tutorials..on it
    What is the current practice in order to solve the above scenario in pig ?
    Praveenesh


    On Sat, Feb 4, 2012 at 4:02 AM, Stan Rosenberg <
    srosenberg@proclivitysystems.com> wrote:
    My hunch is you'll have to write a custom loader, but I'll let the
    experts chime in.  E.g., AvroStorage loader can parse the schema
    from a json file passed to it via the constructor.  I don't think
    PigStorage has the same option.

    stan

    On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar <praveenesh@gmail.com>
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement
    while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is
    there
    a
    way through which I can pass the schema defined in some other file
    (some
    kind of meta file) into pig load statement or do I have to define it every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Praveenesh kumar at Feb 4, 2012 at 6:49 am
    Thanks Stan,
    This would be a great help.. !! I'll try to implement it. :-)

    Praveenesh
    On Sat, Feb 4, 2012 at 8:10 AM, Stan Rosenberg wrote:

    Hi Praveenesh,

    Maybe this will get you started.

    Suppose we have the desired schema parsed and stored in 'map' of type
    LinkedHashMap<String, String>. The key is your field name, and the
    value denotes the data type, e.g., 'string', 'int',
    etc.

    Now, let's derive pig's schema from this map:

    Schema schema = new Schema(); // pig schema

    for (Entry<String, String> entry : map.entrySet()) {
    schema.add(new Schema.FieldSchema(entry.getKey(),
    getPigType(entry.getValue())));
    }

    where getPigType returns the corresponding pig's data type:

    byte getPigType(String fieldType) {
    if (fieldType.equalsIgnoreCase("string")) {
    return DataType.CHARARRAY;
    } else if (fieldType.equalsIgnoreCase("int")) {
    return DataType.INTEGER;
    } else if (fieldType.equalsIgnoreCase("long")) {
    return DataType.LONG;
    } else if (fieldType.equalsIgnoreCase("float")) {
    return DataType.FLOAT;
    } if (fieldType.equalsIgnoreCase("double")) {
    return DataType.DOUBLE;
    } if (fieldType.equalsIgnoreCase("boolean")) {
    return DataType.BOOLEAN;
    } else {
    return DataType.CHARARRAY;
    }
    }


    Now, you'll want to implement 'getSchema' in your custom loader:

    @Override
    public ResourceSchema getSchema(String location, Job job) throws
    IOException {
    return new ResourceSchema(schema); // I'd actually cache this
    result if the schema is fixed
    }

    This should take care of the schema except you'd probably also need to
    serialize it to the back-end so that
    you can enforce the schema inside 'getNext'.

    stan

    P.S. The above is essentially pseudo-code; I haven't actually type-checked
    it.
    On Fri, Feb 3, 2012 at 5:45 PM, praveenesh kumar wrote:
    Thanks Stan,
    I was going through these only. I was wondering is there a easy way to do
    it or am I reading something wrong.
    Now I will focus on what you have suggested. but I hope there is some easy
    solution to my problem

    Praveenesh

    On Sat, Feb 4, 2012 at 4:12 AM, Stan Rosenberg <
    srosenberg@proclivitysystems.com> wrote:
    Hi Praveenesh,

    Assuming you have already read these:

    http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
    http://pig.apache.org/docs/r0.9.2/udf.html#load-store-functions

    my next step would be to peruse the source code of some existing
    loaders, e.g., PigStorage.

    Best,

    stan


    On Fri, Feb 3, 2012 at 5:35 PM, praveenesh kumar <praveenesh@gmail.com>
    wrote:
    Thanks Stan,
    If you were facing this kind of scenario, how would you have
    proceeded ?
    Can you give me some pointers on how to write custom loader, some good
    tutorials..on it
    What is the current practice in order to solve the above scenario in
    pig
    ?
    Praveenesh


    On Sat, Feb 4, 2012 at 4:02 AM, Stan Rosenberg <
    srosenberg@proclivitysystems.com> wrote:
    My hunch is you'll have to write a custom loader, but I'll let the
    experts chime in. E.g., AvroStorage loader can parse the schema
    from a json file passed to it via the constructor. I don't think
    PigStorage has the same option.

    stan

    On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar <
    praveenesh@gmail.com>
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement
    while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is
    there
    a
    way through which I can pass the schema defined in some other file
    (some
    kind of meta file) into pig load statement or do I have to define
    it
    every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Dmitriy Ryaboy at Feb 6, 2012 at 6:26 am
    It's pretty straightforward, that's why the LoadMetadata interface exists.
    You just have to implement it and translate however you store the schema to
    a Pig Schema object.

    PigStorageSchema will read a json file that describes the schema, you can
    look at how that's done there (actually, PigStorage itself will do that in
    trunk).

    You can also check out what the Elephant-Bird library does for loading
    protocol buffers and thrift objects, where schema is derived from the
    object itself.

    -Dmitriy
    On Fri, Feb 3, 2012 at 4:35 AM, praveenesh kumar wrote:

    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is there a
    way through which I can pass the schema defined in some other file (some
    kind of meta file) into pig load statement or do I have to define it every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Praveenesh kumar at Feb 6, 2012 at 6:36 am
    Thanks,
    I was also looking for -schema option in PigStorage.
    But Can anyone explain how can we define that json schema file.
    Some tutorial/small example would be very helpful.

    Praveenesh
    On Mon, Feb 6, 2012 at 11:55 AM, Dmitriy Ryaboy wrote:

    It's pretty straightforward, that's why the LoadMetadata interface exists.
    You just have to implement it and translate however you store the schema to
    a Pig Schema object.

    PigStorageSchema will read a json file that describes the schema, you can
    look at how that's done there (actually, PigStorage itself will do that in
    trunk).

    You can also check out what the Elephant-Bird library does for loading
    protocol buffers and thrift objects, where schema is derived from the
    object itself.

    -Dmitriy

    On Fri, Feb 3, 2012 at 4:35 AM, praveenesh kumar <praveenesh@gmail.com
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is there a
    way through which I can pass the schema defined in some other file (some
    kind of meta file) into pig load statement or do I have to define it every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Dmitriy Ryaboy at Feb 6, 2012 at 6:49 am
    It's a json serialization of the Pig schema object, and isn't really meant
    to be created by hand.
    Patches to make it more human-friendly would be quite welcome.

    D
    On Sun, Feb 5, 2012 at 10:35 PM, praveenesh kumar wrote:

    Thanks,
    I was also looking for -schema option in PigStorage.
    But Can anyone explain how can we define that json schema file.
    Some tutorial/small example would be very helpful.

    Praveenesh
    On Mon, Feb 6, 2012 at 11:55 AM, Dmitriy Ryaboy wrote:

    It's pretty straightforward, that's why the LoadMetadata interface exists.
    You just have to implement it and translate however you store the schema to
    a Pig Schema object.

    PigStorageSchema will read a json file that describes the schema, you can
    look at how that's done there (actually, PigStorage itself will do that in
    trunk).

    You can also check out what the Elephant-Bird library does for loading
    protocol buffers and thrift objects, where schema is derived from the
    object itself.

    -Dmitriy

    On Fri, Feb 3, 2012 at 4:35 AM, praveenesh kumar <praveenesh@gmail.com
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement
    while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is
    there
    a
    way through which I can pass the schema defined in some other file
    (some
    kind of meta file) into pig load statement or do I have to define it every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Praveenesh kumar at Feb 6, 2012 at 7:00 am
    Okie.. so how can I make use of -schema option with PigStorage.

    Suppose my Jscon schema is -

    {
    "name":"Student_Data",
    "properties":
    {
    "id":
    {
    "type":"INTEGER",
    "description":"Student id"
    },
    "name":
    {
    "type":"CHARARRAY",
    "description":"Name of the student"

    },
    "marks":
    {
    "type":"INTEGER",
    "description":"Marks of the student"
    },

    }
    }

    I tried to create the above schema in Pig Datatypes. Can I use it or Is
    there a different way to use "-schema" option ?
    <code>-schema</code> Reads/Stores the schema of the relation using a hidden
    JSON file.

    Or is there some other way to directly pass the schema defined in some
    other file as plain text file and read it using PigStorage ?

    Thanks,
    Praveenesh

    On Mon, Feb 6, 2012 at 12:18 PM, Dmitriy Ryaboy wrote:

    It's a json serialization of the Pig schema object, and isn't really meant
    to be created by hand.
    Patches to make it more human-friendly would be quite welcome.

    D

    On Sun, Feb 5, 2012 at 10:35 PM, praveenesh kumar <praveenesh@gmail.com
    wrote:
    Thanks,
    I was also looking for -schema option in PigStorage.
    But Can anyone explain how can we define that json schema file.
    Some tutorial/small example would be very helpful.

    Praveenesh

    On Mon, Feb 6, 2012 at 11:55 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    It's pretty straightforward, that's why the LoadMetadata interface exists.
    You just have to implement it and translate however you store the
    schema
    to
    a Pig Schema object.

    PigStorageSchema will read a json file that describes the schema, you
    can
    look at how that's done there (actually, PigStorage itself will do that in
    trunk).

    You can also check out what the Elephant-Bird library does for loading
    protocol buffers and thrift objects, where schema is derived from the
    object itself.

    -Dmitriy

    On Fri, Feb 3, 2012 at 4:35 AM, praveenesh kumar <praveenesh@gmail.com
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement
    while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is
    there
    a
    way through which I can pass the schema defined in some other file
    (some
    kind of meta file) into pig load statement or do I have to define it every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Dmitriy Ryaboy at Feb 6, 2012 at 9:12 am
    it reads the schema file *it creates* . So, you process some data, store
    it, then read it back later, and the schema is back.
    Like I said, the json is not very human-readable -- the types are integers
    rather than words like "chararray", etc.
    Try saving something and check out the .pig_schema file to see an example.

    D
    On Sun, Feb 5, 2012 at 10:59 PM, praveenesh kumar wrote:

    Okie.. so how can I make use of -schema option with PigStorage.

    Suppose my Jscon schema is -

    {
    "name":"Student_Data",
    "properties":
    {
    "id":
    {
    "type":"INTEGER",
    "description":"Student id"
    },
    "name":
    {
    "type":"CHARARRAY",
    "description":"Name of the student"

    },
    "marks":
    {
    "type":"INTEGER",
    "description":"Marks of the student"
    },

    }
    }

    I tried to create the above schema in Pig Datatypes. Can I use it or Is
    there a different way to use "-schema" option ?
    <code>-schema</code> Reads/Stores the schema of the relation using a hidden
    JSON file.

    Or is there some other way to directly pass the schema defined in some
    other file as plain text file and read it using PigStorage ?

    Thanks,
    Praveenesh

    On Mon, Feb 6, 2012 at 12:18 PM, Dmitriy Ryaboy wrote:

    It's a json serialization of the Pig schema object, and isn't really meant
    to be created by hand.
    Patches to make it more human-friendly would be quite welcome.

    D

    On Sun, Feb 5, 2012 at 10:35 PM, praveenesh kumar <praveenesh@gmail.com
    wrote:
    Thanks,
    I was also looking for -schema option in PigStorage.
    But Can anyone explain how can we define that json schema file.
    Some tutorial/small example would be very helpful.

    Praveenesh

    On Mon, Feb 6, 2012 at 11:55 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    It's pretty straightforward, that's why the LoadMetadata interface exists.
    You just have to implement it and translate however you store the
    schema
    to
    a Pig Schema object.

    PigStorageSchema will read a json file that describes the schema, you
    can
    look at how that's done there (actually, PigStorage itself will do
    that
    in
    trunk).

    You can also check out what the Elephant-Bird library does for
    loading
    protocol buffers and thrift objects, where schema is derived from the
    object itself.

    -Dmitriy

    On Fri, Feb 3, 2012 at 4:35 AM, praveenesh kumar <
    praveenesh@gmail.com
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load statement
    while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols.. Is
    there
    a
    way through which I can pass the schema defined in some other file
    (some
    kind of meta file) into pig load statement or do I have to define
    it
    every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Praveenesh kumar at Feb 6, 2012 at 9:17 am
    Yeah I tried that -
    Here's what I get for a small sample data :

    {
    "fields":
    [
    {"name":"name","type":55,"description":"autogenerated from
    Pig Field Schema","schema":null},
    {"name":"age","type":10,"description":"autogenerated from
    Pig Field Schema","schema":null},
    {"name":"gpa","type":20,"description":"autogenerated from
    Pig Field Schema","schema":null}
    ],

    "version":0,
    "sortKeys":[],
    "sortKeyOrders":[]
    }


    I am looking to see if I can decode this formats and try to define my own
    schema in this way and use it in PigLoader function

    Thanks,
    Praveenesh
    On Mon, Feb 6, 2012 at 2:41 PM, Dmitriy Ryaboy wrote:

    it reads the schema file *it creates* . So, you process some data, store
    it, then read it back later, and the schema is back.
    Like I said, the json is not very human-readable -- the types are integers
    rather than words like "chararray", etc.
    Try saving something and check out the .pig_schema file to see an example.

    D

    On Sun, Feb 5, 2012 at 10:59 PM, praveenesh kumar <praveenesh@gmail.com
    wrote:
    Okie.. so how can I make use of -schema option with PigStorage.

    Suppose my Jscon schema is -

    {
    "name":"Student_Data",
    "properties":
    {
    "id":
    {
    "type":"INTEGER",
    "description":"Student id"
    },
    "name":
    {
    "type":"CHARARRAY",
    "description":"Name of the student"

    },
    "marks":
    {
    "type":"INTEGER",
    "description":"Marks of the student"
    },

    }
    }

    I tried to create the above schema in Pig Datatypes. Can I use it or Is
    there a different way to use "-schema" option ?
    <code>-schema</code> Reads/Stores the schema of the relation using a hidden
    JSON file.

    Or is there some other way to directly pass the schema defined in some
    other file as plain text file and read it using PigStorage ?

    Thanks,
    Praveenesh


    On Mon, Feb 6, 2012 at 12:18 PM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    It's a json serialization of the Pig schema object, and isn't really meant
    to be created by hand.
    Patches to make it more human-friendly would be quite welcome.

    D

    On Sun, Feb 5, 2012 at 10:35 PM, praveenesh kumar <
    praveenesh@gmail.com
    wrote:
    Thanks,
    I was also looking for -schema option in PigStorage.
    But Can anyone explain how can we define that json schema file.
    Some tutorial/small example would be very helpful.

    Praveenesh

    On Mon, Feb 6, 2012 at 11:55 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    It's pretty straightforward, that's why the LoadMetadata interface exists.
    You just have to implement it and translate however you store the
    schema
    to
    a Pig Schema object.

    PigStorageSchema will read a json file that describes the schema,
    you
    can
    look at how that's done there (actually, PigStorage itself will do
    that
    in
    trunk).

    You can also check out what the Elephant-Bird library does for
    loading
    protocol buffers and thrift objects, where schema is derived from
    the
    object itself.

    -Dmitriy

    On Fri, Feb 3, 2012 at 4:35 AM, praveenesh kumar <
    praveenesh@gmail.com
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load
    statement
    while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols..
    Is
    there
    a
    way through which I can pass the schema defined in some other
    file
    (some
    kind of meta file) into pig load statement or do I have to define
    it
    every
    time inside LOAD statement ?

    Thanks,
    Praveenesh
  • Dmitriy Ryaboy at Feb 6, 2012 at 9:20 pm
    The integer values for types come from org.apache.pig.data.DataType
    On Mon, Feb 6, 2012 at 1:17 AM, praveenesh kumar wrote:

    Yeah I tried that -
    Here's what I get for a small sample data :

    {
    "fields":
    [
    {"name":"name","type":55,"description":"autogenerated from
    Pig Field Schema","schema":null},
    {"name":"age","type":10,"description":"autogenerated from
    Pig Field Schema","schema":null},
    {"name":"gpa","type":20,"description":"autogenerated from
    Pig Field Schema","schema":null}
    ],

    "version":0,
    "sortKeys":[],
    "sortKeyOrders":[]
    }


    I am looking to see if I can decode this formats and try to define my own
    schema in this way and use it in PigLoader function

    Thanks,
    Praveenesh
    On Mon, Feb 6, 2012 at 2:41 PM, Dmitriy Ryaboy wrote:

    it reads the schema file *it creates* . So, you process some data, store
    it, then read it back later, and the schema is back.
    Like I said, the json is not very human-readable -- the types are integers
    rather than words like "chararray", etc.
    Try saving something and check out the .pig_schema file to see an example.
    D

    On Sun, Feb 5, 2012 at 10:59 PM, praveenesh kumar <praveenesh@gmail.com
    wrote:
    Okie.. so how can I make use of -schema option with PigStorage.

    Suppose my Jscon schema is -

    {
    "name":"Student_Data",
    "properties":
    {
    "id":
    {
    "type":"INTEGER",
    "description":"Student id"
    },
    "name":
    {
    "type":"CHARARRAY",
    "description":"Name of the student"

    },
    "marks":
    {
    "type":"INTEGER",
    "description":"Marks of the student"
    },

    }
    }

    I tried to create the above schema in Pig Datatypes. Can I use it or Is
    there a different way to use "-schema" option ?
    <code>-schema</code> Reads/Stores the schema of the relation using a hidden
    JSON file.

    Or is there some other way to directly pass the schema defined in some
    other file as plain text file and read it using PigStorage ?

    Thanks,
    Praveenesh


    On Mon, Feb 6, 2012 at 12:18 PM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    It's a json serialization of the Pig schema object, and isn't really meant
    to be created by hand.
    Patches to make it more human-friendly would be quite welcome.

    D

    On Sun, Feb 5, 2012 at 10:35 PM, praveenesh kumar <
    praveenesh@gmail.com
    wrote:
    Thanks,
    I was also looking for -schema option in PigStorage.
    But Can anyone explain how can we define that json schema file.
    Some tutorial/small example would be very helpful.

    Praveenesh

    On Mon, Feb 6, 2012 at 11:55 AM, Dmitriy Ryaboy <
    dvryaboy@gmail.com>
    wrote:
    It's pretty straightforward, that's why the LoadMetadata
    interface
    exists.
    You just have to implement it and translate however you store the
    schema
    to
    a Pig Schema object.

    PigStorageSchema will read a json file that describes the schema,
    you
    can
    look at how that's done there (actually, PigStorage itself will
    do
    that
    in
    trunk).

    You can also check out what the Elephant-Bird library does for
    loading
    protocol buffers and thrift objects, where schema is derived from
    the
    object itself.

    -Dmitriy

    On Fri, Feb 3, 2012 at 4:35 AM, praveenesh kumar <
    praveenesh@gmail.com
    wrote:
    Hey guys,

    I am new to Pig.
    I was wondering is it possible to pass schema in pig load
    statement
    while
    loading it first time.

    Suppose if I have a huge dataset.. containing around 100 cols..
    Is
    there
    a
    way through which I can pass the schema defined in some other
    file
    (some
    kind of meta file) into pig load statement or do I have to
    define
    it
    every
    time inside LOAD statement ?

    Thanks,
    Praveenesh

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 3, '12 at 12:35p
activeFeb 6, '12 at 9:20p
posts14
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase