Grokbase Groups Hive user April 2010
FAQ
Hi,

I currently run a MapReduce job to rewrite a tab delimited file, and then I
use Hive for everything after that stage.

Am I correct in thinking that I can create a Jar with my own method which
can then be called in SQL?

Would the syntax be:

hive> ADD JAR /tmp/parse.jar;
hive> INSERT OVERWRITE TABLE target SELECT s.id,
s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
parsedName;

and parse be a MR job? If so what are the input and output formats please
for the parse? Or is it a class implementing an interface perhaps and Hive
take care of the rest?

Thanks for any pointers,
Tim

Search Discussions

  • Edward Capriolo at Apr 27, 2010 at 3:12 pm

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson wrote:

    Hi,

    I currently run a MapReduce job to rewrite a tab delimited file, and then I
    use Hive for everything after that stage.

    Am I correct in thinking that I can create a Jar with my own method which
    can then be called in SQL?

    Would the syntax be:

    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
    parsedName;

    and parse be a MR job? If so what are the input and output formats please
    for the parse? Or is it a class implementing an interface perhaps and Hive
    take care of the rest?

    Thanks for any pointers,
    Tim

    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is a way to
    pipe data to external process and read the results back in. Almost like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,
  • Tim Robertson at Apr 27, 2010 at 5:14 pm
    Thanks Edward,

    You are indeed correct - I am confused!

    So I checked out the source, and poked around. If I were to extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?

    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo wrote:



    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson <timrobertson100@gmail.com
    wrote:
    Hi,

    I currently run a MapReduce job to rewrite a tab delimited file, and then
    I use Hive for everything after that stage.

    Am I correct in thinking that I can create a Jar with my own method which
    can then be called in SQL?

    Would the syntax be:

    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
    parsedName;

    and parse be a MR job? If so what are the input and output formats please
    for the parse? Or is it a class implementing an interface perhaps and Hive
    take care of the rest?

    Thanks for any pointers,
    Tim

    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is a way to
    pipe data to external process and read the results back in. Almost like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns a
    canonical representation.



    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,


  • Adam O'Donnell at Apr 27, 2010 at 5:37 pm
    It sounds like what you want is a custom SerDe. I have tried to write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around.  If I were to extend UDF and
    implement  public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim

    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo wrote:


    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited file, and then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own method which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
    parsedName;
    and parse be a MR job?  If so what are the input and output formats
    please for the parse?  Or is it a class implementing an interface perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is a way to
    pipe data to external process and read the results back in. Almost like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
  • Tim Robertson at Apr 27, 2010 at 5:48 pm
    Hmmm... I am not trying to serialize or deserialize custom content, but
    simply take an input String (Text) run some Java and return a new (Text) by
    calling a function

    Looking at public class UDFYear extends UDF { the annotation at the top
    suggests extending UDF and adding the annotation, might be enough.

    I'll try it anyways...
    Tim
    On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell wrote:

    It sounds like what you want is a custom SerDe. I have tried to write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around. If I were to extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <edlinuxguru@gmail.com>
    wrote:

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited file, and
    then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own method
    which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
    parsedName;
    and parse be a MR job? If so what are the input and output formats
    please for the parse? Or is it a class implementing an interface
    perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is a way
    to
    pipe data to external process and read the results back in. Almost like
    a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
  • Edward Capriolo at Apr 27, 2010 at 5:53 pm

    On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson wrote:

    Hmmm... I am not trying to serialize or deserialize custom content, but
    simply take an input String (Text) run some Java and return a new (Text) by
    calling a function

    Looking at public class UDFYear extends UDF { the annotation at the top
    suggests extending UDF and adding the annotation, might be enough.

    I'll try it anyways...
    Tim
    On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell wrote:

    It sounds like what you want is a custom SerDe. I have tried to write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around. If I were to extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <edlinuxguru@gmail.com

    wrote:

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited file, and
    then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own method
    which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using 'parse' as
    parsedName;
    and parse be a MR job? If so what are the input and output formats
    please for the parse? Or is it a class implementing an interface
    perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is a way
    to
    pipe data to external process and read the results back in. Almost like
    a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
    Tim,

    I think you are on the right track with the UDF approach.

    You could accomplish something similiar with a serdy accept from the client
    prospecting it would be more "transparent".

    A UDF is a bit more reusable then a serde. You can only chose a serde once
    when the table is created, but you UDF is applied on the resultset.

    Edward
  • Tim Robertson at Apr 27, 2010 at 5:56 pm
    Thanks Edward,

    I get where you are coming from now with that explanation.

    Cheers,
    Tim

    On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo wrote:


    On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson wrote:

    Hmmm... I am not trying to serialize or deserialize custom content, but
    simply take an input String (Text) run some Java and return a new (Text) by
    calling a function

    Looking at public class UDFYear extends UDF { the annotation at the top
    suggests extending UDF and adding the annotation, might be enough.

    I'll try it anyways...
    Tim
    On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell wrote:

    It sounds like what you want is a custom SerDe. I have tried to write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around. If I were to extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
    edlinuxguru@gmail.com>
    wrote:

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited file, and
    then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own method
    which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using 'parse'
    as
    parsedName;
    and parse be a MR job? If so what are the input and output formats
    please for the parse? Or is it a class implementing an interface
    perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is a
    way to
    pipe data to external process and read the results back in. Almost
    like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad however,
    I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
    Tim,

    I think you are on the right track with the UDF approach.

    You could accomplish something similiar with a serdy accept from the client
    prospecting it would be more "transparent".

    A UDF is a bit more reusable then a serde. You can only chose a serde once
    when the table is created, but you UDF is applied on the resultset.

    Edward
  • Avram Aelony at Apr 27, 2010 at 6:06 pm
    Hi -

    If you would like to "simply take an input String (Text) run some Java and return a new (Text) by calling a function" then you may wish to consider using the "map" and "reduce" keywords directly from Hive and using a scripting language like Perl that contains your mapper and reducer code.

    for example:

    create external table some_input_table ( field_1 string ) row format (etc...);
    create table your_next_table ( output_field_1 string, output_field_2 string, output_field_3 string );


    from (
    from some_input_table i
    map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
    insert overwrite table your_next_table
    reduce mapper_output.* using 'some_custom_reducer_code.pl' as output_field_1, output_field_2, output_field_3
    ;

    --test it
    select * from your_next_table ;

    Hope that helps.

    cheers,
    Avram





    On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" wrote:
    >
  • Tim Robertson at Apr 28, 2010 at 8:12 am
    Ok, so it turns out I overlooked some things in my current MR job with the
    configure() and a UDF isn't enough.

    I do want to use the Hive Map keyword and call my own MR map().

    Currently my map() looks like the following, which works on a tab delimited
    input file:

    public void map(LongWritable key, Text value,

    OutputCollector<Text, Text> collector, Reporter reporter)

    throws IOException {

    Pattern tab = Pattern.compile("\t");

    String[] atoms = tab.split(value.toString());

    String parsed = myParseFunction(atoms);

    collector.collect(new Text(parsed), new Text(atoms[0]));

    What would I need to implement to make this usable with a Map keyword in
    Hive please, so I can run this with input from table 1, to populate table
    2?

    Sorry for this confusion, but it is not really clear to me - all help is
    very gratefully received.

    Cheers
    Tim



    On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony wrote:


    Hi -

    If you would like to "simply take an input String (Text) run some Java and return a new (Text) by calling a function" then you may wish to consider using the "map" and "reduce" keywords directly from Hive and using a scripting language like Perl that contains your mapper and reducer code.

    for example:

    create external table some_input_table ( field_1 string ) row format (etc...);
    create table your_next_table ( output_field_1 string, output_field_2 string, output_field_3 string );


    from (
    from some_input_table i
    map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
    insert overwrite table your_next_table
    reduce mapper_output.* using 'some_custom_reducer_code.pl' as output_field_1, output_field_2, output_field_3
    ;

    --test it
    select * from your_next_table ;

    Hope that helps.

    cheers,
    Avram




    On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" wrote:
    Thanks Edward,

    I get where you are coming from now with that explanation.

    Cheers,
    Tim

    On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo wrote:



    On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <timrobertson100@gmail.com
    wrote:
    Hmmm... I am not trying to serialize or deserialize custom content, but
    simply take an input String (Text) run some Java and return a new (Text) by
    calling a function

    Looking at public class UDFYear extends UDF { the annotation at the top
    suggests extending UDF and adding the annotation, might be enough.

    I'll try it anyways...
    Tim
    On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell wrote:

    It sounds like what you want is a custom SerDe. I have tried to write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around. If I were to extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
    edlinuxguru@gmail.com>
    wrote:

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited file, and
    then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own method
    which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using 'parse'
    as
    parsedName;
    and parse be a MR job? If so what are the input and output formats
    please for the parse? Or is it a class implementing an interface
    perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is a
    way to
    pipe data to external process and read the results back in. Almost
    like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad however,
    I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
    Tim,

    I think you are on the right track with the UDF approach.

    You could accomplish something similiar with a serdy accept from the
    client prospecting it would be more "transparent".

    A UDF is a bit more reusable then a serde. You can only chose a serde once
    when the table is created, but you UDF is applied on the resultset.

    Edward
  • Carl Steinbach at Apr 28, 2010 at 9:00 am
    Hi Tim,

    Larry Ogrodnek has a nice blog post describing how to use Hive's
    TRANSFORM/MAP/REDUCE syntax with Java code here:
    http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html

    A version of the library he describes in the blog post has been added to
    Hive's contrib directory, along with some examples and illustrative test
    cases. Check out the following files:

    src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
    src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
    src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/Output.java
    src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
    src/test/queries/clientpositive/java_mr_example.q

    Hope this helps.

    Carl
    On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson wrote:

    Ok, so it turns out I overlooked some things in my current MR job with the
    configure() and a UDF isn't enough.

    I do want to use the Hive Map keyword and call my own MR map().

    Currently my map() looks like the following, which works on a tab delimited
    input file:

    public void map(LongWritable key, Text value,

    OutputCollector<Text, Text> collector, Reporter reporter)

    throws IOException {

    Pattern tab = Pattern.compile("\t");

    String[] atoms = tab.split(value.toString());

    String parsed = myParseFunction(atoms);

    collector.collect(new Text(parsed), new Text(atoms[0]));

    What would I need to implement to make this usable with a Map keyword in
    Hive please, so I can run this with input from table 1, to populate table
    2?

    Sorry for this confusion, but it is not really clear to me - all help is
    very gratefully received.

    Cheers
    Tim



    On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony wrote:


    Hi -

    If you would like to "simply take an input String (Text) run some Java and return a new (Text) by calling a function" then you may wish to consider using the "map" and "reduce" keywords directly from Hive and using a scripting language like Perl that contains your mapper and reducer code.

    for example:

    create external table some_input_table ( field_1 string ) row format (etc...);
    create table your_next_table ( output_field_1 string, output_field_2 string, output_field_3 string );


    from (
    from some_input_table i
    map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
    insert overwrite table your_next_table
    reduce mapper_output.* using 'some_custom_reducer_code.pl' as output_field_1, output_field_2, output_field_3
    ;

    --test it
    select * from your_next_table ;

    Hope that helps.

    cheers,
    Avram




    On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" wrote:
    Thanks Edward,

    I get where you are coming from now with that explanation.

    Cheers,
    Tim

    On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo wrote:



    On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
    timrobertson100@gmail.com> wrote:
    Hmmm... I am not trying to serialize or deserialize custom content, but
    simply take an input String (Text) run some Java and return a new (Text) by
    calling a function

    Looking at public class UDFYear extends UDF { the annotation at the
    top suggests extending UDF and adding the annotation, might be enough.

    I'll try it anyways...
    Tim
    On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell wrote:

    It sounds like what you want is a custom SerDe. I have tried to write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around. If I were to extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
    edlinuxguru@gmail.com>
    wrote:

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited file,
    and then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own method
    which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using 'parse'
    as
    parsedName;
    and parse be a MR job? If so what are the input and output formats
    please for the parse? Or is it a class implementing an interface
    perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is a
    way to
    pipe data to external process and read the results back in. Almost
    like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad
    however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
    Tim,

    I think you are on the right track with the UDF approach.

    You could accomplish something similiar with a serdy accept from the
    client prospecting it would be more "transparent".

    A UDF is a bit more reusable then a serde. You can only chose a serde
    once when the table is created, but you UDF is applied on the resultset.

    Edward
  • Tim Robertson at Apr 28, 2010 at 9:06 am
    Thanks Carl - that looks ideal for this.

    On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach wrote:

    Hi Tim,

    Larry Ogrodnek has a nice blog post describing how to use Hive's
    TRANSFORM/MAP/REDUCE syntax with Java code here:
    http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html

    A version of the library he describes in the blog post has been added to
    Hive's contrib directory, along with some examples and illustrative test
    cases. Check out the following files:

    src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
    src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
    src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/Output.java
    src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
    src/test/queries/clientpositive/java_mr_example.q

    Hope this helps.

    Carl

    On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson wrote:

    Ok, so it turns out I overlooked some things in my current MR job with the
    configure() and a UDF isn't enough.

    I do want to use the Hive Map keyword and call my own MR map().

    Currently my map() looks like the following, which works on a tab
    delimited input file:

    public void map(LongWritable key, Text value,

    OutputCollector<Text, Text> collector, Reporter reporter)

    throws IOException {

    Pattern tab = Pattern.compile("\t");

    String[] atoms = tab.split(value.toString());

    String parsed = myParseFunction(atoms);

    collector.collect(new Text(parsed), new Text(atoms[0]));

    What would I need to implement to make this usable with a Map keyword in
    Hive please, so I can run this with input from table 1, to populate table
    2?

    Sorry for this confusion, but it is not really clear to me - all help is
    very gratefully received.

    Cheers
    Tim



    On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony wrote:


    Hi -

    If you would like to "simply take an input String (Text) run some Java and return a new (Text) by calling a function" then you may wish to consider using the "map" and "reduce" keywords directly from Hive and using a scripting language like Perl that contains your mapper and reducer code.

    for example:

    create external table some_input_table ( field_1 string ) row format (etc...);
    create table your_next_table ( output_field_1 string, output_field_2 string, output_field_3 string );


    from (
    from some_input_table i
    map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
    insert overwrite table your_next_table
    reduce mapper_output.* using 'some_custom_reducer_code.pl' as output_field_1, output_field_2, output_field_3
    ;

    --test it
    select * from your_next_table ;

    Hope that helps.

    cheers,
    Avram




    On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" wrote:
    Thanks Edward,

    I get where you are coming from now with that explanation.

    Cheers,
    Tim

    On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo wrote:



    On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
    timrobertson100@gmail.com> wrote:
    Hmmm... I am not trying to serialize or deserialize custom content, but
    simply take an input String (Text) run some Java and return a new (Text) by
    calling a function

    Looking at public class UDFYear extends UDF { the annotation at the
    top suggests extending UDF and adding the annotation, might be enough.

    I'll try it anyways...
    Tim
    On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell wrote:

    It sounds like what you want is a custom SerDe. I have tried to write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around. If I were to extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
    edlinuxguru@gmail.com>
    wrote:

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited file,
    and then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own
    method which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using
    'parse' as
    parsedName;
    and parse be a MR job? If so what are the input and output
    formats
    please for the parse? Or is it a class implementing an interface
    perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is a
    way to
    pipe data to external process and read the results back in. Almost
    like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad
    however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns
    a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
    Tim,

    I think you are on the right track with the UDF approach.

    You could accomplish something similiar with a serdy accept from the
    client prospecting it would be more "transparent".

    A UDF is a bit more reusable then a serde. You can only chose a serde
    once when the table is created, but you UDF is applied on the resultset.

    Edward
  • Edward Capriolo at Apr 28, 2010 at 2:17 pm

    On Wed, Apr 28, 2010 at 5:06 AM, Tim Robertson wrote:

    Thanks Carl - that looks ideal for this.

    On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach wrote:

    Hi Tim,

    Larry Ogrodnek has a nice blog post describing how to use Hive's
    TRANSFORM/MAP/REDUCE syntax with Java code here:
    http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html

    A version of the library he describes in the blog post has been added to
    Hive's contrib directory, along with some examples and illustrative test
    cases. Check out the following files:

    src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
    src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
    src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/Output.java
    src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
    src/test/queries/clientpositive/java_mr_example.q

    Hope this helps.

    Carl


    On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson <timrobertson100@gmail.com
    wrote:
    Ok, so it turns out I overlooked some things in my current MR job with
    the configure() and a UDF isn't enough.

    I do want to use the Hive Map keyword and call my own MR map().

    Currently my map() looks like the following, which works on a tab
    delimited input file:

    public void map(LongWritable key, Text value,

    OutputCollector<Text, Text> collector, Reporter reporter)

    throws IOException {

    Pattern tab = Pattern.compile("\t");

    String[] atoms = tab.split(value.toString());

    String parsed = myParseFunction(atoms);

    collector.collect(new Text(parsed), new Text(atoms[0]));

    What would I need to implement to make this usable with a Map keyword in
    Hive please, so I can run this with input from table 1, to populate table
    2?

    Sorry for this confusion, but it is not really clear to me - all help is
    very gratefully received.

    Cheers
    Tim



    On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony wrote:


    Hi -

    If you would like to "simply take an input String (Text) run some Java and return a new (Text) by calling a function" then you may wish to consider using the "map" and "reduce" keywords directly from Hive and using a scripting language like Perl that contains your mapper and reducer code.

    for example:

    create external table some_input_table ( field_1 string ) row format (etc...);
    create table your_next_table ( output_field_1 string, output_field_2 string, output_field_3 string );


    from (
    from some_input_table i
    map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
    insert overwrite table your_next_table
    reduce mapper_output.* using 'some_custom_reducer_code.pl' as output_field_1, output_field_2, output_field_3
    ;

    --test it
    select * from your_next_table ;

    Hope that helps.

    cheers,
    Avram




    On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" wrote:
    Thanks Edward,

    I get where you are coming from now with that explanation.

    Cheers,
    Tim


    On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo <edlinuxguru@gmail.com
    wrote:

    On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
    timrobertson100@gmail.com> wrote:
    Hmmm... I am not trying to serialize or deserialize custom content,
    but simply take an input String (Text) run some Java and return a new
    (Text) by calling a function

    Looking at public class UDFYear extends UDF { the annotation at the
    top suggests extending UDF and adding the annotation, might be enough.

    I'll try it anyways...
    Tim
    On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell wrote:

    It sounds like what you want is a custom SerDe. I have tried to
    write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around. If I were to extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
    edlinuxguru@gmail.com>
    wrote:

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited file,
    and then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own
    method which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using
    'parse' as
    parsedName;
    and parse be a MR job? If so what are the input and output
    formats
    please for the parse? Or is it a class implementing an interface
    perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is
    a way to
    pipe data to external process and read the results back in. Almost
    like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad
    however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and returns
    a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
    Tim,

    I think you are on the right track with the UDF approach.

    You could accomplish something similiar with a serdy accept from the
    client prospecting it would be more "transparent".

    A UDF is a bit more reusable then a serde. You can only chose a serde
    once when the table is created, but you UDF is applied on the resultset.

    Edward
    Based on your description now this is looking more like a SERDE. I say this
    because
    map(), reduce(), transform() are almsot identical to a UDF with the
    exception that they can return multiple columns more easily.

    Lets look at an example

    RawFile

    1\t6,7,8

    So by default Hive uses a TextInputFormat
    It reads the key as "1" it reads the value as "6,7,8"

    Now the default OutputFormat is HiveIgnoreKeyOutputFormat. Thus hive drops
    the "1"

    and the output is now "6,7,8"

    Now the SERDE takes over. based on the DELIMITER you specified when you
    created the table, the serder attempts to split the row.

    If I chose a DELIMITER ',' hive would split the row into
    ['6','7','8']
    If I chose the wrong delimiter say '^K' or somethign whacky hive really
    would not split the row you would get 1 return.
    ['6,7,8']

    So you can chose the 'wrong delimiter' as Carl suggests above essentially
    this turns the row into a single string. Then you are free to operate on it
    as a single string. You can use a UDF that works with a single string or use
    a map() sytax that can split the string and return multiple columns.

    Any approach is valid and will yeild results, but treating a row as a single
    string is anti-pattern-ish. Since you ideally want hive to understand the
    schema of the row.
  • Edward Capriolo at Apr 28, 2010 at 2:32 pm
    On Wed, Apr 28, 2010 at 10:16 AM, Edward Capriolo wrote:
    On Wed, Apr 28, 2010 at 5:06 AM, Tim Robertson wrote:

    Thanks Carl - that looks ideal for this.

    On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach wrote:

    Hi Tim,

    Larry Ogrodnek has a nice blog post describing how to use Hive's
    TRANSFORM/MAP/REDUCE syntax with Java code here:
    http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html

    A version of the library he describes in the blog post has been added to
    Hive's contrib directory, along with some examples and illustrative test
    cases. Check out the following files:

    src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
    src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
    src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/Output.java
    src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
    src/test/queries/clientpositive/java_mr_example.q

    Hope this helps.

    Carl


    On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson <
    timrobertson100@gmail.com> wrote:
    Ok, so it turns out I overlooked some things in my current MR job with
    the configure() and a UDF isn't enough.

    I do want to use the Hive Map keyword and call my own MR map().

    Currently my map() looks like the following, which works on a tab
    delimited input file:

    public void map(LongWritable key, Text value,

    OutputCollector<Text, Text> collector, Reporter reporter)

    throws IOException {

    Pattern tab = Pattern.compile("\t");

    String[] atoms = tab.split(value.toString());

    String parsed = myParseFunction(atoms);

    collector.collect(new Text(parsed), new Text(atoms[0]));

    What would I need to implement to make this usable with a Map keyword in
    Hive please, so I can run this with input from table 1, to populate table
    2?

    Sorry for this confusion, but it is not really clear to me - all help is
    very gratefully received.

    Cheers
    Tim



    On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony wrote:


    Hi -

    If you would like to "simply take an input String (Text) run some Java and return a new (Text) by calling a function" then you may wish to consider using the "map" and "reduce" keywords directly from Hive and using a scripting language like Perl that contains your mapper and reducer code.

    for example:

    create external table some_input_table ( field_1 string ) row format (etc...);
    create table your_next_table ( output_field_1 string, output_field_2 string, output_field_3 string );


    from (
    from some_input_table i
    map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
    insert overwrite table your_next_table
    reduce mapper_output.* using 'some_custom_reducer_code.pl' as output_field_1, output_field_2, output_field_3
    ;

    --test it
    select * from your_next_table ;

    Hope that helps.

    cheers,
    Avram




    On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" wrote:
    Thanks Edward,

    I get where you are coming from now with that explanation.

    Cheers,
    Tim


    On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo <
    edlinuxguru@gmail.com> wrote:

    On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
    timrobertson100@gmail.com> wrote:
    Hmmm... I am not trying to serialize or deserialize custom content,
    but simply take an input String (Text) run some Java and return a new
    (Text) by calling a function

    Looking at public class UDFYear extends UDF { the annotation at the
    top suggests extending UDF and adding the annotation, might be enough.

    I'll try it anyways...
    Tim
    On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell wrote:

    It sounds like what you want is a custom SerDe. I have tried to
    write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around. If I were to
    extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
    edlinuxguru@gmail.com>
    wrote:

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited file,
    and then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own
    method which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using
    'parse' as
    parsedName;
    and parse be a MR job? If so what are the input and output
    formats
    please for the parse? Or is it a class implementing an
    interface perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described is
    a way to
    pipe data to external process and read the results back in.
    Almost like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad
    however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and
    returns a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
    Tim,

    I think you are on the right track with the UDF approach.

    You could accomplish something similiar with a serdy accept from the
    client prospecting it would be more "transparent".

    A UDF is a bit more reusable then a serde. You can only chose a serde
    once when the table is created, but you UDF is applied on the resultset.

    Edward
    Based on your description now this is looking more like a SERDE. I say this
    because
    map(), reduce(), transform() are almsot identical to a UDF with the
    exception that they can return multiple columns more easily.

    Lets look at an example

    RawFile

    1\t6,7,8

    So by default Hive uses a TextInputFormat
    It reads the key as "1" it reads the value as "6,7,8"

    Now the default OutputFormat is HiveIgnoreKeyOutputFormat. Thus hive drops
    the "1"

    and the output is now "6,7,8"

    Now the SERDE takes over. based on the DELIMITER you specified when you
    created the table, the serder attempts to split the row.

    If I chose a DELIMITER ',' hive would split the row into
    ['6','7','8']
    If I chose the wrong delimiter say '^K' or somethign whacky hive really
    would not split the row you would get 1 return.
    ['6,7,8']

    So you can chose the 'wrong delimiter' as Carl suggests above essentially
    this turns the row into a single string. Then you are free to operate on it
    as a single string. You can use a UDF that works with a single string or use
    a map() sytax that can split the string and return multiple columns.

    Any approach is valid and will yeild results, but treating a row as a
    single string is anti-pattern-ish. Since you ideally want hive to understand
    the schema of the row.
    Just wanted to follow up a bit. One benefit of Carl's approach is the map(),
    reduce(), transform() is definately easier to try. Making a serde takes a
    while. You have to learn the framework, write some code, alter your current
    table definitition, etc. map() reduce(), transform() can be done without
    writing java code to extend hive.
  • Tim Robertson at Apr 28, 2010 at 2:55 pm
    Thanks Edward. As I venture further into Hive I will want it to understand
    the schema of the row (especially for Numeric types), so will put the effort
    in now to understand serde.

    For context...
    We have a 200million row mysql table which currently we scan in batches and
    process (parsing, georeferencing etc) into another 200million row "fact"
    table which serves the web, and I am moving from this to Hive processing,
    and having really amazing results so far. Sqoop -> Hive will be integrated
    soon. Following that, the source table will be ported to HBase, with Hive
    performing this processing n HBase input and then sqoop pushing the
    processed results to mysql / PostGIS DBs (and probably some Lucene). If I
    can get this Serde working (as I find time) it is really going to allow us a
    clean processing workflow.

    More questions to follow I am sure.

    Tim



    On Wed, Apr 28, 2010 at 4:32 PM, Edward Capriolo wrote:


    On Wed, Apr 28, 2010 at 10:16 AM, Edward Capriolo wrote:



    On Wed, Apr 28, 2010 at 5:06 AM, Tim Robertson <timrobertson100@gmail.com
    wrote:
    Thanks Carl - that looks ideal for this.

    On Wed, Apr 28, 2010 at 10:59 AM, Carl Steinbach wrote:

    Hi Tim,

    Larry Ogrodnek has a nice blog post describing how to use Hive's
    TRANSFORM/MAP/REDUCE syntax with Java code here:
    http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html

    A version of the library he describes in the blog post has been added to
    Hive's contrib directory, along with some examples and illustrative test
    cases. Check out the following files:

    src/java/org/apache/hadoop/hive/contrib/mr/example/IdentityMapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/example/WordCountReduce.java
    src/java/org/apache/hadoop/hive/contrib/mr/GenericMR.java
    src/java/org/apache/hadoop/hive/contrib/mr/Mapper.java
    src/java/org/apache/hadoop/hive/contrib/mr/Output.java
    src/java/org/apache/hadoop/hive/contrib/mr/Reducer.java
    src/test/queries/clientpositive/java_mr_example.q

    Hope this helps.

    Carl


    On Wed, Apr 28, 2010 at 1:11 AM, Tim Robertson <
    timrobertson100@gmail.com> wrote:
    Ok, so it turns out I overlooked some things in my current MR job with
    the configure() and a UDF isn't enough.

    I do want to use the Hive Map keyword and call my own MR map().

    Currently my map() looks like the following, which works on a tab
    delimited input file:

    public void map(LongWritable key, Text value,

    OutputCollector<Text, Text> collector, Reporter reporter)

    throws IOException {

    Pattern tab = Pattern.compile("\t");

    String[] atoms = tab.split(value.toString());

    String parsed = myParseFunction(atoms);

    collector.collect(new Text(parsed), new Text(atoms[0]));

    What would I need to implement to make this usable with a Map keyword
    in Hive please, so I can run this with input from table 1, to populate table
    2?

    Sorry for this confusion, but it is not really clear to me - all help
    is very gratefully received.

    Cheers
    Tim



    On Tue, Apr 27, 2010 at 8:06 PM, Avram Aelony wrote:


    Hi -

    If you would like to "simply take an input String (Text) run some Java and return a new (Text) by calling a function" then you may wish to consider using the "map" and "reduce" keywords directly from Hive and using a scripting language like Perl that contains your mapper and reducer code.

    for example:

    create external table some_input_table ( field_1 string ) row format (etc...);
    create table your_next_table ( output_field_1 string, output_field_2 string, output_field_3 string );


    from (
    from some_input_table i
    map i.field_1 using 'some_custom_mapper_code.pl' ) mapper_output
    insert overwrite table your_next_table
    reduce mapper_output.* using 'some_custom_reducer_code.pl' as output_field_1, output_field_2, output_field_3
    ;

    --test it
    select * from your_next_table ;

    Hope that helps.

    cheers,
    Avram




    On Tuesday, April 27, 2010, at 10:55AM, "Tim Robertson" wrote:
    Thanks Edward,

    I get where you are coming from now with that explanation.

    Cheers,
    Tim


    On Tue, Apr 27, 2010 at 7:53 PM, Edward Capriolo <
    edlinuxguru@gmail.com> wrote:

    On Tue, Apr 27, 2010 at 1:48 PM, Tim Robertson <
    timrobertson100@gmail.com> wrote:
    Hmmm... I am not trying to serialize or deserialize custom content,
    but simply take an input String (Text) run some Java and return a new
    (Text) by calling a function

    Looking at public class UDFYear extends UDF { the annotation at
    the top suggests extending UDF and adding the annotation, might be enough.

    I'll try it anyways...
    Tim
    On Tue, Apr 27, 2010 at 7:37 PM, Adam O'Donnell wrote:

    It sounds like what you want is a custom SerDe. I have tried to
    write
    one but ran into some difficulty.

    On Tue, Apr 27, 2010 at 10:13 AM, Tim Robertson
    wrote:
    Thanks Edward,
    You are indeed correct - I am confused!
    So I checked out the source, and poked around. If I were to
    extend UDF and
    implement public Text evaluate(Text source) {
    would I be heading along the correct lines to use what you say above?
    Thanks,
    Tim


    On Tue, Apr 27, 2010 at 5:11 PM, Edward Capriolo <
    edlinuxguru@gmail.com>
    wrote:

    On Tue, Apr 27, 2010 at 10:22 AM, Tim Robertson
    wrote:
    Hi,
    I currently run a MapReduce job to rewrite a tab delimited
    file, and then
    I use Hive for everything after that stage.
    Am I correct in thinking that I can create a Jar with my own
    method which
    can then be called in SQL?
    Would the syntax be:
    hive> ADD JAR /tmp/parse.jar;
    hive> INSERT OVERWRITE TABLE target SELECT s.id,
    s.canonical, parsedName FROM source s MAP s.canonical using
    'parse' as
    parsedName;
    and parse be a MR job? If so what are the input and output
    formats
    please for the parse? Or is it a class implementing an
    interface perhaps
    and Hive take care of the rest?
    Thanks for any pointers,
    Tim
    Tim,

    A UDF is an sql function like toString() max()
    An InputFormat teachers hive to read data from Key Value files
    A serde tells Hive how to parse input data into columns.
    Finally, the map()reduce(), transform() keywords you described
    is a way to
    pipe data to external process and read the results back in.
    Almost like a
    non-native to hive UDF.

    So you have munged up 4 concepts together :) Do not feel bad
    however, I
    struggled though an input format for the last month.

    It sounds most like you want a udf that takes a string and
    returns a
    canonical representation.


    hive> ADD JAR /tmp/parse.jar;
    create temporary function canonical as 'my.package.canonical';
    select canonical(my colum) from source;

    Regards,



    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
    Tim,

    I think you are on the right track with the UDF approach.

    You could accomplish something similiar with a serdy accept from the
    client prospecting it would be more "transparent".

    A UDF is a bit more reusable then a serde. You can only chose a serde
    once when the table is created, but you UDF is applied on the resultset.

    Edward
    Based on your description now this is looking more like a SERDE. I say
    this because
    map(), reduce(), transform() are almsot identical to a UDF with the
    exception that they can return multiple columns more easily.

    Lets look at an example

    RawFile

    1\t6,7,8

    So by default Hive uses a TextInputFormat
    It reads the key as "1" it reads the value as "6,7,8"

    Now the default OutputFormat is HiveIgnoreKeyOutputFormat. Thus hive drops
    the "1"

    and the output is now "6,7,8"

    Now the SERDE takes over. based on the DELIMITER you specified when you
    created the table, the serder attempts to split the row.

    If I chose a DELIMITER ',' hive would split the row into
    ['6','7','8']
    If I chose the wrong delimiter say '^K' or somethign whacky hive really
    would not split the row you would get 1 return.
    ['6,7,8']

    So you can chose the 'wrong delimiter' as Carl suggests above essentially
    this turns the row into a single string. Then you are free to operate on it
    as a single string. You can use a UDF that works with a single string or use
    a map() sytax that can split the string and return multiple columns.

    Any approach is valid and will yeild results, but treating a row as a
    single string is anti-pattern-ish. Since you ideally want hive to understand
    the schema of the row.
    Just wanted to follow up a bit. One benefit of Carl's approach is the
    map(), reduce(), transform() is definately easier to try. Making a serde
    takes a while. You have to learn the framework, write some code, alter your
    current table definitition, etc. map() reduce(), transform() can be done
    without writing java code to extend hive.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedApr 27, '10 at 2:22p
activeApr 28, '10 at 2:55p
posts14
users5
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase