FAQ
Is it possible to do a query like the following:
INSERT OVERWRITE TABLE table1 PARTITION(...)
FROM table2
SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script' AS (col1,
col2, ...)
WHERE (...)

I can run the select transform segment of the query by itself fine and I get
the results I expect.

When I try and do the insert as well I'm getting errors with column type
mismatches even though my script is outputting 3 columns with the exact same
types in the exact order that they appear in table1. I tried doing this with
both a mapper and reducer similar to what was shown in the Apache Con slides
and it still didn't work. Am I doing something wrong query wise?

I'm using the 0.19 release.

Josh Ferguson

Search Discussions

  • Josh Ferguson at Jan 10, 2009 at 8:17 pm
    I want to follow up on this a little, here are the schemas for the source
    and destination tables and the query I am trying to run.
    Source table:

    hive> DESCRIBE EXTENDED users;

    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:users,dbName:default,owner:Josh,createTime:1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})


    Destination table:

    hive> DESCRIBE EXTENDED distinct_users;
    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:distinct_users,dbName:default,owner:Josh,createTime:1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/distinct_users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})

    The query:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT
    TRANSFORM(users.occurred_at, users.id, users.properties) USING '/bin/cat' AS
    (occurred_at, id, properties) FROM users;
    FAILED: Error in semantic analysis: line 1:23 Cannot insert into target
    table because column number/types are different distinct_users: Cannot
    convert column 2 from string to map<string,string>.

    I'm really confused because the two tables are the exact same except for
    their names and I'm just trying to do an insert from one of them into the
    other using a script.

    For reference this appears to work:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,
    properties FROM users;

    What is it about transforming that is messing up the semantic analysis?

    Josh Ferguson
    On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson wrote:

    Is it possible to do a query like the following:
    INSERT OVERWRITE TABLE table1 PARTITION(...)
    FROM table2
    SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script' AS
    (col1, col2, ...)
    WHERE (...)

    I can run the select transform segment of the query by itself fine and I
    get the results I expect.

    When I try and do the insert as well I'm getting errors with column type
    mismatches even though my script is outputting 3 columns with the exact same
    types in the exact order that they appear in table1. I tried doing this with
    both a mapper and reducer similar to what was shown in the Apache Con slides
    and it still didn't work. Am I doing something wrong query wise?

    I'm using the 0.19 release.

    Josh Ferguson
  • Josh Ferguson at Jan 11, 2009 at 6:25 am
    One more small update, it seems that transform doesn't work at all for
    inserting into columns of type MAP<X,Y>. I suspect this is because the
    semantic analyzer treats all columns out of a custom map phase as type
    'STRING' and then complains when it can't convert the assumed type into the
    type necessary, which is MAP<STRING, STRING> in this case. Is this correct?
    Is anyone else using a MAP type with custom map or reduce scripts? What
    queries have you gotten to work?
    Jos
    On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson wrote:

    I want to follow up on this a little, here are the schemas for the source
    and destination tables and the query I am trying to run.
    Source table:

    hive> DESCRIBE EXTENDED users;

    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:users,dbName:default,owner:Josh,createTime:1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})


    Destination table:

    hive> DESCRIBE EXTENDED distinct_users;
    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:distinct_users,dbName:default,owner:Josh,createTime:1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/distinct_users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})

    The query:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT
    TRANSFORM(users.occurred_at, users.id, users.properties) USING '/bin/cat'
    AS (occurred_at, id, properties) FROM users;
    FAILED: Error in semantic analysis: line 1:23 Cannot insert into target
    table because column number/types are different distinct_users: Cannot
    convert column 2 from string to map<string,string>.

    I'm really confused because the two tables are the exact same except for
    their names and I'm just trying to do an insert from one of them into the
    other using a script.

    For reference this appears to work:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,
    properties FROM users;

    What is it about transforming that is messing up the semantic analysis?

    Josh Ferguson
    On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson wrote:

    Is it possible to do a query like the following:
    INSERT OVERWRITE TABLE table1 PARTITION(...)
    FROM table2
    SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script' AS
    (col1, col2, ...)
    WHERE (...)

    I can run the select transform segment of the query by itself fine and I
    get the results I expect.

    When I try and do the insert as well I'm getting errors with column type
    mismatches even though my script is outputting 3 columns with the exact same
    types in the exact order that they appear in table1. I tried doing this with
    both a mapper and reducer similar to what was shown in the Apache Con slides
    and it still didn't work. Am I doing something wrong query wise?

    I'm using the 0.19 release.

    Josh Ferguson
  • Zheng Shao at Jan 11, 2009 at 6:46 am
    Hi Josh,

    Yes the transform assumes every output column will be string.
    And, if the input of the transform is not a string, it will be converted to
    a string. But we don't have a mechanism to convert string back to
    map<string,string> in the language. What do you think we should do to
    support that?

    Zheng
    On Sat, Jan 10, 2009 at 10:25 PM, Josh Ferguson wrote:

    One more small update, it seems that transform doesn't work at all for
    inserting into columns of type MAP<X,Y>. I suspect this is because the
    semantic analyzer treats all columns out of a custom map phase as type
    'STRING' and then complains when it can't convert the assumed type into the
    type necessary, which is MAP<STRING, STRING> in this case. Is this correct?
    Is anyone else using a MAP type with custom map or reduce scripts? What
    queries have you gotten to work?
    Jos

    On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson wrote:

    I want to follow up on this a little, here are the schemas for the source
    and destination tables and the query I am trying to run.
    Source table:

    hive> DESCRIBE EXTENDED users;

    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:users,dbName:default,owner:Josh,createTime:1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})


    Destination table:

    hive> DESCRIBE EXTENDED distinct_users;
    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:distinct_users,dbName:default,owner:Josh,createTime:1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/distinct_users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})

    The query:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT
    TRANSFORM(users.occurred_at, users.id, users.properties) USING '/bin/cat'
    AS (occurred_at, id, properties) FROM users;
    FAILED: Error in semantic analysis: line 1:23 Cannot insert into target
    table because column number/types are different distinct_users: Cannot
    convert column 2 from string to map<string,string>.

    I'm really confused because the two tables are the exact same except for
    their names and I'm just trying to do an insert from one of them into the
    other using a script.

    For reference this appears to work:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,
    properties FROM users;

    What is it about transforming that is messing up the semantic analysis?

    Josh Ferguson
    On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson wrote:

    Is it possible to do a query like the following:
    INSERT OVERWRITE TABLE table1 PARTITION(...)
    FROM table2
    SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script' AS
    (col1, col2, ...)
    WHERE (...)

    I can run the select transform segment of the query by itself fine and I
    get the results I expect.

    When I try and do the insert as well I'm getting errors with column type
    mismatches even though my script is outputting 3 columns with the exact same
    types in the exact order that they appear in table1. I tried doing this with
    both a mapper and reducer similar to what was shown in the Apache Con slides
    and it still didn't work. Am I doing something wrong query wise?

    I'm using the 0.19 release.

    Josh Ferguson

    --
    Yours,
    Zheng
  • Josh Ferguson at Jan 11, 2009 at 7:23 am
    My initial assumption when I tried to write the query was that it
    would use the same delimiters I defined in the schema definition of
    the target table. That led to my confusion because I thought hive had
    enough information (in the schema) to do proper string -> map<x,x>
    data conversion.

    Maybe something like that could work?

    Josh F.
    On Jan 10, 2009, at 10:46 PM, Zheng Shao wrote:

    Hi Josh,

    Yes the transform assumes every output column will be string.
    And, if the input of the transform is not a string, it will be
    converted to a string. But we don't have a mechanism to convert
    string back to map<string,string> in the language. What do you think
    we should do to support that?

    Zheng

    On Sat, Jan 10, 2009 at 10:25 PM, Josh Ferguson wrote:
    One more small update, it seems that transform doesn't work at all
    for inserting into columns of type MAP<X,Y>. I suspect this is
    because the semantic analyzer treats all columns out of a custom map
    phase as type 'STRING' and then complains when it can't convert the
    assumed type into the type necessary, which is MAP<STRING, STRING>
    in this case. Is this correct? Is anyone else using a MAP type with
    custom map or reduce scripts? What queries have you gotten to work?

    Jos


    On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson wrote:
    I want to follow up on this a little, here are the schemas for the
    source and destination tables and the query I am trying to run.

    Source table:

    hive> DESCRIBE EXTENDED
    users
    ; OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:users,dbName:default,owner:Josh,createTime:
    1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
    [FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema
    (name:properties,type:map<string,string>,comment:null)],location:/
    user/hive/warehouse/
    users
    ,inputFormat:org
    .apache
    .hadoop
    .mapred
    .TextInputFormat
    ,outputFormat:org
    .apache
    .hadoop
    .hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:
    32
    ,serdeInfo:SerDeInfo
    (name:null
    ,serializationLib:org
    .apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:
    {colelction
    .delim
    =
    44
    ,mapkey
    .delim
    =
    58
    ,serialization
    .format
    =
    org
    .apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:
    [id],sortCols:[],parameters:{}),partitionKeys:
    [FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})


    Destination table:

    hive> DESCRIBE EXTENDED distinct_users;
    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:distinct_users,dbName:default,owner:Josh,createTime:
    1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
    [FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema
    (name:properties,type:map<string,string>,comment:null)],location:/
    user/hive/warehouse/
    distinct_users
    ,inputFormat:org
    .apache
    .hadoop
    .mapred
    .TextInputFormat
    ,outputFormat:org
    .apache
    .hadoop
    .hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:
    32
    ,serdeInfo:SerDeInfo
    (name:null
    ,serializationLib:org
    .apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:
    {colelction
    .delim
    =
    44
    ,mapkey
    .delim
    =
    58
    ,serialization
    .format
    =
    org
    .apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:
    [id],sortCols:[],parameters:{}),partitionKeys:
    [FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})

    The query:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT
    TRANSFORM(users.occurred_at, users.id, users.properties) USING '/bin/
    cat' AS (occurred_at, id, properties) FROM users;
    FAILED: Error in semantic analysis: line 1:23 Cannot insert into
    target table because column number/types are different
    distinct_users: Cannot convert column 2 from string to
    map<string,string>.

    I'm really confused because the two tables are the exact same except
    for their names and I'm just trying to do an insert from one of them
    into the other using a script.

    For reference this appears to work:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,
    properties FROM users;

    What is it about transforming that is messing up the semantic
    analysis?

    Josh Ferguson

    On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson wrote:
    Is it possible to do a query like the following:

    INSERT OVERWRITE TABLE table1 PARTITION(...)
    FROM table2
    SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script'
    AS (col1, col2, ...)
    WHERE (...)

    I can run the select transform segment of the query by itself fine
    and I get the results I expect.

    When I try and do the insert as well I'm getting errors with column
    type mismatches even though my script is outputting 3 columns with
    the exact same types in the exact order that they appear in table1.
    I tried doing this with both a mapper and reducer similar to what
    was shown in the Apache Con slides and it still didn't work. Am I
    doing something wrong query wise?

    I'm using the 0.19 release.

    Josh Ferguson





    --
    Yours,
    Zheng
  • Zheng Shao at Jan 11, 2009 at 7:38 am
    I don't think it's a good idea to rely on the information from the table.
    The data might go through multiple processing steps before it reaches the
    final destination table. And the destination table may store the data in any
    way (may not be delimited).


    What about allowing some syntax like this:

    SELECT TRANSFORM(myint, mymap) ROW FORMAT DELIMITED KEY TERMINATED BY '3'
    COLLECTION ITEM TERMINATED BY '2'
    USING '/bin/cat'
    AS (myint INT, mymap MAP<STRING,STRING>) ROW FORMAT DELIMITED KEY TERMINATED
    BY '3' COLLECTION ITEM TERMINATED BY '2'

    The first ROW FORMAT describes the input format for the script, and the
    second describes the output format of the script.


    Zheng
    On Sat, Jan 10, 2009 at 11:22 PM, Josh Ferguson wrote:

    My initial assumption when I tried to write the query was that it would use
    the same delimiters I defined in the schema definition of the target table.
    That led to my confusion because I thought hive had enough information (in
    the schema) to do proper string -> map<x,x> data conversion.
    Maybe something like that could work?
    Josh F.


    On Jan 10, 2009, at 10:46 PM, Zheng Shao wrote:

    Hi Josh,

    Yes the transform assumes every output column will be string.
    And, if the input of the transform is not a string, it will be converted to
    a string. But we don't have a mechanism to convert string back to
    map<string,string> in the language. What do you think we should do to
    support that?

    Zheng
    On Sat, Jan 10, 2009 at 10:25 PM, Josh Ferguson wrote:

    One more small update, it seems that transform doesn't work at all for
    inserting into columns of type MAP<X,Y>. I suspect this is because the
    semantic analyzer treats all columns out of a custom map phase as type
    'STRING' and then complains when it can't convert the assumed type into the
    type necessary, which is MAP<STRING, STRING> in this case. Is this correct?
    Is anyone else using a MAP type with custom map or reduce scripts? What
    queries have you gotten to work?
    Jos

    On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson wrote:

    I want to follow up on this a little, here are the schemas for the source
    and destination tables and the query I am trying to run.
    Source table:

    hive> DESCRIBE EXTENDED users;

    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:users,dbName:default,owner:Josh,createTime:1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})


    Destination table:

    hive> DESCRIBE EXTENDED distinct_users;
    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:distinct_users,dbName:default,owner:Josh,createTime:1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema(name:properties,type:map<string,string>,comment:null)],location:/user/hive/warehouse/distinct_users,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:{colelction.delim=44,mapkey.delim=58,serialization.format=org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})

    The query:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT
    TRANSFORM(users.occurred_at, users.id, users.properties) USING
    '/bin/cat' AS (occurred_at, id, properties) FROM users;
    FAILED: Error in semantic analysis: line 1:23 Cannot insert into target
    table because column number/types are different distinct_users: Cannot
    convert column 2 from string to map<string,string>.

    I'm really confused because the two tables are the exact same except
    for their names and I'm just trying to do an insert from one of them into
    the other using a script.

    For reference this appears to work:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,
    properties FROM users;

    What is it about transforming that is messing up the semantic analysis?

    Josh Ferguson
    On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson wrote:

    Is it possible to do a query like the following:
    INSERT OVERWRITE TABLE table1 PARTITION(...)
    FROM table2
    SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script' AS
    (col1, col2, ...)
    WHERE (...)

    I can run the select transform segment of the query by itself fine and I
    get the results I expect.

    When I try and do the insert as well I'm getting errors with column type
    mismatches even though my script is outputting 3 columns with the exact same
    types in the exact order that they appear in table1. I tried doing this with
    both a mapper and reducer similar to what was shown in the Apache Con slides
    and it still didn't work. Am I doing something wrong query wise?

    I'm using the 0.19 release.

    Josh Ferguson

    --
    Yours,
    Zheng


    --
    Yours,
    Zheng
  • Josh Ferguson at Jan 11, 2009 at 7:55 am
    I don't know when I'd use the input formatting (my script would have
    to take special formatting information), but the output formatting
    would be useful, as long as the semantic analyzer can take it into
    account when figuring things out..:)

    Josh
    On Jan 10, 2009, at 11:38 PM, Zheng Shao wrote:

    I don't think it's a good idea to rely on the information from the
    table. The data might go through multiple processing steps before it
    reaches the final destination table. And the destination table may
    store the data in any way (may not be delimited).


    What about allowing some syntax like this:

    SELECT TRANSFORM(myint, mymap) ROW FORMAT DELIMITED KEY TERMINATED
    BY '3' COLLECTION ITEM TERMINATED BY '2'
    USING '/bin/cat'
    AS (myint INT, mymap MAP<STRING,STRING>) ROW FORMAT DELIMITED KEY
    TERMINATED BY '3' COLLECTION ITEM TERMINATED BY '2'

    The first ROW FORMAT describes the input format for the script, and
    the second describes the output format of the script.


    Zheng

    On Sat, Jan 10, 2009 at 11:22 PM, Josh Ferguson wrote:
    My initial assumption when I tried to write the query was that it
    would use the same delimiters I defined in the schema definition of
    the target table. That led to my confusion because I thought hive
    had enough information (in the schema) to do proper string ->
    map<x,x> data conversion.

    Maybe something like that could work?

    Josh F.

    On Jan 10, 2009, at 10:46 PM, Zheng Shao wrote:

    Hi Josh,

    Yes the transform assumes every output column will be string.
    And, if the input of the transform is not a string, it will be
    converted to a string. But we don't have a mechanism to convert
    string back to map<string,string> in the language. What do you
    think we should do to support that?

    Zheng

    On Sat, Jan 10, 2009 at 10:25 PM, Josh Ferguson
    wrote:
    One more small update, it seems that transform doesn't work at all
    for inserting into columns of type MAP<X,Y>. I suspect this is
    because the semantic analyzer treats all columns out of a custom
    map phase as type 'STRING' and then complains when it can't convert
    the assumed type into the type necessary, which is MAP<STRING,
    STRING> in this case. Is this correct? Is anyone else using a MAP
    type with custom map or reduce scripts? What queries have you
    gotten to work?

    Jos


    On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson
    wrote:
    I want to follow up on this a little, here are the schemas for the
    source and destination tables and the query I am trying to run.

    Source table:

    hive> DESCRIBE EXTENDED
    users
    ; OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:users,dbName:default,owner:Josh,createTime:
    1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
    [FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema
    (name:properties,type:map<string,string>,comment:null)],location:/
    user/hive/warehouse/
    users
    ,inputFormat:org
    .apache
    .hadoop
    .mapred
    .TextInputFormat
    ,outputFormat:org
    .apache
    .hadoop
    .hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:
    32
    ,serdeInfo:SerDeInfo
    (name:null
    ,serializationLib:org
    .apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:
    {colelction
    .delim
    =
    44
    ,mapkey
    .delim
    =
    58
    ,serialization
    .format
    =
    org
    .apache
    .hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:
    [id],sortCols:[],parameters:{}),partitionKeys:
    [FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})


    Destination table:

    hive> DESCRIBE EXTENDED distinct_users;
    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:distinct_users,dbName:default,owner:Josh,createTime:
    1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
    [FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema
    (name:properties,type:map<string,string>,comment:null)],location:/
    user/hive/warehouse/
    distinct_users
    ,inputFormat:org
    .apache
    .hadoop
    .mapred
    .TextInputFormat
    ,outputFormat:org
    .apache
    .hadoop
    .hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:
    32
    ,serdeInfo:SerDeInfo
    (name:null
    ,serializationLib:org
    .apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:
    {colelction
    .delim
    =
    44
    ,mapkey
    .delim
    =
    58
    ,serialization
    .format
    =
    org
    .apache
    .hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:
    [id],sortCols:[],parameters:{}),partitionKeys:
    [FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})

    The query:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT
    TRANSFORM(users.occurred_at, users.id, users.properties) USING '/
    bin/cat' AS (occurred_at, id, properties) FROM users;
    FAILED: Error in semantic analysis: line 1:23 Cannot insert into
    target table because column number/types are different
    distinct_users: Cannot convert column 2 from string to
    map<string,string>.

    I'm really confused because the two tables are the exact same
    except for their names and I'm just trying to do an insert from one
    of them into the other using a script.

    For reference this appears to work:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,
    properties FROM users;

    What is it about transforming that is messing up the semantic
    analysis?

    Josh Ferguson

    On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson <josh@besquared.net>
    wrote:
    Is it possible to do a query like the following:

    INSERT OVERWRITE TABLE table1 PARTITION(...)
    FROM table2
    SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script'
    AS (col1, col2, ...)
    WHERE (...)

    I can run the select transform segment of the query by itself fine
    and I get the results I expect.

    When I try and do the insert as well I'm getting errors with column
    type mismatches even though my script is outputting 3 columns with
    the exact same types in the exact order that they appear in table1.
    I tried doing this with both a mapper and reducer similar to what
    was shown in the Apache Con slides and it still didn't work. Am I
    doing something wrong query wise?

    I'm using the 0.19 release.

    Josh Ferguson





    --
    Yours,
    Zheng



    --
    Yours,
    Zheng
  • Josh Ferguson at Jan 12, 2009 at 1:57 am
    I just want to note that conceptually using a transform like this
    seems like it is a series of 3 steps:

    1) dump
    2) some stuff in the middle the semantic analyzer shouldn't care about
    3 ) load

    I'm not really sure what the difference is between what is actually
    happening and those steps, but I know that it feels wrong to have to
    specify my delimiting information in multiple places and in every
    script I want to run that inserts data into a table containing a
    column with a MAP or a LIST type.

    Maybe hive isn't set up like this right now. I think all the
    information you need to do any of those steps (even repeatedly) is
    already available somewhere long before this query is ever run, so I
    should be able to use that information and not have to specify it
    again every time I want to run a query like this.

    Maybe there is something I'm missing? What was the use case for the
    "multiple processing steps" that you mentioned in your last email?

    Josh F.
    On Jan 10, 2009, at 11:38 PM, Zheng Shao wrote:

    I don't think it's a good idea to rely on the information from the
    table. The data might go through multiple processing steps before it
    reaches the final destination table. And the destination table may
    store the data in any way (may not be delimited).


    What about allowing some syntax like this:

    SELECT TRANSFORM(myint, mymap) ROW FORMAT DELIMITED KEY TERMINATED
    BY '3' COLLECTION ITEM TERMINATED BY '2'
    USING '/bin/cat'
    AS (myint INT, mymap MAP<STRING,STRING>) ROW FORMAT DELIMITED KEY
    TERMINATED BY '3' COLLECTION ITEM TERMINATED BY '2'

    The first ROW FORMAT describes the input format for the script, and
    the second describes the output format of the script.


    Zheng

    On Sat, Jan 10, 2009 at 11:22 PM, Josh Ferguson wrote:
    My initial assumption when I tried to write the query was that it
    would use the same delimiters I defined in the schema definition of
    the target table. That led to my confusion because I thought hive
    had enough information (in the schema) to do proper string ->
    map<x,x> data conversion.

    Maybe something like that could work?

    Josh F.

    On Jan 10, 2009, at 10:46 PM, Zheng Shao wrote:

    Hi Josh,

    Yes the transform assumes every output column will be string.
    And, if the input of the transform is not a string, it will be
    converted to a string. But we don't have a mechanism to convert
    string back to map<string,string> in the language. What do you
    think we should do to support that?

    Zheng

    On Sat, Jan 10, 2009 at 10:25 PM, Josh Ferguson
    wrote:
    One more small update, it seems that transform doesn't work at all
    for inserting into columns of type MAP<X,Y>. I suspect this is
    because the semantic analyzer treats all columns out of a custom
    map phase as type 'STRING' and then complains when it can't convert
    the assumed type into the type necessary, which is MAP<STRING,
    STRING> in this case. Is this correct? Is anyone else using a MAP
    type with custom map or reduce scripts? What queries have you
    gotten to work?

    Jos


    On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson
    wrote:
    I want to follow up on this a little, here are the schemas for the
    source and destination tables and the query I am trying to run.

    Source table:

    hive> DESCRIBE EXTENDED
    users
    ; OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:users,dbName:default,owner:Josh,createTime:
    1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
    [FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema
    (name:properties,type:map<string,string>,comment:null)],location:/
    user/hive/warehouse/
    users
    ,inputFormat:org
    .apache
    .hadoop
    .mapred
    .TextInputFormat
    ,outputFormat:org
    .apache
    .hadoop
    .hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:
    32
    ,serdeInfo:SerDeInfo
    (name:null
    ,serializationLib:org
    .apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:
    {colelction
    .delim
    =
    44
    ,mapkey
    .delim
    =
    58
    ,serialization
    .format
    =
    org
    .apache
    .hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:
    [id],sortCols:[],parameters:{}),partitionKeys:
    [FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})


    Destination table:

    hive> DESCRIBE EXTENDED distinct_users;
    OK
    occurred_at int
    id string
    properties map<string,string>
    account string
    application string
    dataset string
    hour int
    Detailed Table Information:
    Table(tableName:distinct_users,dbName:default,owner:Josh,createTime:
    1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
    [FieldSchema(name:occurred_at,type:int,comment:null),
    FieldSchema(name:id,type:string,comment:null),
    FieldSchema
    (name:properties,type:map<string,string>,comment:null)],location:/
    user/hive/warehouse/
    distinct_users
    ,inputFormat:org
    .apache
    .hadoop
    .mapred
    .TextInputFormat
    ,outputFormat:org
    .apache
    .hadoop
    .hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:
    32
    ,serdeInfo:SerDeInfo
    (name:null
    ,serializationLib:org
    .apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters:
    {colelction
    .delim
    =
    44
    ,mapkey
    .delim
    =
    58
    ,serialization
    .format
    =
    org
    .apache
    .hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:
    [id],sortCols:[],parameters:{}),partitionKeys:
    [FieldSchema(name:account,type:string,comment:null),
    FieldSchema(name:application,type:string,comment:null),
    FieldSchema(name:dataset,type:string,comment:null),
    FieldSchema(name:hour,type:int,comment:null)],parameters:{})

    The query:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT
    TRANSFORM(users.occurred_at, users.id, users.properties) USING '/
    bin/cat' AS (occurred_at, id, properties) FROM users;
    FAILED: Error in semantic analysis: line 1:23 Cannot insert into
    target table because column number/types are different
    distinct_users: Cannot convert column 2 from string to
    map<string,string>.

    I'm really confused because the two tables are the exact same
    except for their names and I'm just trying to do an insert from one
    of them into the other using a script.

    For reference this appears to work:

    hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,
    properties FROM users;

    What is it about transforming that is messing up the semantic
    analysis?

    Josh Ferguson

    On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson <josh@besquared.net>
    wrote:
    Is it possible to do a query like the following:

    INSERT OVERWRITE TABLE table1 PARTITION(...)
    FROM table2
    SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script'
    AS (col1, col2, ...)
    WHERE (...)

    I can run the select transform segment of the query by itself fine
    and I get the results I expect.

    When I try and do the insert as well I'm getting errors with column
    type mismatches even though my script is outputting 3 columns with
    the exact same types in the exact order that they appear in table1.
    I tried doing this with both a mapper and reducer similar to what
    was shown in the Apache Con slides and it still didn't work. Am I
    doing something wrong query wise?

    I'm using the 0.19 release.

    Josh Ferguson





    --
    Yours,
    Zheng



    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 9, '09 at 7:53p
activeJan 12, '09 at 1:57a
posts8
users2
websitehive.apache.org

2 users in discussion

Josh Ferguson: 6 posts Zheng Shao: 2 posts

People

Translate

site design / logo © 2021 Grokbase