FAQ
This is Impala 1.0, Hive 0.10 (CDH 4.2.1).

It looks like Hive will do schema resolution nicely, e.g. if we start out
with a table like:

hive> CREATE TABLE test_avro
   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
   STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
   TBLPROPERTIES (
     'avro.schema.literal'='{"name":"test_record",
                             "type":"record",
                             "fields": [
                               {"name":"test", "type":"string"}]}');

hive> LOAD DATA LOCAL INPATH 'test1.avro' OVERWRITE INTO TABLE test_avro;
hive> SELECT * FROM test_avro;
OK
X4250
X7594
X7968
X2843
X6344
X419
X956
X6603
X6842
X5522
Time taken: 0.082 seconds

Now same thing in Impala:
select * from test_avro;
Query: select * from test_avro
Query finished, fetching results ...
+-------+
test | +-------+
X4250 |
X7594 |
X7968 |
X2843 |
X6344 |
X419 |
X956 |
X6603 |
X6842 |
X5522 |
+-------+
Returned 10 row(s) in 0.27s

All is well. Now we alter the schema to add another column with a default:

hive> ALTER TABLE test_avro SET TBLPROPERTIES (
     'avro.schema.literal'='{"name":"test_record",
                             "type":"record",
                             "fields": [
                               {"name":"test", "type":"string"},
                               {"name":"new", "type":"string",
"default":"nothing"}]}');

hive> SELECT * FROM test_avro;
OK
X4250 nothing
X7594 nothing
X7968 nothing
X2843 nothing
X6344 nothing
X419 nothing
X956 nothing
X6603 nothing
X6842 nothing
X5522 nothing
Time taken: 0.081 seconds

Pretty awesome! Now let's try Impala:
refresh test_avro;
Successfully refreshed table: default.test_avro
select * from test_avro;
Query: select * from test_avro
Query finished, fetching results ...

Returned 0 row(s) in 0.29s

:(

Is this as it's supposed to be, a bug, or something Not Yet Implemented?

P.S. test1.avro file was created using the following ruby code:

require 'rubygems'
require 'avro'
FILE = 'test1.avro'

schema = Avro::Schema.parse('{"name":"test_record", ' +
                             ' "type":"record", ' +
                             ' "fields": [' +
                             ' {"name":"test", "type":"string"}]}')

writer = Avro::IO::DatumWriter.new(schema)
file = File.open(FILE, 'wb')
dw = Avro::DataFile::Writer.new(file, writer, schema)
10.times do
   dw << {'test'=>"X#{rand(10000)}"}
end
dw.flush
dw.close


Thanks!

Grisha

Search Discussions

  • Cng1067 at Jun 10, 2013 at 8:27 am
    from what i remember doing select * doesn't work, you need to select some
    actual columns.
    On Saturday, June 8, 2013 2:00:15 AM UTC, Grisha Trubetskoy wrote:


    This is Impala 1.0, Hive 0.10 (CDH 4.2.1).

    It looks like Hive will do schema resolution nicely, e.g. if we start out
    with a table like:

    hive> CREATE TABLE test_avro
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT
    'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    TBLPROPERTIES (
    'avro.schema.literal'='{"name":"test_record",
    "type":"record",
    "fields": [
    {"name":"test", "type":"string"}]}');

    hive> LOAD DATA LOCAL INPATH 'test1.avro' OVERWRITE INTO TABLE test_avro;
    hive> SELECT * FROM test_avro;
    OK
    X4250
    X7594
    X7968
    X2843
    X6344
    X419
    X956
    X6603
    X6842
    X5522
    Time taken: 0.082 seconds

    Now same thing in Impala:
    select * from test_avro;
    Query: select * from test_avro
    Query finished, fetching results ...
    +-------+
    test | +-------+
    X4250 |
    X7594 |
    X7968 |
    X2843 |
    X6344 |
    X419 |
    X956 |
    X6603 |
    X6842 |
    X5522 |
    +-------+
    Returned 10 row(s) in 0.27s

    All is well. Now we alter the schema to add another column with a default:

    hive> ALTER TABLE test_avro SET TBLPROPERTIES (
    'avro.schema.literal'='{"name":"test_record",
    "type":"record",
    "fields": [
    {"name":"test", "type":"string"},
    {"name":"new", "type":"string",
    "default":"nothing"}]}');

    hive> SELECT * FROM test_avro;
    OK
    X4250 nothing
    X7594 nothing
    X7968 nothing
    X2843 nothing
    X6344 nothing
    X419 nothing
    X956 nothing
    X6603 nothing
    X6842 nothing
    X5522 nothing
    Time taken: 0.081 seconds

    Pretty awesome! Now let's try Impala:
    refresh test_avro;
    Successfully refreshed table: default.test_avro
    select * from test_avro;
    Query: select * from test_avro
    Query finished, fetching results ...

    Returned 0 row(s) in 0.29s

    :(

    Is this as it's supposed to be, a bug, or something Not Yet Implemented?

    P.S. test1.avro file was created using the following ruby code:

    require 'rubygems'
    require 'avro'
    FILE = 'test1.avro'

    schema = Avro::Schema.parse('{"name":"test_record", ' +
    ' "type":"record", ' +
    ' "fields": [' +
    ' {"name":"test", "type":"string"}]}')

    writer = Avro::IO::DatumWriter.new(schema)
    file = File.open(FILE, 'wb')
    dw = Avro::DataFile::Writer.new(file, writer, schema)
    10.times do
    dw << {'test'=>"X#{rand(10000)}"}
    end
    dw.flush
    dw.close


    Thanks!

    Grisha

  • Gregory Trubetskoy at Jun 10, 2013 at 2:48 pm
    Verified - same problem with columns selected as well.

    The real bug is that no indication is given that something went wrong, so
    if Impala isn't doing resolution, it should then error out. Currently it
    silently skips rows.

    Grisha

    On Mon, Jun 10, 2013 at 4:27 AM, wrote:

    from what i remember doing select * doesn't work, you need to select some
    actual columns.

    On Saturday, June 8, 2013 2:00:15 AM UTC, Grisha Trubetskoy wrote:


    This is Impala 1.0, Hive 0.10 (CDH 4.2.1).

    It looks like Hive will do schema resolution nicely, e.g. if we start out
    with a table like:

    hive> CREATE TABLE test_avro
    ROW FORMAT SERDE 'org.apache.hadoop.hive.**serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.**
    avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.**avro.**
    AvroContainerOutputFormat'
    TBLPROPERTIES (
    'avro.schema.literal'='{"name"**:"test_record",
    "type":"record",
    "fields": [
    {"name":"test", "type":"string"}]}');

    hive> LOAD DATA LOCAL INPATH 'test1.avro' OVERWRITE INTO TABLE test_avro;
    hive> SELECT * FROM test_avro;
    OK
    X4250
    X7594
    X7968
    X2843
    X6344
    X419
    X956
    X6603
    X6842
    X5522
    Time taken: 0.082 seconds

    Now same thing in Impala:
    select * from test_avro;
    Query: select * from test_avro
    Query finished, fetching results ...
    +-------+
    test | +-------+
    X4250 |
    X7594 |
    X7968 |
    X2843 |
    X6344 |
    X419 |
    X956 |
    X6603 |
    X6842 |
    X5522 |
    +-------+
    Returned 10 row(s) in 0.27s

    All is well. Now we alter the schema to add another column with a default:

    hive> ALTER TABLE test_avro SET TBLPROPERTIES (
    'avro.schema.literal'='{"name"**:"test_record",
    "type":"record",
    "fields": [
    {"name":"test", "type":"string"},
    {"name":"new", "type":"string",
    "default":"nothing"}]}');

    hive> SELECT * FROM test_avro;
    OK
    X4250 nothing
    X7594 nothing
    X7968 nothing
    X2843 nothing
    X6344 nothing
    X419 nothing
    X956 nothing
    X6603 nothing
    X6842 nothing
    X5522 nothing
    Time taken: 0.081 seconds

    Pretty awesome! Now let's try Impala:
    refresh test_avro;
    Successfully refreshed table: default.test_avro
    select * from test_avro;
    Query: select * from test_avro
    Query finished, fetching results ...

    Returned 0 row(s) in 0.29s

    :(

    Is this as it's supposed to be, a bug, or something Not Yet Implemented?

    P.S. test1.avro file was created using the following ruby code:

    require 'rubygems'
    require 'avro'
    FILE = 'test1.avro'

    schema = Avro::Schema.parse('{"name":"**test_record", ' +
    ' "type":"record", ' +
    ' "fields": [' +
    ' {"name":"test", "type":"string"}]}')

    writer = Avro::IO::DatumWriter.new(**schema)
    file = File.open(FILE, 'wb')
    dw = Avro::DataFile::Writer.new(**file, writer, schema)
    10.times do
    dw << {'test'=>"X#{rand(10000)}"}
    end
    dw.flush
    dw.close


    Thanks!

    Grisha

  • Bewang Tech at Jun 11, 2013 at 7:33 pm
    Impala BE uses its own avro scanner that doesn't use AvRO ResolvingDecoder.
    I filed a ticket about this https://issues.cloudera.org/browse/IMPALA-401.

    I'm waiting for this feature too.
    On Monday, June 10, 2013 7:45:45 AM UTC-7, Grisha Trubetskoy wrote:

    Verified - same problem with columns selected as well.

    The real bug is that no indication is given that something went wrong, so
    if Impala isn't doing resolution, it should then error out. Currently it
    silently skips rows.

    Grisha

    On Mon, Jun 10, 2013 at 4:27 AM, <cng...@gmail.com <javascript:>> wrote:

    from what i remember doing select * doesn't work, you need to select some
    actual columns.

    On Saturday, June 8, 2013 2:00:15 AM UTC, Grisha Trubetskoy wrote:


    This is Impala 1.0, Hive 0.10 (CDH 4.2.1).

    It looks like Hive will do schema resolution nicely, e.g. if we start
    out with a table like:

    hive> CREATE TABLE test_avro
    ROW FORMAT SERDE 'org.apache.hadoop.hive.**serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.**
    avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.**avro.**
    AvroContainerOutputFormat'
    TBLPROPERTIES (
    'avro.schema.literal'='{"name"**:"test_record",
    "type":"record",
    "fields": [
    {"name":"test", "type":"string"}]}');

    hive> LOAD DATA LOCAL INPATH 'test1.avro' OVERWRITE INTO TABLE test_avro;
    hive> SELECT * FROM test_avro;
    OK
    X4250
    X7594
    X7968
    X2843
    X6344
    X419
    X956
    X6603
    X6842
    X5522
    Time taken: 0.082 seconds

    Now same thing in Impala:
    select * from test_avro;
    Query: select * from test_avro
    Query finished, fetching results ...
    +-------+
    test | +-------+
    X4250 |
    X7594 |
    X7968 |
    X2843 |
    X6344 |
    X419 |
    X956 |
    X6603 |
    X6842 |
    X5522 |
    +-------+
    Returned 10 row(s) in 0.27s

    All is well. Now we alter the schema to add another column with a
    default:

    hive> ALTER TABLE test_avro SET TBLPROPERTIES (
    'avro.schema.literal'='{"name"**:"test_record",
    "type":"record",
    "fields": [
    {"name":"test", "type":"string"},
    {"name":"new", "type":"string",
    "default":"nothing"}]}');

    hive> SELECT * FROM test_avro;
    OK
    X4250 nothing
    X7594 nothing
    X7968 nothing
    X2843 nothing
    X6344 nothing
    X419 nothing
    X956 nothing
    X6603 nothing
    X6842 nothing
    X5522 nothing
    Time taken: 0.081 seconds

    Pretty awesome! Now let's try Impala:
    refresh test_avro;
    Successfully refreshed table: default.test_avro
    select * from test_avro;
    Query: select * from test_avro
    Query finished, fetching results ...

    Returned 0 row(s) in 0.29s

    :(

    Is this as it's supposed to be, a bug, or something Not Yet Implemented?

    P.S. test1.avro file was created using the following ruby code:

    require 'rubygems'
    require 'avro'
    FILE = 'test1.avro'

    schema = Avro::Schema.parse('{"name":"**test_record", ' +
    ' "type":"record", ' +
    ' "fields": [' +
    ' {"name":"test", "type":"string"}]}')

    writer = Avro::IO::DatumWriter.new(**schema)
    file = File.open(FILE, 'wb')
    dw = Avro::DataFile::Writer.new(**file, writer, schema)
    10.times do
    dw << {'test'=>"X#{rand(10000)}"}
    end
    dw.flush
    dw.close


    Thanks!

    Grisha

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJun 8, '13 at 2:00a
activeJun 11, '13 at 7:33p
posts4
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase