FAQ
Hi there.

I would just like to double check that my understanding of hive/impala is
correct (I am most likely missing something)

*hive: *I have created a new table, inserted some data (via hive's LOAD
DATA)
*hive: *Query data -> all there.

*impala-shell:* I execute "refresh" -> My query returns all the results
just added.
*impala-shell: *Query data -> all there.
*
*
*hive:* insert more data (again via LOAD DATA).
*hive: *Query data -> all there.

*impala-shell: *Query data -> new data is not returned in the query (only
the result set is shown)
*impala-shell:* I execute "refresh"
*impala-shell: *Query data -> all there.


This is where my understanding fails me:

- It's documented that impala needs refresh when the hive metadata
changes, I assumed this would be adding tables, modifying tables etc. not
insertion of new data into existing tables?
- Does one need to run refresh in impala to pick up new data added to
existing tables via hive?
- Must one insert new data via the impala api's to make impala aware of
it?

The expected behavior could very well be that a refresh is required (my
understanding of "metadata" being incorrect in this context or something
else).

Clarification by anyone would be much appreciated.

Kind Regards
Stephan Kotze

Search Discussions

  • Marcel Kornacker at Mar 1, 2013 at 6:22 pm
    On Fri, Mar 1, 2013 at 9:00 AM, Stephan Kotze wrote:
    Hi there.

    I would just like to double check that my understanding of hive/impala is
    correct (I am most likely missing something)

    hive: I have created a new table, inserted some data (via hive's LOAD DATA)
    hive: Query data -> all there.

    impala-shell: I execute "refresh" -> My query returns all the results just
    added.
    impala-shell: Query data -> all there.

    hive: insert more data (again via LOAD DATA).
    hive: Query data -> all there.

    impala-shell: Query data -> new data is not returned in the query (only the
    result set is shown)
    impala-shell: I execute "refresh"
    impala-shell: Query data -> all there.


    This is where my understanding fails me:

    It's documented that impala needs refresh when the hive metadata changes, I
    assumed this would be adding tables, modifying tables etc. not insertion of
    new data into existing tables?
    That's correct.
    Does one need to run refresh in impala to pick up new data added to existing
    tables via hive?
    Not, that's not necessary.
    Must one insert new data via the impala api's to make impala aware of it?
    That is also not necessary.
    The expected behavior could very well be that a refresh is required (my
    understanding of "metadata" being incorrect in this context or something
    else).

    Clarification by anyone would be much appreciated.

    Kind Regards
    Stephan Kotze
  • Stephan Kotze at Mar 5, 2013 at 10:34 am
    Thanks for the replies.

    Indeed, it's reproducible (it's the only behavior I'm currently
    experiencing).

    Basically,

    1) I downloaded the Impala VM image from cloudera (
    https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM#Cloudera%27sImpalaDemoVM-DemoVMwareImage
    )
    2) Fired the box up.
    3) Created 2 tables as per:
    https://ccp.cloudera.com/display/IMPALA10BETADOC/Learning+Impala+Tutorial
    4) Refreshed from an Impala shell
    5) Load some data

    The behavior as per my original email persists.

    The table isn't partitioned.

    Any other info I can provide or places I can look to find out why my impala
    is behaving oddly?

    Stephan



    On Sun, Mar 3, 2013 at 10:15 PM, Eric Sammer wrote:

    One other question: When you load data, are you creating a new partition
    or simply adding to an existing partition? If it's the former, this
    behavior makes sense as Impala would need to know about the newly created
    partitions (which are tracked explicitly in the metadata). Instead, if
    you're just creating new files (and it's reproducible, as Marcel asked),
    it's likely to be a bug.

    On Sun, Mar 3, 2013 at 1:17 PM, Marcel Kornacker wrote:

    On Fri, Mar 1, 2013 at 9:00 AM, Stephan Kotze <stephanus.kotze@gmail.com>
    wrote:
    Hi there.

    I would just like to double check that my understanding of hive/impala is
    correct (I am most likely missing something)

    hive: I have created a new table, inserted some data (via hive's LOAD DATA)
    hive: Query data -> all there.

    impala-shell: I execute "refresh" -> My query returns all the results just
    added.
    impala-shell: Query data -> all there.

    hive: insert more data (again via LOAD DATA).
    hive: Query data -> all there.

    impala-shell: Query data -> new data is not returned in the query (only the
    result set is shown)
    Sorry, I missed this part. Is this a reproducible problem?
    impala-shell: I execute "refresh"
    impala-shell: Query data -> all there.


    This is where my understanding fails me:

    It's documented that impala needs refresh when the hive metadata
    changes, I
    assumed this would be adding tables, modifying tables etc. not
    insertion of
    new data into existing tables?
    Does one need to run refresh in impala to pick up new data added to existing
    tables via hive?
    Must one insert new data via the impala api's to make impala aware of it?
    The expected behavior could very well be that a refresh is required (my
    understanding of "metadata" being incorrect in this context or something
    else).

    Clarification by anyone would be much appreciated.

    Kind Regards
    Stephan Kotze


    --
    Eric Sammer
    twitter: esammer
    data: www.cloudera.com
  • Stephan Kotze at Mar 8, 2013 at 11:23 am
    Hi, sure.

    Sorry, not getting back round to these quickly enough:

    Anyways:

    *1) in the hive shell execute:*
    ____________
    create database tpcds;
    use tpcds;

    create external table customer
    (
    c_customer_sk int,
    c_customer_id string,
    c_current_cdemo_sk int,
    c_current_hdemo_sk int,
    c_current_addr_sk int,
    c_first_shipto_date_sk int,
    c_first_sales_date_sk int,
    c_salutation string,
    c_first_name string,
    c_last_name string,
    c_preferred_cust_flag string,
    c_birth_day int,
    c_birth_month int,
    c_birth_year int,
    c_birth_country string,
    c_login string,
    c_email_address string,
    c_last_review_date string
    )
    row format delimited fields terminated by '|'
    location '/hive/warehouse/tpcds.db/customer';

    create external table customer_address
    (
    ca_address_sk int,
    ca_address_id string,
    ca_street_number string,
    ca_street_name string,
    ca_street_type string,
    ca_suite_number string,
    ca_city string,
    ca_county string,
    ca_state string,
    ca_zip string,
    ca_country string,
    ca_gmt_offset float,
    ca_location_type string
    )
    row format delimited fields terminated by '|'
    location '/hive/warehouse/tpcds.db/customer_address';
    ____________

    *2) using in hive shell execute: *
    ____________
    load data local inpath '/tmp/test' into TABLE customer;
    select * from customer;
    ____________
    The expected data is visible;

    *3) using impala-shell execute:
    *____________
    connect localhost;
    refresh;
    use tpcds;
    select * from customer;
    ____________
    The expected data is visible;

    *4) in hive shell execute*:
    ____________
    load data local inpath '/tmp/test2' into TABLE customer;
    select * from customer;
    ____________
    The expected data is visible;

    *5) in impala-shell execute:*
    ____________
    select * from customer;
    ____________
    Only one row is visible.

    *6) in impala-shell execute:*
    ____________
    refresh;
    select * from customer;
    ____________
    All data is now shown.


    Need anything else?

    Regards
    Stephan

    PS. Contents of the two test files:
    test
    1|cust1|1|1|1|1|1|Mr|Stephan|Kotze|Y|2|2|1900|SA|skotze|Stephan.Kotze@blah.com|01/01/2013
    test2
    2|cust2|2|2|2|2|2|Me|Person1|Person1|N|30|07|1979|UK|p3rs0n1|person1@blah.com|14/08/2012

    On Tuesday, 5 March 2013 10:17:53 UTC-5, Marcel Kornacker wrote:
    On Tue, Mar 5, 2013 at 2:34 AM, Stephan Kotze wrote:
    Thanks for the replies.

    Indeed, it's reproducible (it's the only behavior I'm currently
    experiencing).

    Basically,

    1) I downloaded the Impala VM image from cloudera
    (
    https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM#Cloudera%27sImpalaDemoVM-DemoVMwareImage<https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Impala+Demo+VM#Cloudera%27sImpalaDemoVM-DemoVMwareImage>)
    2) Fired the box up.
    3) Created 2 tables as per:
    https://ccp.cloudera.com/display/IMPALA10BETADOC/Learning+Impala+Tutorial
    4) Refreshed from an Impala shell
    5) Load some data

    The behavior as per my original email persists.

    The table isn't partitioned.

    Any other info I can provide or places I can look to find out why my impala
    is behaving oddly?
    Stephan, could you send us the exact statements you use for steps 3)-5)?
  • Marcel Kornacker at Mar 9, 2013 at 12:29 am
    Stephan,

    I'm sorry for my earlier misinformation: you actually do need to run
    the refresh command when you add new data files, because Impala caches
    file and block locations metadata in order to minimize interactions
    with the name node.

    Marcel
    On Fri, Mar 8, 2013 at 3:23 AM, Stephan Kotze wrote:
    Hi, sure.

    Sorry, not getting back round to these quickly enough:

    Anyways:

    1) in the hive shell execute:
    ____________
    create database tpcds;
    use tpcds;

    create external table customer
    (
    c_customer_sk int,
    c_customer_id string,
    c_current_cdemo_sk int,
    c_current_hdemo_sk int,
    c_current_addr_sk int,
    c_first_shipto_date_sk int,
    c_first_sales_date_sk int,
    c_salutation string,
    c_first_name string,
    c_last_name string,
    c_preferred_cust_flag string,
    c_birth_day int,
    c_birth_month int,
    c_birth_year int,
    c_birth_country string,
    c_login string,
    c_email_address string,
    c_last_review_date string
    )
    row format delimited fields terminated by '|'
    location '/hive/warehouse/tpcds.db/customer';

    create external table customer_address
    (
    ca_address_sk int,
    ca_address_id string,
    ca_street_number string,
    ca_street_name string,
    ca_street_type string,
    ca_suite_number string,
    ca_city string,
    ca_county string,
    ca_state string,
    ca_zip string,
    ca_country string,
    ca_gmt_offset float,
    ca_location_type string
    )
    row format delimited fields terminated by '|'
    location '/hive/warehouse/tpcds.db/customer_address';
    ____________

    2) using in hive shell execute:
    ____________
    load data local inpath '/tmp/test' into TABLE customer;
    select * from customer;
    ____________
    The expected data is visible;

    3) using impala-shell execute:
    ____________
    connect localhost;
    refresh;
    use tpcds;
    select * from customer;
    ____________
    The expected data is visible;

    4) in hive shell execute:
    ____________
    load data local inpath '/tmp/test2' into TABLE customer;
    select * from customer;
    ____________
    The expected data is visible;

    5) in impala-shell execute:
    ____________
    select * from customer;
    ____________
    Only one row is visible.

    6) in impala-shell execute:
    ____________
    refresh;
    select * from customer;
    ____________
    All data is now shown.


    Need anything else?

    Regards
    Stephan

    PS. Contents of the two test files:
    test
    1|cust1|1|1|1|1|1|Mr|Stephan|Kotze|Y|2|2|1900|SA|skotze|Stephan.Kotze@blah.com|01/01/2013
    test2
    2|cust2|2|2|2|2|2|Me|Person1|Person1|N|30|07|1979|UK|p3rs0n1|person1@blah.com|14/08/2012

    On Tuesday, 5 March 2013 10:17:53 UTC-5, Marcel Kornacker wrote:

    On Tue, Mar 5, 2013 at 2:34 AM, Stephan Kotze <stephan...@gmail.com>
    wrote:
    Thanks for the replies.

    Indeed, it's reproducible (it's the only behavior I'm currently
    experiencing).

    Basically,

    1) I downloaded the Impala VM image from cloudera

    (https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM#Cloudera%27sImpalaDemoVM-DemoVMwareImage)
    2) Fired the box up.
    3) Created 2 tables as per:

    https://ccp.cloudera.com/display/IMPALA10BETADOC/Learning+Impala+Tutorial
    4) Refreshed from an Impala shell
    5) Load some data

    The behavior as per my original email persists.

    The table isn't partitioned.

    Any other info I can provide or places I can look to find out why my
    impala
    is behaving oddly?
    Stephan, could you send us the exact statements you use for steps 3)-5)?
  • Stephan Kotze at Mar 12, 2013 at 12:20 pm
    Aaah, no worries. Good to know.

    Thanks for your time.

    Stephan

    On Sat, Mar 9, 2013 at 12:29 AM, Marcel Kornacker wrote:

    Stephan,

    I'm sorry for my earlier misinformation: you actually do need to run
    the refresh command when you add new data files, because Impala caches
    file and block locations metadata in order to minimize interactions
    with the name node.

    Marcel
    On Fri, Mar 8, 2013 at 3:23 AM, Stephan Kotze wrote:
    Hi, sure.

    Sorry, not getting back round to these quickly enough:

    Anyways:

    1) in the hive shell execute:
    ____________
    create database tpcds;
    use tpcds;

    create external table customer
    (
    c_customer_sk int,
    c_customer_id string,
    c_current_cdemo_sk int,
    c_current_hdemo_sk int,
    c_current_addr_sk int,
    c_first_shipto_date_sk int,
    c_first_sales_date_sk int,
    c_salutation string,
    c_first_name string,
    c_last_name string,
    c_preferred_cust_flag string,
    c_birth_day int,
    c_birth_month int,
    c_birth_year int,
    c_birth_country string,
    c_login string,
    c_email_address string,
    c_last_review_date string
    )
    row format delimited fields terminated by '|'
    location '/hive/warehouse/tpcds.db/customer';

    create external table customer_address
    (
    ca_address_sk int,
    ca_address_id string,
    ca_street_number string,
    ca_street_name string,
    ca_street_type string,
    ca_suite_number string,
    ca_city string,
    ca_county string,
    ca_state string,
    ca_zip string,
    ca_country string,
    ca_gmt_offset float,
    ca_location_type string
    )
    row format delimited fields terminated by '|'
    location '/hive/warehouse/tpcds.db/customer_address';
    ____________

    2) using in hive shell execute:
    ____________
    load data local inpath '/tmp/test' into TABLE customer;
    select * from customer;
    ____________
    The expected data is visible;

    3) using impala-shell execute:
    ____________
    connect localhost;
    refresh;
    use tpcds;
    select * from customer;
    ____________
    The expected data is visible;

    4) in hive shell execute:
    ____________
    load data local inpath '/tmp/test2' into TABLE customer;
    select * from customer;
    ____________
    The expected data is visible;

    5) in impala-shell execute:
    ____________
    select * from customer;
    ____________
    Only one row is visible.

    6) in impala-shell execute:
    ____________
    refresh;
    select * from customer;
    ____________
    All data is now shown.


    Need anything else?

    Regards
    Stephan

    PS. Contents of the two test files:
    test
    1|cust1|1|1|1|1|1|Mr|Stephan|Kotze|Y|2|2|1900|SA|skotze|
    Stephan.Kotze@blah.com|01/01/2013
    test2
    2|cust2|2|2|2|2|2|Me|Person1|Person1|N|30|07|1979|UK|p3rs0n1|
    person1@blah.com|14/08/2012
    On Tuesday, 5 March 2013 10:17:53 UTC-5, Marcel Kornacker wrote:

    On Tue, Mar 5, 2013 at 2:34 AM, Stephan Kotze <stephan...@gmail.com>
    wrote:
    Thanks for the replies.

    Indeed, it's reproducible (it's the only behavior I'm currently
    experiencing).

    Basically,

    1) I downloaded the Impala VM image from cloudera

    (
    https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM#Cloudera%27sImpalaDemoVM-DemoVMwareImage
    )
    2) Fired the box up.
    3) Created 2 tables as per:
    https://ccp.cloudera.com/display/IMPALA10BETADOC/Learning+Impala+Tutorial
    4) Refreshed from an Impala shell
    5) Load some data

    The behavior as per my original email persists.

    The table isn't partitioned.

    Any other info I can provide or places I can look to find out why my
    impala
    is behaving oddly?
    Stephan, could you send us the exact statements you use for steps 3)-5)?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMar 1, '13 at 5:00p
activeMar 12, '13 at 12:20p
posts6
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Stephan Kotze: 4 posts Marcel Kornacker: 2 posts

People

Translate

site design / logo © 2021 Grokbase