FAQ
I'm using Impala with parquet table and staging phase described here:
https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

All looks good but I'm wondering how Impala actually handles concurrent
read/write?
I mean, what will happen if I overwrite a parquet file in the data
warehouse and refresh the corresponding table while another person's
querying that table?

Thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

  • Nong Li at Feb 27, 2014 at 6:32 pm
    Impala doesn't do anything special for concurrent reads and writes as this
    really needs to be handled at the storage layer. When a file is deleted in
    HDFS,
    the file will be removed even if there are active readers (HDFS doesn't
    track active
    readers).

    Simultaneous read queries and refreshes are fine. Simultaneous read queries
    and deletes will cause the read queries to fail with file doesn't exist
    errors.

    On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu wrote:

    I'm using Impala with parquet table and staging phase described here:
    https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

    All looks good but I'm wondering how Impala actually handles concurrent
    read/write?
    I mean, what will happen if I overwrite a parquet file in the data
    warehouse and refresh the corresponding table while another person's
    querying that table?

    Thanks!

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Tivona Hu at Mar 3, 2014 at 8:15 am
    Thanks Nong for the reply :)

    Then I'm wondering if there's any way to do a "virtual drop" on a file..

    For example, I convert staging avro files to parquet every hour:
    data_1pm.parquet
    data_2pm.parquet (= data_1pm + new data generated between 1pm~2pm)

    At 2pm, I want to do a refresh to virtually remove data_1pm.parquet and add
    data_2pm.parquet to the metastore.
    So if any on-going query happened between data_1pm& data_2pm, it will not
    fail since the file's still there physically.
    And any query happens after data_2pm will only query data_2pm since the
    metastore's refreshed.

    Then finally after few hours, I can delete the file data_1pm physically
    since there should be no existing queries associated with it.

    Anyone has idea is it possible to do this kind of things?

    Thanks!

    Nong於 2014年2月28日星期五UTC+8上午2時31分59秒寫道:
    Impala doesn't do anything special for concurrent reads and writes as this
    really needs to be handled at the storage layer. When a file is deleted in
    HDFS,
    the file will be removed even if there are active readers (HDFS doesn't
    track active
    readers).

    Simultaneous read queries and refreshes are fine. Simultaneous read queries
    and deletes will cause the read queries to fail with file doesn't exist
    errors.


    On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu <huti...@gmail.com<javascript:>
    wrote:
    I'm using Impala with parquet table and staging phase described here:
    https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

    All looks good but I'm wondering how Impala actually handles concurrent
    read/write?
    I mean, what will happen if I overwrite a parquet file in the data
    warehouse and refresh the corresponding table while another person's
    querying that table?

    Thanks!

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Keith Simmons at Mar 4, 2014 at 9:28 pm
    We do something similar to this. We manually tell impala where the parquet
    files are by writing to the hive metastore. You can only tell the
    metastore about directories, not specific files, so when we want to drop in
    a new file, we create a new timestamped directory. We then drop the new
    file in there, update the metastore, then tell impala to update it's
    metadata. Once the invalidate metadata call has completed, we delete the
    old timestamped directory. So for example:

    Before update:

    my_table/partition_1/20140203/old-parquet-file.parquet

    After update:

    my_table/partition_1/20140203/old-parquet-file.parquet
    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    After invalidate metadata has completed:

    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    We don't have long in-flight queries, so we normally delete the old data
    directory as soon as impala has refreshed its metadata, but you could
    easily give it some extra time to make sure all queries have completed.

    Keith

    On Mon, Mar 3, 2014 at 12:14 AM, Tivona Hu wrote:

    Thanks Nong for the reply :)

    Then I'm wondering if there's any way to do a "virtual drop" on a file..

    For example, I convert staging avro files to parquet every hour:
    data_1pm.parquet
    data_2pm.parquet (= data_1pm + new data generated between 1pm~2pm)

    At 2pm, I want to do a refresh to virtually remove data_1pm.parquet and
    add data_2pm.parquet to the metastore.
    So if any on-going query happened between data_1pm& data_2pm, it will not
    fail since the file's still there physically.
    And any query happens after data_2pm will only query data_2pm since the
    metastore's refreshed.

    Then finally after few hours, I can delete the file data_1pm physically
    since there should be no existing queries associated with it.

    Anyone has idea is it possible to do this kind of things?

    Thanks!

    Nong於 2014年2月28日星期五UTC+8上午2時31分59秒寫道:
    Impala doesn't do anything special for concurrent reads and writes as this
    really needs to be handled at the storage layer. When a file is deleted
    in HDFS,
    the file will be removed even if there are active readers (HDFS doesn't
    track active
    readers).

    Simultaneous read queries and refreshes are fine. Simultaneous read
    queries
    and deletes will cause the read queries to fail with file doesn't exist
    errors.

    On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu wrote:

    I'm using Impala with parquet table and staging phase described here:
    https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

    All looks good but I'm wondering how Impala actually handles concurrent
    read/write?
    I mean, what will happen if I overwrite a parquet file in the data
    warehouse and refresh the corresponding table while another person's
    querying that table?

    Thanks!

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Ananth Gundabattula at Mar 5, 2014 at 12:12 am
    We too have a similar requirement and we are solving the issue via the
    application layer. The application layer makes the jdbc call after checking
    for a lock in zookeeper. The lock is being held by a continuous compactor
    process with the approach that it only locks the partition that is getting
    updated. The compactor continuously keeps merging data from kafka ( so that
    old data AND the new data ) are continuously written as one file. So the
    compaction process continuously keeps compacting into a temporary HDFS file
    and the oozie job at the end just obtains a lock on the partition in
    zookeeper, drops the old files which were impacted due to compaction and
    moves the new file into the partition and then releases the zookeeper lock.

    The downside of this approach is that it might not work from the impala
    shell which does not have an understanding the locks being generated by the
    continuous compactor.

    Regards,
    Ananth


    On Wed, Mar 5, 2014 at 8:28 AM, Keith Simmons wrote:

    We do something similar to this. We manually tell impala where the
    parquet files are by writing to the hive metastore. You can only tell the
    metastore about directories, not specific files, so when we want to drop in
    a new file, we create a new timestamped directory. We then drop the new
    file in there, update the metastore, then tell impala to update it's
    metadata. Once the invalidate metadata call has completed, we delete the
    old timestamped directory. So for example:

    Before update:

    my_table/partition_1/20140203/old-parquet-file.parquet

    After update:

    my_table/partition_1/20140203/old-parquet-file.parquet
    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    After invalidate metadata has completed:

    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    We don't have long in-flight queries, so we normally delete the old data
    directory as soon as impala has refreshed its metadata, but you could
    easily give it some extra time to make sure all queries have completed.

    Keith

    On Mon, Mar 3, 2014 at 12:14 AM, Tivona Hu wrote:

    Thanks Nong for the reply :)

    Then I'm wondering if there's any way to do a "virtual drop" on a file..

    For example, I convert staging avro files to parquet every hour:
    data_1pm.parquet
    data_2pm.parquet (= data_1pm + new data generated between 1pm~2pm)

    At 2pm, I want to do a refresh to virtually remove data_1pm.parquet and
    add data_2pm.parquet to the metastore.
    So if any on-going query happened between data_1pm& data_2pm, it will not
    fail since the file's still there physically.
    And any query happens after data_2pm will only query data_2pm since the
    metastore's refreshed.

    Then finally after few hours, I can delete the file data_1pm physically
    since there should be no existing queries associated with it.

    Anyone has idea is it possible to do this kind of things?

    Thanks!

    Nong於 2014年2月28日星期五UTC+8上午2時31分59秒寫道:
    Impala doesn't do anything special for concurrent reads and writes as
    this
    really needs to be handled at the storage layer. When a file is deleted
    in HDFS,
    the file will be removed even if there are active readers (HDFS doesn't
    track active
    readers).

    Simultaneous read queries and refreshes are fine. Simultaneous read
    queries
    and deletes will cause the read queries to fail with file doesn't exist
    errors.

    On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu wrote:

    I'm using Impala with parquet table and staging phase described here:
    https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

    All looks good but I'm wondering how Impala actually handles concurrent
    read/write?
    I mean, what will happen if I overwrite a parquet file in the data
    warehouse and refresh the corresponding table while another person's
    querying that table?

    Thanks!

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Tivona Hu at Mar 10, 2014 at 10:15 am
    Hi Keith,

    Could you kindly provide some hints on how to write the metastore? I
    couldn't find appropriate API to do this.
    Really appreciate the reply :)

    Keith Simmons於 2014年3月5日星期三UTC+8上午5時28分32秒寫道:
    We do something similar to this. We manually tell impala where the
    parquet files are by writing to the hive metastore. You can only tell the
    metastore about directories, not specific files, so when we want to drop in
    a new file, we create a new timestamped directory. We then drop the new
    file in there, update the metastore, then tell impala to update it's
    metadata. Once the invalidate metadata call has completed, we delete the
    old timestamped directory. So for example:

    Before update:

    my_table/partition_1/20140203/old-parquet-file.parquet

    After update:

    my_table/partition_1/20140203/old-parquet-file.parquet
    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    After invalidate metadata has completed:

    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    We don't have long in-flight queries, so we normally delete the old data
    directory as soon as impala has refreshed its metadata, but you could
    easily give it some extra time to make sure all queries have completed.

    Keith


    On Mon, Mar 3, 2014 at 12:14 AM, Tivona Hu <huti...@gmail.com<javascript:>
    wrote:
    Thanks Nong for the reply :)

    Then I'm wondering if there's any way to do a "virtual drop" on a file..

    For example, I convert staging avro files to parquet every hour:
    data_1pm.parquet
    data_2pm.parquet (= data_1pm + new data generated between 1pm~2pm)

    At 2pm, I want to do a refresh to virtually remove data_1pm.parquet and
    add data_2pm.parquet to the metastore.
    So if any on-going query happened between data_1pm& data_2pm, it will not
    fail since the file's still there physically.
    And any query happens after data_2pm will only query data_2pm since the
    metastore's refreshed.

    Then finally after few hours, I can delete the file data_1pm physically
    since there should be no existing queries associated with it.

    Anyone has idea is it possible to do this kind of things?

    Thanks!

    Nong於 2014年2月28日星期五UTC+8上午2時31分59秒寫道:
    Impala doesn't do anything special for concurrent reads and writes as
    this
    really needs to be handled at the storage layer. When a file is deleted
    in HDFS,
    the file will be removed even if there are active readers (HDFS doesn't
    track active
    readers).

    Simultaneous read queries and refreshes are fine. Simultaneous read
    queries
    and deletes will cause the read queries to fail with file doesn't exist
    errors.

    On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu wrote:

    I'm using Impala with parquet table and staging phase described here:
    https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

    All looks good but I'm wondering how Impala actually handles concurrent
    read/write?
    I mean, what will happen if I overwrite a parquet file in the data
    warehouse and refresh the corresponding table while another person's
    querying that table?

    Thanks!

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alan Choi at Mar 10, 2014 at 9:38 pm
    Hi Tivona,

    You can issue the following stmt in Impala

    ALTER TABLE <table name> PARTITION (part_col=val, part_col=val...) SET
    LOCATION <path>

    to update the partition location. By doing it in Impala, you don't even
    need to call "invalidate metadata".

    Thanks,
    Alan

    On Mon, Mar 10, 2014 at 3:15 AM, Tivona Hu wrote:

    Hi Keith,

    Could you kindly provide some hints on how to write the metastore? I
    couldn't find appropriate API to do this.
    Really appreciate the reply :)

    Keith Simmons於 2014年3月5日星期三UTC+8上午5時28分32秒寫道:
    We do something similar to this. We manually tell impala where the
    parquet files are by writing to the hive metastore. You can only tell the
    metastore about directories, not specific files, so when we want to drop in
    a new file, we create a new timestamped directory. We then drop the new
    file in there, update the metastore, then tell impala to update it's
    metadata. Once the invalidate metadata call has completed, we delete the
    old timestamped directory. So for example:

    Before update:

    my_table/partition_1/20140203/old-parquet-file.parquet

    After update:

    my_table/partition_1/20140203/old-parquet-file.parquet
    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    After invalidate metadata has completed:

    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    We don't have long in-flight queries, so we normally delete the old data
    directory as soon as impala has refreshed its metadata, but you could
    easily give it some extra time to make sure all queries have completed.

    Keith

    On Mon, Mar 3, 2014 at 12:14 AM, Tivona Hu wrote:

    Thanks Nong for the reply :)

    Then I'm wondering if there's any way to do a "virtual drop" on a file..

    For example, I convert staging avro files to parquet every hour:
    data_1pm.parquet
    data_2pm.parquet (= data_1pm + new data generated between 1pm~2pm)

    At 2pm, I want to do a refresh to virtually remove data_1pm.parquet and
    add data_2pm.parquet to the metastore.
    So if any on-going query happened between data_1pm& data_2pm, it will
    not fail since the file's still there physically.
    And any query happens after data_2pm will only query data_2pm since the
    metastore's refreshed.

    Then finally after few hours, I can delete the file data_1pm physically
    since there should be no existing queries associated with it.

    Anyone has idea is it possible to do this kind of things?

    Thanks!

    Nong於 2014年2月28日星期五UTC+8上午2時31分59秒寫道:
    Impala doesn't do anything special for concurrent reads and writes as
    this
    really needs to be handled at the storage layer. When a file is deleted
    in HDFS,
    the file will be removed even if there are active readers (HDFS doesn't
    track active
    readers).

    Simultaneous read queries and refreshes are fine. Simultaneous read
    queries
    and deletes will cause the read queries to fail with file doesn't exist
    errors.

    On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu wrote:

    I'm using Impala with parquet table and staging phase described here:
    https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

    All looks good but I'm wondering how Impala actually handles
    concurrent read/write?
    I mean, what will happen if I overwrite a parquet file in the data
    warehouse and refresh the corresponding table while another person's
    querying that table?

    Thanks!

    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Tivona Hu at Mar 11, 2014 at 7:54 am
    Thanks Alan :)

    This approach makes me think about using a "double-buffer" partition for a
    growing parquet. However, it seems we can't get current partition location
    if the data's in parquet format?
    http://stackoverflow.com/questions/18003038/is-there-a-way-to-show-partitions-on-cloudera-impala

    Does anyone know if there's any way to get the location by shell or api? Or
    I may have to use other database to keep track of it.

    Thanks!



    2014-03-11 5:38 GMT+08:00 Alan Choi <alan@cloudera.com>:
    Hi Tivona,

    You can issue the following stmt in Impala

    ALTER TABLE <table name> PARTITION (part_col=val, part_col=val...) SET
    LOCATION <path>

    to update the partition location. By doing it in Impala, you don't even
    need to call "invalidate metadata".

    Thanks,
    Alan

    On Mon, Mar 10, 2014 at 3:15 AM, Tivona Hu wrote:

    Hi Keith,

    Could you kindly provide some hints on how to write the metastore? I
    couldn't find appropriate API to do this.
    Really appreciate the reply :)

    Keith Simmons於 2014年3月5日星期三UTC+8上午5時28分32秒寫道:
    We do something similar to this. We manually tell impala where the
    parquet files are by writing to the hive metastore. You can only tell the
    metastore about directories, not specific files, so when we want to drop in
    a new file, we create a new timestamped directory. We then drop the new
    file in there, update the metastore, then tell impala to update it's
    metadata. Once the invalidate metadata call has completed, we delete the
    old timestamped directory. So for example:

    Before update:

    my_table/partition_1/20140203/old-parquet-file.parquet

    After update:

    my_table/partition_1/20140203/old-parquet-file.parquet
    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    After invalidate metadata has completed:

    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    We don't have long in-flight queries, so we normally delete the old data
    directory as soon as impala has refreshed its metadata, but you could
    easily give it some extra time to make sure all queries have completed.

    Keith

    On Mon, Mar 3, 2014 at 12:14 AM, Tivona Hu wrote:

    Thanks Nong for the reply :)

    Then I'm wondering if there's any way to do a "virtual drop" on a file..

    For example, I convert staging avro files to parquet every hour:
    data_1pm.parquet
    data_2pm.parquet (= data_1pm + new data generated between 1pm~2pm)

    At 2pm, I want to do a refresh to virtually remove data_1pm.parquet and
    add data_2pm.parquet to the metastore.
    So if any on-going query happened between data_1pm& data_2pm, it will
    not fail since the file's still there physically.
    And any query happens after data_2pm will only query data_2pm since the
    metastore's refreshed.

    Then finally after few hours, I can delete the file data_1pm physically
    since there should be no existing queries associated with it.

    Anyone has idea is it possible to do this kind of things?

    Thanks!

    Nong於 2014年2月28日星期五UTC+8上午2時31分59秒寫道:
    Impala doesn't do anything special for concurrent reads and writes as
    this
    really needs to be handled at the storage layer. When a file is
    deleted in HDFS,
    the file will be removed even if there are active readers (HDFS
    doesn't track active
    readers).

    Simultaneous read queries and refreshes are fine. Simultaneous read
    queries
    and deletes will cause the read queries to fail with file doesn't
    exist errors.

    On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu wrote:

    I'm using Impala with parquet table and staging phase described
    here:
    https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

    All looks good but I'm wondering how Impala actually handles
    concurrent read/write?
    I mean, what will happen if I overwrite a parquet file in the data
    warehouse and refresh the corresponding table while another person's
    querying that table?

    Thanks!

    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alan Choi at Mar 11, 2014 at 5:58 pm
    Before we implement SHOW PARTITIONS in Impala, I guess you might have to
    use Hive's describe formatted and then do some parsing

    describe formatted <table> partition


    *partition_spec*


    On Tue, Mar 11, 2014 at 12:54 AM, Tivona Hu wrote:

    Thanks Alan :)

    This approach makes me think about using a "double-buffer" partition for a
    growing parquet. However, it seems we can't get current partition location
    if the data's in parquet format?
    http://stackoverflow.com/questions/18003038/is-there-a-way-to-show-partitions-on-cloudera-impala

    Does anyone know if there's any way to get the location by shell or api?
    Or I may have to use other database to keep track of it.

    Thanks!



    2014-03-11 5:38 GMT+08:00 Alan Choi <alan@cloudera.com>:

    Hi Tivona,
    You can issue the following stmt in Impala

    ALTER TABLE <table name> PARTITION (part_col=val, part_col=val...) SET
    LOCATION <path>

    to update the partition location. By doing it in Impala, you don't even
    need to call "invalidate metadata".

    Thanks,
    Alan

    On Mon, Mar 10, 2014 at 3:15 AM, Tivona Hu wrote:

    Hi Keith,

    Could you kindly provide some hints on how to write the metastore? I
    couldn't find appropriate API to do this.
    Really appreciate the reply :)

    Keith Simmons於 2014年3月5日星期三UTC+8上午5時28分32秒寫道:
    We do something similar to this. We manually tell impala where the
    parquet files are by writing to the hive metastore. You can only tell the
    metastore about directories, not specific files, so when we want to drop in
    a new file, we create a new timestamped directory. We then drop the new
    file in there, update the metastore, then tell impala to update it's
    metadata. Once the invalidate metadata call has completed, we delete the
    old timestamped directory. So for example:

    Before update:

    my_table/partition_1/20140203/old-parquet-file.parquet

    After update:

    my_table/partition_1/20140203/old-parquet-file.parquet
    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    After invalidate metadata has completed:

    my_table/partition_1/20140204/new-merged-parquet-file.parquet

    We don't have long in-flight queries, so we normally delete the old
    data directory as soon as impala has refreshed its metadata, but you could
    easily give it some extra time to make sure all queries have completed.

    Keith

    On Mon, Mar 3, 2014 at 12:14 AM, Tivona Hu wrote:

    Thanks Nong for the reply :)

    Then I'm wondering if there's any way to do a "virtual drop" on a
    file..

    For example, I convert staging avro files to parquet every hour:
    data_1pm.parquet
    data_2pm.parquet (= data_1pm + new data generated between 1pm~2pm)

    At 2pm, I want to do a refresh to virtually remove data_1pm.parquet
    and add data_2pm.parquet to the metastore.
    So if any on-going query happened between data_1pm& data_2pm, it will
    not fail since the file's still there physically.
    And any query happens after data_2pm will only query data_2pm since
    the metastore's refreshed.

    Then finally after few hours, I can delete the file data_1pm
    physically since there should be no existing queries associated with it.

    Anyone has idea is it possible to do this kind of things?

    Thanks!

    Nong於 2014年2月28日星期五UTC+8上午2時31分59秒寫道:
    Impala doesn't do anything special for concurrent reads and writes as
    this
    really needs to be handled at the storage layer. When a file is
    deleted in HDFS,
    the file will be removed even if there are active readers (HDFS
    doesn't track active
    readers).

    Simultaneous read queries and refreshes are fine. Simultaneous read
    queries
    and deletes will cause the read queries to fail with file doesn't
    exist errors.

    On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu wrote:

    I'm using Impala with parquet table and staging phase described
    here:
    https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

    All looks good but I'm wondering how Impala actually handles
    concurrent read/write?
    I mean, what will happen if I overwrite a parquet file in the data
    warehouse and refresh the corresponding table while another person's
    querying that table?

    Thanks!

    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedFeb 27, '14 at 7:36a
activeMar 11, '14 at 5:58p
posts9
users5
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase