We do something similar to this. We manually tell impala where the parquet
files are by writing to the hive metastore. You can only tell the
metastore about directories, not specific files, so when we want to drop in
a new file, we create a new timestamped directory. We then drop the new
file in there, update the metastore, then tell impala to update it's
metadata. Once the invalidate metadata call has completed, we delete the
old timestamped directory. So for example:

Before update:


After update:


After invalidate metadata has completed:


We don't have long in-flight queries, so we normally delete the old data
directory as soon as impala has refreshed its metadata, but you could
easily give it some extra time to make sure all queries have completed.


On Mon, Mar 3, 2014 at 12:14 AM, Tivona Hu wrote:

Thanks Nong for the reply :)

Then I'm wondering if there's any way to do a "virtual drop" on a file..

For example, I convert staging avro files to parquet every hour:
data_2pm.parquet (= data_1pm + new data generated between 1pm~2pm)

At 2pm, I want to do a refresh to virtually remove data_1pm.parquet and
add data_2pm.parquet to the metastore.
So if any on-going query happened between data_1pm& data_2pm, it will not
fail since the file's still there physically.
And any query happens after data_2pm will only query data_2pm since the
metastore's refreshed.

Then finally after few hours, I can delete the file data_1pm physically
since there should be no existing queries associated with it.

Anyone has idea is it possible to do this kind of things?


Nong於 2014年2月28日星期五UTC+8上午2時31分59秒寫道:
Impala doesn't do anything special for concurrent reads and writes as this
really needs to be handled at the storage layer. When a file is deleted
in HDFS,
the file will be removed even if there are active readers (HDFS doesn't
track active

Simultaneous read queries and refreshes are fine. Simultaneous read
and deletes will cause the read queries to fail with file doesn't exist

On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu wrote:

I'm using Impala with parquet table and staging phase described here:

All looks good but I'm wondering how Impala actually handles concurrent
I mean, what will happen if I overwrite a parquet file in the data
warehouse and refresh the corresponding table while another person's
querying that table?


To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user...@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 9 | next ›
Discussion Overview
groupimpala-user @
postedFeb 27, '14 at 7:36a
activeMar 11, '14 at 5:58p



site design / logo © 2021 Grokbase