FAQ
We do something similar to this. We manually tell impala where the parquet
files are by writing to the hive metastore. You can only tell the
metastore about directories, not specific files, so when we want to drop in
a new file, we create a new timestamped directory. We then drop the new
file in there, update the metastore, then tell impala to update it's
metadata. Once the invalidate metadata call has completed, we delete the
old timestamped directory. So for example:

Before update:

my_table/partition_1/20140203/old-parquet-file.parquet

After update:

my_table/partition_1/20140203/old-parquet-file.parquet
my_table/partition_1/20140204/new-merged-parquet-file.parquet

After invalidate metadata has completed:

my_table/partition_1/20140204/new-merged-parquet-file.parquet

We don't have long in-flight queries, so we normally delete the old data
directory as soon as impala has refreshed its metadata, but you could
easily give it some extra time to make sure all queries have completed.

Keith

On Mon, Mar 3, 2014 at 12:14 AM, Tivona Hu wrote:

Thanks Nong for the reply :)

Then I'm wondering if there's any way to do a "virtual drop" on a file..

For example, I convert staging avro files to parquet every hour:
data_1pm.parquet
data_2pm.parquet (= data_1pm + new data generated between 1pm~2pm)

At 2pm, I want to do a refresh to virtually remove data_1pm.parquet and
add data_2pm.parquet to the metastore.
So if any on-going query happened between data_1pm& data_2pm, it will not
fail since the file's still there physically.
And any query happens after data_2pm will only query data_2pm since the
metastore's refreshed.

Then finally after few hours, I can delete the file data_1pm physically
since there should be no existing queries associated with it.

Anyone has idea is it possible to do this kind of things?

Thanks!

Nong於 2014年2月28日星期五UTC+8上午2時31分59秒寫道:
Impala doesn't do anything special for concurrent reads and writes as this
really needs to be handled at the storage layer. When a file is deleted
in HDFS,
the file will be removed even if there are active readers (HDFS doesn't
track active
readers).

Simultaneous read queries and refreshes are fine. Simultaneous read
queries
and deletes will cause the read queries to fail with file doesn't exist
errors.

On Wed, Feb 26, 2014 at 11:36 PM, Tivona Hu wrote:

I'm using Impala with parquet table and staging phase described here:
https://github.com/cloudera/cdk-examples/tree/master/dataset-staging

All looks good but I'm wondering how Impala actually handles concurrent
read/write?
I mean, what will happen if I overwrite a parquet file in the data
warehouse and refresh the corresponding table while another person's
querying that table?

Thanks!

To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user...@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 9 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedFeb 27, '14 at 7:36a
activeMar 11, '14 at 5:58p
posts9
users5
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase