Hi Felix,

thanks for letting us know about your experience! The way inserts work
in Impala is that the new data is first staged into a temporary
directory (the "_dir" you are seeing) and the moved into the parent
directory at the very end of execution.
We do this to (1) make something like "insert overwrite t select *
from t" work properly, and (2) add at least a weak degree of
"atomicity" to the insert operation, such that a user can clean up the
staging area if an insert failed in the middle of its operation
(without affecting the existing data even if "overwrite" was

It's possible that the "_dir" you are seeing is a left over directory
from a failed insert. It is also not recommended to run Hive queries
concurrently with Impala inserts as you may run into the issues you
have observed.
In the future, we may look into putting the staging area for inserts
somewhere else to allow Hive to run queries concurrently.



On Tue, Jun 25, 2013 at 7:34 PM, Felix Xu wrote:
Hi all,

I encountered with a problem , currently I use Hive and Imapla in a
commutative way, some queries are executed by hive and some queries are
executed by Impala,
this works fine in a period , however, sometimes I find that Hive is not
able to read the output file of impala table , simply because impala has
generated a sub-folder(e.g.123456789--442211_123456_dir) within that table
directory(/user/hive/warehouse/sometable), also it is strange that not all
the files are in that sub-folder, most of them are placed directly in the
table directory. Thus if Hive read that table , it will throw an
FileNotFoundException saying that
/user/hive/warehouse/sometable/123456789--442211_123456_dir Path is not a

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
postedJun 28, '13 at 2:24a
activeJun 28, '13 at 2:24a

1 user in discussion

Alex Behm: 1 post



site design / logo © 2021 Grokbase