FAQ
So with a columnar format such as parquet, what's the best way to deal with
streaming data? The data doesn't need to be real time, I can do something
like load the data every 10 minutes, hour etc.

I noticed that each time I call insert...select to parquet file, impala
creates new files. So when streaming/near streaming, I end up with lots of
small files. I presume this is bad for compression and performance?
On Thursday, June 6, 2013 2:06:23 AM UTC+1, Alex Behm wrote:

Hi Paul,

you are correct:
- We don't recommend "insert into values" for large volumes of data
(either lots of data via a single query, or lots of data via many
small such queries).
- After adding a new file to a table directory, you'll need to refresh
impala

My understanding is that the main issue will be whether the desired
file format supports appends.
SequenceFile, for example, does not support appending to the same file
(see https://issues.apache.org/jira/browse/HADOOP-7139)

You'll need to check which file formats support appending to the same
file, but my guess is that many don't (esp. the columnar formats
benefit from bulk data loads)

I believe you don't need to explicitly enable append in HDFS since
version 0.2.1 (https://issues.apache.org/jira/browse/HDFS-1107).

Cheers,

Alex

On Wed, Jun 5, 2013 at 5:38 PM, Paul Birnie wrote:

I see that impala 1.0.1 supports "insert into values" - but its not
recommended for large volumes of data

Q) I am looking for a simple mechanism to append data into a file and have
this data "immediately" available as part of the impala table.

(The data is already structured and doesn't need PIG ETL etc)

I believe if I create a new file inside the tables directory, I will need to
call refresh


can I use a simple piece of java code that simply appends to a sequence
file?

writer = new SequenceFile.Writer(fs, conf, out, Text.class, Text.class);

Q) also do I need to explicitly enable hdfs file append functionality

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 7 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedJun 6, '13 at 12:38a
activeJun 10, '13 at 5:39p
posts7
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase