FAQ
Option 1 (mixed partition formats) looks interesting. Can you do that in
impala? I don't see it in the docs.

On Friday, June 7, 2013 5:37:03 PM UTC+1, Alex Behm wrote:

Sure you can. Two possible solutions to that problem:
1. Use a table with mixed partition formats. All partitions except a
special ingestion partition would use Parquet. When you query the
table you'll get all the data (including the non-Parquet data from the
special ingestion partition)
2. Use two tables. One table is completely non-Parquet used for
ingestion, and the other table is in Parquet. You can write queries
against the UNION ALL of those two tables.

Cheers,

Alex

On Fri, Jun 7, 2013 at 1:14 AM, <gerrard...@gmail.com <javascript:>>
wrote:
Hm, yeah, but then you can't query that data in impala until you batch
it in
On Thursday, June 6, 2013 9:14:29 PM UTC+1, Alex Behm wrote:

A reasonable approach is to ingest your streaming data into HDFS in
whatever format is convenient for ingestion. Then you periodically
convert the data in batch to parquet for faster querying via Impala.
Obviously, the devil is in the details, but those depend heavily on
your setup, data, ingestion rate, etc.

Cheers,

Alex
On Thu, Jun 6, 2013 at 4:32 AM, wrote:
So with a columnar format such as parquet, what's the best way to
deal
with
streaming data? The data doesn't need to be real time, I can do
something
like load the data every 10 minutes, hour etc.

I noticed that each time I call insert...select to parquet file,
impala
creates new files. So when streaming/near streaming, I end up with
lots
of
small files. I presume this is bad for compression and performance?

On Thursday, June 6, 2013 2:06:23 AM UTC+1, Alex Behm wrote:

Hi Paul,

you are correct:
- We don't recommend "insert into values" for large volumes of data
(either lots of data via a single query, or lots of data via many
small such queries).
- After adding a new file to a table directory, you'll need to
refresh
impala

My understanding is that the main issue will be whether the desired
file format supports appends.
SequenceFile, for example, does not support appending to the same
file
(see https://issues.apache.org/jira/browse/HADOOP-7139)

You'll need to check which file formats support appending to the
same
file, but my guess is that many don't (esp. the columnar formats
benefit from bulk data loads)

I believe you don't need to explicitly enable append in HDFS since
version 0.2.1 (https://issues.apache.org/jira/browse/HDFS-1107).

Cheers,

Alex

On Wed, Jun 5, 2013 at 5:38 PM, Paul Birnie wrote:

I see that impala 1.0.1 supports "insert into values" - but its
not
recommended for large volumes of data

Q) I am looking for a simple mechanism to append data into a file
and
have
this data "immediately" available as part of the impala table.

(The data is already structured and doesn't need PIG ETL etc)

I believe if I create a new file inside the tables directory, I
will
need to
call refresh


can I use a simple piece of java code that simply appends to a
sequence
file?

writer = new SequenceFile.Writer(fs, conf, out, Text.class,
Text.class);

Q) also do I need to explicitly enable hdfs file append
functionality

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 6 of 7 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedJun 6, '13 at 12:38a
activeJun 10, '13 at 5:39p
posts7
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase