|| at Jun 7, 2013 at 4:37 pm
Sure you can. Two possible solutions to that problem:
1. Use a table with mixed partition formats. All partitions except a
special ingestion partition would use Parquet. When you query the
table you'll get all the data (including the non-Parquet data from the
special ingestion partition)
2. Use two tables. One table is completely non-Parquet used for
ingestion, and the other table is in Parquet. You can write queries
against the UNION ALL of those two tables.
On Fri, Jun 7, 2013 at 1:14 AM, wrote:
Hm, yeah, but then you can't query that data in impala until you batch it in
On Thursday, June 6, 2013 9:14:29 PM UTC+1, Alex Behm wrote:
A reasonable approach is to ingest your streaming data into HDFS in
whatever format is convenient for ingestion. Then you periodically
convert the data in batch to parquet for faster querying via Impala.
Obviously, the devil is in the details, but those depend heavily on
your setup, data, ingestion rate, etc.
On Thu, Jun 6, 2013 at 4:32 AM, wrote:
So with a columnar format such as parquet, what's the best way to deal
streaming data? The data doesn't need to be real time, I can do
like load the data every 10 minutes, hour etc.
I noticed that each time I call insert...select to parquet file, impala
creates new files. So when streaming/near streaming, I end up with lots
small files. I presume this is bad for compression and performance?
On Thursday, June 6, 2013 2:06:23 AM UTC+1, Alex Behm wrote:
you are correct:
- We don't recommend "insert into values" for large volumes of data
(either lots of data via a single query, or lots of data via many
small such queries).
- After adding a new file to a table directory, you'll need to refresh
My understanding is that the main issue will be whether the desired
file format supports appends.
SequenceFile, for example, does not support appending to the same file
You'll need to check which file formats support appending to the same
file, but my guess is that many don't (esp. the columnar formats
benefit from bulk data loads)
I believe you don't need to explicitly enable append in HDFS since
version 0.2.1 (https://issues.apache.org/jira/browse/HDFS-1107).
On Wed, Jun 5, 2013 at 5:38 PM, Paul Birnie wrote:
I see that impala 1.0.1 supports "insert into values" - but its not
recommended for large volumes of data
Q) I am looking for a simple mechanism to append data into a file
this data "immediately" available as part of the impala table.
(The data is already structured and doesn't need PIG ETL etc)
I believe if I create a new file inside the tables directory, I will
can I use a simple piece of java code that simply appends to a
writer = new SequenceFile.Writer(fs, conf, out, Text.class,
Q) also do I need to explicitly enable hdfs file append