Hi Swati,

Please see my inline comment.

On Wed, Apr 16, 2014 at 2:49 PM, swavai wrote:

Hello Alan,
Thanks for your reply.
There was a similar question - External Tables : Sequence files updates
and Impala Query
In it Ishaan said its possible to update sequence file and you are saying
its not. I am a bit confused.
Hive has built in capability to combine small files

1. Hive <https://issues.apache.org/jira/browse/HIVE>
2. HIVE-74 <https://issues.apache.org/jira/browse/HIVE-74>

does impala also have something similar,
alan: Impala can compact file for you, but you can't control the file size
or number of files. It's better to use Hive.
b) Can impala take advantage of file names for external tables (table name
can act as sort of query parameters)
alan: You can't get the name of the files you're reading from in Impala.
c)We will be having approx say 3 K per minute (spanning say 6000 to 7000
tables), calling refresh so many times - is it good (as impala wont look
into the new files till the time we call refresh)
alan: You might want to think about the latency of the data and the
underlying storage. HDFS is good for large scan, but not good for tiny
incremental update. HBase is good for small look up, small scan, lots of
tiny update and immediate data visibility.
d) As append is not there, our file size may be as small as 1 k, if we are
able to compact it at night what happens for the query we are doing for
present data (its related to b)
Write out to a new dir. Point the partition to the new dir. When all
existing queries are done, drop the old files.
e) So everynight after compacting we delete current partition and insert a
new partition with a different file format
Yes. Write out to Parquet!
On Tuesday, April 15, 2014 4:36:18 PM UTC-7, Alan wrote:

Hi Swati,

The number of partition should be less than 30k. If you have 2 million
data sources, then you can't create a partition for each data source. I
would suggest that you group data sources into a few number of partitions.

I'm not sure what you mean by "append to sequencefile via thrift c++
client". Impala doesn't append to file. Every INSERT will create a new

You can consider landing the small files as is. Then use a nightly job to
compact the files and convert them into parquet.


On Sat, Apr 12, 2014 at 11:52 PM, swavai wrote:

We need to store data coming from various datasources (As small files
approx 2 million per day) at regular intervals, need to partition data by
datasource and day and also each file can be part of multiple schemas, we
thought we will create symbolic links.
As our file is of small size, we thought we will use sequencefile as
stored mechanism. Can we append to sequencefile via thrift c++ client.

Is the approch correct ?


To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user...@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 3 | next ›
Discussion Overview
groupimpala-user @
postedApr 15, '14 at 1:19p
activeApr 19, '14 at 1:23a

2 users in discussion

Alan Choi: 2 posts Swavai: 1 post



site design / logo © 2022 Grokbase