FAQ
Hi Swati,

The number of partition should be less than 30k. If you have 2 million data
sources, then you can't create a partition for each data source. I would
suggest that you group data sources into a few number of partitions.

I'm not sure what you mean by "append to sequencefile via thrift c++
client". Impala doesn't append to file. Every INSERT will create a new
file.

You can consider landing the small files as is. Then use a nightly job to
compact the files and convert them into parquet.

Thanks,
Alan

On Sat, Apr 12, 2014 at 11:52 PM, swavai wrote:

Hello,
We need to store data coming from various datasources (As small files
approx 2 million per day) at regular intervals, need to partition data by
datasource and day and also each file can be part of multiple schemas, we
thought we will create symbolic links.
As our file is of small size, we thought we will use sequencefile as
stored mechanism. Can we append to sequencefile via thrift c++ client.

Is the approch correct ?

regards
Swati

To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 3 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedApr 15, '14 at 1:19p
activeApr 19, '14 at 1:23a
posts3
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Alan Choi: 2 posts Swavai: 1 post

People

Translate

site design / logo © 2022 Grokbase