The number of partition should be less than 30k. If you have 2 million data
sources, then you can't create a partition for each data source. I would
suggest that you group data sources into a few number of partitions.
I'm not sure what you mean by "append to sequencefile via thrift c++
client". Impala doesn't append to file. Every INSERT will create a new
file.
You can consider landing the small files as is. Then use a nightly job to
compact the files and convert them into parquet.
Thanks,
Alan
On Sat, Apr 12, 2014 at 11:52 PM, swavai wrote:
Hello,
We need to store data coming from various datasources (As small files
approx 2 million per day) at regular intervals, need to partition data by
datasource and day and also each file can be part of multiple schemas, we
thought we will create symbolic links.
As our file is of small size, we thought we will use sequencefile as
stored mechanism. Can we append to sequencefile via thrift c++ client.
Is the approch correct ?
regards
Swati
To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.Hello,
We need to store data coming from various datasources (As small files
approx 2 million per day) at regular intervals, need to partition data by
datasource and day and also each file can be part of multiple schemas, we
thought we will create symbolic links.
As our file is of small size, we thought we will use sequencefile as
stored mechanism. Can we append to sequencefile via thrift c++ client.
Is the approch correct ?
regards
Swati
To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.