On Tue, Jul 9, 2013 at 9:27 PM, Andrei wrote:
We have a similar problem. Our usecase is to store market data, for about
700B records (3 years approx 600M-1B daily records). One of the frequent
queries are of type
"select ... from quotes where symbol = $P_SYMBOL and timestamp between
$P_START and $P_END ... "
So first problem there are far too many files created for 7TB of data (due
to aforementioned 260MB limit). Do you think adding MAXSIZE in CREATE
[internal] TABLE statement is a good idea ? What if I want storage filesize
to be larger than HDFS block ?
Regarding partitioning, if one splits the table by symbol/date it results
in 6M files (8000 symbols x 750 days) which might be too much to handle for
the name node. Will impala (query optimizer) be smart enough if I had to
partition table by say [symbol, year(timestamp) || weekofyear(timestamp) ]
(ie use function in partition definition) ?
What exactly do you mean by "use function in partition definition"?
The select list of the Insert statement that populates the table?
If you have a table that is partitioned by something like (symbol
string, year int, weekofyear int), you will need predicates on those
partition columns in your query in order to take advantage of
partition pruning. In other words, simply having "timestamp between
<start> and <end>" won't be sufficient, you'll need to add "year
between <> and <> ...".
Thanks in advance,
On Friday, June 21, 2013 5:40:59 PM UTC-4, Skye Wanderman-Milne wrote:
Impala 1.1 will be released within the next month.
On Fri, Jun 21, 2013 at 2:05 PM, Daniel Uribe <daniel....@gmail.com>
Thanks for the information Nong, I'm glad to know that this is something
that will be fixed in the next release. Any estimate on the timeframe on
when it might be available?
On Friday, June 21, 2013 1:28:07 PM UTC-4, Nong wrote:
This is an issue in our implementation causing Impala to generate files
less than the target size. This is fixed for our upcoming release.
On Thu, Jun 20, 2013 at 9:17 AM, Daniel Uribe <daniel....@gmail.com>
We have a Parquet table for a day of activity (close to 5 billion
records) which is partitioned by minute and there are several minutes where
the total data size is over 1 GB, but when populating the table it split the
files and I can't find any files that are larger than 260 MB. Since the HDFS
block size is 1GB, this is not ideal, since the optimal Parquet file size
should also be 1GB.
Is there any setting for Impala to make sure that the files don't get
split into smaller sizes to improve I/O performance during queries?