Grokbase Groups Hive user March 2011

Our hive table import process uses a dynamic partition insert into a temporary table, then the resulting sequence files are loaded into the master table using LOAD DATA INPATH because we want the data online immediately for querying. The data that is loaded does not overwrite files already existing in the partitions so we are essentially doing an "append" to the partitions. Our question is, is this a bad practice, and how does this affect table sampling? It seems that the table sample mechanism expects as many files in the partition folder as are partition buckets. Doing a "compaction" of the table using INSERT OVERWRITE to re-write the partitions fixes the table sampling problem, but we would like to avoid the expensive write. Are there better ways to accomplish our goal of putting data online quickly, and preserve the ability to table sample?

Luke Forehand

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMar 29, '11 at 2:37p
activeMar 29, '11 at 2:37p

1 user in discussion

Luke Forehand: 1 post



site design / logo © 2021 Grokbase