|| at Apr 28, 2014 at 7:20 pm
Parquet files or more specifically, parquet row groups are not splittable.
In other words, if you have
a row group that is split over two nodes, only one node can be responsible
for processing the data.
Splits in Impala are generated based on physical boundaries (i.e. HDFS
block splits). For some file
formats, like csv, we can process the split even if we start in the middle
of the file; these files are
We are expecting most parquet files to contain one row group and therefore
assign the entire file
to one node. There's definitely room for improvement for files that contain
multiple row groups and
we can read the file metadata to find the start of the row groups in the
file (i.e. logical splits).
On Wed, Apr 23, 2014 at 2:17 PM, Xiu Guo wrote:
I noticed that when Impala reads Parquet file, it ignores splits whose
offset are not 0. Whereas ParquetMR are reading all splits.
I understand the reason is that Impala reads the whole file instead for
better sequential IO performance.
Can anyone tell me what the concept of splits in Impala parquet scanner
and why it is there? (if we read whole files, why do we still introduce
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected]
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]