FAQ
Hi:

I noticed that when Impala reads Parquet file, it ignores splits whose
offset are not 0. Whereas ParquetMR are reading all splits.

I understand the reason is that Impala reads the whole file instead for
better sequential IO performance.

Can anyone tell me what the concept of splits in Impala parquet scanner and
why it is there? (if we read whole files, why do we still introduce splits)

Thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

Search Discussions

  • Nong Li at Apr 28, 2014 at 7:20 pm
    Parquet files or more specifically, parquet row groups are not splittable.
    In other words, if you have
    a row group that is split over two nodes, only one node can be responsible
    for processing the data.

    Splits in Impala are generated based on physical boundaries (i.e. HDFS
    block splits). For some file
    formats, like csv, we can process the split even if we start in the middle
    of the file; these files are
    splittable.

    We are expecting most parquet files to contain one row group and therefore
    assign the entire file
    to one node. There's definitely room for improvement for files that contain
    multiple row groups and
    we can read the file metadata to find the start of the row groups in the
    file (i.e. logical splits).



    On Wed, Apr 23, 2014 at 2:17 PM, Xiu Guo wrote:

    Hi:

    I noticed that when Impala reads Parquet file, it ignores splits whose
    offset are not 0. Whereas ParquetMR are reading all splits.

    I understand the reason is that Impala reads the whole file instead for
    better sequential IO performance.

    Can anyone tell me what the concept of splits in Impala parquet scanner
    and why it is there? (if we read whole files, why do we still introduce
    splits)

    Thanks!

    To unsubscribe from this group and stop receiving emails from it, send an
    email to [email protected].
    To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedApr 23, '14 at 9:17p
activeApr 28, '14 at 7:20p
posts2
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Xiu Guo: 1 post Nong Li: 1 post

People

Translate

site design / logo © 2023 Grokbase