FAQ
Yes, this is expected behavior in a column store.

This page describes the relative benefits and trade-offs of column stores
versus row stores:

http://en.wikipedia.org/wiki/Column-oriented_DBMS#Benefits

On Fri, May 24, 2013 at 12:18 PM, wrote:

Hi,

When I run a select query on data in parquet with ~50 million rows and 10
columns I get much worse performance as I select more columns in the row.
Suppose the following query returns 3 rows:

select a from table where a = 12345;

This query returns in 2 seconds. Then if I query:

select a, b from table where a = 12345;

the query returns in 4 seconds and so on. Is this expected behaviour as
parquet is a columnar store? Is there a way to optimise this?

Search Discussions

  • Todd Lipcon at May 25, 2013 at 12:32 am
    One thing worth noting is that many column stores do lazy IO on
    dependent columns after applying predicates on others. In this
    example, it seems like your predicate on the 'a' column is restrictive
    -- so it shouldn't have to do any IO on the 'b' column except where a
    = 12345.

    AFAIK Impala doesn't yet make this optimization in the case of
    Parquet, but I imagine it's something to look forward to in future
    versions.

    -Todd
    On Fri, May 24, 2013 at 3:26 PM, Patrick Angeles wrote:
    Yes, this is expected behavior in a column store.

    This page describes the relative benefits and trade-offs of column stores
    versus row stores:

    http://en.wikipedia.org/wiki/Column-oriented_DBMS#Benefits

    On Fri, May 24, 2013 at 12:18 PM, wrote:

    Hi,

    When I run a select query on data in parquet with ~50 million rows and 10
    columns I get much worse performance as I select more columns in the row.
    Suppose the following query returns 3 rows:

    select a from table where a = 12345;

    This query returns in 2 seconds. Then if I query:

    select a, b from table where a = 12345;

    the query returns in 4 seconds and so on. Is this expected behaviour as
    parquet is a columnar store? Is there a way to optimise this?


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Gerrard Mcnulty at May 27, 2013 at 9:09 am
    Is this optimization on the roadmap? Meanwhile, is there a way to mitigate
    this e.g. If I change the schema so that there's more partitions, but run
    the query over the same amount of data, will impala know to read less
    partitions for the dependent columns?
    On Saturday, May 25, 2013 1:32:19 AM UTC+1, Todd Lipcon wrote:

    One thing worth noting is that many column stores do lazy IO on
    dependent columns after applying predicates on others. In this
    example, it seems like your predicate on the 'a' column is restrictive
    -- so it shouldn't have to do any IO on the 'b' column except where a
    = 12345.

    AFAIK Impala doesn't yet make this optimization in the case of
    Parquet, but I imagine it's something to look forward to in future
    versions.

    -Todd
    On Fri, May 24, 2013 at 3:26 PM, Patrick Angeles wrote:
    Yes, this is expected behavior in a column store.

    This page describes the relative benefits and trade-offs of column stores
    versus row stores:

    http://en.wikipedia.org/wiki/Column-oriented_DBMS#Benefits


    On Fri, May 24, 2013 at 12:18 PM, <gerrard...@gmail.com <javascript:>>
    wrote:
    Hi,

    When I run a select query on data in parquet with ~50 million rows and
    10
    columns I get much worse performance as I select more columns in the
    row.
    Suppose the following query returns 3 rows:

    select a from table where a = 12345;

    This query returns in 2 seconds. Then if I query:

    select a, b from table where a = 12345;

    the query returns in 4 seconds and so on. Is this expected behaviour
    as
    parquet is a columnar store? Is there a way to optimise this?


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Nong Li at May 28, 2013 at 3:44 pm
    This optimization is on the roadmap. The first step would be to simply
    apply predicates before decoding the
    other cols. This lets us skip all the CPU cost for the other cols. Saving
    the IO costs requires a bit more
    thought.

    I'm not sure what you are suggesting as an attempt to mitigate this.
      Impala does partition pruning and will not
    read from files that are part of filtered out partitions. You can change
    the schema and this will be reflected but
    changing the cols that are the partitioning cols is *not* a metadata only
    change. The partitioning columns are
    not in the actual data files so changing them will likely require
    re-writing the data files.

    On Mon, May 27, 2013 at 5:09 AM, wrote:

    Is this optimization on the roadmap? Meanwhile, is there a way to
    mitigate this e.g. If I change the schema so that there's more partitions,
    but run the query over the same amount of data, will impala know to read
    less partitions for the dependent columns?

    On Saturday, May 25, 2013 1:32:19 AM UTC+1, Todd Lipcon wrote:

    One thing worth noting is that many column stores do lazy IO on
    dependent columns after applying predicates on others. In this
    example, it seems like your predicate on the 'a' column is restrictive
    -- so it shouldn't have to do any IO on the 'b' column except where a
    = 12345.

    AFAIK Impala doesn't yet make this optimization in the case of
    Parquet, but I imagine it's something to look forward to in future
    versions.

    -Todd

    On Fri, May 24, 2013 at 3:26 PM, Patrick Angeles <pat...@cloudera.com>
    wrote:
    Yes, this is expected behavior in a column store.

    This page describes the relative benefits and trade-offs of column stores
    versus row stores:

    http://en.wikipedia.org/wiki/**Column-oriented_DBMS#Benefits<http://en.wikipedia.org/wiki/Column-oriented_DBMS#Benefits>

    On Fri, May 24, 2013 at 12:18 PM, wrote:

    Hi,

    When I run a select query on data in parquet with ~50 million rows and
    10
    columns I get much worse performance as I select more columns in the
    row.
    Suppose the following query returns 3 rows:

    select a from table where a = 12345;

    This query returns in 2 seconds. Then if I query:

    select a, b from table where a = 12345;

    the query returns in 4 seconds and so on. Is this expected behaviour
    as
    parquet is a columnar store? Is there a way to optimise this?


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Jung-Yup Lee at May 25, 2013 at 5:08 am
    I find the "-*num_threads_per_disk*" option in Impala which determines the
    maximum number of the threads per disk. (The default value is 1)

    I am not quite sure, but increasing this option value might help to read
    the column chunks in parallel. Therefore, the query latency will be lowered.

    Am I right, experts?

    Thanks


    On Saturday, May 25, 2013 1:18:02 AM UTC+9, gerrard...@gmail.com wrote:

    Hi,

    When I run a select query on data in parquet with ~50 million rows and 10
    columns I get much worse performance as I select more columns in the row.
    Suppose the following query returns 3 rows:

    select a from table where a = 12345;

    This query returns in 2 seconds. Then if I query:

    select a, b from table where a = 12345;

    the query returns in 4 seconds and so on. Is this expected behaviour as
    parquet is a columnar store? Is there a way to optimise this?
  • Nong Li at May 28, 2013 at 3:26 pm

    On Sat, May 25, 2013 at 1:07 AM, Jung-Yup Lee wrote:
    I find the "-*num_threads_per_disk*" option in Impala which determines
    the maximum number of the threads per disk. (The default value is 1)

    I am not quite sure, but increasing this option value might help to read
    the column chunks in parallel. Therefore, the query latency will be lowered.

    Am I right, experts?
    This is not correct. On spinning disks, the optimal value for
    num_threads_per_disk is 1, which minimizes the number of disk seeks. To
    read multiple
    column chunks in parallel, we'd need each column stored on a different disk
    which requires some storage layer (i.e. HDFS) changes.
    Thanks



    On Saturday, May 25, 2013 1:18:02 AM UTC+9, gerrard...@gmail.com wrote:

    Hi,

    When I run a select query on data in parquet with ~50 million rows and 10
    columns I get much worse performance as I select more columns in the row.
    Suppose the following query returns 3 rows:

    select a from table where a = 12345;

    This query returns in 2 seconds. Then if I query:

    select a, b from table where a = 12345;

    the query returns in 4 seconds and so on. Is this expected behaviour as
    parquet is a columnar store? Is there a way to optimise this?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 24, '13 at 10:27p
activeMay 28, '13 at 3:44p
posts6
users5
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase