Grokbase Groups Pig user March 2013
FAQ
Fellow Hadoopers,

We'd like to introduce a joint project between Twitter and Cloudera
engineers -- a new columnar storage format for Hadoop called Parquet (
http://parquet.github.com).

We created Parquet to make the advantages of compressed, efficient columnar
data representation available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model, or
programming language.

Parquet is built from the ground up with complex nested data structures in
mind. We adopted the repetition/definition level approach to encoding such
data structures, as described in Google's Dremel paper; we have found this
to be a very efficient method of encoding data in non-trivial object
schemas.

Parquet is built to support very efficient compression and encoding
schemes. Parquet allows compression schemes to be specified on a per-column
level, and is future-proofed to allow adding more encodings as they are
invented and implemented. We separate the concepts of encoding and
compression, allowing parquet consumers to implement operators that work
directly on encoded data without paying decompression and decoding penalty
when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
data processing frameworks, and we are not interested in playing favorites.
We believe that an efficient, well-implemented columnar storage substrate
should be useful to all frameworks without the cost of extensive and
difficult to set up dependencies.

The initial code, available at https://github.com/Parquet, defines the file
format, provides Java building blocks for processing columnar data, and
implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
of a complex integration -- Input/Output formats that can convert
Parquet-stored data directly to and from Thrift objects.

A preview version of Parquet support will be available in Cloudera's Impala
0.7.

Twitter is starting to convert some of its major data source to Parquet in
order to take advantage of the compression and deserialization savings.

Parquet is currently under heavy development. Parquet's near-term roadmap
includes:
* Hive SerDes (Criteo)
* Cascading Taps (Criteo)
* Support for dictionary encoding, zigzag encoding, and RLE encoding of
data (Cloudera and Twitter)
* Further improvements to Pig support (Twitter)

Company names in parenthesis indicate whose engineers signed up to do the
work -- others can feel free to jump in too, of course.

We've also heard requests to provide an Avro container layer, similar to
what we do with Thrift. Seeking volunteers!

We welcome all feedback, patches, and ideas; to foster community
development, we plan to contribute Parquet to the Apache Incubator when the
development is farther along.

Regards,
Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
Jonathan Coveney, and friends.

Search Discussions

  • Stan Rosenberg at Mar 12, 2013 at 6:01 pm
    Dmitriy,

    Please excuse my ignorance. What is/was wrong with trevni
    (https://github.com/cutting/trevni) ?

    Thanks,

    stan
    On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy wrote:
    Fellow Hadoopers,

    We'd like to introduce a joint project between Twitter and Cloudera
    engineers -- a new columnar storage format for Hadoop called Parquet (
    http://parquet.github.com).

    We created Parquet to make the advantages of compressed, efficient columnar
    data representation available to any project in the Hadoop ecosystem,
    regardless of the choice of data processing framework, data model, or
    programming language.

    Parquet is built from the ground up with complex nested data structures in
    mind. We adopted the repetition/definition level approach to encoding such
    data structures, as described in Google's Dremel paper; we have found this
    to be a very efficient method of encoding data in non-trivial object
    schemas.

    Parquet is built to support very efficient compression and encoding
    schemes. Parquet allows compression schemes to be specified on a per-column
    level, and is future-proofed to allow adding more encodings as they are
    invented and implemented. We separate the concepts of encoding and
    compression, allowing parquet consumers to implement operators that work
    directly on encoded data without paying decompression and decoding penalty
    when possible.

    Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
    data processing frameworks, and we are not interested in playing favorites.
    We believe that an efficient, well-implemented columnar storage substrate
    should be useful to all frameworks without the cost of extensive and
    difficult to set up dependencies.

    The initial code, available at https://github.com/Parquet, defines the file
    format, provides Java building blocks for processing columnar data, and
    implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
    of a complex integration -- Input/Output formats that can convert
    Parquet-stored data directly to and from Thrift objects.

    A preview version of Parquet support will be available in Cloudera's Impala
    0.7.

    Twitter is starting to convert some of its major data source to Parquet in
    order to take advantage of the compression and deserialization savings.

    Parquet is currently under heavy development. Parquet's near-term roadmap
    includes:
    * Hive SerDes (Criteo)
    * Cascading Taps (Criteo)
    * Support for dictionary encoding, zigzag encoding, and RLE encoding of
    data (Cloudera and Twitter)
    * Further improvements to Pig support (Twitter)

    Company names in parenthesis indicate whose engineers signed up to do the
    work -- others can feel free to jump in too, of course.

    We've also heard requests to provide an Avro container layer, similar to
    what we do with Thrift. Seeking volunteers!

    We welcome all feedback, patches, and ideas; to foster community
    development, we plan to contribute Parquet to the Apache Incubator when the
    development is farther along.

    Regards,
    Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
    Jonathan Coveney, and friends.
  • Kevin Olson at Mar 12, 2013 at 8:06 pm
    Second on that. Parquet looks compelling, but I'm curious to understand why
    Cloudera suddenly switched from espousing future support for Trevni to
    teaming with Twitter on Parquet.

    On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg
    wrote:
    Dmitriy,

    Please excuse my ignorance. What is/was wrong with trevni
    (https://github.com/cutting/trevni) ?

    Thanks,

    stan
    On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy wrote:
    Fellow Hadoopers,

    We'd like to introduce a joint project between Twitter and Cloudera
    engineers -- a new columnar storage format for Hadoop called Parquet (
    http://parquet.github.com).

    We created Parquet to make the advantages of compressed, efficient columnar
    data representation available to any project in the Hadoop ecosystem,
    regardless of the choice of data processing framework, data model, or
    programming language.

    Parquet is built from the ground up with complex nested data structures in
    mind. We adopted the repetition/definition level approach to encoding such
    data structures, as described in Google's Dremel paper; we have found this
    to be a very efficient method of encoding data in non-trivial object
    schemas.

    Parquet is built to support very efficient compression and encoding
    schemes. Parquet allows compression schemes to be specified on a
    per-column
    level, and is future-proofed to allow adding more encodings as they are
    invented and implemented. We separate the concepts of encoding and
    compression, allowing parquet consumers to implement operators that work
    directly on encoded data without paying decompression and decoding penalty
    when possible.

    Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
    data processing frameworks, and we are not interested in playing
    favorites.
    We believe that an efficient, well-implemented columnar storage substrate
    should be useful to all frameworks without the cost of extensive and
    difficult to set up dependencies.

    The initial code, available at https://github.com/Parquet, defines the file
    format, provides Java building blocks for processing columnar data, and
    implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
    of a complex integration -- Input/Output formats that can convert
    Parquet-stored data directly to and from Thrift objects.

    A preview version of Parquet support will be available in Cloudera's Impala
    0.7.

    Twitter is starting to convert some of its major data source to Parquet in
    order to take advantage of the compression and deserialization savings.

    Parquet is currently under heavy development. Parquet's near-term roadmap
    includes:
    * Hive SerDes (Criteo)
    * Cascading Taps (Criteo)
    * Support for dictionary encoding, zigzag encoding, and RLE encoding of
    data (Cloudera and Twitter)
    * Further improvements to Pig support (Twitter)

    Company names in parenthesis indicate whose engineers signed up to do the
    work -- others can feel free to jump in too, of course.

    We've also heard requests to provide an Avro container layer, similar to
    what we do with Thrift. Seeking volunteers!

    We welcome all feedback, patches, and ideas; to foster community
    development, we plan to contribute Parquet to the Apache Incubator when the
    development is farther along.

    Regards,
    Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
    Jonathan Coveney, and friends.
  • Jonathan Coveney at Mar 12, 2013 at 10:08 pm
    Super excited that this is finally public. The benefits are huge, and
    having an (eventually) battle tested columnar storage format developed for
    a diverse set of needs will be awesome.


    2013/3/12 Kevin Olson <kolson@marinsoftware.com>
    Second on that. Parquet looks compelling, but I'm curious to understand why
    Cloudera suddenly switched from espousing future support for Trevni to
    teaming with Twitter on Parquet.

    On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg
    wrote:
    Dmitriy,

    Please excuse my ignorance. What is/was wrong with trevni
    (https://github.com/cutting/trevni) ?

    Thanks,

    stan

    On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    Fellow Hadoopers,

    We'd like to introduce a joint project between Twitter and Cloudera
    engineers -- a new columnar storage format for Hadoop called Parquet (
    http://parquet.github.com).

    We created Parquet to make the advantages of compressed, efficient columnar
    data representation available to any project in the Hadoop ecosystem,
    regardless of the choice of data processing framework, data model, or
    programming language.

    Parquet is built from the ground up with complex nested data structures in
    mind. We adopted the repetition/definition level approach to encoding such
    data structures, as described in Google's Dremel paper; we have found this
    to be a very efficient method of encoding data in non-trivial object
    schemas.

    Parquet is built to support very efficient compression and encoding
    schemes. Parquet allows compression schemes to be specified on a
    per-column
    level, and is future-proofed to allow adding more encodings as they are
    invented and implemented. We separate the concepts of encoding and
    compression, allowing parquet consumers to implement operators that
    work
    directly on encoded data without paying decompression and decoding penalty
    when possible.

    Parquet is built to be used by anyone. The Hadoop ecosystem is rich
    with
    data processing frameworks, and we are not interested in playing
    favorites.
    We believe that an efficient, well-implemented columnar storage
    substrate
    should be useful to all frameworks without the cost of extensive and
    difficult to set up dependencies.

    The initial code, available at https://github.com/Parquet, defines the file
    format, provides Java building blocks for processing columnar data, and
    implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
    of a complex integration -- Input/Output formats that can convert
    Parquet-stored data directly to and from Thrift objects.

    A preview version of Parquet support will be available in Cloudera's Impala
    0.7.

    Twitter is starting to convert some of its major data source to Parquet in
    order to take advantage of the compression and deserialization savings.

    Parquet is currently under heavy development. Parquet's near-term
    roadmap
    includes:
    * Hive SerDes (Criteo)
    * Cascading Taps (Criteo)
    * Support for dictionary encoding, zigzag encoding, and RLE encoding of
    data (Cloudera and Twitter)
    * Further improvements to Pig support (Twitter)

    Company names in parenthesis indicate whose engineers signed up to do
    the
    work -- others can feel free to jump in too, of course.

    We've also heard requests to provide an Avro container layer, similar
    to
    what we do with Thrift. Seeking volunteers!

    We welcome all feedback, patches, and ideas; to foster community
    development, we plan to contribute Parquet to the Apache Incubator when the
    development is farther along.

    Regards,
    Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
    Jonathan Coveney, and friends.
  • Jarek Jarcec Cecho at Mar 13, 2013 at 7:39 pm
    Cloudera has published a blog post [1] about the Parquet which seems to be answering most of the questions. I would encourage to read that article. It specifically talks about relationship with Trevni:

    Parquet is designed to bring efficient columnar storage to Hadoop. Compared to, and learning from, the initial work done toward this goal in Trevni, Parquet includes the following enhancements:

    * Efficiently encode nested structures and sparsely populated data based on the Google Dremel definition/repetition levels
    * Provide extensible support for per-column encodings (e.g. delta, run length, etc)
    * Provide extensibility of storing multiple types of data in column data (e.g. indexes, bloom filters, statistics)
    * Offer better write performance by storing metadata at the end of the file

    Jarcec

    Links:
    1: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/
    On Tue, Mar 12, 2013 at 01:06:04PM -0700, Kevin Olson wrote:
    Second on that. Parquet looks compelling, but I'm curious to understand why
    Cloudera suddenly switched from espousing future support for Trevni to
    teaming with Twitter on Parquet.

    On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg
    wrote:
    Dmitriy,

    Please excuse my ignorance. What is/was wrong with trevni
    (https://github.com/cutting/trevni) ?

    Thanks,

    stan

    On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    Fellow Hadoopers,

    We'd like to introduce a joint project between Twitter and Cloudera
    engineers -- a new columnar storage format for Hadoop called Parquet (
    http://parquet.github.com).

    We created Parquet to make the advantages of compressed, efficient columnar
    data representation available to any project in the Hadoop ecosystem,
    regardless of the choice of data processing framework, data model, or
    programming language.

    Parquet is built from the ground up with complex nested data structures in
    mind. We adopted the repetition/definition level approach to encoding such
    data structures, as described in Google's Dremel paper; we have found this
    to be a very efficient method of encoding data in non-trivial object
    schemas.

    Parquet is built to support very efficient compression and encoding
    schemes. Parquet allows compression schemes to be specified on a
    per-column
    level, and is future-proofed to allow adding more encodings as they are
    invented and implemented. We separate the concepts of encoding and
    compression, allowing parquet consumers to implement operators that work
    directly on encoded data without paying decompression and decoding penalty
    when possible.

    Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
    data processing frameworks, and we are not interested in playing
    favorites.
    We believe that an efficient, well-implemented columnar storage substrate
    should be useful to all frameworks without the cost of extensive and
    difficult to set up dependencies.

    The initial code, available at https://github.com/Parquet, defines the file
    format, provides Java building blocks for processing columnar data, and
    implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
    of a complex integration -- Input/Output formats that can convert
    Parquet-stored data directly to and from Thrift objects.

    A preview version of Parquet support will be available in Cloudera's Impala
    0.7.

    Twitter is starting to convert some of its major data source to Parquet in
    order to take advantage of the compression and deserialization savings.

    Parquet is currently under heavy development. Parquet's near-term roadmap
    includes:
    * Hive SerDes (Criteo)
    * Cascading Taps (Criteo)
    * Support for dictionary encoding, zigzag encoding, and RLE encoding of
    data (Cloudera and Twitter)
    * Further improvements to Pig support (Twitter)

    Company names in parenthesis indicate whose engineers signed up to do the
    work -- others can feel free to jump in too, of course.

    We've also heard requests to provide an Avro container layer, similar to
    what we do with Thrift. Seeking volunteers!

    We welcome all feedback, patches, and ideas; to foster community
    development, we plan to contribute Parquet to the Apache Incubator when the
    development is farther along.

    Regards,
    Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
    Jonathan Coveney, and friends.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 12, '13 at 5:30p
activeMar 13, '13 at 7:39p
posts5
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase