FAQ
I have some questions about this new upcoming data format.

First of all, is there going to be an easy transition to using it? We
currently have all our data in AVRO Snappy compressed format, and to get
from the original raw CSV Zip files into this format, we use Hive along
with a custom CSV Zip InputFormat and SerDe to insert this data into a
table that is already using AVRO Snappy as its data format. Will it be
similar to this?

Will MapReduce be compatible with this? If so, how soon and what will it
take? Some of our users are still writing streaming MR jobs.

Will Hive work with it at the same time as Impala when Parquet is released?
Can we just specify the InputFormat and SerDe?

Lastly, how is Parquet pronounced? Is the "t" silent like in French?

Thanks,
Ben

Search Discussions

  • Harsh J at Apr 2, 2013 at 4:34 pm
    Hi Benjamin,

    Can answer the MR compatibility question: Format implementations plus
    Apache Hadoop IF/OF/etc. are already available at Parquet-MR:
    https://github.com/Parquet/parquet-mr, which should be readily
    consumable as I also see test cases.

    I'll let more qualified Impala/Hive/French folks answer the rest.
    On Tue, Apr 2, 2013 at 10:00 PM, Benjamin Kim wrote:
    I have some questions about this new upcoming data format.

    First of all, is there going to be an easy transition to using it? We
    currently have all our data in AVRO Snappy compressed format, and to get
    from the original raw CSV Zip files into this format, we use Hive along with
    a custom CSV Zip InputFormat and SerDe to insert this data into a table that
    is already using AVRO Snappy as its data format. Will it be similar to this?

    Will MapReduce be compatible with this? If so, how soon and what will it
    take? Some of our users are still writing streaming MR jobs.

    Will Hive work with it at the same time as Impala when Parquet is released?
    Can we just specify the InputFormat and SerDe?

    Lastly, how is Parquet pronounced? Is the "t" silent like in French?

    Thanks,
    Ben


    --
    Harsh J
  • Adam Smieszny at Apr 2, 2013 at 5:54 pm
    Re: pronunciation: yes, sounds like "par-kay"

    On Tue, Apr 2, 2013 at 12:34 PM, Harsh J wrote:

    Hi Benjamin,

    Can answer the MR compatibility question: Format implementations plus
    Apache Hadoop IF/OF/etc. are already available at Parquet-MR:
    https://github.com/Parquet/parquet-mr, which should be readily
    consumable as I also see test cases.

    I'll let more qualified Impala/Hive/French folks answer the rest.
    On Tue, Apr 2, 2013 at 10:00 PM, Benjamin Kim wrote:
    I have some questions about this new upcoming data format.

    First of all, is there going to be an easy transition to using it? We
    currently have all our data in AVRO Snappy compressed format, and to get
    from the original raw CSV Zip files into this format, we use Hive along with
    a custom CSV Zip InputFormat and SerDe to insert this data into a table that
    is already using AVRO Snappy as its data format. Will it be similar to this?
    Will MapReduce be compatible with this? If so, how soon and what will it
    take? Some of our users are still writing streaming MR jobs.

    Will Hive work with it at the same time as Impala when Parquet is released?
    Can we just specify the InputFormat and SerDe?

    Lastly, how is Parquet pronounced? Is the "t" silent like in French?

    Thanks,
    Ben


    --
    Harsh J


    --
    Adam Smieszny
    Cloudera | Systems Engineer | http://www.linkedin.com/in/adamsmieszny
    917.830.4156
  • Benjamin Kim at Apr 17, 2013 at 5:12 pm
    Hi Nong,

    Any information on how to get Parquet installed into CDH4.2 for use? I want
    to create a Hive table that I can insert some data for a test. I also want
    to see how it compares with our AVRO backed tables in terms of speed and
    storage.

    Thanks,
    Ben
    On Wednesday, April 3, 2013 4:16:53 PM UTC-7, Nong wrote:

    Answers inline:

    On Tue, Apr 2, 2013 at 10:54 AM, Adam Smieszny <ad...@cloudera.com<javascript:>
    wrote:
    Re: pronunciation: yes, sounds like "par-kay"


    On Tue, Apr 2, 2013 at 12:34 PM, Harsh J <ha...@cloudera.com<javascript:>
    wrote:
    Hi Benjamin,

    Can answer the MR compatibility question: Format implementations plus
    Apache Hadoop IF/OF/etc. are already available at Parquet-MR:
    https://github.com/Parquet/parquet-mr, which should be readily
    consumable as I also see test cases.

    I'll let more qualified Impala/Hive/French folks answer the rest.

    On Tue, Apr 2, 2013 at 10:00 PM, Benjamin Kim <bbui...@gmail.com<javascript:>>
    wrote:
    I have some questions about this new upcoming data format.

    First of all, is there going to be an easy transition to using it? We
    currently have all our data in AVRO Snappy compressed format, and to get
    from the original raw CSV Zip files into this format, we use Hive
    along with
    a custom CSV Zip InputFormat and SerDe to insert this data into a
    table that
    is already using AVRO Snappy as its data format. Will it be similar to
    this?
    Converting data into parquet will be very similar to converting it to
    avro snappy. If your current
    data conversion is done using MR, you will be able to use the parquet MR
    output format. If you
    are currently converting your data using Hive (i.e. insert overwrite tbl
    select *), you'll be able
    to do that using Impala.

    Will MapReduce be compatible with this? If so, how soon and what will
    it
    take? Some of our users are still writing streaming MR jobs.
    Yup. The MR input/output format work has been developed side by side
    with the file format
    spec and the Impala implementation. You can find more details on the
    public repo:
    https://github.com/Parquet/parquet-mr

    Will Hive work with it at the same time as Impala when Parquet is
    released?
    Can we just specify the InputFormat and SerDe?
    The Hive Serde is being developed by Criteo, building on top of the MR
    project I linked above.
    We won't hold up the Impala release for the Hive support but the plan is
    to have it ready around
    the Impala GA time frame.

    Lastly, how is Parquet pronounced? Is the "t" silent like in French?
    Thanks,
    Ben


    --
    Harsh J


    --
    Adam Smieszny
    Cloudera | Systems Engineer | http://www.linkedin.com/in/adamsmieszny
    917.830.4156
  • Lenni Kuff at Apr 17, 2013 at 5:22 pm
    Hi,
    Please take a look at this page. It's pretty basic right now, but should
    get you started:
    http://www.cloudera.com/content/cloudera-content/cloudera-docs/ImpalaBeta/0.7/Installing-and-Using-Impala/ciiu_topic_7_3.html

    Let us know if you have any questions.

    Thanks,
    Lenni
    Software Engineer - Cloudera
    On Wed, Apr 17, 2013 at 10:12 AM, Benjamin Kim wrote:

    Hi Nong,

    Any information on how to get Parquet installed into CDH4.2 for use? I
    want to create a Hive table that I can insert some data for a test. I also
    want to see how it compares with our AVRO backed tables in terms of speed
    and storage.

    Thanks,
    Ben

    On Wednesday, April 3, 2013 4:16:53 PM UTC-7, Nong wrote:

    Answers inline:
    On Tue, Apr 2, 2013 at 10:54 AM, Adam Smieszny wrote:

    Re: pronunciation: yes, sounds like "par-kay"

    On Tue, Apr 2, 2013 at 12:34 PM, Harsh J wrote:

    Hi Benjamin,

    Can answer the MR compatibility question: Format implementations plus
    Apache Hadoop IF/OF/etc. are already available at Parquet-MR:
    https://github.com/Parquet/**parquet-mr<https://github.com/Parquet/parquet-mr>,
    which should be readily
    consumable as I also see test cases.

    I'll let more qualified Impala/Hive/French folks answer the rest.

    On Tue, Apr 2, 2013 at 10:00 PM, Benjamin Kim <bbui...@gmail.com>
    wrote:
    I have some questions about this new upcoming data format.

    First of all, is there going to be an easy transition to using it? We
    currently have all our data in AVRO Snappy compressed format, and to get
    from the original raw CSV Zip files into this format, we use Hive
    along with
    a custom CSV Zip InputFormat and SerDe to insert this data into a
    table that
    is already using AVRO Snappy as its data format. Will it be similar
    to this?
    Converting data into parquet will be very similar to converting it to
    avro snappy. If your current
    data conversion is done using MR, you will be able to use the parquet MR
    output format. If you
    are currently converting your data using Hive (i.e. insert overwrite tbl
    select *), you'll be able
    to do that using Impala.

    Will MapReduce be compatible with this? If so, how soon and what will
    it
    take? Some of our users are still writing streaming MR jobs.
    Yup. The MR input/output format work has been developed side by side
    with the file format
    spec and the Impala implementation. You can find more details on the
    public repo:
    https://github.com/Parquet/**parquet-mr<https://github.com/Parquet/parquet-mr>

    Will Hive work with it at the same time as Impala when Parquet is
    released?
    Can we just specify the InputFormat and SerDe?
    The Hive Serde is being developed by Criteo, building on top of the MR
    project I linked above.
    We won't hold up the Impala release for the Hive support but the plan is
    to have it ready around
    the Impala GA time frame.

    Lastly, how is Parquet pronounced? Is the "t" silent like in French?
    Thanks,
    Ben


    --
    Harsh J


    --
    Adam Smieszny
    Cloudera | Systems Engineer | http://www.**linkedin.com/in/adamsmieszny<http://www.linkedin.com/in/adamsmieszny>
    917.830.4156
  • Benjamin Kim at Apr 18, 2013 at 3:46 pm
    Hi,

    Thanks for the info. I tried it out, and it worked. I love the speed!

    But, there are couple things I noticed.
    - Dropping a table does not remove the underlying data directory in HDFS
    - Parquet data size is much larger than the equivalent AVRO Snappy
    compressed data

    Is there a way to make the Parquet data more compact? I created a Parquet
    based table and inserted data from an AVRO Snappy compressed based table
    for an hour's worth of data from March 31st.

    AVRO Snappy: 7,962,451,273B
    Parquet: 20,481,070,712B

    Thanks,
    Ben

    On Wednesday, April 17, 2013 10:22:07 AM UTC-7, lskuff wrote:

    Hi,
    Please take a look at this page. It's pretty basic right now, but should
    get you started:

    http://www.cloudera.com/content/cloudera-content/cloudera-docs/ImpalaBeta/0.7/Installing-and-Using-Impala/ciiu_topic_7_3.html

    Let us know if you have any questions.

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Wed, Apr 17, 2013 at 10:12 AM, Benjamin Kim <bbui...@gmail.com<javascript:>
    wrote:
    Hi Nong,

    Any information on how to get Parquet installed into CDH4.2 for use? I
    want to create a Hive table that I can insert some data for a test. I also
    want to see how it compares with our AVRO backed tables in terms of speed
    and storage.

    Thanks,
    Ben

    On Wednesday, April 3, 2013 4:16:53 PM UTC-7, Nong wrote:

    Answers inline:
    On Tue, Apr 2, 2013 at 10:54 AM, Adam Smieszny wrote:

    Re: pronunciation: yes, sounds like "par-kay"

    On Tue, Apr 2, 2013 at 12:34 PM, Harsh J wrote:

    Hi Benjamin,

    Can answer the MR compatibility question: Format implementations plus
    Apache Hadoop IF/OF/etc. are already available at Parquet-MR:
    https://github.com/Parquet/**parquet-mr<https://github.com/Parquet/parquet-mr>,
    which should be readily
    consumable as I also see test cases.

    I'll let more qualified Impala/Hive/French folks answer the rest.

    On Tue, Apr 2, 2013 at 10:00 PM, Benjamin Kim <bbui...@gmail.com>
    wrote:
    I have some questions about this new upcoming data format.

    First of all, is there going to be an easy transition to using it? We
    currently have all our data in AVRO Snappy compressed format, and to get
    from the original raw CSV Zip files into this format, we use Hive
    along with
    a custom CSV Zip InputFormat and SerDe to insert this data into a
    table that
    is already using AVRO Snappy as its data format. Will it be similar
    to this?
    Converting data into parquet will be very similar to converting it to
    avro snappy. If your current
    data conversion is done using MR, you will be able to use the parquet MR
    output format. If you
    are currently converting your data using Hive (i.e. insert overwrite tbl
    select *), you'll be able
    to do that using Impala.

    Will MapReduce be compatible with this? If so, how soon and what
    will it
    take? Some of our users are still writing streaming MR jobs.
    Yup. The MR input/output format work has been developed side by side
    with the file format
    spec and the Impala implementation. You can find more details on the
    public repo:
    https://github.com/Parquet/**parquet-mr<https://github.com/Parquet/parquet-mr>

    Will Hive work with it at the same time as Impala when Parquet is
    released?
    Can we just specify the InputFormat and SerDe?
    The Hive Serde is being developed by Criteo, building on top of the MR
    project I linked above.
    We won't hold up the Impala release for the Hive support but the plan is
    to have it ready around
    the Impala GA time frame.

    Lastly, how is Parquet pronounced? Is the "t" silent like in French?
    Thanks,
    Ben


    --
    Harsh J


    --
    Adam Smieszny
    Cloudera | Systems Engineer | http://www.**linkedin.com/in/adamsmieszny<http://www.linkedin.com/in/adamsmieszny>
    917.830.4156
  • Nong Li at Apr 18, 2013 at 8:01 pm
    Parquet files being much bigger is unexpected. Can you share the schema?

    Thanks
    Nong

    On Thu, Apr 18, 2013 at 8:46 AM, Benjamin Kim wrote:

    Hi,

    Thanks for the info. I tried it out, and it worked. I love the speed!

    But, there are couple things I noticed.
    - Dropping a table does not remove the underlying data directory in HDFS
    - Parquet data size is much larger than the equivalent AVRO Snappy
    compressed data

    Is there a way to make the Parquet data more compact? I created a Parquet
    based table and inserted data from an AVRO Snappy compressed based table
    for an hour's worth of data from March 31st.

    AVRO Snappy: 7,962,451,273B
    Parquet: 20,481,070,712B

    Thanks,
    Ben

    On Wednesday, April 17, 2013 10:22:07 AM UTC-7, lskuff wrote:

    Hi,
    Please take a look at this page. It's pretty basic right now, but should
    get you started:
    http://www.cloudera.com/**content/cloudera-content/**
    cloudera-docs/ImpalaBeta/0.7/**Installing-and-Using-Impala/**
    ciiu_topic_7_3.html<http://www.cloudera.com/content/cloudera-content/cloudera-docs/ImpalaBeta/0.7/Installing-and-Using-Impala/ciiu_topic_7_3.html>

    Let us know if you have any questions.

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Wed, Apr 17, 2013 at 10:12 AM, Benjamin Kim wrote:

    Hi Nong,

    Any information on how to get Parquet installed into CDH4.2 for use? I
    want to create a Hive table that I can insert some data for a test. I also
    want to see how it compares with our AVRO backed tables in terms of speed
    and storage.

    Thanks,
    Ben

    On Wednesday, April 3, 2013 4:16:53 PM UTC-7, Nong wrote:

    Answers inline:
    On Tue, Apr 2, 2013 at 10:54 AM, Adam Smieszny wrote:

    Re: pronunciation: yes, sounds like "par-kay"

    On Tue, Apr 2, 2013 at 12:34 PM, Harsh J wrote:

    Hi Benjamin,

    Can answer the MR compatibility question: Format implementations plus
    Apache Hadoop IF/OF/etc. are already available at Parquet-MR:
    https://github.com/Parquet/**par**quet-mr<https://github.com/Parquet/parquet-mr>,
    which should be readily
    consumable as I also see test cases.

    I'll let more qualified Impala/Hive/French folks answer the rest.

    On Tue, Apr 2, 2013 at 10:00 PM, Benjamin Kim <bbui...@gmail.com>
    wrote:
    I have some questions about this new upcoming data format.

    First of all, is there going to be an easy transition to using it? We
    currently have all our data in AVRO Snappy compressed format, and to get
    from the original raw CSV Zip files into this format, we use Hive
    along with
    a custom CSV Zip InputFormat and SerDe to insert this data into a
    table that
    is already using AVRO Snappy as its data format. Will it be similar
    to this?
    Converting data into parquet will be very similar to converting it to
    avro snappy. If your current
    data conversion is done using MR, you will be able to use the parquet
    MR output format. If you
    are currently converting your data using Hive (i.e. insert overwrite
    tbl select *), you'll be able
    to do that using Impala.

    Will MapReduce be compatible with this? If so, how soon and what
    will it
    take? Some of our users are still writing streaming MR jobs.
    Yup. The MR input/output format work has been developed side by side
    with the file format
    spec and the Impala implementation. You can find more details on the
    public repo:
    https://github.com/Parquet/**par**quet-mr<https://github.com/Parquet/parquet-mr>

    Will Hive work with it at the same time as Impala when Parquet is
    released?
    Can we just specify the InputFormat and SerDe?
    The Hive Serde is being developed by Criteo, building on top of the MR
    project I linked above.
    We won't hold up the Impala release for the Hive support but the plan
    is to have it ready around
    the Impala GA time frame.

    Lastly, how is Parquet pronounced? Is the "t" silent like in French?
    Thanks,
    Ben


    --
    Harsh J


    --
    Adam Smieszny
    Cloudera | Systems Engineer | http://www.**linkedin**
    .com/in/adamsmieszny <http://www.linkedin.com/in/adamsmieszny>
    917.830.4156
  • Benjamin Kim at Apr 18, 2013 at 8:53 pm
    Nong,

    I attached the AVRO schema file for the table. The Parquet is identical to
    the AVRO Snappy compressed one.

    Below is the directory listing for the Parquet table.

    622838218
    /user/beeswax/warehouse/adr_pq_event_events/-2081508980424160290--9192484851525113256_529777478_dir
    445251639
    /user/beeswax/warehouse/adr_pq_event_events/-2081508980424160290--9192484851525113258_1477730649_dir
    1093088
    /user/beeswax/warehouse/adr_pq_event_events/-2081508980424160290--9192484851525113259_347052596_dir
    530347694
    /user/beeswax/warehouse/adr_pq_event_events/-2081508980424160290--9192484851525113260_1839317137_dir
    531299711
    /user/beeswax/warehouse/adr_pq_event_events/-2081508980424160290--9192484851525113261_842811330_dir
    529442293
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983538_867261985_data.0
    530630467
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983538_867261985_data.1
    409234859
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983538_867261985_data.2
    525925073
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983539_1728157970_data.0
    527085682
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983539_1728157970_data.1
    293639074
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983539_1728157970_data.2
    530454237
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983540_867261985_data.0
    529808232
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983540_867261985_data.1
    276754375
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983540_867261985_data.2
    528974747
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983541_1076416864_data.0
    530230019
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983541_1076416864_data.1
    336322572
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983541_1076416864_data.2
    529983586
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983542_1299525363_data.0
    527976350
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983542_1299525363_data.1
    352928775
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983542_1299525363_data.2
    531242724
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983543_677657164_data.0
    529453910
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983543_677657164_data.1
    376021654
    /user/beeswax/warehouse/adr_pq_event_events/-7074976523056625241--6632766509309983543_677657164_data.2
    368009839
    /user/beeswax/warehouse/adr_pq_event_events/-7205768127447347556--7935676597130767607_988488665_dir
    482934248
    /user/beeswax/warehouse/adr_pq_event_events/-7205768127447347556--7935676597130767608_9596934_dir
    319472164
    /user/beeswax/warehouse/adr_pq_event_events/-7205768127447347556--7935676597130767609_1604000666_dir
    387702859
    /user/beeswax/warehouse/adr_pq_event_events/-7205768127447347556--7935676597130767610_534348516_dir
    522622
    /user/beeswax/warehouse/adr_pq_event_events/-7205768127447347556--7935676597130767611_712544616_dir
    8395490001
    /user/beeswax/warehouse/adr_pq_event_events/dt=2013-03-31%2000%3A00

    Below is the directory listing for the AVRO Snappy table.

    372020871
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000000_0.avro
    371528387
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000001_0.avro
    370889465
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000002_0.avro
    370593611
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000003_0.avro
    369253632
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000004_0.avro
    368696528
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000005_0.avro
    372592997
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000006_0.avro
    367655440
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000007_0.avro
    367738990
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000008_0.avro
    364263869
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000009_0.avro
    361501692
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000010_0.avro
    361533226
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000011_0.avro
    364234195
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000012_0.avro
    360737435
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000013_0.avro
    361056888
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000014_0.avro
    359081228
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000015_0.avro
    361034999
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000016_0.avro
    358344591
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000017_0.avro
    357988193
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000018_0.avro
    360549978
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000019_0.avro
    269585170
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000020_0.avro
    241061342
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000021_0.avro
    100405343
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000022_0.avro
    50103203
    /amg/adr/bi/event/events/1.0.0/ymdh/2013/03/31/00/1364828298062-965be4009adc11e2ade9782bcb2198c4/000023_0.avro

    Thanks,
    Ben
    On Thursday, April 18, 2013 1:01:33 PM UTC-7, Nong wrote:

    Parquet files being much bigger is unexpected. Can you share the schema?

    Thanks
    Nong


    On Thu, Apr 18, 2013 at 8:46 AM, Benjamin Kim <bbui...@gmail.com<javascript:>
    wrote:
    Hi,

    Thanks for the info. I tried it out, and it worked. I love the speed!

    But, there are couple things I noticed.
    - Dropping a table does not remove the underlying data directory in HDFS
    - Parquet data size is much larger than the equivalent AVRO Snappy
    compressed data

    Is there a way to make the Parquet data more compact? I created a Parquet
    based table and inserted data from an AVRO Snappy compressed based table
    for an hour's worth of data from March 31st.

    AVRO Snappy: 7,962,451,273B
    Parquet: 20,481,070,712B

    Thanks,
    Ben

    On Wednesday, April 17, 2013 10:22:07 AM UTC-7, lskuff wrote:

    Hi,
    Please take a look at this page. It's pretty basic right now, but should
    get you started:
    http://www.cloudera.com/**content/cloudera-content/**
    cloudera-docs/ImpalaBeta/0.7/**Installing-and-Using-Impala/**
    ciiu_topic_7_3.html<http://www.cloudera.com/content/cloudera-content/cloudera-docs/ImpalaBeta/0.7/Installing-and-Using-Impala/ciiu_topic_7_3.html>

    Let us know if you have any questions.

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Wed, Apr 17, 2013 at 10:12 AM, Benjamin Kim wrote:

    Hi Nong,

    Any information on how to get Parquet installed into CDH4.2 for use? I
    want to create a Hive table that I can insert some data for a test. I also
    want to see how it compares with our AVRO backed tables in terms of speed
    and storage.

    Thanks,
    Ben

    On Wednesday, April 3, 2013 4:16:53 PM UTC-7, Nong wrote:

    Answers inline:
    On Tue, Apr 2, 2013 at 10:54 AM, Adam Smieszny wrote:

    Re: pronunciation: yes, sounds like "par-kay"

    On Tue, Apr 2, 2013 at 12:34 PM, Harsh J wrote:

    Hi Benjamin,

    Can answer the MR compatibility question: Format implementations plus
    Apache Hadoop IF/OF/etc. are already available at Parquet-MR:
    https://github.com/Parquet/**par**quet-mr<https://github.com/Parquet/parquet-mr>,
    which should be readily
    consumable as I also see test cases.

    I'll let more qualified Impala/Hive/French folks answer the rest.

    On Tue, Apr 2, 2013 at 10:00 PM, Benjamin Kim <bbui...@gmail.com>
    wrote:
    I have some questions about this new upcoming data format.

    First of all, is there going to be an easy transition to using it? We
    currently have all our data in AVRO Snappy compressed format, and to get
    from the original raw CSV Zip files into this format, we use Hive
    along with
    a custom CSV Zip InputFormat and SerDe to insert this data into a
    table that
    is already using AVRO Snappy as its data format. Will it be
    similar to this?
    Converting data into parquet will be very similar to converting it to
    avro snappy. If your current
    data conversion is done using MR, you will be able to use the parquet
    MR output format. If you
    are currently converting your data using Hive (i.e. insert overwrite
    tbl select *), you'll be able
    to do that using Impala.

    Will MapReduce be compatible with this? If so, how soon and what
    will it
    take? Some of our users are still writing streaming MR jobs.
    Yup. The MR input/output format work has been developed side by side
    with the file format
    spec and the Impala implementation. You can find more details on the
    public repo:
    https://github.com/Parquet/**par**quet-mr<https://github.com/Parquet/parquet-mr>

    Will Hive work with it at the same time as Impala when Parquet is
    released?
    Can we just specify the InputFormat and SerDe?
    The Hive Serde is being developed by Criteo, building on top of the
    MR project I linked above.
    We won't hold up the Impala release for the Hive support but the plan
    is to have it ready around
    the Impala GA time frame.

    Lastly, how is Parquet pronounced? Is the "t" silent like in
    French?
    Thanks,
    Ben


    --
    Harsh J


    --
    Adam Smieszny
    Cloudera | Systems Engineer | http://www.**linkedin**
    .com/in/adamsmieszny <http://www.linkedin.com/in/adamsmieszny>
    917.830.4156
  • Benjamin Kim at Apr 18, 2013 at 8:54 pm
    Oops! I forgot to attached the schema file.
    On Thursday, April 18, 2013 1:01:33 PM UTC-7, Nong wrote:

    Parquet files being much bigger is unexpected. Can you share the schema?

    Thanks
    Nong


    On Thu, Apr 18, 2013 at 8:46 AM, Benjamin Kim <bbui...@gmail.com<javascript:>
    wrote:
    Hi,

    Thanks for the info. I tried it out, and it worked. I love the speed!

    But, there are couple things I noticed.
    - Dropping a table does not remove the underlying data directory in HDFS
    - Parquet data size is much larger than the equivalent AVRO Snappy
    compressed data

    Is there a way to make the Parquet data more compact? I created a Parquet
    based table and inserted data from an AVRO Snappy compressed based table
    for an hour's worth of data from March 31st.

    AVRO Snappy: 7,962,451,273B
    Parquet: 20,481,070,712B

    Thanks,
    Ben

    On Wednesday, April 17, 2013 10:22:07 AM UTC-7, lskuff wrote:

    Hi,
    Please take a look at this page. It's pretty basic right now, but should
    get you started:
    http://www.cloudera.com/**content/cloudera-content/**
    cloudera-docs/ImpalaBeta/0.7/**Installing-and-Using-Impala/**
    ciiu_topic_7_3.html<http://www.cloudera.com/content/cloudera-content/cloudera-docs/ImpalaBeta/0.7/Installing-and-Using-Impala/ciiu_topic_7_3.html>

    Let us know if you have any questions.

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Wed, Apr 17, 2013 at 10:12 AM, Benjamin Kim wrote:

    Hi Nong,

    Any information on how to get Parquet installed into CDH4.2 for use? I
    want to create a Hive table that I can insert some data for a test. I also
    want to see how it compares with our AVRO backed tables in terms of speed
    and storage.

    Thanks,
    Ben

    On Wednesday, April 3, 2013 4:16:53 PM UTC-7, Nong wrote:

    Answers inline:
    On Tue, Apr 2, 2013 at 10:54 AM, Adam Smieszny wrote:

    Re: pronunciation: yes, sounds like "par-kay"

    On Tue, Apr 2, 2013 at 12:34 PM, Harsh J wrote:

    Hi Benjamin,

    Can answer the MR compatibility question: Format implementations plus
    Apache Hadoop IF/OF/etc. are already available at Parquet-MR:
    https://github.com/Parquet/**par**quet-mr<https://github.com/Parquet/parquet-mr>,
    which should be readily
    consumable as I also see test cases.

    I'll let more qualified Impala/Hive/French folks answer the rest.

    On Tue, Apr 2, 2013 at 10:00 PM, Benjamin Kim <bbui...@gmail.com>
    wrote:
    I have some questions about this new upcoming data format.

    First of all, is there going to be an easy transition to using it? We
    currently have all our data in AVRO Snappy compressed format, and to get
    from the original raw CSV Zip files into this format, we use Hive
    along with
    a custom CSV Zip InputFormat and SerDe to insert this data into a
    table that
    is already using AVRO Snappy as its data format. Will it be
    similar to this?
    Converting data into parquet will be very similar to converting it to
    avro snappy. If your current
    data conversion is done using MR, you will be able to use the parquet
    MR output format. If you
    are currently converting your data using Hive (i.e. insert overwrite
    tbl select *), you'll be able
    to do that using Impala.

    Will MapReduce be compatible with this? If so, how soon and what
    will it
    take? Some of our users are still writing streaming MR jobs.
    Yup. The MR input/output format work has been developed side by side
    with the file format
    spec and the Impala implementation. You can find more details on the
    public repo:
    https://github.com/Parquet/**par**quet-mr<https://github.com/Parquet/parquet-mr>

    Will Hive work with it at the same time as Impala when Parquet is
    released?
    Can we just specify the InputFormat and SerDe?
    The Hive Serde is being developed by Criteo, building on top of the
    MR project I linked above.
    We won't hold up the Impala release for the Hive support but the plan
    is to have it ready around
    the Impala GA time frame.

    Lastly, how is Parquet pronounced? Is the "t" silent like in
    French?
    Thanks,
    Ben


    --
    Harsh J


    --
    Adam Smieszny
    Cloudera | Systems Engineer | http://www.**linkedin**
    .com/in/adamsmieszny <http://www.linkedin.com/in/adamsmieszny>
    917.830.4156

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedApr 2, '13 at 4:30p
activeApr 18, '13 at 8:54p
posts9
users5
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase