FAQ
Use Parquet with gb block size with snappy  hope that works

Thanks
Deepak Gattala


Sent via the Samsung Galaxy Note® 3, an AT&T 4G LTE smartphone

<div>-------- Original message --------</div><div>From: Pengcheng Liu <zenonlpc@gmail.com> </div><div>Date:05/13/2014 7:20 AM (GMT-08:00) </div><div>To: impala-user@cloudera.org </div><div>Subject: Re: Impala won't work with large parquet files </div><div>
</div>I had a impala version vcdh5-1.3.0

But I just noticed my block size is not 4GB it is 3.96GB, Is this the reason my test is failed, block size has to be a multiplier of 1MB or 1GB?

Thanks
Pengcheng


On Tue, May 13, 2014 at 10:10 AM, Zesheng Wu wrote:
I've tried the option on impala 1.2.4, it does work.


2014-05-13 22:07 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

Hello Zesheng

I tried that still not working. This time when I use 4GB block query failed not returning any values. Before when I use 1GB block size, the query will complete and give me a result and with some additional error log information.

Thanks
Pengcheng


On Sat, May 10, 2014 at 10:54 PM, Zesheng Wu wrote:
Hi Pengcheng, you can try this one in impala-shell:
set PARQUET_FILE_SIZE=${block_size_you_want_to_set};



2014-05-10 4:22 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

Hello Lenni

I already tried invalidate metadata command, this doesn't work.

I am writing the parquet files from a mapreduce job and after the job finished I online those files through the impala JDBC API.

Then I have to call invalidate metadata to see the table in impala.

I was wondering if there is any configuration settings for impala or hdfs which control the maximum block size of the file on hdfs.

Thanks
Pengcheng


On Thu, May 8, 2014 at 3:43 PM, Lenni Kuff wrote:
Hi Pengcheng,
Since Impala caches the table metadata, including block location information, you will need to run an "invalidate metadata <table name>" after you change the block size. Can you try running that command and then re-running your query?

Let me know how this works out. If it resolves the problem we can look at how to improve the error message in Impala to make it easier to diagnose.

Thanks,
Lenni


On Thu, May 8, 2014 at 8:15 AM, Pengcheng Liu wrote:
Hello experts

I have been working with impala for a year and now the new parquet format is really exciting.

I had impala version vcdh5-1.3.0

I had a data set about 40G size in parquet (raw data is 500G) and with 20 partitions but the partition is not evenly distributed.

When i set the block size 1 GB, some of files are split into multiple blocks since they are larger than 1 GB.

The impala query will work but it gives me some warning information saying cannot query parquet files with multiple blocks.

And I saw some folks posted a similar problem here and one of response is setting the block size larger than the actual size of the file.

So I go ahead tried that I used 10 GB as my hdfs file block size.

Now my query failed with this error message:

ERROR: Error seeking to 3955895608 in file: hdfs://research-mn00.saas.local:8020/user/tablepar/201309/-r-00106.snappy.parquet
Error(22): Invalid argument
ERROR: Invalid query handle

Is this error due to the large block size I used? Is there any limits on the maximum block size we can create on hdfs?

Thanks
Pengcheng


To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.



--
Best Wishes!

Yours, Zesheng
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.



--
Best Wishes!

Yours, Zesheng
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

  • Pengcheng Liu at May 19, 2014 at 2:02 pm
    Hello Deepak

    When using 1 GB block size, the query works but there will be some warning
    information in query results about parquet file doesn't like multiple
    blocks.

    That is why I tried larger block size to get rid of the warning, but so far
    unsuccessful.

    Thanks
    Pengcheng


    On Thu, May 15, 2014 at 3:13 PM, gvr.deepak wrote:

    Use Parquet with gb block size with snappy hope that works

    Thanks
    Deepak Gattala


    Sent via the Samsung Galaxy Note® 3, an AT&T 4G LTE smartphone


    -------- Original message --------
    From: Pengcheng Liu
    Date:05/13/2014 7:20 AM (GMT-08:00)
    To: impala-user@cloudera.org
    Subject: Re: Impala won't work with large parquet files

    I had a impala version vcdh5-1.3.0

    But I just noticed my block size is not 4GB it is 3.96GB, Is this the
    reason my test is failed, block size has to be a multiplier of 1MB or 1GB?

    Thanks
    Pengcheng

    On Tue, May 13, 2014 at 10:10 AM, Zesheng Wu wrote:

    I've tried the option on impala 1.2.4, it does work.


    2014-05-13 22:07 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

    Hello Zesheng
    I tried that still not working. This time when I use 4GB block query
    failed not returning any values. Before when I use 1GB block size, the
    query will complete and give me a result and with some additional error log
    information.

    Thanks
    Pengcheng

    On Sat, May 10, 2014 at 10:54 PM, Zesheng Wu wrote:

    Hi Pengcheng, you can try this one in impala-shell:

    set PARQUET_FILE_SIZE=${block_size_you_want_to_set};


    2014-05-10 4:22 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

    Hello Lenni
    I already tried invalidate metadata command, this doesn't work.

    I am writing the parquet files from a mapreduce job and after the job
    finished I online those files through the impala JDBC API.

    Then I have to call invalidate metadata to see the table in impala.

    I was wondering if there is any configuration settings for impala or
    hdfs which control the maximum block size of the file on hdfs.

    Thanks
    Pengcheng

    On Thu, May 8, 2014 at 3:43 PM, Lenni Kuff wrote:

    Hi Pengcheng,
    Since Impala caches the table metadata, including block location
    information, you will need to run an "invalidate metadata <table name>"
    after you change the block size. Can you try running that command and then
    re-running your query?

    Let me know how this works out. If it resolves the problem we can
    look at how to improve the error message in Impala to make it easier to
    diagnose.

    Thanks,
    Lenni

    On Thu, May 8, 2014 at 8:15 AM, Pengcheng Liu wrote:

    Hello experts

    I have been working with impala for a year and now the new parquet
    format is really exciting.

    I had impala version vcdh5-1.3.0

    I had a data set about 40G size in parquet (raw data is 500G) and
    with 20 partitions but the partition is not evenly distributed.

    When i set the block size 1 GB, some of files are split into
    multiple blocks since they are larger than 1 GB.

    The impala query will work but it gives me some warning information
    saying cannot query parquet files with multiple blocks.

    And I saw some folks posted a similar problem here and one of
    response is setting the block size larger than the actual size of the file.

    So I go ahead tried that I used 10 GB as my hdfs file block size.

    Now my query failed with this error message:

    ERROR: Error seeking to 3955895608 in file:
    hdfs://research-mn00.saas.local:8020/user/tablepar/201309/-r-00106.snappy.parquet


    Error(22): Invalid argument
    ERROR: Invalid query handle

    Is this error due to the large block size I used? Is there any
    limits on the maximum block size we can create on hdfs?

    Thanks
    Pengcheng



    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.


    --
    Best Wishes!

    Yours, Zesheng

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.


    --
    Best Wishes!

    Yours, Zesheng

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alan Choi at May 24, 2014 at 2:07 am
    Hi,

    Would you mind sharing the following with us?

    1. The script that generate the parquet file.
    2. The parquet file size, hdfs block size.

    Thanks,
    Alan

    On Tue, May 20, 2014 at 6:40 AM, Pengcheng Liu wrote:

    Hello Alan

    Thanks for the answer. I was generating the parquet files from mapreduce
    job using parquet-mr package.

    When I set the block size to 1GB the query works fine but with some
    warning information which says parquet files doesn't like multiple blocks.

    Then I tried to get rid of the warning by increasing the block size so my
    big parquet file can live in on block. But my query will fail and so far I
    haven't been successfully query the new table. I also followed some of the
    suggestion here set the PARQUET_FILE_SIZE parameter to my block size. Which
    doesn't work.

    I tried all of this on both versions: 1.3.0 and 1,3.1

    Thanks
    Pengcheng

    On Mon, May 19, 2014 at 8:37 PM, Alan Choi wrote:

    Hi Pengcheng,

    If you're generating the parquet file from Impala, then Impala should
    correctly create one block per file for you. If you data is more than 1GB,
    Impala should split it into multiple files.

    If you're generating the parquet file in Hive, then you need to set the "dfs.block.size"
    to 1GB.

    Thanks,
    Alan

    On Mon, May 19, 2014 at 7:02 AM, Pengcheng Liu wrote:

    Hello Deepak

    When using 1 GB block size, the query works but there will be some
    warning information in query results about parquet file doesn't like
    multiple blocks.

    That is why I tried larger block size to get rid of the warning, but so
    far unsuccessful.

    Thanks
    Pengcheng


    On Thu, May 15, 2014 at 3:13 PM, gvr.deepak wrote:

    Use Parquet with gb block size with snappy hope that works

    Thanks
    Deepak Gattala


    Sent via the Samsung Galaxy Note® 3, an AT&T 4G LTE smartphone


    -------- Original message --------
    From: Pengcheng Liu
    Date:05/13/2014 7:20 AM (GMT-08:00)
    To: impala-user@cloudera.org
    Subject: Re: Impala won't work with large parquet files

    I had a impala version vcdh5-1.3.0

    But I just noticed my block size is not 4GB it is 3.96GB, Is this the
    reason my test is failed, block size has to be a multiplier of 1MB or 1GB?

    Thanks
    Pengcheng

    On Tue, May 13, 2014 at 10:10 AM, Zesheng Wu wrote:

    I've tried the option on impala 1.2.4, it does work.


    2014-05-13 22:07 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

    Hello Zesheng
    I tried that still not working. This time when I use 4GB block query
    failed not returning any values. Before when I use 1GB block size, the
    query will complete and give me a result and with some additional error log
    information.

    Thanks
    Pengcheng

    On Sat, May 10, 2014 at 10:54 PM, Zesheng Wu wrote:

    Hi Pengcheng, you can try this one in impala-shell:

    set PARQUET_FILE_SIZE=${block_size_you_want_to_set};


    2014-05-10 4:22 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

    Hello Lenni
    I already tried invalidate metadata command, this doesn't work.

    I am writing the parquet files from a mapreduce job and after the
    job finished I online those files through the impala JDBC API.

    Then I have to call invalidate metadata to see the table in impala.

    I was wondering if there is any configuration settings for impala
    or hdfs which control the maximum block size of the file on hdfs.

    Thanks
    Pengcheng

    On Thu, May 8, 2014 at 3:43 PM, Lenni Kuff wrote:

    Hi Pengcheng,
    Since Impala caches the table metadata, including block location
    information, you will need to run an "invalidate metadata <table name>"
    after you change the block size. Can you try running that command and then
    re-running your query?

    Let me know how this works out. If it resolves the problem we can
    look at how to improve the error message in Impala to make it easier to
    diagnose.

    Thanks,
    Lenni

    On Thu, May 8, 2014 at 8:15 AM, Pengcheng Liu wrote:

    Hello experts

    I have been working with impala for a year and now the new
    parquet format is really exciting.

    I had impala version vcdh5-1.3.0

    I had a data set about 40G size in parquet (raw data is 500G) and
    with 20 partitions but the partition is not evenly distributed.

    When i set the block size 1 GB, some of files are split into
    multiple blocks since they are larger than 1 GB.

    The impala query will work but it gives me some warning
    information saying cannot query parquet files with multiple blocks.

    And I saw some folks posted a similar problem here and one of
    response is setting the block size larger than the actual size of the file.

    So I go ahead tried that I used 10 GB as my hdfs file block size.

    Now my query failed with this error message:

    ERROR: Error seeking to 3955895608 in file:
    hdfs://research-mn00.saas.local:8020/user/tablepar/201309/-r-00106.snappy.parquet


    Error(22): Invalid argument
    ERROR: Invalid query handle

    Is this error due to the large block size I used? Is there any
    limits on the maximum block size we can create on hdfs?

    Thanks
    Pengcheng



    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.


    --
    Best Wishes!

    Yours, Zesheng

    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.


    --
    Best Wishes!

    Yours, Zesheng

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • 邓 展成 at May 27, 2014 at 9:42 am
    hi, impala team member:

    I clone the impala from github(https://github.com/cloudera/Impala).
    and found that the hadoop version in bin/impala-config.sh is still 4.5.0, why not upgrade to hadoop 5.0.0 or 5.0.1?

    best wishes.

    Charles Deng.

    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • 邓 展成 at May 28, 2014 at 2:01 am
    Hi, Lenni & Matt:

    yet i get it. and i downloaded the /impala-1.3.1-cdh5.0.1-src.tar.gz from your website, and uncompress it, and then i compile it by the script buildall.sh, and compile error happens.

    I think that there is not .git repository in the impala source tar.gz. and when run the
      git clean -dfx in the bin/build_thirdparty.sh, the files in thirdparty are deleted.

    how to workaround the problem?

    thanks very much.

    On May 28, 2014, at 2:06 AM, Lenni Kuff wrote:

    To add on to what Matt said, although external mirroring is not yet configured between our internal and external repos for all release branches you can find the Impala source for CDH5 here:
    http://archive.cloudera.com/cdh5/cdh/5/

    For example:
    http://archive.cloudera.com/cdh5/cdh/5/impala-1.3.1-cdh5.0.1-src.tar.gz

    Thanks,
    Lenni



    On Tue, May 27, 2014 at 10:39 AM, Matthew Jacobs wrote:
    Hi Charles,

    The master branch on the public Impala github
    (https://github.com/cloudera/Impala) is our upstream branch, and as
    you noticed, it is setup to target CDH4. Internally, we have a build
    process to make Impala target CDH5, but that is not yet public at this
    time. Sorry for the inconvenience.

    Best,
    Matt
    On Tue, May 27, 2014 at 2:42 AM, 邓 展成 wrote:
    hi, impala team member:

    I clone the impala from github(https://github.com/cloudera/Impala).
    and found that the hadoop version in bin/impala-config.sh is still 4.5.0,
    why not upgrade to hadoop 5.0.0 or 5.0.1?

    best wishes.

    Charles Deng.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.


    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Pengcheng Liu at May 27, 2014 at 1:49 pm
    Hello Alan

    Here is the script that generate the parquet file.

    //Reducer write out the parquet file.

    public static class OutputReducer extends
    Reducer<Text, Text, NullWritable, Text> {
      private MultipleOutputs<NullWritable, Text> mos;

    protected void reduce(Text offset, Iterable<Text> values,
    Context context) throws IOException, InterruptedException {
      String partitionName = offset.toString();
    for (Text v : values) {
    Group group = createGroup(v.toString());
    mos.write("yearmonth", null, group, partitionName + "/");
      }
    }
    }

    //Mapreduce job configurations
    conf.setInt("mapreduce.job.reduces", numTasks);
    conf.set("mapred.child.java.opts","-Xmx6g");
    conf.set("mapreduce.map.java.opts","-Xmx2g");
    conf.set("mapreduce.reduce.java.opts","-Xmx6g");
      conf.set("mapreduce.map.speculative", "false");
    conf.set("mapreduce.reduce.speculative", "false");
    conf.set("mapreduce.map.output.compress", "true");
    conf.set("mapred.output.compress", "true");
    conf.set("mapreduce.map.output.compress.codec",
    "org.apache.hadoop.io.compress.SnappyCodec");
    conf.set("mapreduce.output.fileoutput.format.compress.codec",
    "org.apache.hadoop.io.compress.SnappyCodec");

    conf.set("dfs.blocksize","4294967296");
    //conf.set("dfs.blocksize","1073741824");


    ExampleOutputFormat.setSchema(
         job,
         writeMessageType);
      job.setOutputFormatClass(ExampleOutputFormat.class);
    ExampleOutputFormat.setCompression(job, CompressionCodecName.SNAPPY);
      ExampleOutputFormat.setBlockSize(job, 134217728);
    ExampleOutputFormat.setEnableDictionary(job, true);

    MultipleOutputs.addNamedOutput(job, "yearmonth", ExampleOutputFormat.class,
    NullWritable.class, Text.class);


    The hdfs block size is 128M.

    For parquet file block size I tried 4G, 10G, 1G

    Thanks
    Pengcheng

    On Fri, May 23, 2014 at 10:07 PM, Alan Choi wrote:

    Hi,

    Would you mind sharing the following with us?

    1. The script that generate the parquet file.
    2. The parquet file size, hdfs block size.

    Thanks,
    Alan

    On Tue, May 20, 2014 at 6:40 AM, Pengcheng Liu wrote:

    Hello Alan

    Thanks for the answer. I was generating the parquet files from mapreduce
    job using parquet-mr package.

    When I set the block size to 1GB the query works fine but with some
    warning information which says parquet files doesn't like multiple blocks.

    Then I tried to get rid of the warning by increasing the block size so my
    big parquet file can live in on block. But my query will fail and so far I
    haven't been successfully query the new table. I also followed some of the
    suggestion here set the PARQUET_FILE_SIZE parameter to my block size. Which
    doesn't work.

    I tried all of this on both versions: 1.3.0 and 1,3.1

    Thanks
    Pengcheng

    On Mon, May 19, 2014 at 8:37 PM, Alan Choi wrote:

    Hi Pengcheng,

    If you're generating the parquet file from Impala, then Impala should
    correctly create one block per file for you. If you data is more than 1GB,
    Impala should split it into multiple files.

    If you're generating the parquet file in Hive, then you need to set the "dfs.block.size"
    to 1GB.

    Thanks,
    Alan

    On Mon, May 19, 2014 at 7:02 AM, Pengcheng Liu wrote:

    Hello Deepak

    When using 1 GB block size, the query works but there will be some
    warning information in query results about parquet file doesn't like
    multiple blocks.

    That is why I tried larger block size to get rid of the warning, but so
    far unsuccessful.

    Thanks
    Pengcheng


    On Thu, May 15, 2014 at 3:13 PM, gvr.deepak wrote:

    Use Parquet with gb block size with snappy hope that works

    Thanks
    Deepak Gattala


    Sent via the Samsung Galaxy Note® 3, an AT&T 4G LTE smartphone


    -------- Original message --------
    From: Pengcheng Liu
    Date:05/13/2014 7:20 AM (GMT-08:00)
    To: impala-user@cloudera.org
    Subject: Re: Impala won't work with large parquet files

    I had a impala version vcdh5-1.3.0

    But I just noticed my block size is not 4GB it is 3.96GB, Is this the
    reason my test is failed, block size has to be a multiplier of 1MB or 1GB?

    Thanks
    Pengcheng

    On Tue, May 13, 2014 at 10:10 AM, Zesheng Wu wrote:

    I've tried the option on impala 1.2.4, it does work.


    2014-05-13 22:07 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

    Hello Zesheng
    I tried that still not working. This time when I use 4GB block query
    failed not returning any values. Before when I use 1GB block size, the
    query will complete and give me a result and with some additional error log
    information.

    Thanks
    Pengcheng

    On Sat, May 10, 2014 at 10:54 PM, Zesheng Wu wrote:

    Hi Pengcheng, you can try this one in impala-shell:

    set PARQUET_FILE_SIZE=${block_size_you_want_to_set};


    2014-05-10 4:22 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

    Hello Lenni
    I already tried invalidate metadata command, this doesn't work.

    I am writing the parquet files from a mapreduce job and after the
    job finished I online those files through the impala JDBC API.

    Then I have to call invalidate metadata to see the table in impala.

    I was wondering if there is any configuration settings for impala
    or hdfs which control the maximum block size of the file on hdfs.

    Thanks
    Pengcheng

    On Thu, May 8, 2014 at 3:43 PM, Lenni Kuff wrote:

    Hi Pengcheng,
    Since Impala caches the table metadata, including block location
    information, you will need to run an "invalidate metadata <table name>"
    after you change the block size. Can you try running that command and then
    re-running your query?

    Let me know how this works out. If it resolves the problem we can
    look at how to improve the error message in Impala to make it easier to
    diagnose.

    Thanks,
    Lenni


    On Thu, May 8, 2014 at 8:15 AM, Pengcheng Liu <zenonlpc@gmail.com
    wrote:
    Hello experts

    I have been working with impala for a year and now the new
    parquet format is really exciting.

    I had impala version vcdh5-1.3.0

    I had a data set about 40G size in parquet (raw data is 500G)
    and with 20 partitions but the partition is not evenly distributed.

    When i set the block size 1 GB, some of files are split into
    multiple blocks since they are larger than 1 GB.

    The impala query will work but it gives me some warning
    information saying cannot query parquet files with multiple blocks.

    And I saw some folks posted a similar problem here and one of
    response is setting the block size larger than the actual size of the file.

    So I go ahead tried that I used 10 GB as my hdfs file block size.

    Now my query failed with this error message:

    ERROR: Error seeking to 3955895608 in file:
    hdfs://research-mn00.saas.local:8020/user/tablepar/201309/-r-00106.snappy.parquet


    Error(22): Invalid argument
    ERROR: Invalid query handle

    Is this error due to the large block size I used? Is there any
    limits on the maximum block size we can create on hdfs?

    Thanks
    Pengcheng



    To unsubscribe from this group and stop receiving emails from
    it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from
    it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.


    --
    Best Wishes!

    Yours, Zesheng

    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.


    --
    Best Wishes!

    Yours, Zesheng

    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Pengcheng Liu at May 29, 2014 at 8:16 pm
    Hello guys

    I transferred those parquet files to another cluster with impala 1.2.4 and
    the query failed with a different error information and did return any
    results.

    For now, I just changed my mapreduce code to make sure the parquet file
    size doesn't go over the block size.

    Thanks
    Pengcheng

    On Tue, May 27, 2014 at 9:48 AM, Pengcheng Liu wrote:

    Hello Alan

    Here is the script that generate the parquet file.

    //Reducer write out the parquet file.

    public static class OutputReducer extends
    Reducer<Text, Text, NullWritable, Text> {
    private MultipleOutputs<NullWritable, Text> mos;

    protected void reduce(Text offset, Iterable<Text> values,
    Context context) throws IOException, InterruptedException {
    String partitionName = offset.toString();
    for (Text v : values) {
    Group group = createGroup(v.toString());
    mos.write("yearmonth", null, group, partitionName + "/");
    }
    }
    }

    //Mapreduce job configurations
    conf.setInt("mapreduce.job.reduces", numTasks);
    conf.set("mapred.child.java.opts","-Xmx6g");
    conf.set("mapreduce.map.java.opts","-Xmx2g");
    conf.set("mapreduce.reduce.java.opts","-Xmx6g");
    conf.set("mapreduce.map.speculative", "false");
    conf.set("mapreduce.reduce.speculative", "false");
    conf.set("mapreduce.map.output.compress", "true");
    conf.set("mapred.output.compress", "true");
    conf.set("mapreduce.map.output.compress.codec",
    "org.apache.hadoop.io.compress.SnappyCodec");
    conf.set("mapreduce.output.fileoutput.format.compress.codec",
    "org.apache.hadoop.io.compress.SnappyCodec");

    conf.set("dfs.blocksize","4294967296");
    //conf.set("dfs.blocksize","1073741824");


    ExampleOutputFormat.setSchema(
    job,
    writeMessageType);
    job.setOutputFormatClass(ExampleOutputFormat.class);
    ExampleOutputFormat.setCompression(job, CompressionCodecName.SNAPPY);
    ExampleOutputFormat.setBlockSize(job, 134217728);
    ExampleOutputFormat.setEnableDictionary(job, true);

    MultipleOutputs.addNamedOutput(job, "yearmonth", ExampleOutputFormat.class,
    NullWritable.class, Text.class);


    The hdfs block size is 128M.

    For parquet file block size I tried 4G, 10G, 1G

    Thanks
    Pengcheng

    On Fri, May 23, 2014 at 10:07 PM, Alan Choi wrote:

    Hi,

    Would you mind sharing the following with us?

    1. The script that generate the parquet file.
    2. The parquet file size, hdfs block size.

    Thanks,
    Alan


    On Tue, May 20, 2014 at 6:40 AM, Pengcheng Liu <zenonlpc@gmail.com>
    wrote:
    Hello Alan

    Thanks for the answer. I was generating the parquet files from mapreduce
    job using parquet-mr package.

    When I set the block size to 1GB the query works fine but with some
    warning information which says parquet files doesn't like multiple blocks.

    Then I tried to get rid of the warning by increasing the block size so
    my big parquet file can live in on block. But my query will fail and so far
    I haven't been successfully query the new table. I also followed some of
    the suggestion here set the PARQUET_FILE_SIZE parameter to my block size.
    Which doesn't work.

    I tried all of this on both versions: 1.3.0 and 1,3.1

    Thanks
    Pengcheng

    On Mon, May 19, 2014 at 8:37 PM, Alan Choi wrote:

    Hi Pengcheng,

    If you're generating the parquet file from Impala, then Impala should
    correctly create one block per file for you. If you data is more than 1GB,
    Impala should split it into multiple files.

    If you're generating the parquet file in Hive, then you need to set the
    "dfs.block.size" to 1GB.

    Thanks,
    Alan


    On Mon, May 19, 2014 at 7:02 AM, Pengcheng Liu <zenonlpc@gmail.com>
    wrote:
    Hello Deepak

    When using 1 GB block size, the query works but there will be some
    warning information in query results about parquet file doesn't like
    multiple blocks.

    That is why I tried larger block size to get rid of the warning, but
    so far unsuccessful.

    Thanks
    Pengcheng



    On Thu, May 15, 2014 at 3:13 PM, gvr.deepak <gvr.deepak@gmail.com>
    wrote:
    Use Parquet with gb block size with snappy hope that works

    Thanks
    Deepak Gattala


    Sent via the Samsung Galaxy Note® 3, an AT&T 4G LTE smartphone


    -------- Original message --------
    From: Pengcheng Liu
    Date:05/13/2014 7:20 AM (GMT-08:00)
    To: impala-user@cloudera.org
    Subject: Re: Impala won't work with large parquet files

    I had a impala version vcdh5-1.3.0

    But I just noticed my block size is not 4GB it is 3.96GB, Is this the
    reason my test is failed, block size has to be a multiplier of 1MB or 1GB?

    Thanks
    Pengcheng


    On Tue, May 13, 2014 at 10:10 AM, Zesheng Wu <wuzesheng86@gmail.com>
    wrote:
    I've tried the option on impala 1.2.4, it does work.


    2014-05-13 22:07 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

    Hello Zesheng
    I tried that still not working. This time when I use 4GB block
    query failed not returning any values. Before when I use 1GB block size,
    the query will complete and give me a result and with some additional error
    log information.

    Thanks
    Pengcheng


    On Sat, May 10, 2014 at 10:54 PM, Zesheng Wu <wuzesheng86@gmail.com
    wrote:
    Hi Pengcheng, you can try this one in impala-shell:

    set PARQUET_FILE_SIZE=${block_size_you_want_to_set};


    2014-05-10 4:22 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

    Hello Lenni
    I already tried invalidate metadata command, this doesn't work.

    I am writing the parquet files from a mapreduce job and after the
    job finished I online those files through the impala JDBC API.

    Then I have to call invalidate metadata to see the table in
    impala.

    I was wondering if there is any configuration settings for impala
    or hdfs which control the maximum block size of the file on hdfs.

    Thanks
    Pengcheng


    On Thu, May 8, 2014 at 3:43 PM, Lenni Kuff <lskuff@cloudera.com>
    wrote:
    Hi Pengcheng,
    Since Impala caches the table metadata, including block location
    information, you will need to run an "invalidate metadata <table name>"
    after you change the block size. Can you try running that command and then
    re-running your query?

    Let me know how this works out. If it resolves the problem we
    can look at how to improve the error message in Impala to make it easier to
    diagnose.

    Thanks,
    Lenni


    On Thu, May 8, 2014 at 8:15 AM, Pengcheng Liu <
    zenonlpc@gmail.com> wrote:
    Hello experts

    I have been working with impala for a year and now the new
    parquet format is really exciting.

    I had impala version vcdh5-1.3.0

    I had a data set about 40G size in parquet (raw data is 500G)
    and with 20 partitions but the partition is not evenly distributed.

    When i set the block size 1 GB, some of files are split into
    multiple blocks since they are larger than 1 GB.

    The impala query will work but it gives me some warning
    information saying cannot query parquet files with multiple blocks.

    And I saw some folks posted a similar problem here and one of
    response is setting the block size larger than the actual size of the file.

    So I go ahead tried that I used 10 GB as my hdfs file block
    size.

    Now my query failed with this error message:

    ERROR: Error seeking to 3955895608 in file:
    hdfs://research-mn00.saas.local:8020/user/tablepar/201309/-r-00106.snappy.parquet


    Error(22): Invalid argument
    ERROR: Invalid query handle

    Is this error due to the large block size I used? Is there any
    limits on the maximum block size we can create on hdfs?

    Thanks
    Pengcheng



    To unsubscribe from this group and stop receiving emails from
    it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from
    it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from
    it, send an email to impala-user+unsubscribe@cloudera.org.


    --
    Best Wishes!

    Yours, Zesheng

    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.


    --
    Best Wishes!

    Yours, Zesheng

    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 15, '14 at 7:13p
activeMay 29, '14 at 8:16p
posts7
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase