FAQ
Hi,

Would you mind sharing the following with us?

1. The script that generate the parquet file.
2. The parquet file size, hdfs block size.

Thanks,
Alan

On Tue, May 20, 2014 at 6:40 AM, Pengcheng Liu wrote:

Hello Alan

Thanks for the answer. I was generating the parquet files from mapreduce
job using parquet-mr package.

When I set the block size to 1GB the query works fine but with some
warning information which says parquet files doesn't like multiple blocks.

Then I tried to get rid of the warning by increasing the block size so my
big parquet file can live in on block. But my query will fail and so far I
haven't been successfully query the new table. I also followed some of the
suggestion here set the PARQUET_FILE_SIZE parameter to my block size. Which
doesn't work.

I tried all of this on both versions: 1.3.0 and 1,3.1

Thanks
Pengcheng

On Mon, May 19, 2014 at 8:37 PM, Alan Choi wrote:

Hi Pengcheng,

If you're generating the parquet file from Impala, then Impala should
correctly create one block per file for you. If you data is more than 1GB,
Impala should split it into multiple files.

If you're generating the parquet file in Hive, then you need to set the "dfs.block.size"
to 1GB.

Thanks,
Alan

On Mon, May 19, 2014 at 7:02 AM, Pengcheng Liu wrote:

Hello Deepak

When using 1 GB block size, the query works but there will be some
warning information in query results about parquet file doesn't like
multiple blocks.

That is why I tried larger block size to get rid of the warning, but so
far unsuccessful.

Thanks
Pengcheng


On Thu, May 15, 2014 at 3:13 PM, gvr.deepak wrote:

Use Parquet with gb block size with snappy hope that works

Thanks
Deepak Gattala


Sent via the Samsung Galaxy Note® 3, an AT&T 4G LTE smartphone


-------- Original message --------
From: Pengcheng Liu
Date:05/13/2014 7:20 AM (GMT-08:00)
To: impala-user@cloudera.org
Subject: Re: Impala won't work with large parquet files

I had a impala version vcdh5-1.3.0

But I just noticed my block size is not 4GB it is 3.96GB, Is this the
reason my test is failed, block size has to be a multiplier of 1MB or 1GB?

Thanks
Pengcheng

On Tue, May 13, 2014 at 10:10 AM, Zesheng Wu wrote:

I've tried the option on impala 1.2.4, it does work.


2014-05-13 22:07 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

Hello Zesheng
I tried that still not working. This time when I use 4GB block query
failed not returning any values. Before when I use 1GB block size, the
query will complete and give me a result and with some additional error log
information.

Thanks
Pengcheng

On Sat, May 10, 2014 at 10:54 PM, Zesheng Wu wrote:

Hi Pengcheng, you can try this one in impala-shell:

set PARQUET_FILE_SIZE=${block_size_you_want_to_set};


2014-05-10 4:22 GMT+08:00 Pengcheng Liu <zenonlpc@gmail.com>:

Hello Lenni
I already tried invalidate metadata command, this doesn't work.

I am writing the parquet files from a mapreduce job and after the
job finished I online those files through the impala JDBC API.

Then I have to call invalidate metadata to see the table in impala.

I was wondering if there is any configuration settings for impala
or hdfs which control the maximum block size of the file on hdfs.

Thanks
Pengcheng

On Thu, May 8, 2014 at 3:43 PM, Lenni Kuff wrote:

Hi Pengcheng,
Since Impala caches the table metadata, including block location
information, you will need to run an "invalidate metadata <table name>"
after you change the block size. Can you try running that command and then
re-running your query?

Let me know how this works out. If it resolves the problem we can
look at how to improve the error message in Impala to make it easier to
diagnose.

Thanks,
Lenni

On Thu, May 8, 2014 at 8:15 AM, Pengcheng Liu wrote:

Hello experts

I have been working with impala for a year and now the new
parquet format is really exciting.

I had impala version vcdh5-1.3.0

I had a data set about 40G size in parquet (raw data is 500G) and
with 20 partitions but the partition is not evenly distributed.

When i set the block size 1 GB, some of files are split into
multiple blocks since they are larger than 1 GB.

The impala query will work but it gives me some warning
information saying cannot query parquet files with multiple blocks.

And I saw some folks posted a similar problem here and one of
response is setting the block size larger than the actual size of the file.

So I go ahead tried that I used 10 GB as my hdfs file block size.

Now my query failed with this error message:

ERROR: Error seeking to 3955895608 in file:
hdfs://research-mn00.saas.local:8020/user/tablepar/201309/-r-00106.snappy.parquet


Error(22): Invalid argument
ERROR: Invalid query handle

Is this error due to the large block size I used? Is there any
limits on the maximum block size we can create on hdfs?

Thanks
Pengcheng



To unsubscribe from this group and stop receiving emails from it,
send an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it,
send an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it,
send an email to impala-user+unsubscribe@cloudera.org.


--
Best Wishes!

Yours, Zesheng

To unsubscribe from this group and stop receiving emails from it,
send an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it,
send an email to impala-user+unsubscribe@cloudera.org.


--
Best Wishes!

Yours, Zesheng

To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.

To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 7 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 15, '14 at 7:13p
activeMay 29, '14 at 8:16p
posts7
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase