We ran into the same error. The parquet-hadoop subproject of parquet-mr
jar has a bug where it writes an incorrect size for dictionary pages (the
size doesn't include the header, only the data). Impala uses these sizes
to figure out the read size off disk, so it results in an incomplete read,
and hence the thrift deserialization error. The parquet-hadoop jar only
uses offsets when desierializing, which are correctly set, so hive doesn't
exhibit this bug.

We haven't yet sent a pull request back to the parquet folks yet, but you
can look at the last commit here to see the necessary change if you want to
compile your own jar: https://github.com/pulseio/parquet-mr


On Sun, Dec 8, 2013 at 10:30 PM, Sean O'Brien wrote:

Hi All,

We've been having issues with a bunch of queries that work over some
parquet tables we generate from our own MR. These tables have worked since
our upgrade to impala 1.2, so it's not a total breakage with the upgrade...
it seems more like a bug or something particular to one or two odd rows.

I can produce the failure on a given partition that I've figured out is
bad with a query like:

select max(some_string_filed) from parquet_table where dt='2013-12-07' and

the error I get:
Backend 4:couldn't deserialize thrift msg:
No more data to read.

When I find a node that I believe was the source of the error (still not
sure how to determine which 'backend' is which). I see:

I1208 21:17:44.664397 2897 status.cc:44] couldn't deserialize thrift msg:
No more data to read.
@ 0x6c56e0 impala::Status::Status()
@ 0x9ab081 impala::DeserializeThriftMsg<>()
@ 0x9ac177
@ 0x9ad46f impala::HdfsParquetScanner::AssembleRows()
@ 0x9b0268 impala::HdfsParquetScanner::ProcessSplit()
@ 0x99229a impala::HdfsScanNode::ScannerThreadHelper()
@ 0x98d7f3 impala::HdfsScanNode::ScannerThread()
@ 0x7dfdfc impala::Thread::SuperviseThread()
@ 0x7e070e boost::detail::thread_data<>::run()
@ 0xa28884 thread_proxy
@ 0x7f3493e37e9a start_thread
@ 0x7f3492adc3fd (unknown)

We've tried version matching parquet-mr to 1.2.5 since that appeared to be
the version cdh4.5 uses. What version of parquet is impala 1.2.1 using
perhaps we need to compile our parquet generating MR's against another
version to get things working again.

Also is there any way to get impalad to give me more information about
where in a given file it's failing?


To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 4 | next ›
Discussion Overview
groupimpala-user @
postedDec 9, '13 at 7:04p
activeDec 9, '13 at 8:43p

2 users in discussion

Keith Simmons: 2 posts Dmitriy Ryaboy: 2 posts



site design / logo © 2021 Grokbase