FAQ
We ran into the same error. The parquet-hadoop subproject of parquet-mr
jar has a bug where it writes an incorrect size for dictionary pages (the
size doesn't include the header, only the data). Impala uses these sizes
to figure out the read size off disk, so it results in an incomplete read,
and hence the thrift deserialization error. The parquet-hadoop jar only
uses offsets when desierializing, which are correctly set, so hive doesn't
exhibit this bug.

We haven't yet sent a pull request back to the parquet folks yet, but you
can look at the last commit here to see the necessary change if you want to
compile your own jar: https://github.com/pulseio/parquet-mr

Keith

On Sun, Dec 8, 2013 at 10:30 PM, Sean O'Brien wrote:

Hi All,

We've been having issues with a bunch of queries that work over some
parquet tables we generate from our own MR. These tables have worked since
our upgrade to impala 1.2, so it's not a total breakage with the upgrade...
it seems more like a bug or something particular to one or two odd rows.

I can produce the failure on a given partition that I've figured out is
bad with a query like:

select max(some_string_filed) from parquet_table where dt='2013-12-07' and
hr='12';

the error I get:
ERRORS ENCOUNTERED DURING EXECUTION:
Backend 4:couldn't deserialize thrift msg:
No more data to read.

When I find a node that I believe was the source of the error (still not
sure how to determine which 'backend' is which). I see:

I1208 21:17:44.664397 2897 status.cc:44] couldn't deserialize thrift msg:
No more data to read.
@ 0x6c56e0 impala::Status::Status()
@ 0x9ab081 impala::DeserializeThriftMsg<>()
@ 0x9ac177
impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage()
@ 0x9ad46f impala::HdfsParquetScanner::AssembleRows()
@ 0x9b0268 impala::HdfsParquetScanner::ProcessSplit()
@ 0x99229a impala::HdfsScanNode::ScannerThreadHelper()
@ 0x98d7f3 impala::HdfsScanNode::ScannerThread()
@ 0x7dfdfc impala::Thread::SuperviseThread()
@ 0x7e070e boost::detail::thread_data<>::run()
@ 0xa28884 thread_proxy
@ 0x7f3493e37e9a start_thread
@ 0x7f3492adc3fd (unknown)

We've tried version matching parquet-mr to 1.2.5 since that appeared to be
the version cdh4.5 uses. What version of parquet is impala 1.2.1 using
perhaps we need to compile our parquet generating MR's against another
version to get things working again.

Also is there any way to get impalad to give me more information about
where in a given file it's failing?

Thanks
-Sean


To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

  • Dmitriy Ryaboy at Dec 9, 2013 at 8:18 pm
    Keith,
    Is this the same fix?
    https://github.com/Parquet/parquet-mr/commit/b297c73c1082728ad9626d17ce0f7abe6abaa36b

    On Mon, Dec 9, 2013 at 11:04 AM, Keith Simmons wrote:

    We ran into the same error. The parquet-hadoop subproject of parquet-mr
    jar has a bug where it writes an incorrect size for dictionary pages (the
    size doesn't include the header, only the data). Impala uses these sizes
    to figure out the read size off disk, so it results in an incomplete read,
    and hence the thrift deserialization error. The parquet-hadoop jar only
    uses offsets when desierializing, which are correctly set, so hive doesn't
    exhibit this bug.

    We haven't yet sent a pull request back to the parquet folks yet, but you
    can look at the last commit here to see the necessary change if you want to
    compile your own jar: https://github.com/pulseio/parquet-mr

    Keith

    On Sun, Dec 8, 2013 at 10:30 PM, Sean O'Brien wrote:

    Hi All,

    We've been having issues with a bunch of queries that work over some
    parquet tables we generate from our own MR. These tables have worked since
    our upgrade to impala 1.2, so it's not a total breakage with the upgrade...
    it seems more like a bug or something particular to one or two odd rows.

    I can produce the failure on a given partition that I've figured out is
    bad with a query like:

    select max(some_string_filed) from parquet_table where dt='2013-12-07'
    and hr='12';

    the error I get:
    ERRORS ENCOUNTERED DURING EXECUTION:
    Backend 4:couldn't deserialize thrift msg:
    No more data to read.

    When I find a node that I believe was the source of the error (still not
    sure how to determine which 'backend' is which). I see:

    I1208 21:17:44.664397 2897 status.cc:44] couldn't deserialize thrift msg:
    No more data to read.
    @ 0x6c56e0 impala::Status::Status()
    @ 0x9ab081 impala::DeserializeThriftMsg<>()
    @ 0x9ac177
    impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage()
    @ 0x9ad46f impala::HdfsParquetScanner::AssembleRows()
    @ 0x9b0268 impala::HdfsParquetScanner::ProcessSplit()
    @ 0x99229a impala::HdfsScanNode::ScannerThreadHelper()
    @ 0x98d7f3 impala::HdfsScanNode::ScannerThread()
    @ 0x7dfdfc impala::Thread::SuperviseThread()
    @ 0x7e070e boost::detail::thread_data<>::run()
    @ 0xa28884 thread_proxy
    @ 0x7f3493e37e9a start_thread
    @ 0x7f3492adc3fd (unknown)

    We've tried version matching parquet-mr to 1.2.5 since that appeared to
    be the version cdh4.5 uses. What version of parquet is impala 1.2.1 using
    perhaps we need to compile our parquet generating MR's against another
    version to get things working again.

    Also is there any way to get impalad to give me more information about
    where in a given file it's failing?

    Thanks
    -Sean


    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Keith Simmons at Dec 9, 2013 at 8:34 pm
    Ah, it does look the same, at least on the writer side. Seems I'm not the
    only one that's run into the issue.

    On Mon, Dec 9, 2013 at 12:18 PM, Dmitriy Ryaboy wrote:

    Keith,
    Is this the same fix?

    https://github.com/Parquet/parquet-mr/commit/b297c73c1082728ad9626d17ce0f7abe6abaa36b

    On Mon, Dec 9, 2013 at 11:04 AM, Keith Simmons wrote:

    We ran into the same error. The parquet-hadoop subproject of parquet-mr
    jar has a bug where it writes an incorrect size for dictionary pages (the
    size doesn't include the header, only the data). Impala uses these sizes
    to figure out the read size off disk, so it results in an incomplete read,
    and hence the thrift deserialization error. The parquet-hadoop jar only
    uses offsets when desierializing, which are correctly set, so hive doesn't
    exhibit this bug.

    We haven't yet sent a pull request back to the parquet folks yet, but you
    can look at the last commit here to see the necessary change if you want to
    compile your own jar: https://github.com/pulseio/parquet-mr

    Keith

    On Sun, Dec 8, 2013 at 10:30 PM, Sean O'Brien wrote:

    Hi All,

    We've been having issues with a bunch of queries that work over some
    parquet tables we generate from our own MR. These tables have worked since
    our upgrade to impala 1.2, so it's not a total breakage with the upgrade...
    it seems more like a bug or something particular to one or two odd rows.

    I can produce the failure on a given partition that I've figured out is
    bad with a query like:

    select max(some_string_filed) from parquet_table where dt='2013-12-07'
    and hr='12';

    the error I get:
    ERRORS ENCOUNTERED DURING EXECUTION:
    Backend 4:couldn't deserialize thrift msg:
    No more data to read.

    When I find a node that I believe was the source of the error (still not
    sure how to determine which 'backend' is which). I see:

    I1208 21:17:44.664397 2897 status.cc:44] couldn't deserialize thrift
    msg:
    No more data to read.
    @ 0x6c56e0 impala::Status::Status()
    @ 0x9ab081 impala::DeserializeThriftMsg<>()
    @ 0x9ac177
    impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage()
    @ 0x9ad46f impala::HdfsParquetScanner::AssembleRows()
    @ 0x9b0268 impala::HdfsParquetScanner::ProcessSplit()
    @ 0x99229a impala::HdfsScanNode::ScannerThreadHelper()
    @ 0x98d7f3 impala::HdfsScanNode::ScannerThread()
    @ 0x7dfdfc impala::Thread::SuperviseThread()
    @ 0x7e070e boost::detail::thread_data<>::run()
    @ 0xa28884 thread_proxy
    @ 0x7f3493e37e9a start_thread
    @ 0x7f3492adc3fd (unknown)

    We've tried version matching parquet-mr to 1.2.5 since that appeared to
    be the version cdh4.5 uses. What version of parquet is impala 1.2.1 using
    perhaps we need to compile our parquet generating MR's against another
    version to get things working again.

    Also is there any way to get impalad to give me more information about
    where in a given file it's failing?

    Thanks
    -Sean


    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Dmitriy Ryaboy at Dec 9, 2013 at 8:43 pm
    Nice thin about that patch is that it works on the reader side with old (incorrectly written) files.

    We will make a new parquet-mr release this week to include this and other fixes & optimizations.
    On Dec 9, 2013, at 12:34 PM, Keith Simmons wrote:

    Ah, it does look the same, at least on the writer side. Seems I'm not the only one that's run into the issue.

    On Mon, Dec 9, 2013 at 12:18 PM, Dmitriy Ryaboy wrote:
    Keith,
    Is this the same fix?
    https://github.com/Parquet/parquet-mr/commit/b297c73c1082728ad9626d17ce0f7abe6abaa36b

    On Mon, Dec 9, 2013 at 11:04 AM, Keith Simmons wrote:
    We ran into the same error. The parquet-hadoop subproject of parquet-mr jar has a bug where it writes an incorrect size for dictionary pages (the size doesn't include the header, only the data). Impala uses these sizes to figure out the read size off disk, so it results in an incomplete read, and hence the thrift deserialization error. The parquet-hadoop jar only uses offsets when desierializing, which are correctly set, so hive doesn't exhibit this bug.

    We haven't yet sent a pull request back to the parquet folks yet, but you can look at the last commit here to see the necessary change if you want to compile your own jar: https://github.com/pulseio/parquet-mr

    Keith

    On Sun, Dec 8, 2013 at 10:30 PM, Sean O'Brien wrote:
    Hi All,

    We've been having issues with a bunch of queries that work over some parquet tables we generate from our own MR. These tables have worked since our upgrade to impala 1.2, so it's not a total breakage with the upgrade... it seems more like a bug or something particular to one or two odd rows.

    I can produce the failure on a given partition that I've figured out is bad with a query like:

    select max(some_string_filed) from parquet_table where dt='2013-12-07' and hr='12';

    the error I get:
    ERRORS ENCOUNTERED DURING EXECUTION:
    Backend 4:couldn't deserialize thrift msg:
    No more data to read.

    When I find a node that I believe was the source of the error (still not sure how to determine which 'backend' is which). I see:

    I1208 21:17:44.664397 2897 status.cc:44] couldn't deserialize thrift msg:
    No more data to read.
    @ 0x6c56e0 impala::Status::Status()
    @ 0x9ab081 impala::DeserializeThriftMsg<>()
    @ 0x9ac177 impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage()
    @ 0x9ad46f impala::HdfsParquetScanner::AssembleRows()
    @ 0x9b0268 impala::HdfsParquetScanner::ProcessSplit()
    @ 0x99229a impala::HdfsScanNode::ScannerThreadHelper()
    @ 0x98d7f3 impala::HdfsScanNode::ScannerThread()
    @ 0x7dfdfc impala::Thread::SuperviseThread()
    @ 0x7e070e boost::detail::thread_data<>::run()
    @ 0xa28884 thread_proxy
    @ 0x7f3493e37e9a start_thread
    @ 0x7f3492adc3fd (unknown)

    We've tried version matching parquet-mr to 1.2.5 since that appeared to be the version cdh4.5 uses. What version of parquet is impala 1.2.1 using perhaps we need to compile our parquet generating MR's against another version to get things working again.

    Also is there any way to get impalad to give me more information about where in a given file it's failing?

    Thanks
    -Sean


    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedDec 9, '13 at 7:04p
activeDec 9, '13 at 8:43p
posts4
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Keith Simmons: 2 posts Dmitriy Ryaboy: 2 posts

People

Translate

site design / logo © 2021 Grokbase