FAQ
I was encountered with a problem that Impala query may return wrong results
when data contains '\x00'.

I made a small dataset to reproduce the bug. However, when I ran the select
query, impala-server just failed and impala-shell returned "Error
communicating with impalad: TSocket read 0 bytes", and I had to restart the
impala-server.

I use hive to create table, and the DDL is:

create table mytest(id int, name string, value double)
ROW FORMAT SERDE
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

My default field terminator is '\x01' and line terminator is '\n';

The dataset is:

1^Atest^@^A12.3

^A means '\x01' and ^@ means '\x00' (copied from vim).

The select query is very simple:

select * from mytest;

The results hive returned:

hive> select * from mytest;
OK
1 test 12.3

Version of Impala is 1.0.0 and hadoop is CDH4.2.1.

The error log is (impala-server.log):
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00000000009484d9, pid=3775, tid=140010238035712
#
# JRE version: 7.0_15-b20
# Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64
compressed oops)
# Problematic frame:
# C
[error occurred during error reporting (printing problematic frame), id 0xb]

# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again #
# An error report file with more information is saved as:
# /home/zeadom/hs_err_pid3775.log
[thread 140010295965440 also had an error]
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
# https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
#

But I don't think it is open jdk cause the bug.

Search Discussions

  • Nong Li at Jun 14, 2013 at 7:18 pm
    I think you are running into this issue:
    https://issues.cloudera.org/browse/IMPALA-13.

    Can you send us the hs_err file to confirm?

    On Fri, Jun 14, 2013 at 12:42 AM, wrote:

    I was encountered with a problem that Impala query may return wrong
    results when data contains '\x00'.

    I made a small dataset to reproduce the bug. However, when I ran the
    select query, impala-server just failed and impala-shell returned "Error
    communicating with impalad: TSocket read 0 bytes", and I had to restart the
    impala-server.

    I use hive to create table, and the DDL is:

    create table mytest(id int, name string, value double)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

    My default field terminator is '\x01' and line terminator is '\n';

    The dataset is:

    1^Atest^@^A12.3

    ^A means '\x01' and ^@ means '\x00' (copied from vim).

    The select query is very simple:

    select * from mytest;

    The results hive returned:

    hive> select * from mytest;
    OK
    1 test 12.3

    Version of Impala is 1.0.0 and hadoop is CDH4.2.1.

    The error log is (impala-server.log):
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    # SIGSEGV (0xb) at pc=0x00000000009484d9, pid=3775, tid=140010238035712
    #
    # JRE version: 7.0_15-b20
    # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64
    compressed oops)
    # Problematic frame:
    # C
    [error occurred during error reporting (printing problematic frame), id
    0xb]

    # Failed to write core dump. Core dumps have been disabled. To enable core
    dumping, try "ulimit -c unlimited" before starting Java again #
    # An error report file with more information is saved as:
    # /home/zeadom/hs_err_pid3775.log
    [thread 140010295965440 also had an error]
    #
    # If you would like to submit a bug report, please include
    # instructions on how to reproduce the bug and visit:
    # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
    #

    But I don't think it is open jdk cause the bug.
  • Imzeadom at Jun 17, 2013 at 12:42 pm
    I guess that impala is designed not to process escapes when escape char is
    '\0'.
    However, when handling remaining characters after running ParseSse, impala
    will process escapes whatever escape char is.

    file: be/src/exec/delimited-text-parser.cc

    136 // Handle the remaining characters
    137 while (remaining_len > 0) {
    138 bool new_tuple = false;
    139 bool new_col = false;
    140
    141 if (!last_char_is_escape_) {
    142 if (tuple_delim_ != '\0' && (**byte_buffer_ptr == tuple_delim_ ||
    143 (tuple_delim_ == '\n' && **byte_buffer_ptr == '\r'))) {
    144 new_tuple = true;
    145 new_col = true;
    146 } else if (**byte_buffer_ptr == field_delim_
    147 || **byte_buffer_ptr == collection_item_delim_) {
    148 new_col = true;
    149 }
    150 }
    151
    152 if (**byte_buffer_ptr == escape_char_) {
    153 current_column_has_escape_ = true;
    154 last_char_is_escape_ = !last_char_is_escape_;
    155 } else {
    156 last_char_is_escape_ = false;
    157 }
    On Saturday, June 15, 2013 3:18:21 AM UTC+8, Nong wrote:

    I think you are running into this issue:
    https://issues.cloudera.org/browse/IMPALA-13.

    Can you send us the hs_err file to confirm?

    On Fri, Jun 14, 2013 at 12:42 AM, <imze...@gmail.com <javascript:>> wrote:

    I was encountered with a problem that Impala query may return wrong
    results when data contains '\x00'.

    I made a small dataset to reproduce the bug. However, when I ran the
    select query, impala-server just failed and impala-shell returned "Error
    communicating with impalad: TSocket read 0 bytes", and I had to restart the
    impala-server.

    I use hive to create table, and the DDL is:

    create table mytest(id int, name string, value double)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

    My default field terminator is '\x01' and line terminator is '\n';

    The dataset is:

    1^Atest^@^A12.3

    ^A means '\x01' and ^@ means '\x00' (copied from vim).

    The select query is very simple:

    select * from mytest;

    The results hive returned:

    hive> select * from mytest;
    OK
    1 test 12.3

    Version of Impala is 1.0.0 and hadoop is CDH4.2.1.

    The error log is (impala-server.log):
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    # SIGSEGV (0xb) at pc=0x00000000009484d9, pid=3775, tid=140010238035712
    #
    # JRE version: 7.0_15-b20
    # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64
    compressed oops)
    # Problematic frame:
    # C
    [error occurred during error reporting (printing problematic frame), id
    0xb]

    # Failed to write core dump. Core dumps have been disabled. To enable
    core dumping, try "ulimit -c unlimited" before starting Java again #
    # An error report file with more information is saved as:
    # /home/zeadom/hs_err_pid3775.log
    [thread 140010295965440 also had an error]
    #
    # If you would like to submit a bug report, please include
    # instructions on how to reproduce the bug and visit:
    # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
    #

    But I don't think it is open jdk cause the bug.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJun 14, '13 at 7:42a
activeJun 17, '13 at 12:42p
posts3
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Imzeadom: 2 posts Nong Li: 1 post

People

Translate

site design / logo © 2021 Grokbase