FAQ
All:

I apologize in advance if this is common. I've searched and I can't find
an explanation.

I'm loading a plain text tab delimited file into a Hive (0.4.1-dev)
table. This file is a small sample set of my full dataset and is the
result of a M/R job, written by TextOutputFormat, if it matters. When I
query the table, a small percentage (a few hundred out of a few million)
of the rows contain null values where as the input file does not contain
any null values. The number of null field records seems to grow
proportionally to the total number of records at a relatively constant
rate. It looks as if it's a SerDe error / misconfiguration of some kind,
but I can't pinpoint anything that would cause the issue.

To confirm, I've done an fs -cat of the file to local disk and used cut
and sort to confirm all fields are properly formatted and populated.
Below is the extended table description along with some additional
information. Any help is greatly appreciated as using Hive for simple
aggregation is saving me a ton of time from having to hand write the M/R
jobs myself.

I'm sure there's something I've done wrong. Unfortunately, I'm in a
situation where I can't deal with any portion of the records being
dumped (part of a reporting system).

Original create:

hive> create table raw_facts ( year int, month int, day int, application
string, company_id int, country_code string, receiver_code_id int,
keyword string, total int ) row format delimited fields terminated by '\t';

(I've also tried row format TEXTFORMAT or whatever it is; all fields
were null - assumed it was because hive was expecting ^A delimited.)

Table:

hive> describe extended raw_facts;
OK
year int
month int
day int
application string
company_id int
country_code string
receiver_code_id int
keyword string
total int

Detailed Table Information Table(tableName:raw_facts,
dbName:default, owner:<snip>, createTime:1262631537, lastAccessTime:0,
retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:int,
comment:null), FieldSchema(name:month, type:int, comment:null),
FieldSchema(name:day, type:int, comment:null),
FieldSchema(name:application, type:string, comment:null),
FieldSchema(name:company_id, type:int, comment:null),
FieldSchema(name:country_code, type:string, comment:null),
FieldSchema(name:receiver_code_id, type:int, comment:null),
FieldSchema(name:keyword, type:string, comment:null),
FieldSchema(name:total, type:int, comment:null)],
location:hdfs://<snip>/home/hive/warehouse/raw_facts,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=9,field.delim= }), bucketCols:[],
sortCols:[], parameters:{}), partitionKeys:[], parameters:{})

Sample (real) rows: (these are tab separated in the file)

2009 12 01 f 98 US 171 test 222
2009 12 01 f 98 US 199 test 222
2009 12 01 f 98 US 220 test 222

Load command used:

hive> load data inpath 'hdfs://<snip>/some/path/out/part-r-00000'
overwrite into table raw_facts ;

Some queries:

hive> select count(1) from raw_facts;
OK
4723253

hive> select count(1) from raw_facts where year is null;
OK
277

hive> select year,count(1) from raw_facts group by year;
OK
NULL 277
2009 4722976

Thanks in advance.
--
Eric Sammer
eric@lifless.net
http://esammer.blogspot.com

Search Discussions

  • Zheng Shao at Jan 5, 2010 at 3:34 am
    Hi Eric,

    Most probably there are leading/trailing spaces in the columns that
    are defined as "int".
    If Hive cannot parse the field successfully, the field will become null.

    You can try this to find out the rows:
    SELECT * FROM raw_facts WHERE year IS NULL;

    Zheng
    On Mon, Jan 4, 2010 at 4:10 PM, Eric Sammer wrote:
    All:

    I apologize in advance if this is common. I've searched and I can't find
    an explanation.

    I'm loading a plain text tab delimited file into a Hive (0.4.1-dev)
    table. This file is a small sample set of my full dataset and is the
    result of a M/R job, written by TextOutputFormat, if it matters. When I
    query the table, a small percentage (a few hundred out of a few million)
    of the rows contain null values where as the input file does not contain
    any null values. The number of null field records seems to grow
    proportionally to the total number of records at a relatively constant
    rate. It looks as if it's a SerDe error / misconfiguration of some kind,
    but I can't pinpoint anything that would cause the issue.

    To confirm, I've done an fs -cat of the file to local disk and used cut
    and sort to confirm all fields are properly formatted and populated.
    Below is the extended table description along with some additional
    information. Any help is greatly appreciated as using Hive for simple
    aggregation is saving me a ton of time from having to hand write the M/R
    jobs myself.

    I'm sure there's something I've done wrong. Unfortunately, I'm in a
    situation where I can't deal with any portion of the records being
    dumped (part of a reporting system).

    Original create:

    hive> create table raw_facts ( year int, month int, day int, application
    string, company_id int, country_code string, receiver_code_id int,
    keyword string, total int ) row format delimited fields terminated by '\t';

    (I've also tried row format TEXTFORMAT or whatever it is; all fields
    were null - assumed it was because hive was expecting ^A delimited.)

    Table:

    hive> describe extended raw_facts;
    OK
    year    int
    month   int
    day     int
    application     string
    company_id      int
    country_code    string
    receiver_code_id        int
    keyword string
    total   int

    Detailed Table Information      Table(tableName:raw_facts,
    dbName:default, owner:<snip>, createTime:1262631537, lastAccessTime:0,
    retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:int,
    comment:null), FieldSchema(name:month, type:int, comment:null),
    FieldSchema(name:day, type:int, comment:null),
    FieldSchema(name:application, type:string, comment:null),
    FieldSchema(name:company_id, type:int, comment:null),
    FieldSchema(name:country_code, type:string, comment:null),
    FieldSchema(name:receiver_code_id, type:int, comment:null),
    FieldSchema(name:keyword, type:string, comment:null),
    FieldSchema(name:total, type:int, comment:null)],
    location:hdfs://<snip>/home/hive/warehouse/raw_facts,
    inputFormat:org.apache.hadoop.mapred.TextInputFormat,
    outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
    compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
    serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
    parameters:{serialization.format=9,field.delim=      }), bucketCols:[],
    sortCols:[], parameters:{}), partitionKeys:[], parameters:{})

    Sample (real) rows: (these are tab separated in the file)

    2009    12      01      f       98      US      171     test    222
    2009    12      01      f       98      US      199     test    222
    2009    12      01      f       98      US      220     test    222

    Load command used:

    hive> load data inpath 'hdfs://<snip>/some/path/out/part-r-00000'
    overwrite into table raw_facts ;

    Some queries:

    hive> select count(1) from raw_facts;
    OK
    4723253

    hive> select count(1) from raw_facts where year is null;
    OK
    277

    hive> select year,count(1) from raw_facts group by year;
    OK
    NULL    277
    2009    4722976

    Thanks in advance.
    --
    Eric Sammer
    eric@lifless.net
    http://esammer.blogspot.com


    --
    Yours,
    Zheng
  • Todd Lipcon at Jan 5, 2010 at 4:15 am
    Hi Zheng,

    Is there any way to convince the LazySimpleSerde to allow leading/trailing
    spaces in non-text fields?

    -Todd
    On Mon, Jan 4, 2010 at 7:33 PM, Zheng Shao wrote:

    Hi Eric,

    Most probably there are leading/trailing spaces in the columns that
    are defined as "int".
    If Hive cannot parse the field successfully, the field will become null.

    You can try this to find out the rows:
    SELECT * FROM raw_facts WHERE year IS NULL;

    Zheng
    On Mon, Jan 4, 2010 at 4:10 PM, Eric Sammer wrote:
    All:

    I apologize in advance if this is common. I've searched and I can't find
    an explanation.

    I'm loading a plain text tab delimited file into a Hive (0.4.1-dev)
    table. This file is a small sample set of my full dataset and is the
    result of a M/R job, written by TextOutputFormat, if it matters. When I
    query the table, a small percentage (a few hundred out of a few million)
    of the rows contain null values where as the input file does not contain
    any null values. The number of null field records seems to grow
    proportionally to the total number of records at a relatively constant
    rate. It looks as if it's a SerDe error / misconfiguration of some kind,
    but I can't pinpoint anything that would cause the issue.

    To confirm, I've done an fs -cat of the file to local disk and used cut
    and sort to confirm all fields are properly formatted and populated.
    Below is the extended table description along with some additional
    information. Any help is greatly appreciated as using Hive for simple
    aggregation is saving me a ton of time from having to hand write the M/R
    jobs myself.

    I'm sure there's something I've done wrong. Unfortunately, I'm in a
    situation where I can't deal with any portion of the records being
    dumped (part of a reporting system).

    Original create:

    hive> create table raw_facts ( year int, month int, day int, application
    string, company_id int, country_code string, receiver_code_id int,
    keyword string, total int ) row format delimited fields terminated by '\t';
    (I've also tried row format TEXTFORMAT or whatever it is; all fields
    were null - assumed it was because hive was expecting ^A delimited.)

    Table:

    hive> describe extended raw_facts;
    OK
    year int
    month int
    day int
    application string
    company_id int
    country_code string
    receiver_code_id int
    keyword string
    total int

    Detailed Table Information Table(tableName:raw_facts,
    dbName:default, owner:<snip>, createTime:1262631537, lastAccessTime:0,
    retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:int,
    comment:null), FieldSchema(name:month, type:int, comment:null),
    FieldSchema(name:day, type:int, comment:null),
    FieldSchema(name:application, type:string, comment:null),
    FieldSchema(name:company_id, type:int, comment:null),
    FieldSchema(name:country_code, type:string, comment:null),
    FieldSchema(name:receiver_code_id, type:int, comment:null),
    FieldSchema(name:keyword, type:string, comment:null),
    FieldSchema(name:total, type:int, comment:null)],
    location:hdfs://<snip>/home/hive/warehouse/raw_facts,
    inputFormat:org.apache.hadoop.mapred.TextInputFormat,
    outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
    compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
    serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
    parameters:{serialization.format=9,field.delim= }), bucketCols:[],
    sortCols:[], parameters:{}), partitionKeys:[], parameters:{})

    Sample (real) rows: (these are tab separated in the file)

    2009 12 01 f 98 US 171 test 222
    2009 12 01 f 98 US 199 test 222
    2009 12 01 f 98 US 220 test 222

    Load command used:

    hive> load data inpath 'hdfs://<snip>/some/path/out/part-r-00000'
    overwrite into table raw_facts ;

    Some queries:

    hive> select count(1) from raw_facts;
    OK
    4723253

    hive> select count(1) from raw_facts where year is null;
    OK
    277

    hive> select year,count(1) from raw_facts group by year;
    OK
    NULL 277
    2009 4722976

    Thanks in advance.
    --
    Eric Sammer
    eric@lifless.net
    http://esammer.blogspot.com


    --
    Yours,
    Zheng
  • Zheng Shao at Jan 5, 2010 at 4:56 am
    Currently we don't have such an option.

    However, since Hive is totally pluggable, it shouldn't be hard to add
    a new SerDe which does what the users want.
    We should be able to reuse most of the code from LazySimpleSerDe as well.

    Zheng
    On Mon, Jan 4, 2010 at 8:14 PM, Todd Lipcon wrote:
    Hi Zheng,

    Is there any way to convince the LazySimpleSerde to allow leading/trailing
    spaces in non-text fields?

    -Todd
    On Mon, Jan 4, 2010 at 7:33 PM, Zheng Shao wrote:

    Hi Eric,

    Most probably there are leading/trailing spaces in the columns that
    are defined as "int".
    If Hive cannot parse the field successfully, the field will become null.

    You can try this to find out the rows:
    SELECT * FROM raw_facts WHERE year IS NULL;

    Zheng
    On Mon, Jan 4, 2010 at 4:10 PM, Eric Sammer wrote:
    All:

    I apologize in advance if this is common. I've searched and I can't find
    an explanation.

    I'm loading a plain text tab delimited file into a Hive (0.4.1-dev)
    table. This file is a small sample set of my full dataset and is the
    result of a M/R job, written by TextOutputFormat, if it matters. When I
    query the table, a small percentage (a few hundred out of a few million)
    of the rows contain null values where as the input file does not contain
    any null values. The number of null field records seems to grow
    proportionally to the total number of records at a relatively constant
    rate. It looks as if it's a SerDe error / misconfiguration of some kind,
    but I can't pinpoint anything that would cause the issue.

    To confirm, I've done an fs -cat of the file to local disk and used cut
    and sort to confirm all fields are properly formatted and populated.
    Below is the extended table description along with some additional
    information. Any help is greatly appreciated as using Hive for simple
    aggregation is saving me a ton of time from having to hand write the M/R
    jobs myself.

    I'm sure there's something I've done wrong. Unfortunately, I'm in a
    situation where I can't deal with any portion of the records being
    dumped (part of a reporting system).

    Original create:

    hive> create table raw_facts ( year int, month int, day int, application
    string, company_id int, country_code string, receiver_code_id int,
    keyword string, total int ) row format delimited fields terminated by
    '\t';

    (I've also tried row format TEXTFORMAT or whatever it is; all fields
    were null - assumed it was because hive was expecting ^A delimited.)

    Table:

    hive> describe extended raw_facts;
    OK
    year    int
    month   int
    day     int
    application     string
    company_id      int
    country_code    string
    receiver_code_id        int
    keyword string
    total   int

    Detailed Table Information      Table(tableName:raw_facts,
    dbName:default, owner:<snip>, createTime:1262631537, lastAccessTime:0,
    retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:int,
    comment:null), FieldSchema(name:month, type:int, comment:null),
    FieldSchema(name:day, type:int, comment:null),
    FieldSchema(name:application, type:string, comment:null),
    FieldSchema(name:company_id, type:int, comment:null),
    FieldSchema(name:country_code, type:string, comment:null),
    FieldSchema(name:receiver_code_id, type:int, comment:null),
    FieldSchema(name:keyword, type:string, comment:null),
    FieldSchema(name:total, type:int, comment:null)],
    location:hdfs://<snip>/home/hive/warehouse/raw_facts,
    inputFormat:org.apache.hadoop.mapred.TextInputFormat,
    outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
    compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
    serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
    parameters:{serialization.format=9,field.delim=      }), bucketCols:[],
    sortCols:[], parameters:{}), partitionKeys:[], parameters:{})

    Sample (real) rows: (these are tab separated in the file)

    2009    12      01      f       98      US      171     test    222
    2009    12      01      f       98      US      199     test    222
    2009    12      01      f       98      US      220     test    222

    Load command used:

    hive> load data inpath 'hdfs://<snip>/some/path/out/part-r-00000'
    overwrite into table raw_facts ;

    Some queries:

    hive> select count(1) from raw_facts;
    OK
    4723253

    hive> select count(1) from raw_facts where year is null;
    OK
    277

    hive> select year,count(1) from raw_facts group by year;
    OK
    NULL    277
    2009    4722976

    Thanks in advance.
    --
    Eric Sammer
    eric@lifless.net
    http://esammer.blogspot.com


    --
    Yours,
    Zheng


    --
    Yours,
    Zheng
  • Eric Sammer at Jan 5, 2010 at 6:29 am

    On 1/4/10 10:33 PM, Zheng Shao wrote:
    Hi Eric,

    Most probably there are leading/trailing spaces in the columns that
    are defined as "int".
    If Hive cannot parse the field successfully, the field will become null.

    You can try this to find out the rows:
    SELECT * FROM raw_facts WHERE year IS NULL;
    I just finished triple checking and I can't find any rows that contain
    any leading / trailing whitespace in any of the fields or the lines in
    the file produced by the M/R process. Is it possible the leading zero in
    the month field causes some kind of problem? I'll try rebuilding the
    table with year, month, and day as string rather than int.

    Even though after splitting on \t, no field matches /\t/, /^\s/, or
    /\s$/ (according to perl), the query above does return rows that are
    null. Seems strange. I also checked for \t\s and other similar patterns
    within lines and fields. Is there any kind of debugging output for SerDe
    / Hive when it encounters a malformed record / field? I would imagine
    not given how chatty it might be...

    Any other suggestions are appreciated.
    Thanks!
    --
    Eric Sammer
    eric@lifless.net
    http://esammer.blogspot.com
  • Zheng Shao at Jan 5, 2010 at 6:56 am
    LazySimpleSerDe treats missing fields as NULL.

    A hint to look for is in the stderr/stdout of the mapper tasks.
    If you see something like this:
    WARN: Missing fields! Expected 6 fields but only got 1.
    Then that's the problem.

    Also, if you want to change the column type definition, try “ALTER
    TABLE REPLACE COLUMNS".

    Zheng
    On Mon, Jan 4, 2010 at 10:28 PM, Eric Sammer wrote:
    On 1/4/10 10:33 PM, Zheng Shao wrote:
    Hi Eric,

    Most probably there are leading/trailing spaces in the columns that
    are defined as "int".
    If Hive cannot parse the field successfully, the field will become null.

    You can try this to find out the rows:
    SELECT * FROM raw_facts WHERE year IS NULL;
    I just finished triple checking and I can't find any rows that contain
    any leading / trailing whitespace in any of the fields or the lines in
    the file produced by the M/R process. Is it possible the leading zero in
    the month field causes some kind of problem? I'll try rebuilding the
    table with year, month, and day as string rather than int.

    Even though after splitting on \t, no field matches /\t/, /^\s/, or
    /\s$/ (according to perl), the query above does return rows that are
    null. Seems strange. I also checked for \t\s and other similar patterns
    within lines and fields. Is there any kind of debugging output for SerDe
    / Hive when it encounters a malformed record / field? I would imagine
    not given how chatty it might be...

    Any other suggestions are appreciated.
    Thanks!
    --
    Eric Sammer
    eric@lifless.net
    http://esammer.blogspot.com


    --
    Yours,
    Zheng
  • Eric Sammer at Jan 5, 2010 at 7:00 am

    On 1/5/10 1:28 AM, Eric Sammer wrote:
    On 1/4/10 10:33 PM, Zheng Shao wrote:
    Hi Eric,

    Most probably there are leading/trailing spaces in the columns that
    are defined as "int".
    If Hive cannot parse the field successfully, the field will become null.

    You can try this to find out the rows:
    SELECT * FROM raw_facts WHERE year IS NULL;
    I just finished triple checking and I can't find any rows that contain
    any leading / trailing whitespace in any of the fields or the lines in
    the file produced by the M/R process. Is it possible the leading zero in
    the month field causes some kind of problem? I'll try rebuilding the
    table with year, month, and day as string rather than int.

    Even though after splitting on \t, no field matches /\t/, /^\s/, or
    /\s$/ (according to perl), the query above does return rows that are
    null. Seems strange. I also checked for \t\s and other similar patterns
    within lines and fields. Is there any kind of debugging output for SerDe
    / Hive when it encounters a malformed record / field? I would imagine
    not given how chatty it might be...
    Mea culpa. I knew I was doing something stupid. I missed a \r in one of
    the fields that caused the remainder of the file to be offset. I was
    under the impression (wrongly so) that Hive explicitly only cared about
    \n as a line (record) terminator by default. Fixing this field left me
    with a perfectly working Hive system.

    Thanks, Zheng, for your patience. I should not have doubted. ;)

    It might be nice (as a 1.0 feature) to be able to analyze a table and
    get a report of some metrics to aide in debugging these types of things.
    I'm thinking things like number of rows as SerDe sees the table,
    possibly value distributions, partition / bucket information, etc.
    Another option might be for mission critical tables a "validating SerDe"
    of sorts that can confirm simple things like number of fields. I suppose
    one could implement a M/R job to shuffle out bad records and implement
    this outside of Hive entirely given the transparency of the warehouse /
    table directories.

    For posterity: Switching int fields to string and doing a group by
    revealed values that lead me to the bad lines where as the NULLs from
    the int fields were not helpful in debugging.

    Thanks again!
    --
    Eric Sammer
    eric@lifless.net
    http://esammer.blogspot.com
  • Jeff Hammerbacher at Jan 5, 2010 at 11:38 pm
    Hey Eric,

    Those feature requests are more likely to be implemented if they have a JIRA
    ticket to reference. Why not open one (or more) at
    https://issues.apache.org/jira/browse/HIVE?

    Regards,
    Jeff
    On Mon, Jan 4, 2010 at 11:00 PM, Eric Sammer wrote:
    On 1/5/10 1:28 AM, Eric Sammer wrote:
    On 1/4/10 10:33 PM, Zheng Shao wrote:
    Hi Eric,

    Most probably there are leading/trailing spaces in the columns that
    are defined as "int".
    If Hive cannot parse the field successfully, the field will become null.

    You can try this to find out the rows:
    SELECT * FROM raw_facts WHERE year IS NULL;
    I just finished triple checking and I can't find any rows that contain
    any leading / trailing whitespace in any of the fields or the lines in
    the file produced by the M/R process. Is it possible the leading zero in
    the month field causes some kind of problem? I'll try rebuilding the
    table with year, month, and day as string rather than int.

    Even though after splitting on \t, no field matches /\t/, /^\s/, or
    /\s$/ (according to perl), the query above does return rows that are
    null. Seems strange. I also checked for \t\s and other similar patterns
    within lines and fields. Is there any kind of debugging output for SerDe
    / Hive when it encounters a malformed record / field? I would imagine
    not given how chatty it might be...
    Mea culpa. I knew I was doing something stupid. I missed a \r in one of
    the fields that caused the remainder of the file to be offset. I was
    under the impression (wrongly so) that Hive explicitly only cared about
    \n as a line (record) terminator by default. Fixing this field left me
    with a perfectly working Hive system.

    Thanks, Zheng, for your patience. I should not have doubted. ;)

    It might be nice (as a 1.0 feature) to be able to analyze a table and
    get a report of some metrics to aide in debugging these types of things.
    I'm thinking things like number of rows as SerDe sees the table,
    possibly value distributions, partition / bucket information, etc.
    Another option might be for mission critical tables a "validating SerDe"
    of sorts that can confirm simple things like number of fields. I suppose
    one could implement a M/R job to shuffle out bad records and implement
    this outside of Hive entirely given the transparency of the warehouse /
    table directories.

    For posterity: Switching int fields to string and doing a group by
    revealed values that lead me to the bad lines where as the NULLs from
    the int fields were not helpful in debugging.

    Thanks again!
    --
    Eric Sammer
    eric@lifless.net
    http://esammer.blogspot.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 5, '10 at 12:11a
activeJan 5, '10 at 11:38p
posts8
users4
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase