Grokbase Groups Hive user May 2011
FAQ
I created a partitioned table, partitioned daily. If I query the earlier
partitions, everything works. The later ones fail with error:

hive> select substr(user_name,1,1),count(*) from u_s_h_b where
dtpartition='2010-10-24' group by substr(user_name,1,1) ;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
java.lang.ArrayIndexOutOfBoundsException: 0
at
org.apache.hadoop.mapred.FileInputFormat.identifyHosts(FileInputFormat.java:556)
at
org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:524)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:235)
......snip.......
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Job Submission failed with exception
'java.lang.ArrayIndexOutOfBoundsException(0)'
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MapRedTask

It turns out that 2010-10-24 is 257 days from the very first partition in my
dataset (2010-01-09):
date_sub('2010-10-24',interval 257 day) |
+-----------------------------------------+
2010-02-09 |
That seems like an interesting coincidence. But try as I might, the Great
Googles will not show me a way to tune this, or even if it is tuneable, or
expected. Has anyone else run into a 256-partition limit in Hive? How do you
work around it? Why is that even the limit?! Shouldn't it be more like
32-bit maxint??!!

Thanks!

--
Tim Ellis
Riot Games

Search Discussions

  • Steven Wong at May 4, 2011 at 2:02 am
    I have way more than 256 partitions per table. AFAIK, there is no partition limit.
    From your stack trace, you have some host name issue somewhere.

    From: Time Less
    Sent: Tuesday, May 03, 2011 6:52 PM
    To: user@hive.apache.org
    Subject: Maximum Number of Hive Partitions = 256?

    I created a partitioned table, partitioned daily. If I query the earlier partitions, everything works. The later ones fail with error:

    hive> select substr(user_name,1,1),count(*) from u_s_h_b where dtpartition='2010-10-24' group by substr(user_name,1,1) ;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapred.reduce.tasks=<number>
    java.lang.ArrayIndexOutOfBoundsException: 0
    at org.apache.hadoop.mapred.FileInputFormat.identifyHosts(FileInputFormat.java:556)
    at org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:524)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:235)
    ......snip.......
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
    Job Submission failed with exception 'java.lang.ArrayIndexOutOfBoundsException(0)'
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask

    It turns out that 2010-10-24 is 257 days from the very first partition in my dataset (2010-01-09):
    date_sub('2010-10-24',interval 257 day) |
    +-----------------------------------------+
    2010-02-09 |
    That seems like an interesting coincidence. But try as I might, the Great Googles will not show me a way to tune this, or even if it is tuneable, or expected. Has anyone else run into a 256-partition limit in Hive? How do you work around it? Why is that even the limit?! Shouldn't it be more like 32-bit maxint??!!

    Thanks!

    --
    Tim Ellis
    Riot Games
  • Viral Bajaria at May 4, 2011 at 2:53 am
    same here ... we have way more than 256 partitions in multiple tables. I am
    sure the issue has something to do with an empty string passed to the substr
    function. can you validate that the table has no null/empty string for
    user_name or try running the query with len(user_name) > 1 (not sure about
    query syntax) ?
    On Tue, May 3, 2011 at 7:02 PM, Steven Wong wrote:

    I have way more than 256 partitions per table. AFAIK, there is no partition
    limit.



    From your stack trace, you have some host name issue somewhere.





    *From:* Time Less
    *Sent:* Tuesday, May 03, 2011 6:52 PM
    *To:* user@hive.apache.org
    *Subject:* Maximum Number of Hive Partitions = 256?



    I created a partitioned table, partitioned daily. If I query the earlier
    partitions, everything works. The later ones fail with error:

    hive> select substr(user_name,1,1),count(*) from u_s_h_b where
    dtpartition='2010-10-24' group by substr(user_name,1,1) ;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapred.reduce.tasks=<number>
    java.lang.ArrayIndexOutOfBoundsException: 0
    at
    org.apache.hadoop.mapred.FileInputFormat.identifyHosts(FileInputFormat.java:556)
    at
    org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:524)
    at
    org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:235)
    ......snip.......
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
    Job Submission failed with exception
    'java.lang.ArrayIndexOutOfBoundsException(0)'
    FAILED: Execution Error, return code 1 from
    org.apache.hadoop.hive.ql.exec.MapRedTask

    It turns out that 2010-10-24 is 257 days from the very first partition in
    my dataset (2010-01-09):
    date_sub('2010-10-24',interval 257 day) |
    +-----------------------------------------+
    2010-02-09 |
    That seems like an interesting coincidence. But try as I might, the Great
    Googles will not show me a way to tune this, or even if it is tuneable, or
    expected. Has anyone else run into a 256-partition limit in Hive? How do you
    work around it? Why is that even the limit?! Shouldn't it be more like
    32-bit maxint??!!

    Thanks!

    --
    Tim Ellis
    Riot Games
  • Time Less at May 4, 2011 at 5:53 pm

    I am sure the issue has something to do with an empty string passed to the
    substr function.

    We can rule out the substr() function. I get the same stack trace with any
    query like:

    hive> select <anyColumn> from ushb where dtpartition='2010-10-25' limit 10;

    But this query succeeds:

    hive> select * from ushb where dtpartition='2010-10-25' limit 10 ;

    So SOMETHING about the data makes Hive (Hadoop?) unhappy. More specifically
    something about trying to select a particular column from the data on some
    days. I'm looking at the data to see if I can sort what it is.

    I have way more than 256 partitions per table. AFAIK, there is no partition
    limit.

    From your stack trace, you have some host name issue somewhere.
    I see why you'd think that from the stack trace, though I can't imagine why
    it'd have a "host name issue somewhere." The partition create statements
    have no hostname component. The query has no hostname component.

    This is definitely a curious problem.

    --
    Tim Ellis
    Riot Games
  • Time Less at May 4, 2011 at 6:12 pm
    This is definitely a curious problem.
    It's data corruption. The file is tab-separated, so I created a quick Perl
    pipe to print out the number of tabs on a given line:

    -bash-3.2$ hadoop fs -cat
    /user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 | perl -pe
    's/[^\t\n]//g' | perl -pe 's/\t/-/g' | sort | uniq -c

    The STDOUT was slightly disturbing:

    1 --
    1552318 -------

    The STDERR moreso:

    11/05/04 11:07:49 INFO hdfs.DFSClient: No node available for block:
    blk_-1511269407958713809_10494
    file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25
    11/05/04 11:07:49 INFO hdfs.DFSClient: Could not obtain block
    blk_-1511269407958713809_10494 from any node: java.io.IOException: No live
    nodes contain current block. Will get new block locations from namenode and
    retry...
    11/05/04 11:07:52 INFO hdfs.DFSClient: No node available for block:
    blk_-1511269407958713809_10494
    file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25
    11/05/04 11:07:52 INFO hdfs.DFSClient: Could not obtain block
    blk_-1511269407958713809_10494 from any node: java.io.IOException: No live
    nodes contain current block. Will get new block locations from namenode and
    retry...
    11/05/04 11:07:58 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could
    not obtain block: blk_-1511269407958713809_10494
    file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25
    at
    org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1977)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1784)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1932)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    (...etc)
    cat: Could not obtain block: blk_-1511269407958713809_10494
    file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25

    --
    Tim Ellis
    Riot Games
  • Time Less at May 4, 2011 at 5:59 pm

    It turns out that 2010-10-24 is 257 days from the very first partition in
    my dataset (2010-01-09):
    date_sub('2010-10-24',interval 257 day) |
    +-----------------------------------------+
    2010-02-09 |
    I just noticed 257 days is FEBRUARY 9th, not JANUARY 9th, as the above
    shows. So there isn't even a 256ness of this problem in the first place. The
    human brain tends to pay attention to beginnings and ends of strings,
    ignoring the middle.

    --
    Tim Ellis
    Riot Games

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMay 4, '11 at 1:52a
activeMay 4, '11 at 6:12p
posts6
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase