Grokbase Groups HBase user July 2011
FAQ
Hi all,
I'm running a scan using the M/R framework.
My table contains hundreds of millions of rows and I'm scanning using
start/stop key about 50 million rows.

The problem is that some map tasks get stuck and the task manager kills
these maps after 600 seconds. When retrying the task everything works fine
(sometimes).

To verify that the problem is in hbase (and not in the map code) I removed
all the code from my map function, so it looks like this:
public void map(ImmutableBytesWritable key, Result value, Context context)
throws IOException, InterruptedException {
}

Also, when the map got stuck on a region, I tried to scan this region (using
simple scan from a Java main) and it worked fine.

Any ideas ?

Thanks,
Lior

Search Discussions

  • Ted Yu at Jul 4, 2011 at 2:23 pm
    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce one Split per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs around the moment
    of timeout ?

    Cheers
    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter wrote:

    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm scanning using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task manager kills
    these maps after 600 seconds. When retrying the task everything works fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map code) I removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value, Context context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan this region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Lior Schachter at Jul 4, 2011 at 2:37 pm
    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME, scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that ?

    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.


    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu wrote:

    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce one Split per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs around the moment
    of timeout ?

    Cheers
    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter wrote:

    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm scanning using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task manager kills
    these maps after 600 seconds. When retrying the task everything works fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map code) I removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value, Context context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan this region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Ted Yu at Jul 4, 2011 at 2:55 pm
    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the map tasks. The
    * number of splits matches the number of regions in a table.

    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter wrote:

    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME, scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that ?

    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.


    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu wrote:

    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce one Split per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs around the moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm scanning using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task manager kills
    these maps after 600 seconds. When retrying the task everything works fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map code) I removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value, Context context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan this region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Lior Schachter at Jul 4, 2011 at 3:16 pm
    1. Currently every map gets one region. So I don't understand what
    difference will it make using the splits.
    2. How should I use the TableInputFormatBase.getSplits() ? Could not find
    examples for that.

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu wrote:

    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the map tasks. The
    * number of splits matches the number of regions in a table.

    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter wrote:

    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME, scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that ?

    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.


    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu wrote:

    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce one Split
    per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs around the moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm scanning using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task manager
    kills
    these maps after 600 seconds. When retrying the task everything works fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map code) I removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value, Context context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan this region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Ted Yu at Jul 4, 2011 at 3:22 pm
    I wasn't clear in my previous email.
    It was not answer to why map tasks got stuck.
    TableInputFormatBase.getSplits() is being called already.

    Can you try getting jstack of one of the map tasks before task tracker kills
    it ?

    Thanks
    On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter wrote:

    1. Currently every map gets one region. So I don't understand what
    difference will it make using the splits.
    2. How should I use the TableInputFormatBase.getSplits() ? Could not find
    examples for that.

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu wrote:

    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the map tasks. The
    * number of splits matches the number of regions in a table.


    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME,
    scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that ?

    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.


    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu wrote:

    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce one Split
    per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs around the moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm scanning
    using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task manager
    kills
    these maps after 600 seconds. When retrying the task everything
    works
    fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map code) I removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value, Context context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan this
    region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Lior Schachter at Jul 4, 2011 at 4:27 pm
    I used kill -3, following the thread dump:

    Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode):

    "IPC Client (47) connection to /127.0.0.1:59759 from hadoop" daemon
    prio=10 tid=0x00002aaab05ca800 nid=0x4eaf in Object.wait()
    [0x00000000403c1000]
    java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
    - locked <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)

    "SpillThread" daemon prio=10 tid=0x00002aaab0585000 nid=0x4c99 waiting
    on condition [0x00000000404c2000]
    java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x00000000f9af0c38> (a
    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1169)

    "main-EventThread" daemon prio=10 tid=0x00002aaab035d000 nid=0x4c95
    waiting on condition [0x0000000041207000]
    java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x00000000f9af5f58> (a
    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

    "main-SendThread(hadoop09.infolinks.local:2181)" daemon prio=10
    tid=0x00002aaab035c000 nid=0x4c94 runnable [0x0000000040815000]
    java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    - locked <0x00000000f9af61a8> (a sun.nio.ch.Util$2)
    - locked <0x00000000f9af61b8> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000f9af6160> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1107)

    "communication thread" daemon prio=10 tid=0x000000004d020000
    nid=0x4c93 waiting on condition [0x0000000042497000]
    java.lang.Thread.State: RUNNABLE
    at java.util.Hashtable.put(Hashtable.java:420)
    - locked <0x00000000f9dbaa58> (a java.util.Hashtable)
    at org.apache.hadoop.ipc.Client$Connection.addCall(Client.java:225)
    - locked <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.access$1600(Client.java:176)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:854)
    at org.apache.hadoop.ipc.Client.call(Client.java:720)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source)
    at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:548)
    at java.lang.Thread.run(Thread.java:662)

    "Thread for syncLogs" daemon prio=10 tid=0x00002aaab02e9800 nid=0x4c90
    runnable [0x0000000040714000]
    java.lang.Thread.State: RUNNABLE
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
    at java.lang.StringBuilder.append(StringBuilder.java:119)
    at java.io.UnixFileSystem.resolve(UnixFileSystem.java:93)
    at java.io.File.(TaskLog.java:72)
    at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:180)
    at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:230)
    - locked <0x00000000eea92fc0> (a java.lang.Class for
    org.apache.hadoop.mapred.TaskLog)
    at org.apache.hadoop.mapred.Child$2.run(Child.java:89)

    "Low Memory Detector" daemon prio=10 tid=0x00002aaab0001800 nid=0x4c86
    runnable [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "CompilerThread1" daemon prio=10 tid=0x000000004cb4e800 nid=0x4c85
    waiting on condition [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "CompilerThread0" daemon prio=10 tid=0x000000004cb4b000 nid=0x4c84
    waiting on condition [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "Signal Dispatcher" daemon prio=10 tid=0x000000004cb49000 nid=0x4c83
    runnable [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "Finalizer" daemon prio=10 tid=0x000000004cb2c800 nid=0x4c82 in
    Object.wait() [0x0000000041d7a000]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    - locked <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

    "Reference Handler" daemon prio=10 tid=0x000000004cb25000 nid=0x4c81
    in Object.wait() [0x0000000041005000]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock)

    "main" prio=10 tid=0x000000004cab9000 nid=0x4c77 runnable [0x0000000040f04000]
    java.lang.Thread.State: RUNNABLE
    at com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:124)
    at com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:108)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    "VM Thread" prio=10 tid=0x000000004cb1e800 nid=0x4c80 runnable

    "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004cacc000
    nid=0x4c78 runnable

    "GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004cace000
    nid=0x4c79 runnable

    "GC task thread#2 (ParallelGC)" prio=10 tid=0x000000004cad0000
    nid=0x4c7a runnable

    "GC task thread#3 (ParallelGC)" prio=10 tid=0x000000004cad1800
    nid=0x4c7b runnable

    "GC task thread#4 (ParallelGC)" prio=10 tid=0x000000004cad3800
    nid=0x4c7c runnable

    "GC task thread#5 (ParallelGC)" prio=10 tid=0x000000004cad5800
    nid=0x4c7d runnable

    "GC task thread#6 (ParallelGC)" prio=10 tid=0x000000004cad7000
    nid=0x4c7e runnable

    "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004cad9000
    nid=0x4c7f runnable

    "VM Periodic Task Thread" prio=10 tid=0x00002aaab000c000 nid=0x4c87
    waiting on condition

    JNI global references: 1099

    Heap
    PSYoungGen total 59712K, used 51200K [0x00000000fbd60000,
    0x0000000100000000, 0x0000000100000000)
    eden space 51200K, 100% used
    [0x00000000fbd60000,0x00000000fef60000,0x00000000fef60000)
    from space 8512K, 0% used
    [0x00000000ff7b0000,0x00000000ff7b0000,0x0000000100000000)
    to space 8512K, 0% used
    [0x00000000fef60000,0x00000000fef60000,0x00000000ff7b0000)
    PSOldGen total 136576K, used 104203K [0x00000000f3800000,
    0x00000000fbd60000, 0x00000000fbd60000)
    object space 136576K, 76% used
    [0x00000000f3800000,0x00000000f9dc2d68,0x00000000fbd60000)
    PSPermGen total 21504K, used 11998K [0x00000000ee600000,
    0x00000000efb00000, 0x00000000f3800000)
    object space 21504K, 55% used
    [0x00000000ee600000,0x00000000ef1b7aa0,0x00000000efb00000)

    2011-07-04 15:35:02

    Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode):

    "IPC Client (47) connection to /127.0.0.1:59759 from hadoop" daemon
    prio=10 tid=0x00002aaab05ca800 nid=0x4eaf in Object.wait()
    [0x00000000403c1000]
    java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
    - locked <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)

    "SpillThread" daemon prio=10 tid=0x00002aaab0585000 nid=0x4c99 waiting
    on condition [0x00000000404c2000]
    java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x00000000f9af0c38> (a
    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1169)

    "main-EventThread" daemon prio=10 tid=0x00002aaab035d000 nid=0x4c95
    waiting on condition [0x0000000041207000]
    java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x00000000f9af5f58> (a
    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

    "main-SendThread(hadoop09.infolinks.local:2181)" daemon prio=10
    tid=0x00002aaab035c000 nid=0x4c94 runnable [0x0000000040815000]
    java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    - locked <0x00000000f9af61a8> (a sun.nio.ch.Util$2)
    - locked <0x00000000f9af61b8> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000f9af6160> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1107)

    "communication thread" daemon prio=10 tid=0x000000004d020000
    nid=0x4c93 waiting on condition [0x0000000042497000]
    java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:529)
    at java.lang.Thread.run(Thread.java:662)

    "Thread for syncLogs" daemon prio=10 tid=0x00002aaab02e9800 nid=0x4c90
    waiting on condition [0x0000000040714000]
    java.lang.Thread.State: RUNNABLE
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.(StringBuilder.java:430)
    at org.apache.hadoop.fs.Path.(Path.java:50)
    at org.apache.hadoop.fs.ChecksumFileSystem.getChecksumFile(ChecksumFileSystem.java:73)
    at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:410)
    at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:193)
    at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:230)
    - locked <0x00000000eea92fc0> (a java.lang.Class for
    org.apache.hadoop.mapred.TaskLog)
    at org.apache.hadoop.mapred.Child$2.run(Child.java:89)

    "Low Memory Detector" daemon prio=10 tid=0x00002aaab0001800 nid=0x4c86
    runnable [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "CompilerThread1" daemon prio=10 tid=0x000000004cb4e800 nid=0x4c85
    waiting on condition [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "CompilerThread0" daemon prio=10 tid=0x000000004cb4b000 nid=0x4c84
    waiting on condition [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "Signal Dispatcher" daemon prio=10 tid=0x000000004cb49000 nid=0x4c83
    runnable [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "Finalizer" daemon prio=10 tid=0x000000004cb2c800 nid=0x4c82 in
    Object.wait() [0x0000000041d7a000]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    - locked <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

    "Reference Handler" daemon prio=10 tid=0x000000004cb25000 nid=0x4c81
    in Object.wait() [0x0000000041005000]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock)

    "main" prio=10 tid=0x000000004cab9000 nid=0x4c77 runnable [0x0000000040f04000]
    java.lang.Thread.State: RUNNABLE
    at com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:131)
    at com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:108)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    "VM Thread" prio=10 tid=0x000000004cb1e800 nid=0x4c80 runnable

    "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004cacc000
    nid=0x4c78 runnable

    "GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004cace000
    nid=0x4c79 runnable

    "GC task thread#2 (ParallelGC)" prio=10 tid=0x000000004cad0000
    nid=0x4c7a runnable

    "GC task thread#3 (ParallelGC)" prio=10 tid=0x000000004cad1800
    nid=0x4c7b runnable

    "GC task thread#4 (ParallelGC)" prio=10 tid=0x000000004cad3800
    nid=0x4c7c runnable

    "GC task thread#5 (ParallelGC)" prio=10 tid=0x000000004cad5800
    nid=0x4c7d runnable

    "GC task thread#6 (ParallelGC)" prio=10 tid=0x000000004cad7000
    nid=0x4c7e runnable

    "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004cad9000
    nid=0x4c7f runnable

    "VM Periodic Task Thread" prio=10 tid=0x00002aaab000c000 nid=0x4c87
    waiting on condition

    JNI global references: 1099

    Heap
    PSYoungGen total 59712K, used 51200K [0x00000000fbd60000,
    0x0000000100000000, 0x0000000100000000)
    eden space 51200K, 100% used
    [0x00000000fbd60000,0x00000000fef60000,0x00000000fef60000)
    from space 8512K, 0% used
    [0x00000000ff7b0000,0x00000000ff7b0000,0x0000000100000000)
    to space 8512K, 0% used
    [0x00000000fef60000,0x00000000fef60000,0x00000000ff7b0000)
    PSOldGen total 136576K, used 104212K [0x00000000f3800000,
    0x00000000fbd60000, 0x00000000fbd60000)
    object space 136576K, 76% used
    [0x00000000f3800000,0x00000000f9dc50b8,0x00000000fbd60000)
    PSPermGen total 21504K, used 11998K [0x00000000ee600000,
    0x00000000efb00000, 0x00000000f3800000)
    object space 21504K, 55% used
    [0x00000000ee600000,0x00000000ef1b7aa0,0x00000000efb00000)


    On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu wrote:

    I wasn't clear in my previous email.
    It was not answer to why map tasks got stuck.
    TableInputFormatBase.getSplits() is being called already.

    Can you try getting jstack of one of the map tasks before task tracker
    kills
    it ?

    Thanks
    On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter wrote:

    1. Currently every map gets one region. So I don't understand what
    difference will it make using the splits.
    2. How should I use the TableInputFormatBase.getSplits() ? Could not find
    examples for that.

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu wrote:

    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the map tasks.
    The
    * number of splits matches the number of regions in a table.


    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME,
    scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that ?

    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.


    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu wrote:

    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce one
    Split
    per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs around the moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm scanning
    using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task manager
    kills
    these maps after 600 seconds. When retrying the task everything
    works
    fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map code)
    I
    removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value, Context context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan this
    region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Ted Yu at Jul 4, 2011 at 4:33 pm
    In the future, provide full dump using pastebin.com
    Write snippet of log in email.

    Can you tell us what the following lines are about ?
    HBaseURLsDaysAggregator.java:124
    HBaseURLsDaysAggregator.java:131

    How many mappers were launched ?

    What value is used for hbase.zookeeper.property.maxClientCnxns ?
    You may need to increase the value for above setting.
    On Mon, Jul 4, 2011 at 9:26 AM, Lior Schachter wrote:

    I used kill -3, following the thread dump:

    ...

    On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu wrote:

    I wasn't clear in my previous email.
    It was not answer to why map tasks got stuck.
    TableInputFormatBase.getSplits() is being called already.

    Can you try getting jstack of one of the map tasks before task tracker
    kills
    it ?

    Thanks

    On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    1. Currently every map gets one region. So I don't understand what
    difference will it make using the splits.
    2. How should I use the TableInputFormatBase.getSplits() ? Could not
    find
    examples for that.

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu wrote:

    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the map tasks.
    The
    * number of splits matches the number of regions in a table.


    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME,
    scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that ?

    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.


    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu wrote:

    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce one
    Split
    per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs around
    the
    moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm scanning
    using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task
    manager
    kills
    these maps after 600 seconds. When retrying the task everything
    works
    fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map
    code)
    I
    removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value,
    Context
    context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan this
    region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Lior Schachter at Jul 4, 2011 at 4:47 pm
    1. HBaseURLsDaysAggregator.java:124, HBaseURLsDaysAggregator.java:131 : are
    not important since even when I removed all my map code the tasks got stuck
    (but the thread dumps were generated after I revived the code). If you think
    its important I'll remove the map code again and re-generate the thread
    dumps...

    2. 82 maps were launched but only 36 ran simultaneously.

    3. hbase.zookeeper.property.maxClientCnxns = 300. Should I increase it ?

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 7:33 PM, Ted Yu wrote:

    In the future, provide full dump using pastebin.com
    Write snippet of log in email.

    Can you tell us what the following lines are about ?
    HBaseURLsDaysAggregator.java:124
    HBaseURLsDaysAggregator.java:131

    How many mappers were launched ?

    What value is used for hbase.zookeeper.property.maxClientCnxns ?
    You may need to increase the value for above setting.
    On Mon, Jul 4, 2011 at 9:26 AM, Lior Schachter wrote:

    I used kill -3, following the thread dump:

    ...

    On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu wrote:

    I wasn't clear in my previous email.
    It was not answer to why map tasks got stuck.
    TableInputFormatBase.getSplits() is being called already.

    Can you try getting jstack of one of the map tasks before task tracker
    kills
    it ?

    Thanks

    On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    1. Currently every map gets one region. So I don't understand what
    difference will it make using the splits.
    2. How should I use the TableInputFormatBase.getSplits() ? Could not
    find
    examples for that.

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu wrote:

    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the map
    tasks.
    The
    * number of splits matches the number of regions in a table.


    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME,
    scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that ?

    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.



    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce one
    Split
    per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs around
    the
    moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm
    scanning
    using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task
    manager
    kills
    these maps after 600 seconds. When retrying the task
    everything
    works
    fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map
    code)
    I
    removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value,
    Context
    context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan
    this
    region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Ted Yu at Jul 4, 2011 at 5:12 pm
    The reason I asked about HBaseURLsDaysAggregator.java was that I see no
    HBase (client) code in call stack.
    I have little clue for the problem you experienced.

    There may be more than one connection to zookeeper from one map task.
    So it doesn't hurt if you increase hbase.zookeeper.property.maxClientCnxns

    Cheers
    On Mon, Jul 4, 2011 at 9:47 AM, Lior Schachter wrote:

    1. HBaseURLsDaysAggregator.java:124, HBaseURLsDaysAggregator.java:131 : are
    not important since even when I removed all my map code the tasks got stuck
    (but the thread dumps were generated after I revived the code). If you
    think
    its important I'll remove the map code again and re-generate the thread
    dumps...

    2. 82 maps were launched but only 36 ran simultaneously.

    3. hbase.zookeeper.property.maxClientCnxns = 300. Should I increase it ?

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 7:33 PM, Ted Yu wrote:

    In the future, provide full dump using pastebin.com
    Write snippet of log in email.

    Can you tell us what the following lines are about ?
    HBaseURLsDaysAggregator.java:124
    HBaseURLsDaysAggregator.java:131

    How many mappers were launched ?

    What value is used for hbase.zookeeper.property.maxClientCnxns ?
    You may need to increase the value for above setting.

    On Mon, Jul 4, 2011 at 9:26 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    I used kill -3, following the thread dump:

    ...

    On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu wrote:

    I wasn't clear in my previous email.
    It was not answer to why map tasks got stuck.
    TableInputFormatBase.getSplits() is being called already.

    Can you try getting jstack of one of the map tasks before task
    tracker
    kills
    it ?

    Thanks

    On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    1. Currently every map gets one region. So I don't understand what
    difference will it make using the splits.
    2. How should I use the TableInputFormatBase.getSplits() ? Could
    not
    find
    examples for that.

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu wrote:

    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the map
    tasks.
    The
    * number of splits matches the number of regions in a table.


    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME,
    scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that ?

    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.



    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce
    one
    Split
    per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs
    around
    the
    moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm
    scanning
    using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task
    manager
    kills
    these maps after 600 seconds. When retrying the task
    everything
    works
    fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map
    code)
    I
    removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value,
    Context
    context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan
    this
    region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Ted Yu at Jul 4, 2011 at 5:14 pm
    From master UI, click 'zk dump'
    :60010/zk.jsp would show you the active connections. See if the count
    reaches 300 when map tasks run.
    On Mon, Jul 4, 2011 at 10:12 AM, Ted Yu wrote:

    The reason I asked about HBaseURLsDaysAggregator.java was that I see no
    HBase (client) code in call stack.
    I have little clue for the problem you experienced.

    There may be more than one connection to zookeeper from one map task.
    So it doesn't hurt if you increase hbase.zookeeper.property.maxClientCnxns

    Cheers

    On Mon, Jul 4, 2011 at 9:47 AM, Lior Schachter wrote:

    1. HBaseURLsDaysAggregator.java:124, HBaseURLsDaysAggregator.java:131 :
    are
    not important since even when I removed all my map code the tasks got
    stuck
    (but the thread dumps were generated after I revived the code). If you
    think
    its important I'll remove the map code again and re-generate the thread
    dumps...

    2. 82 maps were launched but only 36 ran simultaneously.

    3. hbase.zookeeper.property.maxClientCnxns = 300. Should I increase it ?

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 7:33 PM, Ted Yu wrote:

    In the future, provide full dump using pastebin.com
    Write snippet of log in email.

    Can you tell us what the following lines are about ?
    HBaseURLsDaysAggregator.java:124
    HBaseURLsDaysAggregator.java:131

    How many mappers were launched ?

    What value is used for hbase.zookeeper.property.maxClientCnxns ?
    You may need to increase the value for above setting.

    On Mon, Jul 4, 2011 at 9:26 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    I used kill -3, following the thread dump:

    ...

    On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu wrote:

    I wasn't clear in my previous email.
    It was not answer to why map tasks got stuck.
    TableInputFormatBase.getSplits() is being called already.

    Can you try getting jstack of one of the map tasks before task
    tracker
    kills
    it ?

    Thanks

    On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter <liors@infolinks.com
    wrote:
    1. Currently every map gets one region. So I don't understand what
    difference will it make using the splits.
    2. How should I use the TableInputFormatBase.getSplits() ? Could
    not
    find
    examples for that.

    Thanks,
    Lior


    On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the map
    tasks.
    The
    * number of splits matches the number of regions in a table.


    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME,
    scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that
    ?
    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.



    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce
    one
    Split
    per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs
    around
    the
    moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm
    scanning
    using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task
    manager
    kills
    these maps after 600 seconds. When retrying the task
    everything
    works
    fine
    (sometimes).

    To verify that the problem is in hbase (and not in the map
    code)
    I
    removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value,
    Context
    context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan
    this
    region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Lior Schachter at Jul 4, 2011 at 5:14 pm
    I will increase the number of connections to 1000.

    Thanks !

    Lior



    On Mon, Jul 4, 2011 at 8:12 PM, Ted Yu wrote:

    The reason I asked about HBaseURLsDaysAggregator.java was that I see no
    HBase (client) code in call stack.
    I have little clue for the problem you experienced.

    There may be more than one connection to zookeeper from one map task.
    So it doesn't hurt if you increase hbase.zookeeper.property.maxClientCnxns

    Cheers
    On Mon, Jul 4, 2011 at 9:47 AM, Lior Schachter wrote:

    1. HBaseURLsDaysAggregator.java:124, HBaseURLsDaysAggregator.java:131 : are
    not important since even when I removed all my map code the tasks got stuck
    (but the thread dumps were generated after I revived the code). If you
    think
    its important I'll remove the map code again and re-generate the thread
    dumps...

    2. 82 maps were launched but only 36 ran simultaneously.

    3. hbase.zookeeper.property.maxClientCnxns = 300. Should I increase it ?

    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 7:33 PM, Ted Yu wrote:

    In the future, provide full dump using pastebin.com
    Write snippet of log in email.

    Can you tell us what the following lines are about ?
    HBaseURLsDaysAggregator.java:124
    HBaseURLsDaysAggregator.java:131

    How many mappers were launched ?

    What value is used for hbase.zookeeper.property.maxClientCnxns ?
    You may need to increase the value for above setting.

    On Mon, Jul 4, 2011 at 9:26 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    I used kill -3, following the thread dump:

    ...

    On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu wrote:

    I wasn't clear in my previous email.
    It was not answer to why map tasks got stuck.
    TableInputFormatBase.getSplits() is being called already.

    Can you try getting jstack of one of the map tasks before task
    tracker
    kills
    it ?

    Thanks

    On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    1. Currently every map gets one region. So I don't understand
    what
    difference will it make using the splits.
    2. How should I use the TableInputFormatBase.getSplits() ? Could
    not
    find
    examples for that.

    Thanks,
    Lior


    On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the map
    tasks.
    The
    * number of splits matches the number of regions in a table.


    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME,
    scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do that
    ?
    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.



    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to produce
    one
    Split
    per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs
    around
    the
    moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm
    scanning
    using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the task
    manager
    kills
    these maps after 600 seconds. When retrying the task
    everything
    works
    fine
    (sometimes).

    To verify that the problem is in hbase (and not in the
    map
    code)
    I
    removed
    all the code from my map function, so it looks like this:
    public void map(ImmutableBytesWritable key, Result value,
    Context
    context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to scan
    this
    region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Ted Yu at Jul 4, 2011 at 5:35 pm
    Although connection count may not be the root cause, please read
    http://zhihongyu.blogspot.com/2011/04/managing-connections-in-hbase-090-and.htmlif
    you have time.
    0.92.0 would do a much better job of managing connections.
    On Mon, Jul 4, 2011 at 10:14 AM, Lior Schachter wrote:

    I will increase the number of connections to 1000.

    Thanks !

    Lior



    On Mon, Jul 4, 2011 at 8:12 PM, Ted Yu wrote:

    The reason I asked about HBaseURLsDaysAggregator.java was that I see no
    HBase (client) code in call stack.
    I have little clue for the problem you experienced.

    There may be more than one connection to zookeeper from one map task.
    So it doesn't hurt if you increase
    hbase.zookeeper.property.maxClientCnxns
    Cheers

    On Mon, Jul 4, 2011 at 9:47 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    1. HBaseURLsDaysAggregator.java:124, HBaseURLsDaysAggregator.java:131 : are
    not important since even when I removed all my map code the tasks got stuck
    (but the thread dumps were generated after I revived the code). If you
    think
    its important I'll remove the map code again and re-generate the thread
    dumps...

    2. 82 maps were launched but only 36 ran simultaneously.

    3. hbase.zookeeper.property.maxClientCnxns = 300. Should I increase it
    ?
    Thanks,
    Lior

    On Mon, Jul 4, 2011 at 7:33 PM, Ted Yu wrote:

    In the future, provide full dump using pastebin.com
    Write snippet of log in email.

    Can you tell us what the following lines are about ?
    HBaseURLsDaysAggregator.java:124
    HBaseURLsDaysAggregator.java:131

    How many mappers were launched ?

    What value is used for hbase.zookeeper.property.maxClientCnxns ?
    You may need to increase the value for above setting.

    On Mon, Jul 4, 2011 at 9:26 AM, Lior Schachter <liors@infolinks.com>
    wrote:
    I used kill -3, following the thread dump:

    ...

    On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu wrote:

    I wasn't clear in my previous email.
    It was not answer to why map tasks got stuck.
    TableInputFormatBase.getSplits() is being called already.

    Can you try getting jstack of one of the map tasks before task
    tracker
    kills
    it ?

    Thanks

    On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    1. Currently every map gets one region. So I don't understand
    what
    difference will it make using the splits.
    2. How should I use the TableInputFormatBase.getSplits() ?
    Could
    not
    find
    examples for that.

    Thanks,
    Lior


    On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    For #2, see TableInputFormatBase.getSplits():
    * Calculates the splits that will serve as input for the
    map
    tasks.
    The
    * number of splits matches the number of regions in a
    table.

    On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    1. yes - I configure my job using this line:
    TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME,
    scan,
    ScanMapper.class, Text.class, MapWritable.class, job)

    which internally uses TableInputFormat.class

    2. One split per region ? What do you mean ? How do I do
    that
    ?
    3. hbase version 0.90.2

    4. no exceptions. the logs are very clean.



    On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Do you use TableInputFormat ?
    To scan large number of rows, it would be better to
    produce
    one
    Split
    per
    region.

    What HBase version do you use ?
    Do you find any exception in master / region server logs
    around
    the
    moment
    of timeout ?

    Cheers

    On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <
    liors@infolinks.com>
    wrote:
    Hi all,
    I'm running a scan using the M/R framework.
    My table contains hundreds of millions of rows and I'm
    scanning
    using
    start/stop key about 50 million rows.

    The problem is that some map tasks get stuck and the
    task
    manager
    kills
    these maps after 600 seconds. When retrying the task
    everything
    works
    fine
    (sometimes).

    To verify that the problem is in hbase (and not in the
    map
    code)
    I
    removed
    all the code from my map function, so it looks like
    this:
    public void map(ImmutableBytesWritable key, Result
    value,
    Context
    context)
    throws IOException, InterruptedException {
    }

    Also, when the map got stuck on a region, I tried to
    scan
    this
    region
    (using
    simple scan from a Java main) and it worked fine.

    Any ideas ?

    Thanks,
    Lior
  • Michel Segel at Jul 4, 2011 at 11:32 pm
    Did a quick trim...

    Sorry to jump in on the tail end of this...
    Two things you may want to look at...

    Are you timing out because you haven't updated your status within the task or are you taking 600seconds to complete a single map() iteration.

    You can test this by tracking to see how long you are spending in each map iteration and printing out the result if it is longer than 2 mins...

    Also try updating your status in each iteration by sending a unique status update like current system time...
    ...


    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Jul 4, 2011, at 12:35 PM, Ted Yu wrote:

    Although connection count may not be the root cause, please read
    http://zhihongyu.blogspot.com/2011/04/managing-connections-in-hbase-090-and.htmlif
    you have time.
    0.92.0 would do a much better job of managing connections.
    On Mon, Jul 4, 2011 at 10:14 AM, Lior Schachter wrote:

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedJul 4, '11 at 11:49a
activeJul 4, '11 at 11:32p
posts14
users3
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase