On Thu, Jun 23, 2011 at 7:11 PM, bijieshan wrote:The following steps can recreate the problem:
1. There's thousands of regions in the cluster.
2. Stop the cluster.
3. Start the cluster. Killing one regionserver while the regions were opening. Restarted it after 10 seconds.
The shutted regionserver will appear in the online list again:
2011-06-23 14:14:30,775 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: Server information: 167-6-1-12,20020,1308803390123=2220, 167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2205, 167-6-1-13,20020,1308803514394=2183
Two regionservers had the same hostname but different startcode:
167-6-1-13,20020,1308803391742=2374
167-6-1-13,20020,1308803514394=2183
Some other related logs:
2011-06-23 12:31:03,192 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for 167-6-1-13,20020,1308803391742
2011-06-23 12:31:06,433 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 2468 region(s) that 167-6-1-13,20020,1308803391742 was carrying (skipping 0 regions(s) that are already in transition)
2011-06-23 12:31:54,687 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=167-6-1-13,20020,1308803514394, regionCount=0, userLoad=false
2011-06-23 12:39:30,773 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: Server information: 167-6-1-12,20020,1308803390123=2942, 167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2917, 167-6-1-13,20020,1308803514394=749
2374
2011-06-23 12:32:41,315 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 2d655e795ea0044885ad60d2ea057113 on serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
2011-06-23 12:32:41,315 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting ac98d67ec85c370a39fcf8b9b2ce76d1 on serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
...
2011-06-23 12:34:30,613 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 24e833b7503a74c66f1c62c72549cb70 on serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
2011-06-23 12:34:30,681 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting d6c4392a580527d23018576357abed3e on serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
2011-06-23 12:34:30,771 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 832f72e0155c6a664c1a05a2b57e8687 on serverName=167-6-1-13,20020,1308803391742, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
Jieshan Bean
--------------------------------------------------
Yes, it seems pretty strange.
You're right, there's so many logs, so I just picked off part of the logs.
I feel a hiding bug here. And I'm still working on it.
I feel the function of AssignmentManager#regionOnline has a problem:
1. ServerShutdownHandler will remove the server from the AssignmentManager.servers.
2. regionOnline will add the server into AssignmentManager.servers again.
What about a competition happened here?
public void regionOnline(HRegionInfo regionInfo, HServerInfo serverInfo) {
synchronized (this.regionsInTransition) {
---------------
}
synchronized (this.regions) {
// Add check
---------------
addToServers(hsiWithoutLoad, regionInfo);
this.regions.notifyAll();
}
// Remove plan if one.
clearRegionPlan(regionInfo);
// Update timers for all regions in transition going against this server.
updateTimers(serverInfo);
}
Thanks J-D.
Jieshan Bean
------------------------------------------------
As far as I can tell your server is alive because it opened the region
just fine:
Handling transition=M_ZK_REGION_OFFLINE, server=167-6-1-11:20000,
region=612342de1fe4733f72299d70addb6d11
Handling transition=RS_ZK_REGION_OPENING,
server=167-6-1-11,20020,1308106004710,
region=612342de1fe4733f72299d70addb6d11
Handling transition=RS_ZK_REGION_OPENING,
server=167-6-1-11,20020,1308106004710,
region=612342de1fe4733f72299d70addb6d11
Handling transition=RS_ZK_REGION_OPENED,
server=167-6-1-11,20020,1308106004710,
region=612342de1fe4733f72299d70addb6d11
It has a different start code from when it was assigned (1308106004710
vs 1308105402003) which is pretty strange, and on top of that the line
with M_ZK_REGION_OFFLINE shows port 20000 instead of 20020.
I believe that this is only a small snippet of a bigger log, right? I
bet there is more we could learn in there, also it'd be nice if you
told us what was happening on your cluster at the same time,
understanding the context would help.
J-D
On Wed, Jun 22, 2011 at 2:31 AM, bijieshan wrote:
1.The Rs 167-6-1-11,20020,1308105402003 finished shutdown at "10:46:37,774":
10:46:37,774 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of 167-6-1-11,20020,1308105402003
2.Overwriting happened, it seemed the RS was still exist in the set of AssignmentManager#regions:
10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 612342de1fe4733f72299d70addb6d11 on serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
3.Region was assigned to this dead RS at "10:50:20,671":
10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. to 167-6-1-11,20020,1308105402003
I'm still working on this issue. I hope anyone can give out any suggestions.
Thanks!
Jieshan Bean
Here's the logs tracked the region of "612342de1fe4733f72299d70addb6d11":
10:39:51,467 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x330912836370000 Async create of unassigned node for 612342de1fe4733f72299d70addb6d11 with OFFLINE state
10:39:51,994 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. state=OFFLINE, ts=1308105591467, server=167-6-1-11,20020,1308105402003
10:39:52,182 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. state=OFFLINE, ts=1308105591467
10:41:43,316 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=167-6-1-11,20020,1308105402003, region=612342de1fe4733f72299d70addb6d11
10:43:22,938 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node
10:43:22,938 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x330912836370000 Deleting existing unassigned node for 612342de1fe4733f72299d70addb6d11 that is in expected state RS_ZK_REGION_OPENED
10:43:23,153 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x330912836370000 Successfully deleted unassigned node for region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED
10:43:23,157 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. on 167-6-1-11,20020,1308105402003
10:43:52,300 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 2994 region(s) that 167-6-1-11,20020,1308105402003 was carrying (skipping 0 regions(s) that are already in transition)
10:45:54,914 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x330912836370000 Creating (or updating) unassigned node for 612342de1fe4733f72299d70addb6d11 with OFFLINE state
10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. so generated a random one; hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., src=, dest=167-6-1-12,20020,1308105397174; 2 (online=2, exclude=null) available servers
10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. so generated a random one; hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., src=, dest=167-6-1-12,20020,1308105397174; 2 (online=2, exclude=null) available servers
10:45:54,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. to 167-6-1-12,20020,1308105397174
10:45:55,029 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=167-6-1-12,20020,1308105397174, region=612342de1fe4733f72299d70addb6d11
10:45:55,063 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=167-6-1-12,20020,1308105397174, region=612342de1fe4733f72299d70addb6d11
10:45:55,063 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node
10:45:55,063 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x330912836370000 Deleting existing unassigned node for 612342de1fe4733f72299d70addb6d11 that is in expected state RS_ZK_REGION_OPENED
10:45:55,072 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x330912836370000 Successfully deleted unassigned node for region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED
10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 612342de1fe4733f72299d70addb6d11 on serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
10:45:55,081 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. on 167-6-1-12,20020,1308105397174
10:46:37,774 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of 167-6-1-11,20020,1308105402003
10:48:45,067 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003
10:48:45,067 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. (offlining)
10:48:45,068 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=167-6-1-12,20020,1308105397174, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.
10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/612342de1fe4733f72299d70addb6d11 (region=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., server=167-6-1-12,20020,1308105397174, state=RS_ZK_REGION_CLOSED)
10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/612342de1fe4733f72299d70addb6d11 (region=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., server=167-6-1-12,20020,1308105397174, state=RS_ZK_REGION_CLOSED)
10:48:45,439 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=167-6-1-12,20020,1308105397174, region=612342de1fe4733f72299d70addb6d11
10:50:20,527 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 612342de1fe4733f72299d70addb6d11
10:50:20,580 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. state=CLOSED, ts=1308106125379
10:50:20,580 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x330912836370000 Creating (or updating) unassigned node for 612342de1fe4733f72299d70addb6d11 with OFFLINE state
10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.; plan=hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003
10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11.; plan=hri=Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11., src=167-6-1-12,20020,1308105397174, dest=167-6-1-11,20020,1308105402003
10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. to 167-6-1-11,20020,1308105402003
10:50:20,692 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=167-6-1-11:20000, region=612342de1fe4733f72299d70addb6d11
10:50:20,720 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=167-6-1-11,20020,1308106004710, region=612342de1fe4733f72299d70addb6d11
10:50:20,743 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=167-6-1-11,20020,1308106004710, region=612342de1fe4733f72299d70addb6d11
10:50:20,779 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=167-6-1-11,20020,1308106004710, region=612342de1fe4733f72299d70addb6d11
10:51:33,245 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 612342de1fe4733f72299d70addb6d11; deleting unassigned node
10:51:33,245 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x330912836370000 Deleting existing unassigned node for 612342de1fe4733f72299d70addb6d11 that is in expected state RS_ZK_REGION_OPENED
10:51:33,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x330912836370000 Successfully deleted unassigned node for region 612342de1fe4733f72299d70addb6d11 in expected state RS_ZK_REGION_OPENED
10:51:33,312 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. on 167-6-1-11,20020,1308106004710