Grokbase Groups HBase user May 2016
FAQ
Hey everyone,

We are noticing a file descriptor leak that is only affecting nodes in our
cluster running 5.7.0, not those still running 5.3.8. I ran an lsof against
an affected regionserver, and noticed that there were 10k+ unix sockets
that are just called "socket", as well as another 10k+ of the form
"/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
2 seem related based on how closely the counts match.

We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
handled the namenode upgrade separately). The 5.3.8 nodes *do not*
experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
more regionservers until we can figure this out. I'm not sure if any
intermediate versions between the 2 have the issue.

We traced the root cause to a hadoop job running against a basic table:

'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
{'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

This is very similar to all of our other tables (we have many). However,
it's regions are getting up there in size, 40+gb per region, compressed.
This has not been an issue for us previously.

The hadoop job is a simple TableMapper job with no special parameters,
though we haven't updated our client yet to the latest (will do that once
we finish the server side). The hadoop job runs on a separate hadoop
cluster, remotely accessing the HBase cluster. It does not do any other
reads or writes, outside of the TableMapper scans.

Moving the regions off of an affected server, or killing the hadoop job,
causes the file descriptors to gradually go back down to normal.

Any ideas?

Thanks,

Bryan

Search Discussions

  • Ted Yu at May 23, 2016 at 4:59 pm
    Have you taken a look at HBASE-9393 ?
    On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault wrote:

    Hey everyone,

    We are noticing a file descriptor leak that is only affecting nodes in our
    cluster running 5.7.0, not those still running 5.3.8. I ran an lsof against
    an affected regionserver, and noticed that there were 10k+ unix sockets
    that are just called "socket", as well as another 10k+ of the form
    "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
    2 seem related based on how closely the counts match.

    We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
    handled the namenode upgrade separately). The 5.3.8 nodes *do not*
    experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
    more regionservers until we can figure this out. I'm not sure if any
    intermediate versions between the 2 have the issue.

    We traced the root cause to a hadoop job running against a basic table:

    'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
    MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
    BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
    {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

    This is very similar to all of our other tables (we have many). However,
    it's regions are getting up there in size, 40+gb per region, compressed.
    This has not been an issue for us previously.

    The hadoop job is a simple TableMapper job with no special parameters,
    though we haven't updated our client yet to the latest (will do that once
    we finish the server side). The hadoop job runs on a separate hadoop
    cluster, remotely accessing the HBase cluster. It does not do any other
    reads or writes, outside of the TableMapper scans.

    Moving the regions off of an affected server, or killing the hadoop job,
    causes the file descriptors to gradually go back down to normal.

    Any ideas?

    Thanks,

    Bryan
  • Bryan Beaudreault at May 23, 2016 at 5:31 pm
    Thanks Ted,

    I was not familiar with that JIRA, though I have read it now. The next time
    it happens I will run the test at the top: netstat -nap |grep CLOSE_WAIT
    grep 21592 |wc -l
    However, from what I can tell in the JIRA this should affect almost all
    versions of HBase above 0.94. The issue we are hitting does not appear to
    affect 0.98/CDH5.3.8. We also never saw it when we were on 0.94. This seems
    new in either 1.0+ or 1.2+.
    On Mon, May 23, 2016 at 12:59 PM Ted Yu wrote:

    Have you taken a look at HBASE-9393 ?

    On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <
    bbeaudreault@hubspot.com
    wrote:
    Hey everyone,

    We are noticing a file descriptor leak that is only affecting nodes in our
    cluster running 5.7.0, not those still running 5.3.8. I ran an lsof against
    an affected regionserver, and noticed that there were 10k+ unix sockets
    that are just called "socket", as well as another 10k+ of the form
    "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
    2 seem related based on how closely the counts match.

    We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
    handled the namenode upgrade separately). The 5.3.8 nodes *do not*
    experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
    more regionservers until we can figure this out. I'm not sure if any
    intermediate versions between the 2 have the issue.

    We traced the root cause to a hadoop job running against a basic table:

    'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
    MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
    BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
    {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

    This is very similar to all of our other tables (we have many). However,
    it's regions are getting up there in size, 40+gb per region, compressed.
    This has not been an issue for us previously.

    The hadoop job is a simple TableMapper job with no special parameters,
    though we haven't updated our client yet to the latest (will do that once
    we finish the server side). The hadoop job runs on a separate hadoop
    cluster, remotely accessing the HBase cluster. It does not do any other
    reads or writes, outside of the TableMapper scans.

    Moving the regions off of an affected server, or killing the hadoop job,
    causes the file descriptors to gradually go back down to normal.

    Any ideas?

    Thanks,

    Bryan
  • Michael Stack at May 23, 2016 at 6:03 pm

    On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault wrote:

    Hey everyone,

    We are noticing a file descriptor leak that is only affecting nodes in our
    cluster running 5.7.0, not those still running 5.3.8.

    Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs hbase-0.98.6+hadoop-2.5.0.

    I ran an lsof against
    an affected regionserver, and noticed that there were 10k+ unix sockets
    that are just called "socket", as well as another 10k+ of the form
    "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
    2 seem related based on how closely the counts match.

    We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
    handled the namenode upgrade separately). The 5.3.8 nodes *do not*
    experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
    more regionservers until we can figure this out. I'm not sure if any
    intermediate versions between the 2 have the issue.

    We traced the root cause to a hadoop job running against a basic table:

    'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
    MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
    BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
    {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

    This is very similar to all of our other tables (we have many).

    You are doing MR against some of these also? They have different schemas?
    No leaks here?


    However,
    it's regions are getting up there in size, 40+gb per region, compressed.
    This has not been an issue for us previously.

    The hadoop job is a simple TableMapper job with no special parameters,
    though we haven't updated our client yet to the latest (will do that once
    we finish the server side). The hadoop job runs on a separate hadoop
    cluster, remotely accessing the HBase cluster. It does not do any other
    reads or writes, outside of the TableMapper scans.

    Moving the regions off of an affected server, or killing the hadoop job,
    causes the file descriptors to gradually go back down to normal.
    Any ideas?
    Is it just the FD cache running 'normally'? 10k seems like a lot though.
    256 seems to be the default in hdfs but maybe it is different in CM or in
    hbase?

    What is your dfs.client.read.shortcircuit.streams.cache.size set to?
    St.Ack


    Thanks,

    Bryan
  • Bryan Beaudreault at May 23, 2016 at 6:19 pm
    We run MR against many tables in all of our clusters, they mostly have
    similar schema definitions though vary in terms of key length, # columns,
    etc. This is the only cluster and only table we've seen leak so far. It's
    probably the table with the biggest regions which we MR against, though
    it's hard to verify that (anyone in engineering can run such a job).

    dfs.client.read.shortcircuit.streams.cache.size = 256

    Our typical FD amount is around 3000. When this hadoop job runs, that can
    climb up to our limit of over 30k if we don't act -- it is a gradual build
    up over the course of a couple hours. When we move the regions off or kill
    the job, the FDs will gradually go back down at roughly the same pace. It
    forms a graph in the shape of a pyramid.

    We don't use CM, we use mostly the default *-site.xml. We haven't
    overridden anything related to this. The configs between CDH5.3.8 and 5.7.0
    are identical for us.
    On Mon, May 23, 2016 at 2:03 PM Stack wrote:

    On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <
    bbeaudreault@hubspot.com
    wrote:
    Hey everyone,

    We are noticing a file descriptor leak that is only affecting nodes in our
    cluster running 5.7.0, not those still running 5.3.8.

    Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs hbase-0.98.6+hadoop-2.5.0.

    I ran an lsof against
    an affected regionserver, and noticed that there were 10k+ unix sockets
    that are just called "socket", as well as another 10k+ of the form
    "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
    2 seem related based on how closely the counts match.

    We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
    handled the namenode upgrade separately). The 5.3.8 nodes *do not*
    experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
    more regionservers until we can figure this out. I'm not sure if any
    intermediate versions between the 2 have the issue.

    We traced the root cause to a hadoop job running against a basic table:

    'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
    MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
    BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
    {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

    This is very similar to all of our other tables (we have many).

    You are doing MR against some of these also? They have different schemas?
    No leaks here?


    However,
    it's regions are getting up there in size, 40+gb per region, compressed.
    This has not been an issue for us previously.

    The hadoop job is a simple TableMapper job with no special parameters,
    though we haven't updated our client yet to the latest (will do that once
    we finish the server side). The hadoop job runs on a separate hadoop
    cluster, remotely accessing the HBase cluster. It does not do any other
    reads or writes, outside of the TableMapper scans.

    Moving the regions off of an affected server, or killing the hadoop job,
    causes the file descriptors to gradually go back down to normal.
    Any ideas?
    Is it just the FD cache running 'normally'? 10k seems like a lot though.
    256 seems to be the default in hdfs but maybe it is different in CM or in
    hbase?

    What is your dfs.client.read.shortcircuit.streams.cache.size set to?
    St.Ack


    Thanks,

    Bryan
  • Michael Stack at May 23, 2016 at 6:53 pm

    On Mon, May 23, 2016 at 11:19 AM, Bryan Beaudreault wrote:

    We run MR against many tables in all of our clusters, they mostly have
    similar schema definitions though vary in terms of key length, # columns,
    etc. This is the only cluster and only table we've seen leak so far. It's
    probably the table with the biggest regions which we MR against, though
    it's hard to verify that (anyone in engineering can run such a job).

    dfs.client.read.shortcircuit.streams.cache.size = 256

    Our typical FD amount is around 3000. When this hadoop job runs, that can
    climb up to our limit of over 30k if we don't act -- it is a gradual build
    up over the course of a couple hours. When we move the regions off or kill
    the job, the FDs will gradually go back down at roughly the same pace. It
    forms a graph in the shape of a pyramid.

    We don't use CM, we use mostly the default *-site.xml. We haven't
    overridden anything related to this. The configs between CDH5.3.8 and 5.7.0
    are identical for us.

    Chatting w/ our Colin, Mr ShortCircuit, he notes that it doesn't sound
    like a leak if it comes down when you kill the job. Anything about the job
    itself that is holding open references or throwing away files w/o closing
    them? It could also just be the size that you note.

    Seems like the expiry is set to 5minutes by default for the fd cache:

    2099 <property>
    2100 <name>dfs.client.read.shortcircuit.streams.cache.expiry.ms</name>
    2101 <value>300000</value>
    2102 <description>
    2103 This controls the minimum amount of time
    2104 file descriptors need to sit in the client cache context
    2105 before they can be closed for being inactive for too long.
    2106 </description>
    2107 </property>

    Even waiting a while, the fd count doesn't fall?

    St.Ack



    On Mon, May 23, 2016 at 2:03 PM Stack wrote:

    On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <
    bbeaudreault@hubspot.com
    wrote:
    Hey everyone,

    We are noticing a file descriptor leak that is only affecting nodes in our
    cluster running 5.7.0, not those still running 5.3.8.

    Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs
    hbase-0.98.6+hadoop-2.5.0.
    I ran an lsof against
    an affected regionserver, and noticed that there were 10k+ unix sockets
    that are just called "socket", as well as another 10k+ of the form
    "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
    2 seem related based on how closely the counts match.

    We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
    handled the namenode upgrade separately). The 5.3.8 nodes *do not*
    experience this issue. The 5.7.0 nodes *do. *We are holding off
    upgrading
    more regionservers until we can figure this out. I'm not sure if any
    intermediate versions between the 2 have the issue.

    We traced the root cause to a hadoop job running against a basic table:

    'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
    MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
    BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
    {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

    This is very similar to all of our other tables (we have many).

    You are doing MR against some of these also? They have different schemas?
    No leaks here?


    However,
    it's regions are getting up there in size, 40+gb per region,
    compressed.
    This has not been an issue for us previously.

    The hadoop job is a simple TableMapper job with no special parameters,
    though we haven't updated our client yet to the latest (will do that
    once
    we finish the server side). The hadoop job runs on a separate hadoop
    cluster, remotely accessing the HBase cluster. It does not do any other
    reads or writes, outside of the TableMapper scans.

    Moving the regions off of an affected server, or killing the hadoop
    job,
    causes the file descriptors to gradually go back down to normal.
    Any ideas?
    Is it just the FD cache running 'normally'? 10k seems like a lot though.
    256 seems to be the default in hdfs but maybe it is different in CM or in
    hbase?

    What is your dfs.client.read.shortcircuit.streams.cache.size set to?
    St.Ack


    Thanks,

    Bryan
  • Bryan Beaudreault at May 23, 2016 at 7:13 pm
    I've forced the issue to happen again. netstat takes a while to run on this
    host while it's happening, but I do not see an abnormal amount of
    CLOSE_WAIT (compared to other hosts).

    I forced more than usual number of regions for the affected table onto the
    host to speed up the process. File Descriptors are now growing quite
    rapidly, about 8-10 per second.

    This is what lsof looks like, multiplied by a couple thousand:

    COMMAND PID USER FD TYPE DEVICE SIZE/OFF
    NODE NAME
    java 23180 hbase DEL REG 0,16 3848784656
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1702253823
    java 23180 hbase DEL REG 0,16 3847643924
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1614925966
    java 23180 hbase DEL REG 0,16 3847614191
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_888427288

    The only thing that varies is the last int on the end.
    Anything about the job itself that is holding open references or throwing
    away files w/o closing them?

    The MR job does a TableMapper directly against HBase, which as far as I
    know uses the HBase RPC and does not hit HDFS directly at all. Is it
    possible that a long running scan (one with many, many next() calls) could
    keep some references to HDFS open for the duration of the overall scan?

    On Mon, May 23, 2016 at 2:19 PM Bryan Beaudreault wrote:

    We run MR against many tables in all of our clusters, they mostly have
    similar schema definitions though vary in terms of key length, # columns,
    etc. This is the only cluster and only table we've seen leak so far. It's
    probably the table with the biggest regions which we MR against, though
    it's hard to verify that (anyone in engineering can run such a job).

    dfs.client.read.shortcircuit.streams.cache.size = 256

    Our typical FD amount is around 3000. When this hadoop job runs, that can
    climb up to our limit of over 30k if we don't act -- it is a gradual build
    up over the course of a couple hours. When we move the regions off or kill
    the job, the FDs will gradually go back down at roughly the same pace. It
    forms a graph in the shape of a pyramid.

    We don't use CM, we use mostly the default *-site.xml. We haven't
    overridden anything related to this. The configs between CDH5.3.8 and 5.7.0
    are identical for us.
    On Mon, May 23, 2016 at 2:03 PM Stack wrote:

    On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <
    bbeaudreault@hubspot.com
    wrote:
    Hey everyone,

    We are noticing a file descriptor leak that is only affecting nodes in our
    cluster running 5.7.0, not those still running 5.3.8.

    Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs
    hbase-0.98.6+hadoop-2.5.0.

    I ran an lsof against
    an affected regionserver, and noticed that there were 10k+ unix sockets
    that are just called "socket", as well as another 10k+ of the form
    "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
    2 seem related based on how closely the counts match.

    We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
    handled the namenode upgrade separately). The 5.3.8 nodes *do not*
    experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
    more regionservers until we can figure this out. I'm not sure if any
    intermediate versions between the 2 have the issue.

    We traced the root cause to a hadoop job running against a basic table:

    'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
    MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
    BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
    {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

    This is very similar to all of our other tables (we have many).

    You are doing MR against some of these also? They have different schemas?
    No leaks here?


    However,
    it's regions are getting up there in size, 40+gb per region, compressed.
    This has not been an issue for us previously.

    The hadoop job is a simple TableMapper job with no special parameters,
    though we haven't updated our client yet to the latest (will do that once
    we finish the server side). The hadoop job runs on a separate hadoop
    cluster, remotely accessing the HBase cluster. It does not do any other
    reads or writes, outside of the TableMapper scans.

    Moving the regions off of an affected server, or killing the hadoop job,
    causes the file descriptors to gradually go back down to normal.
    Any ideas?
    Is it just the FD cache running 'normally'? 10k seems like a lot though.
    256 seems to be the default in hdfs but maybe it is different in CM or in
    hbase?

    What is your dfs.client.read.shortcircuit.streams.cache.size set to?
    St.Ack


    Thanks,

    Bryan
  • Bryan Beaudreault at May 23, 2016 at 7:20 pm
    For reference, the Scan backing the job is pretty basic:

    Scan scan = new Scan();
    scan.setCaching(500); // probably too small for the datasize we're dealing
    with
    scan.setCacheBlocks(false);
    scan.setScanMetricsEnabled(true);
    scan.setMaxVersions(1);
    scan.setTimeRange(startTime, stopTime);

    Otherwise it is using the out-of-the-box TableInputFormat.


    On Mon, May 23, 2016 at 3:13 PM Bryan Beaudreault wrote:

    I've forced the issue to happen again. netstat takes a while to run on
    this host while it's happening, but I do not see an abnormal amount of
    CLOSE_WAIT (compared to other hosts).

    I forced more than usual number of regions for the affected table onto the
    host to speed up the process. File Descriptors are now growing quite
    rapidly, about 8-10 per second.

    This is what lsof looks like, multiplied by a couple thousand:

    COMMAND PID USER FD TYPE DEVICE SIZE/OFF
    NODE NAME
    java 23180 hbase DEL REG 0,16 3848784656
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1702253823
    java 23180 hbase DEL REG 0,16 3847643924
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1614925966
    java 23180 hbase DEL REG 0,16 3847614191
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_888427288

    The only thing that varies is the last int on the end.
    Anything about the job itself that is holding open references or
    throwing away files w/o closing them?

    The MR job does a TableMapper directly against HBase, which as far as I
    know uses the HBase RPC and does not hit HDFS directly at all. Is it
    possible that a long running scan (one with many, many next() calls) could
    keep some references to HDFS open for the duration of the overall scan?


    On Mon, May 23, 2016 at 2:19 PM Bryan Beaudreault <
    bbeaudreault@hubspot.com> wrote:
    We run MR against many tables in all of our clusters, they mostly have
    similar schema definitions though vary in terms of key length, # columns,
    etc. This is the only cluster and only table we've seen leak so far. It's
    probably the table with the biggest regions which we MR against, though
    it's hard to verify that (anyone in engineering can run such a job).

    dfs.client.read.shortcircuit.streams.cache.size = 256

    Our typical FD amount is around 3000. When this hadoop job runs, that
    can climb up to our limit of over 30k if we don't act -- it is a gradual
    build up over the course of a couple hours. When we move the regions off or
    kill the job, the FDs will gradually go back down at roughly the same pace.
    It forms a graph in the shape of a pyramid.

    We don't use CM, we use mostly the default *-site.xml. We haven't
    overridden anything related to this. The configs between CDH5.3.8 and 5.7.0
    are identical for us.
    On Mon, May 23, 2016 at 2:03 PM Stack wrote:

    On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <
    bbeaudreault@hubspot.com
    wrote:
    Hey everyone,

    We are noticing a file descriptor leak that is only affecting nodes in our
    cluster running 5.7.0, not those still running 5.3.8.

    Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs
    hbase-0.98.6+hadoop-2.5.0.

    I ran an lsof against
    an affected regionserver, and noticed that there were 10k+ unix sockets
    that are just called "socket", as well as another 10k+ of the form
    "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
    2 seem related based on how closely the counts match.

    We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
    handled the namenode upgrade separately). The 5.3.8 nodes *do not*
    experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
    more regionservers until we can figure this out. I'm not sure if any
    intermediate versions between the 2 have the issue.

    We traced the root cause to a hadoop job running against a basic table:

    'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
    MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
    BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
    {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

    This is very similar to all of our other tables (we have many).

    You are doing MR against some of these also? They have different schemas?
    No leaks here?


    However,
    it's regions are getting up there in size, 40+gb per region,
    compressed.
    This has not been an issue for us previously.

    The hadoop job is a simple TableMapper job with no special parameters,
    though we haven't updated our client yet to the latest (will do that once
    we finish the server side). The hadoop job runs on a separate hadoop
    cluster, remotely accessing the HBase cluster. It does not do any other
    reads or writes, outside of the TableMapper scans.

    Moving the regions off of an affected server, or killing the hadoop job,
    causes the file descriptors to gradually go back down to normal.
    Any ideas?
    Is it just the FD cache running 'normally'? 10k seems like a lot though.
    256 seems to be the default in hdfs but maybe it is different in CM or in
    hbase?

    What is your dfs.client.read.shortcircuit.streams.cache.size set to?
    St.Ack


    Thanks,

    Bryan
  • Michael Stack at May 23, 2016 at 8:07 pm
    How hard to change the below if only temporarily (Trying to get a datapoint
    or two to act on; the short circuit code hasn't changed that we know of...
    perhaps the scan chunking facility in 1.1 has some side effect we've not
    noticed up to this).

    If you up the caching to be bigger does it lower the rate of FD leak
    creation?

    If you cache the blocks, assuming it does not blow the cache for others,
    does that make a difference.

    Hang on... will be back in a sec... just sending this in meantime...

    St.Ack
    On Mon, May 23, 2016 at 12:20 PM, Bryan Beaudreault wrote:

    For reference, the Scan backing the job is pretty basic:

    Scan scan = new Scan();
    scan.setCaching(500); // probably too small for the datasize we're dealing
    with
    scan.setCacheBlocks(false);
    scan.setScanMetricsEnabled(true);
    scan.setMaxVersions(1);
    scan.setTimeRange(startTime, stopTime);

    Otherwise it is using the out-of-the-box TableInputFormat.



    On Mon, May 23, 2016 at 3:13 PM Bryan Beaudreault <
    bbeaudreault@hubspot.com>
    wrote:
    I've forced the issue to happen again. netstat takes a while to run on
    this host while it's happening, but I do not see an abnormal amount of
    CLOSE_WAIT (compared to other hosts).

    I forced more than usual number of regions for the affected table onto the
    host to speed up the process. File Descriptors are now growing quite
    rapidly, about 8-10 per second.

    This is what lsof looks like, multiplied by a couple thousand:

    COMMAND PID USER FD TYPE DEVICE SIZE/OFF
    NODE NAME
    java 23180 hbase DEL REG 0,16 3848784656
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1702253823
    java 23180 hbase DEL REG 0,16 3847643924
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1614925966
    java 23180 hbase DEL REG 0,16 3847614191
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_888427288
    The only thing that varies is the last int on the end.
    Anything about the job itself that is holding open references or
    throwing away files w/o closing them?

    The MR job does a TableMapper directly against HBase, which as far as I
    know uses the HBase RPC and does not hit HDFS directly at all. Is it
    possible that a long running scan (one with many, many next() calls) could
    keep some references to HDFS open for the duration of the overall scan?


    On Mon, May 23, 2016 at 2:19 PM Bryan Beaudreault <
    bbeaudreault@hubspot.com> wrote:
    We run MR against many tables in all of our clusters, they mostly have
    similar schema definitions though vary in terms of key length, #
    columns,
    etc. This is the only cluster and only table we've seen leak so far.
    It's
    probably the table with the biggest regions which we MR against, though
    it's hard to verify that (anyone in engineering can run such a job).

    dfs.client.read.shortcircuit.streams.cache.size = 256

    Our typical FD amount is around 3000. When this hadoop job runs, that
    can climb up to our limit of over 30k if we don't act -- it is a gradual
    build up over the course of a couple hours. When we move the regions
    off or
    kill the job, the FDs will gradually go back down at roughly the same
    pace.
    It forms a graph in the shape of a pyramid.

    We don't use CM, we use mostly the default *-site.xml. We haven't
    overridden anything related to this. The configs between CDH5.3.8 and
    5.7.0
    are identical for us.
    On Mon, May 23, 2016 at 2:03 PM Stack wrote:

    On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <
    bbeaudreault@hubspot.com
    wrote:
    Hey everyone,

    We are noticing a file descriptor leak that is only affecting nodes
    in
    our
    cluster running 5.7.0, not those still running 5.3.8.

    Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs
    hbase-0.98.6+hadoop-2.5.0.

    I ran an lsof against
    an affected regionserver, and noticed that there were 10k+ unix
    sockets
    that are just called "socket", as well as another 10k+ of the form
    "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
    2 seem related based on how closely the counts match.

    We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0
    (we
    handled the namenode upgrade separately). The 5.3.8 nodes *do not*
    experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
    more regionservers until we can figure this out. I'm not sure if any
    intermediate versions between the 2 have the issue.

    We traced the root cause to a hadoop job running against a basic
    table:
    'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
    MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
    BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
    {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

    This is very similar to all of our other tables (we have many).

    You are doing MR against some of these also? They have different
    schemas?
    No leaks here?


    However,
    it's regions are getting up there in size, 40+gb per region,
    compressed.
    This has not been an issue for us previously.

    The hadoop job is a simple TableMapper job with no special
    parameters,
    though we haven't updated our client yet to the latest (will do that once
    we finish the server side). The hadoop job runs on a separate hadoop
    cluster, remotely accessing the HBase cluster. It does not do any
    other
    reads or writes, outside of the TableMapper scans.

    Moving the regions off of an affected server, or killing the hadoop job,
    causes the file descriptors to gradually go back down to normal.
    Any ideas?
    Is it just the FD cache running 'normally'? 10k seems like a lot
    though.
    256 seems to be the default in hdfs but maybe it is different in CM or
    in
    hbase?

    What is your dfs.client.read.shortcircuit.streams.cache.size set to?
    St.Ack


    Thanks,

    Bryan
  • Bryan Beaudreault at May 23, 2016 at 8:11 pm
    I'll try to run with a higher caching to see how that changes things, thanks
    On Mon, May 23, 2016 at 4:07 PM Stack wrote:

    How hard to change the below if only temporarily (Trying to get a datapoint
    or two to act on; the short circuit code hasn't changed that we know of...
    perhaps the scan chunking facility in 1.1 has some side effect we've not
    noticed up to this).

    If you up the caching to be bigger does it lower the rate of FD leak
    creation?

    If you cache the blocks, assuming it does not blow the cache for others,
    does that make a difference.

    Hang on... will be back in a sec... just sending this in meantime...

    St.Ack

    On Mon, May 23, 2016 at 12:20 PM, Bryan Beaudreault <
    bbeaudreault@hubspot.com> wrote:
    For reference, the Scan backing the job is pretty basic:

    Scan scan = new Scan();
    scan.setCaching(500); // probably too small for the datasize we're dealing
    with
    scan.setCacheBlocks(false);
    scan.setScanMetricsEnabled(true);
    scan.setMaxVersions(1);
    scan.setTimeRange(startTime, stopTime);

    Otherwise it is using the out-of-the-box TableInputFormat.



    On Mon, May 23, 2016 at 3:13 PM Bryan Beaudreault <
    bbeaudreault@hubspot.com>
    wrote:
    I've forced the issue to happen again. netstat takes a while to run on
    this host while it's happening, but I do not see an abnormal amount of
    CLOSE_WAIT (compared to other hosts).

    I forced more than usual number of regions for the affected table onto the
    host to speed up the process. File Descriptors are now growing quite
    rapidly, about 8-10 per second.

    This is what lsof looks like, multiplied by a couple thousand:

    COMMAND PID USER FD TYPE DEVICE SIZE/OFF
    NODE NAME
    java 23180 hbase DEL REG 0,16
    3848784656
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1702253823
    java 23180 hbase DEL REG 0,16
    3847643924
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1614925966
    java 23180 hbase DEL REG 0,16
    3847614191
    /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_888427288
    The only thing that varies is the last int on the end.
    Anything about the job itself that is holding open references or
    throwing away files w/o closing them?

    The MR job does a TableMapper directly against HBase, which as far as I
    know uses the HBase RPC and does not hit HDFS directly at all. Is it
    possible that a long running scan (one with many, many next() calls) could
    keep some references to HDFS open for the duration of the overall scan?


    On Mon, May 23, 2016 at 2:19 PM Bryan Beaudreault <
    bbeaudreault@hubspot.com> wrote:
    We run MR against many tables in all of our clusters, they mostly have
    similar schema definitions though vary in terms of key length, #
    columns,
    etc. This is the only cluster and only table we've seen leak so far.
    It's
    probably the table with the biggest regions which we MR against,
    though
    it's hard to verify that (anyone in engineering can run such a job).

    dfs.client.read.shortcircuit.streams.cache.size = 256

    Our typical FD amount is around 3000. When this hadoop job runs, that
    can climb up to our limit of over 30k if we don't act -- it is a
    gradual
    build up over the course of a couple hours. When we move the regions
    off or
    kill the job, the FDs will gradually go back down at roughly the same
    pace.
    It forms a graph in the shape of a pyramid.

    We don't use CM, we use mostly the default *-site.xml. We haven't
    overridden anything related to this. The configs between CDH5.3.8 and
    5.7.0
    are identical for us.
    On Mon, May 23, 2016 at 2:03 PM Stack wrote:

    On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <
    bbeaudreault@hubspot.com
    wrote:
    Hey everyone,

    We are noticing a file descriptor leak that is only affecting nodes
    in
    our
    cluster running 5.7.0, not those still running 5.3.8.

    Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs
    hbase-0.98.6+hadoop-2.5.0.

    I ran an lsof against
    an affected regionserver, and noticed that there were 10k+ unix
    sockets
    that are just called "socket", as well as another 10k+ of the form
    "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>".
    The
    2 seem related based on how closely the counts match.

    We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0
    (we
    handled the namenode upgrade separately). The 5.3.8 nodes *do not*
    experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
    more regionservers until we can figure this out. I'm not sure if
    any
    intermediate versions between the 2 have the issue.

    We traced the root cause to a hadoop job running against a basic
    table:
    'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
    MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
    BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
    {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

    This is very similar to all of our other tables (we have many).

    You are doing MR against some of these also? They have different
    schemas?
    No leaks here?


    However,
    it's regions are getting up there in size, 40+gb per region,
    compressed.
    This has not been an issue for us previously.

    The hadoop job is a simple TableMapper job with no special
    parameters,
    though we haven't updated our client yet to the latest (will do
    that
    once
    we finish the server side). The hadoop job runs on a separate
    hadoop
    cluster, remotely accessing the HBase cluster. It does not do any
    other
    reads or writes, outside of the TableMapper scans.

    Moving the regions off of an affected server, or killing the hadoop job,
    causes the file descriptors to gradually go back down to normal.
    Any ideas?
    Is it just the FD cache running 'normally'? 10k seems like a lot
    though.
    256 seems to be the default in hdfs but maybe it is different in CM
    or
    in
    hbase?

    What is your dfs.client.read.shortcircuit.streams.cache.size set to?
    St.Ack


    Thanks,

    Bryan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedMay 23, '16 at 4:55p
activeMay 23, '16 at 8:11p
posts10
users3
websitehbase.apache.org

People

Translate

site design / logo © 2018 Grokbase