FAQ

[HBase-user] Average RPC Queue Time

Shawn Hermans
Nov 20, 2013 at 4:31 pm
I am using CDH 4.3.1 with HBase 0.94.6. Using Cloudera manager, I notice a
metric called Average RPC Queue Time is abnormal. It is over 3 hours
normally and drops to a few minutes during non-peak times. What is the
meaning of this metric? Are these high queue times normal?

Thanks,
Shawn
reply

Search Discussions

8 responses

  • Bryan Beaudreault at Nov 20, 2013 at 4:56 pm
    A regionserver is configured with a certain number of RPC handlers
    (hbase.regionserver.handler.count). When these handlers are all occupied,
    the calls back up into a callQueue. This call queue is bounded by
    ipc.server.max.callqueue.size (defaulting to 1GB of serialized requests)
    and ipc.server.max.callqueue.length (10 * numHandlers). So, with 5
    handlers a maximum of 50 calls will be queued up before requests are
    rejected outright.

    The RpcQueueTime metrics are a measurement of how long individual calls
    stay in this queued state. If your handlers were never 100% occupied, this
    value would be 0. An average of 3 hours is concerning, it basically means
    that when a call comes into the RegionServer it takes on average 3 hours to
    start processing, because handlers are all occupied for that amount of time.

    You can lower time through a few options:

    - Up the max number of handlers (beware using too many, as this just shifts
    load to the disks, and also takes more memory)
    - Make your requests smaller (use caching or batching on a scan to return
    less data per RPC call)
    - Lower your client-side timeouts, so that you can handle the issue on the
    client side (i.e. retries)
    - Investigate disk or network issues that could be causing extremely slow
    response times (ensure data is 100% local, too)

    Just for perspective, the nominal operating value of this probably varies
    greatly with the workload/environment, but in our clusters we have an
    Average RPC Queue Time of near 0. We only see the callQueue fill up in the
    case of real problems, and almost always respond with immediate
    redistribution of data to other servers.

    HTH

      - Bryan

    On Wed, Nov 20, 2013 at 11:31 AM, Shawn Hermans wrote:

    I am using CDH 4.3.1 with HBase 0.94.6. Using Cloudera manager, I notice a
    metric called Average RPC Queue Time is abnormal. It is over 3 hours
    normally and drops to a few minutes during non-peak times. What is the
    meaning of this metric? Are these high queue times normal?

    Thanks,
    Shawn
  • Vladimir Rodionov at Nov 20, 2013 at 5:08 pm

    The RpcQueueTime metrics are a measurement of how long individual calls
    stay in this queued state. If your handlers were never 100% occupied, this
    value would be 0. An average of 3 hours is concerning, it basically means
    that when a call comes into the RegionServer it takes on average 3 hours to
    start processing, because handlers are all occupied for that amount of time.
    Definitely, this metric is meaningless because default RPC timeout is 60 sec and under no circumstances
    call data can survive this 60 sec in a callQueue unless we have a bug.

    Best regards,
    Vladimir Rodionov
    Principal Platform Engineer
    Carrier IQ, www.carrieriq.com
    e-mail: vro...@...com

    ________________________________________
    From: Bryan Beaudreault [bbe...@...com]
    Sent: Wednesday, November 20, 2013 8:55 AM
    To: user@hbase.apache.org
    Subject: Re: Average RPC Queue Time

    A regionserver is configured with a certain number of RPC handlers
    (hbase.regionserver.handler.count). When these handlers are all occupied,
    the calls back up into a callQueue. This call queue is bounded by
    ipc.server.max.callqueue.size (defaulting to 1GB of serialized requests)
    and ipc.server.max.callqueue.length (10 * numHandlers). So, with 5
    handlers a maximum of 50 calls will be queued up before requests are
    rejected outright.

    The RpcQueueTime metrics are a measurement of how long individual calls
    stay in this queued state. If your handlers were never 100% occupied, this
    value would be 0. An average of 3 hours is concerning, it basically means
    that when a call comes into the RegionServer it takes on average 3 hours to
    start processing, because handlers are all occupied for that amount of time.

    You can lower time through a few options:

    - Up the max number of handlers (beware using too many, as this just shifts
    load to the disks, and also takes more memory)
    - Make your requests smaller (use caching or batching on a scan to return
    less data per RPC call)
    - Lower your client-side timeouts, so that you can handle the issue on the
    client side (i.e. retries)
    - Investigate disk or network issues that could be causing extremely slow
    response times (ensure data is 100% local, too)

    Just for perspective, the nominal operating value of this probably varies
    greatly with the workload/environment, but in our clusters we have an
    Average RPC Queue Time of near 0. We only see the callQueue fill up in the
    case of real problems, and almost always respond with immediate
    redistribution of data to other servers.

    HTH

      - Bryan

    On Wed, Nov 20, 2013 at 11:31 AM, Shawn Hermans wrote:

    I am using CDH 4.3.1 with HBase 0.94.6. Using Cloudera manager, I notice a
    metric called Average RPC Queue Time is abnormal. It is over 3 hours
    normally and drops to a few minutes during non-peak times. What is the
    meaning of this metric? Are these high queue times normal?

    Thanks,
    Shawn
    Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or not...@...com and delete or destroy any copy of this message and its attachments.
  • Jean-Marc Spaggiari at Nov 20, 2013 at 5:24 pm
    But that will depend on the timeout that they have configured, right?

    I have seen some third party applications recommending to increase timeouts
    to 1h30...

    JMS
    Le 2013-11-20 12:08, "Vladimir Rodionov" <vro...@...com> a écrit :
    The RpcQueueTime metrics are a measurement of how long individual calls
    stay in this queued state. If your handlers were never 100% occupied,
    this
    value would be 0. An average of 3 hours is concerning, it basically
    means
    that when a call comes into the RegionServer it takes on average 3 hours
    to
    start processing, because handlers are all occupied for that amount of
    time.

    Definitely, this metric is meaningless because default RPC timeout is 60
    sec and under no circumstances
    call data can survive this 60 sec in a callQueue unless we have a bug.

    Best regards,
    Vladimir Rodionov
    Principal Platform Engineer
    Carrier IQ, www.carrieriq.com
    e-mail: vro...@...com

    ________________________________________
    From: Bryan Beaudreault [bbe...@...com]
    Sent: Wednesday, November 20, 2013 8:55 AM
    To: user@hbase.apache.org
    Subject: Re: Average RPC Queue Time

    A regionserver is configured with a certain number of RPC handlers
    (hbase.regionserver.handler.count). When these handlers are all occupied,
    the calls back up into a callQueue. This call queue is bounded by
    ipc.server.max.callqueue.size (defaulting to 1GB of serialized requests)
    and ipc.server.max.callqueue.length (10 * numHandlers). So, with 5
    handlers a maximum of 50 calls will be queued up before requests are
    rejected outright.

    The RpcQueueTime metrics are a measurement of how long individual calls
    stay in this queued state. If your handlers were never 100% occupied, this
    value would be 0. An average of 3 hours is concerning, it basically means
    that when a call comes into the RegionServer it takes on average 3 hours to
    start processing, because handlers are all occupied for that amount of
    time.

    You can lower time through a few options:

    - Up the max number of handlers (beware using too many, as this just shifts
    load to the disks, and also takes more memory)
    - Make your requests smaller (use caching or batching on a scan to return
    less data per RPC call)
    - Lower your client-side timeouts, so that you can handle the issue on the
    client side (i.e. retries)
    - Investigate disk or network issues that could be causing extremely slow
    response times (ensure data is 100% local, too)

    Just for perspective, the nominal operating value of this probably varies
    greatly with the workload/environment, but in our clusters we have an
    Average RPC Queue Time of near 0. We only see the callQueue fill up in the
    case of real problems, and almost always respond with immediate
    redistribution of data to other servers.

    HTH

    - Bryan


    On Wed, Nov 20, 2013 at 11:31 AM, Shawn Hermans <sha...@...com
    wrote:
    I am using CDH 4.3.1 with HBase 0.94.6. Using Cloudera manager, I notice a
    metric called Average RPC Queue Time is abnormal. It is over 3 hours
    normally and drops to a few minutes during non-peak times. What is the
    meaning of this metric? Are these high queue times normal?

    Thanks,
    Shawn
    Confidentiality Notice: The information contained in this message,
    including any attachments hereto, may be confidential and is intended to be
    read only by the individual or entity to whom this message is addressed. If
    the reader of this message is not the intended recipient or an agent or
    designee of the intended recipient, please note that any review, use,
    disclosure or distribution of this message or its attachments, in any form,
    is strictly prohibited. If you have received this message in error, please
    immediately notify the sender and/or not...@...com and
    delete or destroy any copy of this message and its attachments.
  • Shawn Hermans at Nov 20, 2013 at 5:47 pm
    Our hbase.rpc.timeout is set for 60 seconds. Confused as to why I would
    see such large values for the average rpc queue time. Are there any other
    metrics? The RPC call queue length is consistently between 150 and 200
    during peak usage time. Is this normal?

    Regards,
    Shawn

    On Wed, Nov 20, 2013 at 11:24 AM, Jean-Marc Spaggiari wrote:

    But that will depend on the timeout that they have configured, right?

    I have seen some third party applications recommending to increase timeouts
    to 1h30...

    JMS
    Le 2013-11-20 12:08, "Vladimir Rodionov" <vro...@...com> a
    écrit :
    The RpcQueueTime metrics are a measurement of how long individual calls
    stay in this queued state. If your handlers were never 100% occupied,
    this
    value would be 0. An average of 3 hours is concerning, it basically
    means
    that when a call comes into the RegionServer it takes on average 3
    hours
    to
    start processing, because handlers are all occupied for that amount of
    time.

    Definitely, this metric is meaningless because default RPC timeout is 60
    sec and under no circumstances
    call data can survive this 60 sec in a callQueue unless we have a bug.

    Best regards,
    Vladimir Rodionov
    Principal Platform Engineer
    Carrier IQ, www.carrieriq.com
    e-mail: vro...@...com

    ________________________________________
    From: Bryan Beaudreault [bbe...@...com]
    Sent: Wednesday, November 20, 2013 8:55 AM
    To: user@hbase.apache.org
    Subject: Re: Average RPC Queue Time

    A regionserver is configured with a certain number of RPC handlers
    (hbase.regionserver.handler.count). When these handlers are all occupied,
    the calls back up into a callQueue. This call queue is bounded by
    ipc.server.max.callqueue.size (defaulting to 1GB of serialized requests)
    and ipc.server.max.callqueue.length (10 * numHandlers). So, with 5
    handlers a maximum of 50 calls will be queued up before requests are
    rejected outright.

    The RpcQueueTime metrics are a measurement of how long individual calls
    stay in this queued state. If your handlers were never 100% occupied, this
    value would be 0. An average of 3 hours is concerning, it basically means
    that when a call comes into the RegionServer it takes on average 3 hours to
    start processing, because handlers are all occupied for that amount of
    time.

    You can lower time through a few options:

    - Up the max number of handlers (beware using too many, as this just shifts
    load to the disks, and also takes more memory)
    - Make your requests smaller (use caching or batching on a scan to return
    less data per RPC call)
    - Lower your client-side timeouts, so that you can handle the issue on the
    client side (i.e. retries)
    - Investigate disk or network issues that could be causing extremely slow
    response times (ensure data is 100% local, too)

    Just for perspective, the nominal operating value of this probably varies
    greatly with the workload/environment, but in our clusters we have an
    Average RPC Queue Time of near 0. We only see the callQueue fill up in the
    case of real problems, and almost always respond with immediate
    redistribution of data to other servers.

    HTH

    - Bryan


    On Wed, Nov 20, 2013 at 11:31 AM, Shawn Hermans <sha...@...com
    wrote:
    I am using CDH 4.3.1 with HBase 0.94.6. Using Cloudera manager, I notice a
    metric called Average RPC Queue Time is abnormal. It is over 3 hours
    normally and drops to a few minutes during non-peak times. What is the
    meaning of this metric? Are these high queue times normal?

    Thanks,
    Shawn
    Confidentiality Notice: The information contained in this message,
    including any attachments hereto, may be confidential and is intended to be
    read only by the individual or entity to whom this message is addressed. If
    the reader of this message is not the intended recipient or an agent or
    designee of the intended recipient, please note that any review, use,
    disclosure or distribution of this message or its attachments, in any form,
    is strictly prohibited. If you have received this message in error, please
    immediately notify the sender and/or not...@...com and
    delete or destroy any copy of this message and its attachments.
  • Bryan Beaudreault at Nov 20, 2013 at 5:52 pm
    I'm not sure about the cloudera manager ui, but the metric posted to JMX is
    in milliseconds. Are we sure that is not accounting for the confusion?

    On Wed, Nov 20, 2013 at 12:46 PM, Shawn Hermans wrote:

    Our hbase.rpc.timeout is set for 60 seconds. Confused as to why I would
    see such large values for the average rpc queue time. Are there any other
    metrics? The RPC call queue length is consistently between 150 and 200
    during peak usage time. Is this normal?

    Regards,
    Shawn


    On Wed, Nov 20, 2013 at 11:24 AM, Jean-Marc Spaggiari <
    jea...@...org> wrote:
    But that will depend on the timeout that they have configured, right?

    I have seen some third party applications recommending to increase timeouts
    to 1h30...

    JMS
    Le 2013-11-20 12:08, "Vladimir Rodionov" <vro...@...com> a
    écrit :
    The RpcQueueTime metrics are a measurement of how long individual
    calls
    stay in this queued state. If your handlers were never 100%
    occupied,
    this
    value would be 0. An average of 3 hours is concerning, it basically
    means
    that when a call comes into the RegionServer it takes on average 3
    hours
    to
    start processing, because handlers are all occupied for that amount
    of
    time.

    Definitely, this metric is meaningless because default RPC timeout is
    60
    sec and under no circumstances
    call data can survive this 60 sec in a callQueue unless we have a bug.

    Best regards,
    Vladimir Rodionov
    Principal Platform Engineer
    Carrier IQ, www.carrieriq.com
    e-mail: vro...@...com

    ________________________________________
    From: Bryan Beaudreault [bbe...@...com]
    Sent: Wednesday, November 20, 2013 8:55 AM
    To: user@hbase.apache.org
    Subject: Re: Average RPC Queue Time

    A regionserver is configured with a certain number of RPC handlers
    (hbase.regionserver.handler.count). When these handlers are all occupied,
    the calls back up into a callQueue. This call queue is bounded by
    ipc.server.max.callqueue.size (defaulting to 1GB of serialized
    requests)
    and ipc.server.max.callqueue.length (10 * numHandlers). So, with 5
    handlers a maximum of 50 calls will be queued up before requests are
    rejected outright.

    The RpcQueueTime metrics are a measurement of how long individual calls
    stay in this queued state. If your handlers were never 100% occupied, this
    value would be 0. An average of 3 hours is concerning, it basically means
    that when a call comes into the RegionServer it takes on average 3
    hours
    to
    start processing, because handlers are all occupied for that amount of
    time.

    You can lower time through a few options:

    - Up the max number of handlers (beware using too many, as this just shifts
    load to the disks, and also takes more memory)
    - Make your requests smaller (use caching or batching on a scan to
    return
    less data per RPC call)
    - Lower your client-side timeouts, so that you can handle the issue on the
    client side (i.e. retries)
    - Investigate disk or network issues that could be causing extremely
    slow
    response times (ensure data is 100% local, too)

    Just for perspective, the nominal operating value of this probably
    varies
    greatly with the workload/environment, but in our clusters we have an
    Average RPC Queue Time of near 0. We only see the callQueue fill up in the
    case of real problems, and almost always respond with immediate
    redistribution of data to other servers.

    HTH

    - Bryan


    On Wed, Nov 20, 2013 at 11:31 AM, Shawn Hermans <
    sha...@...com
    wrote:
    I am using CDH 4.3.1 with HBase 0.94.6. Using Cloudera manager, I notice a
    metric called Average RPC Queue Time is abnormal. It is over 3 hours
    normally and drops to a few minutes during non-peak times. What is
    the
    meaning of this metric? Are these high queue times normal?

    Thanks,
    Shawn
    Confidentiality Notice: The information contained in this message,
    including any attachments hereto, may be confidential and is intended
    to
    be
    read only by the individual or entity to whom this message is
    addressed.
    If
    the reader of this message is not the intended recipient or an agent or
    designee of the intended recipient, please note that any review, use,
    disclosure or distribution of this message or its attachments, in any form,
    is strictly prohibited. If you have received this message in error, please
    immediately notify the sender and/or not...@...com and
    delete or destroy any copy of this message and its attachments.
  • Shawn Hermans at Nov 20, 2013 at 5:55 pm
    Shouldn't be. Looks like Cloudera just converts it to nicer values. So
    the actual peak value is 14438088.62 ms for Average RPC queue time.

    On Wed, Nov 20, 2013 at 11:51 AM, Bryan Beaudreault wrote:

    I'm not sure about the cloudera manager ui, but the metric posted to JMX is
    in milliseconds. Are we sure that is not accounting for the confusion?


    On Wed, Nov 20, 2013 at 12:46 PM, Shawn Hermans <sha...@...com
    wrote:
    Our hbase.rpc.timeout is set for 60 seconds. Confused as to why I would
    see such large values for the average rpc queue time. Are there any other
    metrics? The RPC call queue length is consistently between 150 and 200
    during peak usage time. Is this normal?

    Regards,
    Shawn


    On Wed, Nov 20, 2013 at 11:24 AM, Jean-Marc Spaggiari <
    jea...@...org> wrote:
    But that will depend on the timeout that they have configured, right?

    I have seen some third party applications recommending to increase timeouts
    to 1h30...

    JMS
    Le 2013-11-20 12:08, "Vladimir Rodionov" <vro...@...com> a
    écrit :
    The RpcQueueTime metrics are a measurement of how long individual
    calls
    stay in this queued state. If your handlers were never 100%
    occupied,
    this
    value would be 0. An average of 3 hours is concerning, it
    basically
    means
    that when a call comes into the RegionServer it takes on average 3
    hours
    to
    start processing, because handlers are all occupied for that amount
    of
    time.

    Definitely, this metric is meaningless because default RPC timeout is
    60
    sec and under no circumstances
    call data can survive this 60 sec in a callQueue unless we have a
    bug.
    Best regards,
    Vladimir Rodionov
    Principal Platform Engineer
    Carrier IQ, www.carrieriq.com
    e-mail: vro...@...com

    ________________________________________
    From: Bryan Beaudreault [bbe...@...com]
    Sent: Wednesday, November 20, 2013 8:55 AM
    To: user@hbase.apache.org
    Subject: Re: Average RPC Queue Time

    A regionserver is configured with a certain number of RPC handlers
    (hbase.regionserver.handler.count). When these handlers are all occupied,
    the calls back up into a callQueue. This call queue is bounded by
    ipc.server.max.callqueue.size (defaulting to 1GB of serialized
    requests)
    and ipc.server.max.callqueue.length (10 * numHandlers). So, with 5
    handlers a maximum of 50 calls will be queued up before requests are
    rejected outright.

    The RpcQueueTime metrics are a measurement of how long individual
    calls
    stay in this queued state. If your handlers were never 100%
    occupied,
    this
    value would be 0. An average of 3 hours is concerning, it basically means
    that when a call comes into the RegionServer it takes on average 3
    hours
    to
    start processing, because handlers are all occupied for that amount
    of
    time.

    You can lower time through a few options:

    - Up the max number of handlers (beware using too many, as this just shifts
    load to the disks, and also takes more memory)
    - Make your requests smaller (use caching or batching on a scan to
    return
    less data per RPC call)
    - Lower your client-side timeouts, so that you can handle the issue
    on
    the
    client side (i.e. retries)
    - Investigate disk or network issues that could be causing extremely
    slow
    response times (ensure data is 100% local, too)

    Just for perspective, the nominal operating value of this probably
    varies
    greatly with the workload/environment, but in our clusters we have an
    Average RPC Queue Time of near 0. We only see the callQueue fill up
    in
    the
    case of real problems, and almost always respond with immediate
    redistribution of data to other servers.

    HTH

    - Bryan


    On Wed, Nov 20, 2013 at 11:31 AM, Shawn Hermans <
    sha...@...com
    wrote:
    I am using CDH 4.3.1 with HBase 0.94.6. Using Cloudera manager, I notice a
    metric called Average RPC Queue Time is abnormal. It is over 3
    hours
    normally and drops to a few minutes during non-peak times. What is
    the
    meaning of this metric? Are these high queue times normal?

    Thanks,
    Shawn
    Confidentiality Notice: The information contained in this message,
    including any attachments hereto, may be confidential and is intended
    to
    be
    read only by the individual or entity to whom this message is
    addressed.
    If
    the reader of this message is not the intended recipient or an agent
    or
    designee of the intended recipient, please note that any review, use,
    disclosure or distribution of this message or its attachments, in any form,
    is strictly prohibited. If you have received this message in error, please
    immediately notify the sender and/or not...@...com and
    delete or destroy any copy of this message and its attachments.
  • Bryan Beaudreault at Nov 20, 2013 at 6:10 pm
    I'm not sure why it is so much higher than your rpc timeout. Enabling
    DEBUG log level on org.apache.hadoop.ipc.HBaseServer.trace and
    org.apache.hadoop.ipc.HBaseServer loggers might provide you with some
    insight.

    On Wed, Nov 20, 2013 at 12:55 PM, Shawn Hermans wrote:

    Shouldn't be. Looks like Cloudera just converts it to nicer values. So
    the actual peak value is 14438088.62 ms for Average RPC queue time.


    On Wed, Nov 20, 2013 at 11:51 AM, Bryan Beaudreault <
    bbe...@...com> wrote:
    I'm not sure about the cloudera manager ui, but the metric posted to JMX is
    in milliseconds. Are we sure that is not accounting for the confusion?


    On Wed, Nov 20, 2013 at 12:46 PM, Shawn Hermans <sha...@...com
    wrote:
    Our hbase.rpc.timeout is set for 60 seconds. Confused as to why I
    would
    see such large values for the average rpc queue time. Are there any other
    metrics? The RPC call queue length is consistently between 150 and 200
    during peak usage time. Is this normal?

    Regards,
    Shawn


    On Wed, Nov 20, 2013 at 11:24 AM, Jean-Marc Spaggiari <
    jea...@...org> wrote:
    But that will depend on the timeout that they have configured, right?

    I have seen some third party applications recommending to increase timeouts
    to 1h30...

    JMS
    Le 2013-11-20 12:08, "Vladimir Rodionov" <vro...@...com> a
    écrit :
    The RpcQueueTime metrics are a measurement of how long individual
    calls
    stay in this queued state. If your handlers were never 100%
    occupied,
    this
    value would be 0. An average of 3 hours is concerning, it
    basically
    means
    that when a call comes into the RegionServer it takes on average
    3
    hours
    to
    start processing, because handlers are all occupied for that
    amount
    of
    time.

    Definitely, this metric is meaningless because default RPC timeout
    is
    60
    sec and under no circumstances
    call data can survive this 60 sec in a callQueue unless we have a
    bug.
    Best regards,
    Vladimir Rodionov
    Principal Platform Engineer
    Carrier IQ, www.carrieriq.com
    e-mail: vro...@...com

    ________________________________________
    From: Bryan Beaudreault [bbe...@...com]
    Sent: Wednesday, November 20, 2013 8:55 AM
    To: user@hbase.apache.org
    Subject: Re: Average RPC Queue Time

    A regionserver is configured with a certain number of RPC handlers
    (hbase.regionserver.handler.count). When these handlers are all occupied,
    the calls back up into a callQueue. This call queue is bounded by
    ipc.server.max.callqueue.size (defaulting to 1GB of serialized
    requests)
    and ipc.server.max.callqueue.length (10 * numHandlers). So, with 5
    handlers a maximum of 50 calls will be queued up before requests
    are
    rejected outright.

    The RpcQueueTime metrics are a measurement of how long individual
    calls
    stay in this queued state. If your handlers were never 100%
    occupied,
    this
    value would be 0. An average of 3 hours is concerning, it
    basically
    means
    that when a call comes into the RegionServer it takes on average 3
    hours
    to
    start processing, because handlers are all occupied for that amount
    of
    time.

    You can lower time through a few options:

    - Up the max number of handlers (beware using too many, as this
    just
    shifts
    load to the disks, and also takes more memory)
    - Make your requests smaller (use caching or batching on a scan to
    return
    less data per RPC call)
    - Lower your client-side timeouts, so that you can handle the issue
    on
    the
    client side (i.e. retries)
    - Investigate disk or network issues that could be causing
    extremely
    slow
    response times (ensure data is 100% local, too)

    Just for perspective, the nominal operating value of this probably
    varies
    greatly with the workload/environment, but in our clusters we have
    an
    Average RPC Queue Time of near 0. We only see the callQueue fill
    up
    in
    the
    case of real problems, and almost always respond with immediate
    redistribution of data to other servers.

    HTH

    - Bryan


    On Wed, Nov 20, 2013 at 11:31 AM, Shawn Hermans <
    sha...@...com
    wrote:
    I am using CDH 4.3.1 with HBase 0.94.6. Using Cloudera manager,
    I
    notice a
    metric called Average RPC Queue Time is abnormal. It is over 3
    hours
    normally and drops to a few minutes during non-peak times. What
    is
    the
    meaning of this metric? Are these high queue times normal?

    Thanks,
    Shawn
    Confidentiality Notice: The information contained in this message,
    including any attachments hereto, may be confidential and is
    intended
    to
    be
    read only by the individual or entity to whom this message is
    addressed.
    If
    the reader of this message is not the intended recipient or an
    agent
    or
    designee of the intended recipient, please note that any review,
    use,
    disclosure or distribution of this message or its attachments, in
    any
    form,
    is strictly prohibited. If you have received this message in
    error,
    please
    immediately notify the sender and/or not...@...comand
    delete or destroy any copy of this message and its attachments.
  • Shawn Hermans at Nov 20, 2013 at 7:16 pm
    Thanks for all the help. Follow-up question. Is it normal to see the
    average RPC call queue length stay at over 100 for times of peak usage?

    On Wed, Nov 20, 2013 at 12:09 PM, Bryan Beaudreault wrote:

    I'm not sure why it is so much higher than your rpc timeout. Enabling
    DEBUG log level on org.apache.hadoop.ipc.HBaseServer.trace and
    org.apache.hadoop.ipc.HBaseServer loggers might provide you with some
    insight.


    On Wed, Nov 20, 2013 at 12:55 PM, Shawn Hermans <sha...@...com
    wrote:
    Shouldn't be. Looks like Cloudera just converts it to nicer values. So
    the actual peak value is 14438088.62 ms for Average RPC queue time.


    On Wed, Nov 20, 2013 at 11:51 AM, Bryan Beaudreault <
    bbe...@...com> wrote:
    I'm not sure about the cloudera manager ui, but the metric posted to
    JMX
    is
    in milliseconds. Are we sure that is not accounting for the confusion?


    On Wed, Nov 20, 2013 at 12:46 PM, Shawn Hermans <
    sha...@...com
    wrote:
    Our hbase.rpc.timeout is set for 60 seconds. Confused as to why I
    would
    see such large values for the average rpc queue time. Are there any other
    metrics? The RPC call queue length is consistently between 150 and
    200
    during peak usage time. Is this normal?

    Regards,
    Shawn


    On Wed, Nov 20, 2013 at 11:24 AM, Jean-Marc Spaggiari <
    jea...@...org> wrote:
    But that will depend on the timeout that they have configured,
    right?
    I have seen some third party applications recommending to increase timeouts
    to 1h30...

    JMS
    Le 2013-11-20 12:08, "Vladimir Rodionov" <vro...@...com>
    a
    écrit :
    The RpcQueueTime metrics are a measurement of how long
    individual
    calls
    stay in this queued state. If your handlers were never 100%
    occupied,
    this
    value would be 0. An average of 3 hours is concerning, it
    basically
    means
    that when a call comes into the RegionServer it takes on
    average
    3
    hours
    to
    start processing, because handlers are all occupied for that
    amount
    of
    time.

    Definitely, this metric is meaningless because default RPC
    timeout
    is
    60
    sec and under no circumstances
    call data can survive this 60 sec in a callQueue unless we have
    a
    bug.
    Best regards,
    Vladimir Rodionov
    Principal Platform Engineer
    Carrier IQ, www.carrieriq.com
    e-mail: vro...@...com

    ________________________________________
    From: Bryan Beaudreault [bbe...@...com]
    Sent: Wednesday, November 20, 2013 8:55 AM
    To: user@hbase.apache.org
    Subject: Re: Average RPC Queue Time

    A regionserver is configured with a certain number of RPC
    handlers
    (hbase.regionserver.handler.count). When these handlers are all occupied,
    the calls back up into a callQueue. This call queue is bounded
    by
    ipc.server.max.callqueue.size (defaulting to 1GB of serialized
    requests)
    and ipc.server.max.callqueue.length (10 * numHandlers). So,
    with 5
    handlers a maximum of 50 calls will be queued up before requests
    are
    rejected outright.

    The RpcQueueTime metrics are a measurement of how long individual
    calls
    stay in this queued state. If your handlers were never 100%
    occupied,
    this
    value would be 0. An average of 3 hours is concerning, it
    basically
    means
    that when a call comes into the RegionServer it takes on average
    3
    hours
    to
    start processing, because handlers are all occupied for that
    amount
    of
    time.

    You can lower time through a few options:

    - Up the max number of handlers (beware using too many, as this
    just
    shifts
    load to the disks, and also takes more memory)
    - Make your requests smaller (use caching or batching on a scan
    to
    return
    less data per RPC call)
    - Lower your client-side timeouts, so that you can handle the
    issue
    on
    the
    client side (i.e. retries)
    - Investigate disk or network issues that could be causing
    extremely
    slow
    response times (ensure data is 100% local, too)

    Just for perspective, the nominal operating value of this
    probably
    varies
    greatly with the workload/environment, but in our clusters we
    have
    an
    Average RPC Queue Time of near 0. We only see the callQueue fill
    up
    in
    the
    case of real problems, and almost always respond with immediate
    redistribution of data to other servers.

    HTH

    - Bryan


    On Wed, Nov 20, 2013 at 11:31 AM, Shawn Hermans <
    sha...@...com
    wrote:
    I am using CDH 4.3.1 with HBase 0.94.6. Using Cloudera
    manager,
    I
    notice a
    metric called Average RPC Queue Time is abnormal. It is over 3
    hours
    normally and drops to a few minutes during non-peak times.
    What
    is
    the
    meaning of this metric? Are these high queue times normal?

    Thanks,
    Shawn
    Confidentiality Notice: The information contained in this
    message,
    including any attachments hereto, may be confidential and is
    intended
    to
    be
    read only by the individual or entity to whom this message is
    addressed.
    If
    the reader of this message is not the intended recipient or an
    agent
    or
    designee of the intended recipient, please note that any review,
    use,
    disclosure or distribution of this message or its attachments, in
    any
    form,
    is strictly prohibited. If you have received this message in
    error,
    please
    immediately notify the sender and/or
    not...@...comand
    delete or destroy any copy of this message and its attachments.

Related Discussions

Discussion Navigation
viewthread | post