FAQ
I ve written a simple UDF that parses a chararray (which looks like
...[a].....[b]...[a]...) to capture stuff inside brackets and return them
as String a=2;b=1; and so on. The input chararray are rarely more than
1000 characters and are not more than 100000 (I ve added log.warn in my
udf to ensure this). But, I still see java heap error while running this
udf (even in local mode, the job simply fails). My assumption is maps and
lists that I use locally will be recollected by gc. Am I missing
something?

Thanks,
Aniket

Search Discussions

  • Dmitriy Ryaboy at Feb 24, 2011 at 3:57 am
    Aniket, share the code?
    It really depends on how you create them.

    -D
    On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi wrote:

    I ve written a simple UDF that parses a chararray (which looks like
    ...[a].....[b]...[a]...) to capture stuff inside brackets and return them
    as String a=2;b=1; and so on. The input chararray are rarely more than
    1000 characters and are not more than 100000 (I ve added log.warn in my
    udf to ensure this). But, I still see java heap error while running this
    udf (even in local mode, the job simply fails). My assumption is maps and
    lists that I use locally will be recollected by gc. Am I missing
    something?

    Thanks,
    Aniket
  • Jai Krishna at Feb 24, 2011 at 8:59 am
    Sharing the code would be useful as mentioned. Also of help would the heap settings that the JVM had.

    However, off the top of my head, one common situation (esp. in text processing/tokenizing) is instantiating Strings in a tight loop.

    Besides you could also exercise your UDF in a local JVM and take a heap dump / profile it.
    If your heap is less than 512M, you could use basic profiling via hprof/hat (see http://java.sun.com/developer/technicalArticles/Programming/HPROF.html ).

    Thanks,
    Jai


    On 2/24/11 9:26 AM, "Dmitriy Ryaboy" wrote:

    Aniket, share the code?
    It really depends on how you create them.

    -D
    On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi wrote:

    I ve written a simple UDF that parses a chararray (which looks like
    ...[a].....[b]...[a]...) to capture stuff inside brackets and return them
    as String a=2;b=1; and so on. The input chararray are rarely more than
    1000 characters and are not more than 100000 (I ve added log.warn in my
    udf to ensure this). But, I still see java heap error while running this
    udf (even in local mode, the job simply fails). My assumption is maps and
    lists that I use locally will be recollected by gc. Am I missing
    something?

    Thanks,
    Aniket
  • Aniket Mokashi at Feb 24, 2011 at 11:49 pm
    Hi Jai,

    Thanks for your email. I suspect that its the Strings in tight loop reason
    as you have suggested. I have a loop in my udf that does the following.

    while((startInd = someLog.indexOf('[',startInd)) > 0) {
    endInd = someLog.indexOf(']', startInd);
    if(endInd > 0) {
    category = someLog.substring(startInd, endInd+1);
    cats.add(category);
    }
    startInd = endInd;
    }

    My jobs are failing in both local and mr mode. UDF works fine for a
    smaller input (a few lines). Also, I checked that sizeof someLog doesnt
    exceed a 10000.

    Thanks,
    Aniket

    On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
    Sharing the code would be useful as mentioned. Also of help would the
    heap settings that the JVM had.

    However, off the top of my head, one common situation (esp. in text
    processing/tokenizing) is instantiating Strings in a tight loop.

    Besides you could also exercise your UDF in a local JVM and take a heap
    dump / profile it. If your heap is less than 512M, you could use basic
    profiling via hprof/hat (see
    http://java.sun.com/developer/technicalArticles/Programming/HPROF.html ).


    Thanks,
    Jai



    On 2/24/11 9:26 AM, "Dmitriy Ryaboy" wrote:


    Aniket, share the code?
    It really depends on how you create them.


    -D


    On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
    wrote:

    I ve written a simple UDF that parses a chararray (which looks like
    ...[a].....[b]...[a]...) to capture stuff inside brackets and return
    them as String a=2;b=1; and so on. The input chararray are rarely more
    than 1000 characters and are not more than 100000 (I ve added log.warn
    in my udf to ensure this). But, I still see java heap error while
    running this udf (even in local mode, the job simply fails). My
    assumption is maps and lists that I use locally will be recollected by
    gc. Am I missing something?

    Thanks,
    Aniket

  • Dmitriy Ryaboy at Feb 25, 2011 at 12:14 am
    That's a max of 3.3K single-character strings. Even with the java overhead
    that shouldn't be more than a meg right?
    none of these should make it out of young gen assuming the list "cats"
    doesn't stick around outside the udf.
    On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi wrote:

    Hi Jai,

    Thanks for your email. I suspect that its the Strings in tight loop reason
    as you have suggested. I have a loop in my udf that does the following.

    while((startInd = someLog.indexOf('[',startInd)) > 0) {
    endInd = someLog.indexOf(']', startInd);
    if(endInd > 0) {
    category =
    someLog.substring(startInd, endInd+1);
    cats.add(category);
    }
    startInd = endInd;
    }

    My jobs are failing in both local and mr mode. UDF works fine for a
    smaller input (a few lines). Also, I checked that sizeof someLog doesnt
    exceed a 10000.

    Thanks,
    Aniket

    On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
    Sharing the code would be useful as mentioned. Also of help would the
    heap settings that the JVM had.

    However, off the top of my head, one common situation (esp. in text
    processing/tokenizing) is instantiating Strings in a tight loop.

    Besides you could also exercise your UDF in a local JVM and take a heap
    dump / profile it. If your heap is less than 512M, you could use basic
    profiling via hprof/hat (see
    http://java.sun.com/developer/technicalArticles/Programming/HPROF.html).


    Thanks,
    Jai



    On 2/24/11 9:26 AM, "Dmitriy Ryaboy" wrote:


    Aniket, share the code?
    It really depends on how you create them.


    -D


    On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
    wrote:

    I ve written a simple UDF that parses a chararray (which looks like
    ...[a].....[b]...[a]...) to capture stuff inside brackets and return
    them as String a=2;b=1; and so on. The input chararray are rarely more
    than 1000 characters and are not more than 100000 (I ve added log.warn
    in my udf to ensure this). But, I still see java heap error while
    running this udf (even in local mode, the job simply fails). My
    assumption is maps and lists that I use locally will be recollected by
    gc. Am I missing something?

    Thanks,
    Aniket

  • Daniel Dai at Feb 25, 2011 at 12:26 am
    Hi, Aniket,
    What is your Pig script? Is the UDF in map side or reduce side?

    Daniel

    Dmitriy Ryaboy wrote:
    That's a max of 3.3K single-character strings. Even with the java overhead
    that shouldn't be more than a meg right?
    none of these should make it out of young gen assuming the list "cats"
    doesn't stick around outside the udf.

    On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi wrote:

    Hi Jai,

    Thanks for your email. I suspect that its the Strings in tight loop reason
    as you have suggested. I have a loop in my udf that does the following.

    while((startInd = someLog.indexOf('[',startInd)) > 0) {
    endInd = someLog.indexOf(']', startInd);
    if(endInd > 0) {
    category =
    someLog.substring(startInd, endInd+1);
    cats.add(category);
    }
    startInd = endInd;
    }

    My jobs are failing in both local and mr mode. UDF works fine for a
    smaller input (a few lines). Also, I checked that sizeof someLog doesnt
    exceed a 10000.

    Thanks,
    Aniket

    On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:

    Sharing the code would be useful as mentioned. Also of help would the
    heap settings that the JVM had.

    However, off the top of my head, one common situation (esp. in text
    processing/tokenizing) is instantiating Strings in a tight loop.

    Besides you could also exercise your UDF in a local JVM and take a heap
    dump / profile it. If your heap is less than 512M, you could use basic
    profiling via hprof/hat (see
    http://java.sun.com/developer/technicalArticles/Programming/HPROF.html).


    Thanks,
    Jai



    On 2/24/11 9:26 AM, "Dmitriy Ryaboy" wrote:


    Aniket, share the code?
    It really depends on how you create them.


    -D


    On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
    wrote:


    I ve written a simple UDF that parses a chararray (which looks like
    ...[a].....[b]...[a]...) to capture stuff inside brackets and return
    them as String a=2;b=1; and so on. The input chararray are rarely more
    than 1000 characters and are not more than 100000 (I ve added log.warn
    in my udf to ensure this). But, I still see java heap error while
    running this udf (even in local mode, the job simply fails). My
    assumption is maps and lists that I use locally will be recollected by
    gc. Am I missing something?

    Thanks,
    Aniket


  • Aniket Mokashi at Feb 25, 2011 at 12:47 am
    This is a map side udf.
    pig script loads a log file and grabs contents inside angle brackets.
    a = load; b = foreach a generate F(a); dump b;

    I see following on tasktrackers-
    2011-02-23 18:01:25,992 INFO
    org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
    - Collection threshold init = 5439488(5312K) used = 409337824(399743K)
    committed = 534118400(521600K) max = 715849728(699072K)
    2011-02-23 18:01:26,102 INFO
    org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
    call- Usage threshold init = 5439488(5312K) used = 546751088(533936K)
    committed = 671547392(655808K) max = 715849728(699072K)

    I am trying out some changes in udf to see if they work.

    Thanks,
    Aniket
    On Thu, February 24, 2011 7:25 pm, Daniel Dai wrote:
    Hi, Aniket,
    What is your Pig script? Is the UDF in map side or reduce side?


    Daniel


    Dmitriy Ryaboy wrote:
    That's a max of 3.3K single-character strings. Even with the java
    overhead that shouldn't be more than a meg right? none of these should
    make it out of young gen assuming the list "cats" doesn't stick around
    outside the udf.

    On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi
    wrote:


    Hi Jai,


    Thanks for your email. I suspect that its the Strings in tight loop
    reason as you have suggested. I have a loop in my udf that does the
    following.

    while((startInd = someLog.indexOf('[',startInd)) > 0) { endInd =
    someLog.indexOf(']', startInd); if(endInd > 0) { category =
    someLog.substring(startInd, endInd+1); cats.add(category); }
    startInd = endInd; }


    My jobs are failing in both local and mr mode. UDF works fine for a
    smaller input (a few lines). Also, I checked that sizeof someLog
    doesnt exceed a 10000.

    Thanks,
    Aniket



    On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:

    Sharing the code would be useful as mentioned. Also of help would
    the heap settings that the JVM had.

    However, off the top of my head, one common situation (esp. in text
    processing/tokenizing) is instantiating Strings in a tight loop.

    Besides you could also exercise your UDF in a local JVM and take a
    heap dump / profile it. If your heap is less than 512M, you could
    use basic profiling via hprof/hat (see
    http://java.sun.com/developer/technicalArticles/Programming/HPROF.h
    tml).


    Thanks,
    Jai




    On 2/24/11 9:26 AM, "Dmitriy Ryaboy" wrote:



    Aniket, share the code?
    It really depends on how you create them.



    -D



    On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
    wrote:



    I ve written a simple UDF that parses a chararray (which looks
    like ...[a].....[b]...[a]...) to capture stuff inside brackets and
    return them as String a=2;b=1; and so on. The input chararray are
    rarely more than 1000 characters and are not more than 100000 (I
    ve added log.warn in my udf to ensure this). But, I still see java
    heap error while running this udf (even in local mode, the job
    simply fails). My assumption is maps and lists that I use locally
    will be recollected by gc. Am I missing something?

    Thanks,
    Aniket



  • Aniket Mokashi at Feb 25, 2011 at 1:26 am
    Thanks everyone for helping me out, I figured it was one of those logical
    errors which lead to infinite loops. Actually indexof operation doesnt
    always return -1 on failure which was causing this to get into infinite
    loop (I should have thought about this). (ie. indexof('[', 187) would
    return 187 and the loop would continue always.
    Thanks again,
    Aniket
    On Thu, February 24, 2011 7:47 pm, Aniket Mokashi wrote:
    This is a map side udf.
    pig script loads a log file and grabs contents inside angle brackets. a =
    load; b = foreach a generate F(a); dump b;

    I see following on tasktrackers-
    2011-02-23 18:01:25,992 INFO
    org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
    - Collection threshold init = 5439488(5312K) used = 409337824(399743K)
    committed = 534118400(521600K) max = 715849728(699072K) 2011-02-23
    18:01:26,102 INFO
    org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
    call- Usage threshold init = 5439488(5312K) used = 546751088(533936K)
    committed = 671547392(655808K) max = 715849728(699072K)

    I am trying out some changes in udf to see if they work.


    Thanks,
    Aniket

    On Thu, February 24, 2011 7:25 pm, Daniel Dai wrote:

    Hi, Aniket,
    What is your Pig script? Is the UDF in map side or reduce side?



    Daniel



    Dmitriy Ryaboy wrote:

    That's a max of 3.3K single-character strings. Even with the java
    overhead that shouldn't be more than a meg right? none of these should
    make it out of young gen assuming the list "cats" doesn't stick
    around outside the udf.

    On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi
    wrote:



    Hi Jai,



    Thanks for your email. I suspect that its the Strings in tight loop
    reason as you have suggested. I have a loop in my udf that does
    the following.

    while((startInd = someLog.indexOf('[',startInd)) > 0) { endInd =
    someLog.indexOf(']', startInd); if(endInd > 0) { category =
    someLog.substring(startInd, endInd+1); cats.add(category); }
    startInd = endInd; }


    My jobs are failing in both local and mr mode. UDF works fine for a
    smaller input (a few lines). Also, I checked that sizeof someLog
    doesnt exceed a 10000.

    Thanks,
    Aniket




    On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:


    Sharing the code would be useful as mentioned. Also of help would
    the heap settings that the JVM had.

    However, off the top of my head, one common situation (esp. in
    text processing/tokenizing) is instantiating Strings in a tight
    loop.

    Besides you could also exercise your UDF in a local JVM and take
    a heap dump / profile it. If your heap is less than 512M, you
    could use basic profiling via hprof/hat (see
    http://java.sun.com/developer/technicalArticles/Programming/HPROF
    .h
    tml).


    Thanks,
    Jai





    On 2/24/11 9:26 AM, "Dmitriy Ryaboy" wrote:




    Aniket, share the code?
    It really depends on how you create them.




    -D




    On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
    wrote:




    I ve written a simple UDF that parses a chararray (which looks
    like ...[a].....[b]...[a]...) to capture stuff inside brackets
    and return them as String a=2;b=1; and so on. The input
    chararray are rarely more than 1000 characters and are not more
    than 100000 (I ve added log.warn in my udf to ensure this). But,
    I still see java
    heap error while running this udf (even in local mode, the job
    simply fails). My assumption is maps and lists that I use
    locally will be recollected by gc. Am I missing something?

    Thanks,
    Aniket





Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 24, '11 at 3:50a
activeFeb 25, '11 at 1:26a
posts8
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase