FAQ
I'm working on a tool that runs 1,000s of regexp replace instructions
across ~200,000 files (adding up to ~4gb atm). Currently this is written as
a shell script using git grep + perl + xargs + lockfile.

The source is here: https://github.com/daaku/pie

The current implementation has a simple model:

spawn runtime.NumCPU() worker goroutines all listening on work chan
filewalk tree, sending each file on the worker channel

Additionally since most files don't match, I'm optimizing for that case by
using gommap.

I'm running this on a beefy 72gb/24 core machine, and while the process
starts off well using all cores split across the threads, it falls off
after some time, slows down and ends up using just 1-2 cores for a long
period of time. After some time, it is again using all the cores, and
eventually the cycle repeats.

My guess was this is because it's IO bound, but that seems shouldn't be the
case because reading _all_ of those 4gb of files takes less time than one
of the cycles where it's only using 1-2 cores. This makes sense too as
running 1,000s of regexps seems like it should end up being CPU/memory
bound.

In profiling it I see the regexp logic dominating. That's what confuses me
though, because if that is the case then shouldn't it be using all the
cores most of the time?

Any insight or pointers would be much appreciated.


Thanks,

--
-Naitik

--

Search Discussions

  • Dave Cheney at Dec 1, 2012 at 7:38 am
    When you job gets down to one core, hit it with a SIGQUIT and have a
    look at the output, it should fairly easy to tell what the goroutines
    are doing and if they are blocked.
    On Sat, Dec 1, 2012 at 6:13 PM, Naitik Shah wrote:
    I'm working on a tool that runs 1,000s of regexp replace instructions across
    ~200,000 files (adding up to ~4gb atm). Currently this is written as a shell
    script using git grep + perl + xargs + lockfile.

    The source is here: https://github.com/daaku/pie

    The current implementation has a simple model:

    spawn runtime.NumCPU() worker goroutines all listening on work chan
    filewalk tree, sending each file on the worker channel

    Additionally since most files don't match, I'm optimizing for that case by
    using gommap.

    I'm running this on a beefy 72gb/24 core machine, and while the process
    starts off well using all cores split across the threads, it falls off after
    some time, slows down and ends up using just 1-2 cores for a long period of
    time. After some time, it is again using all the cores, and eventually the
    cycle repeats.

    My guess was this is because it's IO bound, but that seems shouldn't be the
    case because reading _all_ of those 4gb of files takes less time than one of
    the cycles where it's only using 1-2 cores. This makes sense too as running
    1,000s of regexps seems like it should end up being CPU/memory bound.

    In profiling it I see the regexp logic dominating. That's what confuses me
    though, because if that is the case then shouldn't it be using all the cores
    most of the time?

    Any insight or pointers would be much appreciated.


    Thanks,

    --
    -Naitik

    --
    --
  • Dustin Sallings at Dec 1, 2012 at 7:41 am

    Naitik Shah writes:

    In profiling it I see the regexp logic dominating. That's what
    confuses me though, because if that is the case then shouldn't it be
    using all the cores most of the time?

    Any insight or pointers would be much appreciated.
    Are you setting GOMAXPROCS?

    Are you out of memory?

    --
    dustin

    --
  • Naitik Shah at Dec 1, 2012 at 10:22 am

    When you job gets down to one core, hit it with a SIGQUIT and have a
    look at the output, it should fairly easy to tell what the goroutines
    are doing and if they are blocked.
    Awesome, did not know about this. I was using
    github.com/surma/stacksignal for something similar but it didn't work
    when things were stuck. With SIGQUIT I can now see they are all stuck
    trying to close their files. Not sure what this means yet.
    Are you setting GOMAXPROCS?
    Yep, setting it to runtime.NumCPU().
    Are you out of memory?
    No, the last run was using ~2% of memory when it got into this state.


    --
    -Naitik

    --
  • Dave Cheney at Dec 1, 2012 at 10:37 am

    Awesome, did not know about this. I was using
    github.com/surma/stacksignal for something similar but it didn't work
    when things were stuck. With SIGQUIT I can now see they are all stuck
    trying to close their files. Not sure what this means yet.
    Are they all trying to close the same file ?

    Possibly related,
    https://github.com/daaku/pie/blob/master/pie/pie.go#L48 should be a
    defer and https://github.com/daaku/pie/blob/master/pie/pie.go#L87
    should close work to signal to the workers to exit, thus releasing
    their waitgroup semaphore, allowing wg.Wait() to complete.

    --
  • Naitik Shah at Dec 1, 2012 at 10:49 am
    I think they're trying to close different files. Here are the stack
    traces if they help: https://gist.github.com/a0c9170d83861e80619c

    I'll fix the defer (though I guess unless I see errors it probably
    isn't the cause). But I'm not sure I follow your second comment about
    close(work). I've also removed the use of mmap in an updated rev to
    eliminate use a foreign library but still seeing the same behavior.


    --
    -Naitik

    --
  • Naitik Shah at Dec 1, 2012 at 10:58 am
    Actually I don't get the first one either, why should that be a defer?
    I want to execute that before the next loop iteration.
    On Sat, Dec 1, 2012 at 2:48 AM, Naitik Shah wrote:
    I think they're trying to close different files. Here are the stack
    traces if they help: https://gist.github.com/a0c9170d83861e80619c

    I'll fix the defer (though I guess unless I see errors it probably
    isn't the cause). But I'm not sure I follow your second comment about
    close(work). I've also removed the use of mmap in an updated rev to
    eliminate use a foreign library but still seeing the same behavior.


    --
    -Naitik


    --
    -Naitik

    --
  • Dave Cheney at Dec 1, 2012 at 10:58 am
    Sorry, that was my mistake, please ignore that advice.
    On Sat, Dec 1, 2012 at 9:50 PM, Naitik Shah wrote:
    Actually I don't get the first one either, why should that be a defer?
    I want to execute that before the next loop iteration.
    On Sat, Dec 1, 2012 at 2:48 AM, Naitik Shah wrote:
    I think they're trying to close different files. Here are the stack
    traces if they help: https://gist.github.com/a0c9170d83861e80619c

    I'll fix the defer (though I guess unless I see errors it probably
    isn't the cause). But I'm not sure I follow your second comment about
    close(work). I've also removed the use of mmap in an updated rev to
    eliminate use a foreign library but still seeing the same behavior.


    --
    -Naitik


    --
    -Naitik
    --
  • Naitik Shah at Dec 1, 2012 at 11:27 am
    It's cool, thanks for helping. I've tried a few different approaches
    but nothing seems to have helped. I'm trying a run with an explicit
    close before I remove the file, though I think unlink of open files is
    allowed. I'm also using watch+lsof and I see a stream files being
    opened and closed.

    --
  • Naitik Shah at Dec 1, 2012 at 11:53 am
    Just had another one such cycle and based on lsof output it seems like
    the 2 running goroutines were working on some larger (400k & 500k)
    files, and all the others were blocked, assuming by the close syscall
    again. And one more cycle.. again with a 3.1mb and 140k file running
    and all others blocked, this one is a much longer pause.

    Seems like working with bigger files ends up blocking other threads on
    close, even though according to lsof those files must already be
    closed.

    --
  • Naitik Shah at Dec 1, 2012 at 11:59 am
    I SIGQUIT it while it was in the mostly blocked state and got a
    different stack trace. Most of the goroutines say running but don't
    have a stack trace! What does that mean?

    https://gist.github.com/f446ea777971acd9ff8a

    --
  • Kyle Lemons at Dec 1, 2012 at 5:01 pm
    They do have a stack trace, though it doesn't really make a lot of sense
    (it says they're waiting on the "go" line) which might just mean that the
    code on github doesn't match what generated the stack traces.

    A few comments:

    *WaitGroup*
    You're using sync.WaitGroup in an odd way, and I suspect it is not working.
    Use it to count workers, not work units. The model should be something
    like this:

    var wg sync.WaitGroup
    work := make(chan Work)

    for i := 0; i < numWorkers; i++ {
    wg.Add(1)
    go func() {
    defer wg.Done()
    run(work)
    }()
    }

    /* ... send work over the channel ... */
    close(work)
    wg.Wait()

    You always .Add before you go x(); the Add should not be inside the
    function that's being counted.

    *Work channel*
    I would buffer the work channel by at least as many workers as you have,
    just to prevent the scheduler having to thrash back and forth. I would
    also use 1.5 to 2 times the number of cores you have (maybe more), because
    you're going to be I/O bound in most of your goroutines, not CPU bound.

    Notice that in the above I closed the work channel before I waited on the
    wait group. The worker function won't complete (and thus decrement the
    WaitGroup) until it exits its range loop over the work channel, and that
    won't happen until you close it. Once you do that, they'll start exiting
    (once the work queue is exhausted) and you can Wait for them to complete.

    *Worker function*
    Inside your worker function, you have something like for { r, ok := <-work;
    ... }, the idiomatic way of writing which is

    for r := range work {
    // ...
    }

    This loop will run once per work item and exit when the channel is closed.

    Style nit:
    if err := f.Run(compiledInstructions); err != nil {
    // log error and exit
    }

    Use the initializer clause when you're only returning an error and that
    error is what you're checking (or, more generally, when you're computing a
    value that is only relevant for one if statement).

    On Sat, Dec 1, 2012 at 6:59 AM, Naitik Shah wrote:

    I SIGQUIT it while it was in the mostly blocked state and got a
    different stack trace. Most of the goroutines say running but don't
    have a stack trace! What does that mean?

    https://gist.github.com/f446ea777971acd9ff8a

    --

    --
  • Naitik Shah at Dec 1, 2012 at 7:56 pm
    Thank you for all the great suggestions!

    I bumped the number of worker goroutines to 2*NumCPU, and the channel
    buffer to 4*NumCPU. gomaxprocs is still set to NumCPU though, does it
    make sense to set this to 2*NumCPU as well (I'm trying that now)?

    I tried a run with the updated code (also on github) and here's the
    stack traces from when it got into a pause cycle:

    https://gist.github.com/37baca8b54bad149d2c1

    Looks like only 1 goroutine shows the strange no-stack trace behavior,
    while others are either waiting for Syscall.Close or chan receive. The
    goroutine which sends on the worker chan is stuck on an Lstat.

    --
  • Naitik Shah at Dec 1, 2012 at 8:25 pm
    The new run with gomaxprocs=2*NumCPU shows similar behavior. Most of
    the 50 goroutines are stuck on syscall.Close and the work is basically
    stalled:

    https://gist.github.com/e0867641a03d7ac913f6

    --
  • Kevin Gillette at Dec 1, 2012 at 11:57 pm
    I'd expect goroutines to be blocked on close, since that's likely when the kernel will finish flushing the files. Since you're transforming the files fully buffered in memory, if there are any changes, the entire contents get written out. If you had a uniproc machine, it could be cpu-bound, but certainly with 24 cores, it's io-bound.

    You can possibly smooth things out by having a few goroutines dedicated to the io, with the rest doing all the in-memory processing.

    Regarding file handling, you may want to do atomic rename into the destination instead of removing the dest and writing to it anew; that way, you can be guaranteed you will have either the old or new contents, as opposed to neither in the worst case.

    --
  • Kyle Lemons at Dec 2, 2012 at 8:37 pm

    On Sat, Dec 1, 2012 at 6:57 PM, Kevin Gillette wrote:

    I'd expect goroutines to be blocked on close, since that's likely when the
    kernel will finish flushing the files. Since you're transforming the files
    fully buffered in memory, if there are any changes, the entire contents get
    written out. If you had a uniproc machine, it could be cpu-bound, but
    certainly with 24 cores, it's io-bound.
    I didn't think about Close() waiting for the kernel write buffer to flush.
    That certainly would explain the behavior. If that's the case, you could
    defer a "go Close()" instead and see if that helps?

    You can possibly smooth things out by having a few goroutines dedicated to
    the io, with the rest doing all the in-memory processing.

    Regarding file handling, you may want to do atomic rename into the
    destination instead of removing the dest and writing to it anew; that way,
    you can be guaranteed you will have either the old or new contents, as
    opposed to neither in the worst case.

    --

    --
  • Naitik Shah at Dec 2, 2012 at 9:04 pm

    I'd expect goroutines to be blocked on close, since that's likely when the
    kernel will finish flushing the files. Since you're transforming the files
    fully buffered in memory, if there are any changes, the entire contents get
    written out. If you had a uniproc machine, it could be cpu-bound, but
    certainly with 24 cores, it's io-bound.
    I didn't think about Close() waiting for the kernel write buffer to flush.
    That certainly would explain the behavior. If that's the case, you could
    defer a "go Close()" instead and see if that helps?
    I tried "go Close()" and in that case I ended up with all threads
    waiting on the close when the slow down happened. My guess was that a
    syscall locks the thread, so eventually most threads were waiting on
    the close and none were doing any work. So I tried two dedicated
    goroutines for close instead, details for that run are in my previous
    email (it showed the same odd slowdown as well).

    I also just tried not closing the files at all, since in my case that
    should be fine. But oddly lsof still showed files being opened which
    were then gone from the list, and I'm not sure how that was happening
    just yet.

    My current best guess is that I'm triggering a lot of random access IO
    (this server does not have an SSD) with so many threads reading files
    and iterating thru the directory. I'm going to rework it to split the
    IO including reading into dedicated goroutines next.


    --
    -Naitik

    --
  • Naitik Shah at Dec 2, 2012 at 9:06 pm
    I should clear something up wrt the close -- I'm using
    ioutil.WriteFile which does it's own close, so the explicit close was
    on the original file that was being read.

    --
  • Dave Cheney at Dec 2, 2012 at 9:08 pm
    Run iostat -xm 3 $whateveryourdeviceis

    Wait for a stall

    Please post the last few lines.
    On 2 Dec 2012 07:25, "Naitik Shah" wrote:

    The new run with gomaxprocs=2*NumCPU shows similar behavior. Most of
    the 50 goroutines are stuck on syscall.Close and the work is basically
    stalled:

    https://gist.github.com/e0867641a03d7ac913f6
    --
  • Naitik Shah at Dec 2, 2012 at 11:21 pm
  • Dave Cheney at Dec 3, 2012 at 12:18 am
    So, to be clear.

    You are writing to files host ? And these are the files you say block
    while closing ? If both of these points are true, what are the
    mount/export settings you are using ?
    On Mon, Dec 3, 2012 at 10:21 AM, Naitik Shah wrote:
    The files are on the local disk, though there are nfs mounts on the machine.

    Here's the iostat from before/during/after a few stalls. The third one
    is still stalled many minutes later:

    https://raw.github.com/gist/ddca326eaff358f0ddfd/88cdab29f0c9866587625e1c0d9f25171727da51/gistfile1.txt

    https://raw.github.com/gist/58c25b1273f4ceadaad5/aa3a0ee8def5b355a6bed37367566e3d84cb6b19/gistfile1.txt

    https://raw.github.com/gist/513a73f93822f4cac3b9/711143b598e995e711693e725cee3df59da58a5c/gistfile1.txt
    --
  • Naitik Shah at Dec 3, 2012 at 1:16 am
    The files are being read and written (replaced) on the local disk. My
    home directory is on nfs, but it isn't involved in the work the
    application is doing. The disk is ext3 with no custom options (notable
    my usual mount has noatime but this one doesn't). Kernel is 3.2.18-70
    if it helps.

    --
  • Dave Cheney at Dec 3, 2012 at 1:24 am
    In that case, looking at the iostat data I cannot see any evidence of
    the device being staturated on io

    avg-cpu: %user %nice %system %iowait %steal %idle
    13.81 0.00 0.96 0.49 0.00 84.74

    Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0.00 36.67 0.00 255.33 0.00 0.51
    4.05 0.62 2.43 2.15 54.90

    The only thing to note is 54% util, and 0.51mb written per sec is
    quite low. Is this disk sick ? Are you running in a vm ?

    I'm not sure it was requested before, but can you please strace your process.
    On Mon, Dec 3, 2012 at 12:15 PM, Naitik Shah wrote:
    The files are being read and written (replaced) on the local disk. My
    home directory is on nfs, but it isn't involved in the work the
    application is doing. The disk is ext3 with no custom options (notable
    my usual mount has noatime but this one doesn't). Kernel is 3.2.18-70
    if it helps.
    --
  • Naitik Shah at Dec 3, 2012 at 2:19 am
    It's still running and has gone thru the few initial stalls it
    recovers from, but right after one of these stalls is done I see
    thousands of such calls:

    27889 madvise(0xf89a2f8000, 4096, MADV_DONTNEED) = 0

    --
  • Dave Cheney at Dec 3, 2012 at 2:20 am
    That is the gc scavenger informing the operating system that part of
    the heap is no longer need. Calls to read/write and close will
    probably be more interesting.
    On Mon, Dec 3, 2012 at 1:19 PM, Naitik Shah wrote:
    It's still running and has gone thru the few initial stalls it
    recovers from, but right after one of these stalls is done I see
    thousands of such calls:

    27889 madvise(0xf89a2f8000, 4096, MADV_DONTNEED) = 0
    --
  • Naitik Shah at Dec 3, 2012 at 2:36 am
    Forgot to answer your earlier question -- this isn't a vm, and I ran
    this a couple of times on another machine and it showed similar
    behavior (though this was many versions ago).

    Here's the trace from the stall:

    https://raw.github.com/gist/b4ec713c507babc319d3/e5c498615fc767f3b1a27e8f06b5fa5c1008e483/gistfile1.txt

    At this point the strace output has stopped (it's going to a file) and
    the last line in the output is where it's been stuck for several
    minutes now.

    --
  • Naitik Shah at Dec 3, 2012 at 2:49 am
    Here's an updated version of the strace output once the process resumed.

    https://raw.github.com/gist/f25803383b2e0ad8dc09/0030099c5d04bf16bda718e7797a86b5ad7bdff1/gistfile1.txt

    I'm not sure what to make of it, but there are lot of close calls that
    eventually appeared once it got past the stall.

    --
  • Dave Cheney at Dec 3, 2012 at 3:00 am
    Is it possible to generate a simpler program that we can run on our
    machines using /dev/zero or something for sample input so we do not
    need access to your large source data ?
    On Mon, Dec 3, 2012 at 1:49 PM, Naitik Shah wrote:
    Here's an updated version of the strace output once the process resumed.

    https://raw.github.com/gist/f25803383b2e0ad8dc09/0030099c5d04bf16bda718e7797a86b5ad7bdff1/gistfile1.txt

    I'm not sure what to make of it, but there are lot of close calls that
    eventually appeared once it got past the stall.
    --
  • Dave Cheney at Dec 5, 2012 at 7:28 am
    Repeating my request for a version of your program that we can run on
    out machines without access to your large dataset.
    On Mon, Dec 3, 2012 at 2:00 PM, Dave Cheney wrote:
    Is it possible to generate a simpler program that we can run on our
    machines using /dev/zero or something for sample input so we do not
    need access to your large source data ?
    On Mon, Dec 3, 2012 at 1:49 PM, Naitik Shah wrote:
    Here's an updated version of the strace output once the process resumed.

    https://raw.github.com/gist/f25803383b2e0ad8dc09/0030099c5d04bf16bda718e7797a86b5ad7bdff1/gistfile1.txt

    I'm not sure what to make of it, but there are lot of close calls that
    eventually appeared once it got past the stall.
    --
  • Rémy Oudompheng at Dec 5, 2012 at 7:42 am

    On 2012/12/3 Naitik Shah <n@daaku.org>:
    Here's an updated version of the strace output once the process resumed.

    https://raw.github.com/gist/f25803383b2e0ad8dc09/0030099c5d04bf16bda718e7797a86b5ad7bdff1/gistfile1.txt

    I'm not sure what to make of it, but there are lot of close calls that
    eventually appeared once it got past the stall.
    * How many threads does your program have? (on linux, you can see
    their PIDs in /proc/PID/task). I think you have hundreds, maybe
    thousands of OS threads and they might be very large.

    * Is your process swapping ?

    * Can you run strace with the -C and -T options (or -c) so that the
    time spent in syscalls is displayed ?

    Rémy.

    --
  • Dmitry Vyukov at Dec 2, 2012 at 8:28 am

    On Saturday, December 1, 2012 11:13:23 AM UTC+4, Naitik Shah wrote:

    I'm working on a tool that runs 1,000s of regexp replace instructions
    across ~200,000 files (adding up to ~4gb atm). Currently this is written as
    a shell script using git grep + perl + xargs + lockfile.

    The source is here: https://github.com/daaku/pie

    The current implementation has a simple model:

    spawn runtime.NumCPU() worker goroutines all listening on work chan
    filewalk tree, sending each file on the worker channel

    Additionally since most files don't match, I'm optimizing for that case by
    using gommap.

    I'm running this on a beefy 72gb/24 core machine, and while the process
    starts off well using all cores split across the threads, it falls off
    after some time, slows down and ends up using just 1-2 cores for a long
    period of time. After some time, it is again using all the cores, and
    eventually the cycle repeats.

    My guess was this is because it's IO bound, but that seems shouldn't be
    the case because reading _all_ of those 4gb of files takes less time than
    one of the cycles where it's only using 1-2 cores. This makes sense too as
    running 1,000s of regexps seems like it should end up being CPU/memory
    bound.

    In profiling it I see the regexp logic dominating. That's what confuses me
    though, because if that is the case then shouldn't it be using all the
    cores most of the time?

    Any insight or pointers would be much appreciated.
    Maybe it's garbage collector. Try to run it with GOGCTRACE=1 env var, it
    will print GC logs.

    --
  • Naitik Shah at Dec 2, 2012 at 7:31 pm
    I moved the file close to into a pair of dedicated goroutines:

    https://github.com/daaku/pie/commit/641dd82d8f339d1b8599657bc3df169459a9167f

    (Still need to work on the atomic rename.)

    Also used used GOGCTRACE=1 (thanks! learning a lot with this thread),
    and this is the output so far when the process is in the stalled
    state:

    https://gist.github.com/b171e6fcbf6071cef8d4

    As far I as I can understand the GC isn't the problem.

    The stacktraces outputted from SIGQUIT still seem wrong. In the most
    recent version they seem to be stuck on the defer that sends the
    *os.File down the closer channel, which is odd because it has a buffer
    of 10k.


    -Naitik

    --
  • Dmitry Vyukov at Dec 3, 2012 at 4:16 am

    On Sun, Dec 2, 2012 at 11:31 PM, Naitik Shah wrote:
    I moved the file close to into a pair of dedicated goroutines:

    https://github.com/daaku/pie/commit/641dd82d8f339d1b8599657bc3df169459a9167f

    (Still need to work on the atomic rename.)

    Also used used GOGCTRACE=1 (thanks! learning a lot with this thread),
    and this is the output so far when the process is in the stalled
    state:

    https://gist.github.com/b171e6fcbf6071cef8d4

    As far I as I can understand the GC isn't the problem.
    As far as I see it is the problem.
    You have on scavenger forced GCs and some of them take up to 140 seconds:
    gc23(4): 141287+95+2 ms 2491 -> 2452 MB 9529964 -> 9483201
    (20516886-11033685) objects 762 handoff

    Please collect stack dumps with GOTRACEBACK=2 env var, it will show
    longer stack traces.
    It seems that stoptheworld phase takes long for some reason, and there
    are that strange goroutines:

    goroutine 2 [running]:
    created by runtime.main
    /home/naitik/usr/go/src/pkg/runtime/proc.c:221

    You see a lot of goroutines blocked on chan send/recv, because GC is
    pending, so runtime tries to park goroutines at safe points (one of
    which is chan operations).



    The stacktraces outputted from SIGQUIT still seem wrong. In the most
    recent version they seem to be stuck on the defer that sends the
    *os.File down the closer channel, which is odd because it has a buffer
    of 10k.


    -Naitik
    --
  • Naitik Shah at Dec 5, 2012 at 7:25 am
    Thanks for the explanation, that makes sense. Here's some sample
    output with GOTRACEBACK=2:

    https://raw.github.com/gist/6dd75b2f332f27ee0bcf/f4bab749a48070d9fb368704c2abd7bf10bc7358/gistfile1.txt

    I guess I should reuse the memory for reading files to start with.
    That will cover the dominant case of no modifications. But I'm not
    sure how to replace the use of regexp.ReplaceAll to avoid new
    allocations.

    --
  • Dave Cheney at Dec 5, 2012 at 7:40 am
    That stack trace is very strange, all the goroutines are are at a safe
    point, and I can even see the goroutine that caused the GC pass, but I
    can't see any goroutines actually inside the gc code. Dmitry, should I
    be able to see that ?

    Naitik -- i've searched the conversation, but couldn't find this
    information, what OS and version of Go are you running ?

    Dave
    On Wed, Dec 5, 2012 at 6:24 PM, Naitik Shah wrote:
    Thanks for the explanation, that makes sense. Here's some sample
    output with GOTRACEBACK=2:

    https://raw.github.com/gist/6dd75b2f332f27ee0bcf/f4bab749a48070d9fb368704c2abd7bf10bc7358/gistfile1.txt

    I guess I should reuse the memory for reading files to start with.
    That will cover the dominant case of no modifications. But I'm not
    sure how to replace the use of regexp.ReplaceAll to avoid new
    allocations.

    --
    --
  • Dmitry Vyukov at Dec 5, 2012 at 8:12 am

    On Wed, Dec 5, 2012 at 11:24 AM, Naitik Shah wrote:
    Thanks for the explanation, that makes sense. Here's some sample
    output with GOTRACEBACK=2:

    https://raw.github.com/gist/6dd75b2f332f27ee0bcf/f4bab749a48070d9fb368704c2abd7bf10bc7358/gistfile1.txt

    I guess I should reuse the memory for reading files to start with.
    That will cover the dominant case of no modifications. But I'm not
    sure how to replace the use of regexp.ReplaceAll to avoid new
    allocations.
    I see what happens, but I do not yet understand what happens.

    Most of garbage collections pass normally, and take ~2 seconds which
    is fine for 2.5GB heap:
    gc79(4): 2073+92+0 ms 2450 -> 2450 MB 9447709 -> 9447621
    (20366628-10919007) objects 801 handoff

    But some GCs take much longer:
    gc65(4): 31385+95+3 ms 2518 -> 2451 MB 9536771 -> 9453496
    (20364147-10910651) objects 952 handoff
    gc23(4): 141287+95+2 ms 2491 -> 2452 MB 9529964 -> 9483201
    (20516886-11033685) objects 762 handoff

    That longer time is the stoptheworld phase -- GC waits for all
    goroutines to voluntary park.

    In the log we see the goroutine that presumably cause GC (the stack is
    somewhat strange, though):
    goroutine 2 [running]:
    runtime.gosched()
    /home/naitik/usr/go/src/pkg/runtime/proc.c:927 +0x5c
    runtime.unlock(0x1, 0xdf8475800)
    /home/naitik/usr/go/src/pkg/runtime/lock_futex.c:97 +0x55
    runtime.MHeap_Scavenger()
    /home/naitik/usr/go/src/pkg/runtime/mheap.c:370 +0x121
    runtime.goexit()
    /home/naitik/usr/go/src/pkg/runtime/proc.c:271
    created by runtime.main
    /home/naitik/usr/go/src/pkg/runtime/proc.c:221

    And these 2 goorutine that are running and reject to park:

    goroutine 4 [running]:
    runtime.entersyscall()
    /home/naitik/usr/go/src/pkg/runtime/proc.c:952 +0x37
    created by github.com/daaku/pie/pie.(*Run).Run
    /home/naitik/usr/go/src/pkg/github.com/daaku/pie/pie/pie.go:65 +0x15b

    goroutine 15 [running]:
    runtime.entersyscall()
    /home/naitik/usr/go/src/pkg/runtime/proc.c:952 +0x37
    created by github.com/daaku/pie/pie.(*Run).Run
    /home/naitik/usr/go/src/pkg/github.com/daaku/pie/pie/pie.go:65 +0x15b

    I don't know now what the goroutines are doing and why they reject to
    park in reasonable time.
    Is it the case that the long pause happens only when GC is triggered
    from scavenger?..

    --
  • Naitik Shah at Dec 5, 2012 at 8:31 am

    Repeating my request for a version of your program that we can run on
    out machines without access to your large dataset.
    I'm working on this for sure, will hopefully have it soon.
    OS and version of Go are you running.
    I'm on a Linux box running 3.2.18-70 (x86_64). The binary is compiled
    on a different machine with go 1.0.3 (but I can compile it locally if
    it matters).
    How many threads does your program have? (on linux, you can see
    their PIDs in /proc/PID/task). I think you have hundreds, maybe
    thousands of OS threads and they might be very large.
    There are 43 entries in this directory, and it stays constant. I
    create NumCPU*2 worker goroutines and gomaxprocs is set to NumCPU*2.
    NumCPU = 24 on this machine.
    Is your process swapping ?
    Nope, swap usage has been 0 all this while. In total the system is
    using ~10gb of 72gb while the process is running.


    I made some changes including reusing the buffer for reading files
    (though I allocate a larger than necessary buffer atm):

    https://github.com/daaku/pie/compare/641dd82d8f...72f5056c64

    This definitely helped, but it eventually got into the bad state. Here
    are the stack traces:

    https://raw.github.com/gist/d29300318abe46d25fcc/682ae855967018484fbbab4272ffa35d0785234e/gistfile1.txt
    Can you run strace with the -C and -T options (or -c) so that the
    time spent in syscalls is displayed.

    I have the full log, but it's large. I image this summary might be
    enough though:

    https://raw.github.com/gist/34978a0b8d8af620d2a8/baf40a454252b5bb12cdabc4679d70d9e25eb90d/gistfile1.txt

    I should point out that tailing the strace output shows that when it
    gets into a paused state there is _nothing_ being written out. I guess
    this means it's just waiting. The last few calls are a mix of close
    and futex wait.

    --
  • Dmitry Vyukov at Dec 5, 2012 at 8:32 am

    On Wed, Dec 5, 2012 at 12:12 PM, Dmitry Vyukov wrote:
    On Wed, Dec 5, 2012 at 11:24 AM, Naitik Shah wrote:
    Thanks for the explanation, that makes sense. Here's some sample
    output with GOTRACEBACK=2:

    https://raw.github.com/gist/6dd75b2f332f27ee0bcf/f4bab749a48070d9fb368704c2abd7bf10bc7358/gistfile1.txt

    I guess I should reuse the memory for reading files to start with.
    That will cover the dominant case of no modifications. But I'm not
    sure how to replace the use of regexp.ReplaceAll to avoid new
    allocations.
    I see what happens, but I do not yet understand what happens.

    Most of garbage collections pass normally, and take ~2 seconds which
    is fine for 2.5GB heap:
    gc79(4): 2073+92+0 ms 2450 -> 2450 MB 9447709 -> 9447621
    (20366628-10919007) objects 801 handoff

    But some GCs take much longer:
    gc65(4): 31385+95+3 ms 2518 -> 2451 MB 9536771 -> 9453496
    (20364147-10910651) objects 952 handoff
    gc23(4): 141287+95+2 ms 2491 -> 2452 MB 9529964 -> 9483201
    (20516886-11033685) objects 762 handoff

    That longer time is the stoptheworld phase -- GC waits for all
    goroutines to voluntary park.

    In the log we see the goroutine that presumably cause GC (the stack is
    somewhat strange, though):
    goroutine 2 [running]:
    runtime.gosched()
    /home/naitik/usr/go/src/pkg/runtime/proc.c:927 +0x5c
    runtime.unlock(0x1, 0xdf8475800)
    /home/naitik/usr/go/src/pkg/runtime/lock_futex.c:97 +0x55
    runtime.MHeap_Scavenger()
    /home/naitik/usr/go/src/pkg/runtime/mheap.c:370 +0x121
    runtime.goexit()
    /home/naitik/usr/go/src/pkg/runtime/proc.c:271
    created by runtime.main
    /home/naitik/usr/go/src/pkg/runtime/proc.c:221

    And these 2 goorutine that are running and reject to park:

    goroutine 4 [running]:
    runtime.entersyscall()
    /home/naitik/usr/go/src/pkg/runtime/proc.c:952 +0x37
    created by github.com/daaku/pie/pie.(*Run).Run
    /home/naitik/usr/go/src/pkg/github.com/daaku/pie/pie/pie.go:65 +0x15b

    goroutine 15 [running]:
    runtime.entersyscall()
    /home/naitik/usr/go/src/pkg/runtime/proc.c:952 +0x37
    created by github.com/daaku/pie/pie.(*Run).Run
    /home/naitik/usr/go/src/pkg/github.com/daaku/pie/pie/pie.go:65 +0x15b

    I don't know now what the goroutines are doing and why they reject to
    park in reasonable time.
    Is it the case that the long pause happens only when GC is triggered
    from scavenger?..

    OK, I see what happens.

    The 4 and 15 stacks are caused by the fact that entersyscall calls
    gosave, the goroutine is actually executing in other place.

    It just seems that the worker goroutines are busy matching regexps for
    dozens of seconds and block GC.

    How many different regexps do you have? And how much time each regexp can take?
    You can try inserting runtime.Gosched() into the matching loop between
    different regexps (and between any other long operations if any). And
    see if it fixes the issue.

    --
  • Naitik Shah at Dec 5, 2012 at 8:47 am
    I have ~10k regular expressions. Not sure how much time each one
    takes. I'm trying a run which calls Gosched every 10th loop, will
    report back on how it goes.

    --
  • Dmitry Vyukov at Dec 5, 2012 at 8:44 am

    On Wed, Dec 5, 2012 at 12:40 PM, Naitik Shah wrote:
    I have ~10k regular expressions. Not sure how much time each one
    takes. I'm trying a run which calls Gosched every 10th loop, will
    report back on how it goes.
    Calling runtime.Gosched() every 1ms-10ms should not incur too much
    overhead and bound stop-the-world phase.

    --
  • Andy Balholm at Dec 5, 2012 at 2:38 pm

    On Wednesday, December 5, 2012 12:40:08 AM UTC-8, Naitik Shah wrote:

    I have ~10k regular expressions.
    What kind of regular expressions are they? Do most of them include
    distinctive text, or are they more like `\(\d\d\d\) \d\d\d-\d\d\d\d`?
    If they have distinctive text, I know how to speed things up. (Although
    it's probably also good to track down the GC problem.)

    In code.google.com/p/redwood-filter, I match URLs against a large number of
    regular expressions. But before I try the regexes, I use Aho-Corasick
    string search to narrow the possibilities. This gives a huge speedup with
    the regexes I use.

    I use some code derived from Russ's codesearch project to extract the
    strings from the regexes.

    --
  • Naitik Shah at Dec 7, 2012 at 7:43 am
    Thank you all for your help. I have something that works reasonably
    well and the issues have been resolved.

    The runtime.Gosched call was key and the lack of which caused the long
    pauses. Like Dmitry said, a big file with 10k regexps can take a long
    time, which the GC ends up waiting on, during which time it parked the
    other goroutines.

    Once it actually worked on the large data set the first successful run
    over took a few hours!

    The 10k regexps were the bottleneck, and as Andy suggested this was a
    case where narrowing the possibilities could help a lot. So I hooked
    it up to use the codesearch index which helped reduce the time from a
    few hours to less than an hour!

    The code lives at https://github.com/daaku/pie if anyone's interested.

    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedDec 1, '12 at 7:13a
activeDec 7, '12 at 7:43a
posts42
users8
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase