FAQ
I'm looking for code that will calculate the running median of a
sequence, efficiently. (I'm trying to subtract the running median from
a signal to correct for gradual drift).

My naive attempt (taking the median of a sliding window) is
unfortunately too slow as my sliding windows are quite large (~1k) and
so are my sequences (~50k). On my PC it takes about 18 seconds per
sequence. 17 of those seconds is spent in sorting the sliding windows.

I've googled around and it looks like there are some recent journal
articles on it, but no code. Any suggestions?

Thanks
Janto

Search Discussions

  • Peter Otten at Oct 13, 2009 at 4:12 pm

    Janto Dreijer wrote:

    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).

    My naive attempt (taking the median of a sliding window) is
    unfortunately too slow as my sliding windows are quite large (~1k) and
    so are my sequences (~50k). On my PC it takes about 18 seconds per
    sequence. 17 of those seconds is spent in sorting the sliding windows.

    I've googled around and it looks like there are some recent journal
    articles on it, but no code. Any suggestions?
    If you aren't using numpy, try that. Showing your code might also be a good
    idea...

    Peter
  • Janto Dreijer at Oct 14, 2009 at 12:30 am

    On Oct 13, 6:12 pm, Peter Otten wrote:
    Janto Dreijer wrote:
    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).
    My naive attempt (taking the median of a sliding window) is
    unfortunately too slow as my sliding windows are quite large (~1k) and
    so are my sequences (~50k). On my PC it takes about 18 seconds per
    sequence. 17 of those seconds is spent in sorting the sliding windows.
    I've googled around and it looks like there are some recent journal
    articles on it, but no code. Any suggestions?
    If you aren't using numpy, try that. Showing your code might also be a good
    idea...

    Peter
    I placed the test code and its output here:
    http://bitbucket.org/janto/snippets/src/tip/running_median.py

    I also have a version that uses numpy. On random data it seems to be
    about twice as fast as the pure python one. It spends half the time
    sorting and the other half converting the windows from python lists to
    numpy arrays.
    If the data is already sorted, the version using python's builtin sort
    outperforms the numpy convert-and-sort by about 5 times. Strangely
    satisfying :)

    I'm hoping to find a trick to do better than scipy.signal.medfilt.
    Looking at its code (PyArray_OrderFilterND in
    http://svn.scipy.org/svn/scipy/trunk/scipy/signal/sigtoolsmodule.c)
    there may be hope. It looks like they're sorting the entire window
    instead of doing a QuickSelect (which I could do with STL's
    nth_element).

    Janto
  • Peter Otten at Oct 14, 2009 at 2:53 pm

    Janto Dreijer wrote:
    On Oct 13, 6:12 pm, Peter Otten wrote:
    Janto Dreijer wrote:
    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).
    My naive attempt (taking the median of a sliding window) is
    unfortunately too slow as my sliding windows are quite large (~1k) and
    so are my sequences (~50k). On my PC it takes about 18 seconds per
    sequence. 17 of those seconds is spent in sorting the sliding windows.
    I've googled around and it looks like there are some recent journal
    articles on it, but no code. Any suggestions?
    If you aren't using numpy, try that. Showing your code might also be a
    good idea...

    Peter
    I placed the test code and its output here:
    http://bitbucket.org/janto/snippets/src/tip/running_median.py
    That gives me something to tinker ;)
    I also have a version that uses numpy. On random data it seems to be
    about twice as fast as the pure python one. It spends half the time
    sorting and the other half converting the windows from python lists to
    numpy arrays.
    If the data is already sorted, the version using python's builtin sort
    outperforms the numpy convert-and-sort by about 5 times. Strangely
    satisfying :)
    I was thinking of using as many of numpy's bulk operations as possible:

    def running_median_numpy(seq):
    data = array(seq, dtype=float)
    result = []
    for i in xrange(1, window_size):
    window = data[:i]
    result.append(median(window))
    for i in xrange(len(data)-window_size+1):
    window = data[i:i+window_size]
    result.append(median(window))
    return result

    But it didn't help as much as I had hoped.
    The fastest I came up with tries hard to keep the data sorted:

    def running_median_insort(seq):
    seq = iter(seq)
    d = deque()
    s = []
    result = []
    for item in islice(seq, window_size):
    d.append(item)
    insort(s, item)
    result.append(s[len(d)//2])
    m = window_size // 2
    for item in seq:
    old = d.popleft()
    d.append(item)
    del s[bisect_left(s, old)]
    insort(s, item)
    result.append(s[m])
    return result

    Some numbers:

    10.197 seconds for running_median_scipy_medfilt
    25.043 seconds for running_median_python
    13.040 seconds for running_median_python_msort
    14.280 seconds for running_median_python_scipy_median
    4.024 seconds for running_median_numpy
    0.221 seconds for running_median_insort

    What would be an acceptable performance, by the way?

    Peter

    PS: code not tested for correctness
  • Janto Dreijer at Oct 14, 2009 at 9:54 pm

    On Oct 14, 4:53?pm, Peter Otten wrote:
    Some numbers:

    10.197 seconds for running_median_scipy_medfilt
    25.043 seconds for running_median_python
    13.040 seconds for running_median_python_msort
    14.280 seconds for running_median_python_scipy_median
    4.024 seconds for running_median_numpy
    0.221 seconds for running_median_insort

    What would be an acceptable performance, by the way?
    That's great!
    Well, the faster it works, the better. It means I can process more
    data before getting frustrated. So if you have a faster version I'd
    like to see it :)

    Thankyou!
    Janto
  • Dave Angel at Oct 13, 2009 at 5:27 pm

    Janto Dreijer wrote:
    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).

    My naive attempt (taking the median of a sliding window) is
    unfortunately too slow as my sliding windows are quite large (~1k) and
    so are my sequences (~50k). On my PC it takes about 18 seconds per
    sequence. 17 of those seconds is spent in sorting the sliding windows.

    I've googled around and it looks like there are some recent journal
    articles on it, but no code. Any suggestions?

    Thanks
    Janto
    Since the previous window is sorted, it should just be a "case of delete
    one, add one" in the sorted list. And if that isn't fast enough,
    consider some form of tree.

    DaveA
  • Ethan Furman at Oct 13, 2009 at 5:37 pm

    Janto Dreijer wrote:
    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).

    My naive attempt (taking the median of a sliding window) is
    unfortunately too slow as my sliding windows are quite large (~1k) and
    so are my sequences (~50k). On my PC it takes about 18 seconds per
    sequence. 17 of those seconds is spent in sorting the sliding windows.

    I've googled around and it looks like there are some recent journal
    articles on it, but no code. Any suggestions?

    Thanks
    Janto
    You might look at http://pypi.python.org/pypi/blist/0.9.4

    ~Ethan~
  • Dale Dalrymple at Oct 13, 2009 at 6:29 pm

    On Oct 13, 8:22 am, Janto Dreijer wrote:
    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).
    ...
    Any suggestions?
    For a reference try:

    Comparison of algorithms for standard median filtering
    Juhola, M. Katajainen, J. Raita, T.
    Signal Processing, IEEE Transactions on
    Jan. 1991, Volume: 39 , Issue: 1, page(s): 204 - 208
    Abstract

    An algorithm I have used comes from:

    On computation of the running median
    Astola, J.T. Campbell, T.G.
    Acoustics, Speech and Signal Processing, IEEE Transactions on
    Apr 1989, Volume: 37, Issue: 4, page(s): 572-574

    This is a dual heap approach. No code though. There are some obvious
    (yeah, right) errors in their pseudo-code.

    The point of the dual heap algorithm (loosely put) is to reduce the
    computation to slide the window 1 element to be proportional to 2
    bubble sorts of log window size instead of a window size sort.

    Good luck.

    Dale B. Dalrymple
  • Janto Dreijer at Oct 13, 2009 at 9:23 pm

    On Oct 13, 8:29?pm, Dale Dalrymple wrote:
    On Oct 13, 8:22 am, Janto Dreijer wrote:

    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).
    ...
    ?Any suggestions?
    For a reference try:

    ?Comparison of algorithms for standard median filtering
    Juhola, M. ? Katajainen, J. ? Raita, T.
    Signal Processing, IEEE Transactions on
    Jan. 1991, Volume: 39 , Issue: 1, page(s): 204 - 208
    Abstract

    An algorithm I have used comes from:

    ?On computation of the running median
    Astola, J.T. ? Campbell, T.G.
    Acoustics, Speech and Signal Processing, IEEE Transactions on
    Apr 1989, Volume: 37, ?Issue: 4, page(s): 572-574

    This is a dual heap approach. No code though. There are some obvious
    (yeah, right) errors in their pseudo-code.

    The point of the dual heap algorithm (loosely put) is to reduce the
    computation to slide the window 1 element to be proportional to 2
    bubble sorts of log window size instead of a window size sort.

    Good luck.

    Dale B. Dalrymple
    Thanks. I'll have a look.

    I found a PDF by Soumya D. Mohanty entitled "Efficient Algorithm for
    computing a Running Median" (2003) by Googling. It has code snippets
    at the end, but it's not going to be a simple cut-and-paste job. It
    will take some work figuring out the missing parts.

    Regards
    Janto
  • Raymond Hettinger at Oct 15, 2009 at 10:41 am
    [Janto Dreijer]
    I found a PDF by Soumya D. Mohanty entitled "Efficient Algorithm for
    computing a Running Median" (2003) by Googling. It has code snippets
    at the end, but it's not going to be a simple cut-and-paste job. It
    will take some work figuring out the missing parts.
    See http://code.activestate.com/recipes/576930/ for working Python
    code
    adapted from that paper.


    Raymond
  • Janto Dreijer at Oct 15, 2009 at 7:33 pm

    On Oct 15, 12:41?pm, Raymond Hettinger wrote:
    [Janto Dreijer]
    I found a PDF by Soumya D. Mohanty entitled "Efficient Algorithm for
    computing a Running Median" (2003) by Googling. It has code snippets
    at the end, but it's not going to be a simple cut-and-paste job. It
    will take some work figuring out the missing parts.
    Seehttp://code.activestate.com/recipes/576930/for working Python
    code
    adapted from that paper.

    Raymond
    Wow! Thanks. I'll have a look.

    Janto
  • Denis at Oct 19, 2009 at 2:38 pm
    Folks,
    approximate medians -- would you settle for 49 - 51 % ? --
    open up new possibilities, and there's quite a lot of work on that,
    for huuuge datasets.

    A class of problems:
    from a data stream X1 X2 ... you want, every so often,
    a histogram / quantile summary / distribution estimator such that
    H approximates the distribution of X, so
    median of H(t) ~ median of some of X.
    Some parameters of this class:
    NH, size of H: 100 => H(p) ~ pth percentile of X
    A, how often you want H
    window, drop or age old data
    runtime, memory of course
    accuracy -- in theory, in practice

    A histogram viewer in which you can change buckets, colors, zoom,
    in REALTIME sounds tantalizing -- fast loop >> fast algorithm + slow
    loop.
    Anyone know of one off-the -shelf, python or anything else ?

    Refs:
    Manku et al., Approximate medians in one pass with limited memory,
    1998, 10p
    under http://infolab.stanford.edu/~manku/papers.html
    nice tree pictures
    they optimize mem (NH * Nbuf) not runtime, and don't window

    Suri +, Quantiles on Streams, 2006, 5p, http://www.cs.ucsb.edu/~suri/psdir/ency.pdf
    ~ 20 refs zzz

    cheers
    -- denis
  • Denis at Oct 20, 2009 at 9:28 am
    Yet another link: http://en.wikipedia.org/wiki/Median_filter
    --> Perreault + Hebert, Median Filtering in Constant Time,
    nice 6-page paper + 400 lines well-documented C code:
    http://nomis80.org/ctmf.html
    (for 2d images, dropping to 1d must be possible :)

    They're also in opencv, which I haven't tried --
    http://www.intel.com/technology/computing/opencv
    http://code.google.com/p/ctypes-opencv
    http://code.google.com/p/opencv-cython

    cheers
    -- denis
  • Denis at Oct 26, 2009 at 3:28 pm
    Based on Perreault + Hebert, here's python code for a rather different
    algorithm:
    - for quantized e.g. 8-bit data
    - runs faster for wider windows / slowly changing medians
    - no heaps, no trees: the only import is numpy, and even that's not
    essential

    http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c

    cheers
    -- denis
  • Sturlamolden at Oct 13, 2009 at 6:57 pm

    On 13 Okt, 18:33, Paul Rubin wrote:

    The obvious way to compute a running median involves a tree structure
    so you can quickly insert and delete elements, and find the median.
    That would be asymtotically O(n log n) but messy to implement.
    QuickSelect will find the median in O(log n) time.
  • Janto Dreijer at Oct 13, 2009 at 9:15 pm

    On Oct 13, 9:59?pm, Paul Rubin wrote:
    sturlamolden <sturlamol... at yahoo.no> writes:
    The obvious way to compute a running median involves a tree structure
    so you can quickly insert and delete elements, and find the median.
    That would be asymtotically O(n log n) but messy to implement.
    QuickSelect will find the median in O(log n) time.
    That makes no sense, you can't even examine the input in O(log n) time.
    True. I think he means O(n).
    I've tried wrapping the lesser-known nth_element function from the C++
    STL (http://www.cppreference.com/wiki/stl/algorithm/nth_element).
    Unfortunately it seems the conversion to std::vector<double> is quite
    slow...I'll have a look again. Will probably have to rewrite my whole
    median_filter function in C++ to avoid unnecessary conversions.
    Anyway, the problem isn't to find the median of n elements. ?It's to
    find n different medians of n different sets. ?You might be able to
    get close to linear time though, depending on how the data acts.
    Yes, that's what I've been hoping for. If I want it reeeeaaally fast
    I'll have to figure out how to do that.

    Thanks
    Janto
  • Steven D'Aprano at Oct 13, 2009 at 10:03 pm

    On Tue, 13 Oct 2009 12:59:47 -0700, Paul Rubin wrote:

    sturlamolden <sturlamolden at yahoo.no> writes:
    The obvious way to compute a running median involves a tree structure
    so you can quickly insert and delete elements, and find the median.
    That would be asymtotically O(n log n) but messy to implement.
    QuickSelect will find the median in O(log n) time.
    That makes no sense, you can't even examine the input in O(log n) time.
    If that were a relevant argument, no algorithms would ever be described
    as running in O(log n).

    Obviously to run in O(log n) you must have already built the tree. If you
    include the time required to build the tree, then O(n) is the lower-bound
    (because you have to see each element to put it in the tree) but that
    applies to any data structure. The point is, once you have that tree, you
    can perform repeated calculations in O(log n) time (or so it has been
    claimed).

    For example, if you want the m running medians of m data points, you
    might do the following:

    find the median of element 1
    find the median of element 1 and 2
    find the median of element 1 through 3
    find the median of element 1 through 4
    ...
    find the median of element 1 through m

    Each median calculation may take O(log n), where n varies from 1 to m,
    giving a total order of:

    log 1 + log 2 + log 3 + ... + log m
    => O(log m!)

    steps, which is much better than:

    1 + 2 + 3 + ... + m = 1/2*m*(m+1)
    => O(m**2)

    Anyway, the problem isn't to find the median of n elements. It's to
    find n different medians of n different sets. You might be able to get
    close to linear time though, depending on how the data acts.
    But since the data sets aren't independent, a tree structure seems like
    the way to go to me. Inserting and deleting elements should be fast, and
    assuming the claim of calculating the median in O(log n) is correct,
    that's quite fast too.




    --
    Steven
  • Sturlamolden at Oct 13, 2009 at 11:20 pm

    On 14 Okt, 00:03, Steven D'Aprano wrote:

    Obviously to run in O(log n) you must have already built the tree.
    You don't need a tree. Quickselect is a partial quicksort.

    But my memory served me badly, quickselect is O(n).
  • Raymond Hettinger at Oct 13, 2009 at 11:40 pm

    On Oct 13, 11:57?am, sturlamolden wrote:
    On 13 Okt, 18:33, Paul Rubin wrote:

    The obvious way to compute a running median involves a tree structure
    so you can quickly insert and delete elements, and find the median.
    That would be asymtotically O(n log n) but messy to implement.
    QuickSelect will find the median in O(log n) time.
    Right. See http://code.activestate.com/recipes/269554/ for a sample
    implementation in Python.


    Raymond
  • Daniel Stutzbach at Oct 13, 2009 at 7:22 pm

    On Tue, Oct 13, 2009 at 10:22 AM, Janto Dreijer wrote:

    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).
    In the past, I've used the following algorithm which approximates the
    running median. I got it from an article in a peer-reviewed journal, but I
    can't locate the reference at the moment.

    my_median = some_reasonable_default
    step_size = some_value_depending_on_your_application

    def adjust_median(value):
    global my_median
    if my_median < value:
    my_median += step_size
    else:
    my_median -= step_size

    Let me know if you have any questions about it.

    --
    Daniel Stutzbach, Ph.D.
    President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20091013/f958f154/attachment.htm>
  • Janto Dreijer at Oct 13, 2009 at 9:55 pm

    On Oct 13, 6:33?pm, Paul Rubin wrote:
    Janto Dreijer <jan... at gmail.com> writes:
    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).
    Is that a known and valid technique of correcting for drift? ?Maybe
    what you really want is a high-pass filter, to remove very low
    frequency components (drift) from the signal. ?You can do that with
    standard digital filters. ?Using something depending on ordering
    within the samples (i.e. the median) seems weird to me, but I'm no
    expert.
    Well, I don't have a lot of theoretical reasoning behind wanting to
    use a median filter. I have some experience applying it to images with
    very good results, so that was the first thing I tried. It felt right
    at the time and the results looked good. Also, most of the elements in
    the sequence are supposed to be near zero, so the median is supposed
    to give the amount of drift.

    That said, you're probably right. I should have first tried to apply a
    HPF.
    The obvious way to compute a running median involves a tree structure
    so you can quickly insert and delete elements, and find the median.
    That would be asymtotically O(n log n) but messy to implement.
    Messy. Yes. I tried and failed :P

    Thanks
    Janto
  • Janto Dreijer at Oct 13, 2009 at 10:22 pm

    On Oct 14, 12:13?am, Paul Rubin wrote:
    Janto Dreijer <jan... at gmail.com> writes:
    Well, I don't have a lot of theoretical reasoning behind wanting to
    use a median filter. I have some experience applying it to images with
    very good results, so that was the first thing I tried. It felt right
    at the time and the results looked good.
    If this is image data, which typically would have fairly low
    resolution per pixel (say 8-12 bits), then maybe you could just use
    bins to count how many times each value occurs in a window. ?That
    would let you find the median of a window fairly quickly, and then
    update it with each new sample by remembering the number of samples
    above and below the current median, etc.
    Thanks, unfortunately it's not image data. It's electrocardiogram
    sequences. So it's sequences of floats.

    Janto
  • Janto Dreijer at Oct 13, 2009 at 10:04 pm

    On Oct 13, 7:37?pm, Ethan Furman wrote:
    Janto Dreijer wrote:
    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).
    My naive attempt (taking the median of a sliding window) is
    unfortunately too slow as my sliding windows are quite large (~1k) and
    so are my sequences (~50k). On my PC it takes about 18 seconds per
    sequence. 17 of those seconds is spent in sorting the sliding windows.
    I've googled around and it looks like there are some recent journal
    articles on it, but no code. Any suggestions?
    Thanks
    Janto
    You might look athttp://pypi.python.org/pypi/blist/0.9.4

    ~Ethan~
    Very nice! I assume you mean I can use it to quickly insert items into
    the sliding window?

    Thanks
    Janto
  • Ethan Furman at Oct 14, 2009 at 10:53 pm

    Janto Dreijer wrote:
    On Oct 13, 7:37 pm, Ethan Furman wrote:

    Janto Dreijer wrote:
    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).
    My naive attempt (taking the median of a sliding window) is
    unfortunately too slow as my sliding windows are quite large (~1k) and
    so are my sequences (~50k). On my PC it takes about 18 seconds per
    sequence. 17 of those seconds is spent in sorting the sliding windows.
    I've googled around and it looks like there are some recent journal
    articles on it, but no code. Any suggestions?
    Thanks
    Janto
    You might look athttp://pypi.python.org/pypi/blist/0.9.4

    ~Ethan~

    Very nice! I assume you mean I can use it to quickly insert items into
    the sliding window?

    Thanks
    Janto
    I'm afraid I can't help any further. Going from your post, I thought a
    quicker list implementation might be useful, but beyond that I have no
    knowledge to share.

    Who said ignorance is bliss? *hangs head*

    ~Ethan~
  • Hendrik van Rooyen at Oct 14, 2009 at 6:48 am

    On Tuesday, 13 October 2009 17:22:55 Janto Dreijer wrote:
    I'm looking for code that will calculate the running median of a
    sequence, efficiently. (I'm trying to subtract the running median from
    a signal to correct for gradual drift).

    My naive attempt (taking the median of a sliding window) is
    unfortunately too slow as my sliding windows are quite large (~1k) and
    so are my sequences (~50k). On my PC it takes about 18 seconds per
    sequence. 17 of those seconds is spent in sorting the sliding windows.

    I've googled around and it looks like there are some recent journal
    articles on it, but no code. Any suggestions?
    Can you not use the mean instead? It is easier to calculate, as you do not
    have to work through all the values.

    - Hendrik

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedOct 13, '09 at 3:22p
activeOct 26, '09 at 3:28p
posts25
users11
websitepython.org

People

Translate

site design / logo © 2022 Grokbase