I believe this was caused by the switch to using lock addl[esp], 0 instead of mfence for volatile membars, 6822204. My review request for that said that at the time I didn't measure any performance change for Intel, http://cr.openjdk.java.net/~never/6822204. On your microbenchmark I can measure the difference though so I'm going to remeasure derby which previously showed the big difference. We may want to make the lock addl be AMD specific.

tom
On Aug 11, 2011, at 11:05 AM, Clemens Eisserer wrote:

Hi Vitaly,

I tried this bench on 6u23 and if I first run that code in a 10k iteration loop and then time the 1mm iteration loop I get about 10 ms speedup. The first loop would trigger jit compilation (10k is the default threshold I believe) and second should run without compilation interruption.

Can you try the same? Also might be interesting to time it under the interpreter (-Xint).

I changed the testcase a bit, to no longer rely on OSR - as lockBench() will for sure soon hit the compilation threshold after a few runs.

I get the following timings for 1m runs:

jdk7-server: 53ms
jdk7-client: 62ms
jdk7-xint : 955ms

jdk6-xint : 1000ms
jdk6-client: 68ms
jdk6-server: 52ms

jdk5-server: 40ms
jdk5-client: 61ms
jdk5-xint : 832ms

So JDK7 is slower in every case, the regression seems to have landed in jdk6 (I was using openjdk6).

Should I file a bug-report about this behaviour?

Thanks, Clemens


public class LockPerf {
static ReentrantLock lock = new ReentrantLock();

public static void main(String[] args) {
while (true) {
long start2 = System.nanoTime();
for(int i=0; i < 1000; i++) {
lockBench();
}
System.out.println("Lock bench: " + ((System.nanoTime() - start2)) / 1000000);
}
}

private static void lockBench() {
for (int i = 0; i < 1000; i++) {
lock.lock();
lock.unlock();
}
}
}

On Aug 11, 2011 11:38 AM, "Clemens Eisserer" wrote:
Hi Vitaly,

Which OS are you using?
Linux-3.0 (Fedora 15)

Also, you should use System.nanoTime() for this type of timing as it gives
you a more precise timer.
I tried, but results remained the same. ~53ms for jdk6/7, ~41 for JDK5.
I was using the server compiler both times.

Thanks, Clemens

Search Discussions

  • Vitaly Davidovich at Aug 11, 2011 at 3:39 pm
    Hi Tom,

    Just curious - I recall reading on Dave Dice's blog that he found locked add
    to perform better than mfence. Granted he tested on a nehalem box - do you
    think it may need more granular decision making in the jit than just amd vs
    Intel? i.e. check Intel generation as well.

    Thanks
    On Aug 11, 2011 6:03 PM, "Tom Rodriguez" wrote:
    I believe this was caused by the switch to using lock addl[esp], 0 instead
    of mfence for volatile membars, 6822204. My review request for that said
    that at the time I didn't measure any performance change for Intel,
    http://cr.openjdk.java.net/~never/6822204. On your microbenchmark I can
    measure the difference though so I'm going to remeasure derby which
    previously showed the big difference. We may want to make the lock addl be
    AMD specific.
    tom
    On Aug 11, 2011, at 11:05 AM, Clemens Eisserer wrote:

    Hi Vitaly,

    I tried this bench on 6u23 and if I first run that code in a 10k
    iteration loop and then time the 1mm iteration loop I get about 10 ms
    speedup. The first loop would trigger jit compilation (10k is the default
    threshold I believe) and second should run without compilation interruption.
    Can you try the same? Also might be interesting to time it under the
    interpreter (-Xint).
    I changed the testcase a bit, to no longer rely on OSR - as lockBench()
    will for sure soon hit the compilation threshold after a few runs.
    I get the following timings for 1m runs:

    jdk7-server: 53ms
    jdk7-client: 62ms
    jdk7-xint : 955ms

    jdk6-xint : 1000ms
    jdk6-client: 68ms
    jdk6-server: 52ms

    jdk5-server: 40ms
    jdk5-client: 61ms
    jdk5-xint : 832ms

    So JDK7 is slower in every case, the regression seems to have landed in
    jdk6 (I was using openjdk6).
    Should I file a bug-report about this behaviour?

    Thanks, Clemens


    public class LockPerf {
    static ReentrantLock lock = new ReentrantLock();

    public static void main(String[] args) {
    while (true) {
    long start2 = System.nanoTime();
    for(int i=0; i < 1000; i++) {
    lockBench();
    }
    System.out.println("Lock bench: " + ((System.nanoTime() - start2)) /
    1000000);
    }
    }

    private static void lockBench() {
    for (int i = 0; i < 1000; i++) {
    lock.lock();
    lock.unlock();
    }
    }
    }

    On Aug 11, 2011 11:38 AM, "Clemens Eisserer" wrote:
    Hi Vitaly,

    Which OS are you using?
    Linux-3.0 (Fedora 15)

    Also, you should use System.nanoTime() for this type of timing as it
    gives
    you a more precise timer.
    I tried, but results remained the same. ~53ms for jdk6/7, ~41 for JDK5.
    I was using the server compiler both times.

    Thanks, Clemens
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20110811/c7fc1bf9/attachment.html
  • Florian Weimer at Aug 12, 2011 at 12:57 am

    * Tom Rodriguez:

    I believe this was caused by the switch to using lock addl[esp], 0
    instead of mfence for volatile membars, 6822204. My review request
    for that said that at the time I didn't measure any performance change
    for Intel, http://cr.openjdk.java.net/~never/6822204. On your
    microbenchmark I can measure the difference though so I'm going to
    remeasure derby which previously showed the big difference. We may
    want to make the lock addl be AMD specific.
    Couldn't the relative speed of the two instructions also depend on the
    type of benchmark?

    --
    Florian Weimer <fweimer at bfk.de>
    BFK edv-consulting GmbH http://www.bfk.de/
    Kriegsstra?e 100 tel: +49-721-96201-1
    D-76133 Karlsruhe fax: +49-721-96201-99
  • Tom Rodriguez at Aug 12, 2011 at 11:22 am

    On Aug 12, 2011, at 12:57 AM, Florian Weimer wrote:

    * Tom Rodriguez:
    I believe this was caused by the switch to using lock addl[esp], 0
    instead of mfence for volatile membars, 6822204. My review request
    for that said that at the time I didn't measure any performance change
    for Intel, http://cr.openjdk.java.net/~never/6822204. On your
    microbenchmark I can measure the difference though so I'm going to
    remeasure derby which previously showed the big difference. We may
    want to make the lock addl be AMD specific.
    Couldn't the relative speed of the two instructions also depend on the
    type of benchmark?
    These are primarily being emitted for volatile fences so many programs won't care about their speed at all. If you look at my other email it suggests that the difference is that Intel chips prior to Nehalem had heavier weight implementation of lock addl than was required. mfence stayed approximately the same between processor versions with it's speed pretty much tracking the relative clock speeds, 2.4 for the Tigerton and 2.8 for Nehalem. The original data suggested no performance change on Nehalem when switching instructions so it probably doesn't care either way.

    tom
    --
    Florian Weimer <fweimer at bfk.de>
    BFK edv-consulting GmbH http://www.bfk.de/
    Kriegsstra?e 100 tel: +49-721-96201-1
    D-76133 Karlsruhe fax: +49-721-96201-99

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphotspot-compiler-dev @
categoriesopenjdk
postedAug 11, '11 at 3:02p
activeAug 12, '11 at 11:22a
posts4
users3
websiteopenjdk.java.net

People

Translate

site design / logo © 2021 Grokbase