FAQ
In do_matrix in src/array.c there is a type switch containing :

case LGLSXP :
for (i = 0; i < nr; i++)
for (j = 0; j < nc; j++)
LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

That seems page inefficient, iiuc. Think it should be :

case LGLSXP :
for (j = 0; j < nc; j++)
for (i = 0; i < nr; i++)
LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

or more simply :

case LGLSXP :
for (i = 0; i < nc*nr; i++)
LOGICAL(ans)[i] = NA_LOGICAL;

( with some fine tuning required since NR is type R_xlen_t whilst i, nc
and nr are type int ).

Same goes for all the other types in that switch.

This came up on Stack Overflow here :
http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r

Matthew

Search Discussions

  • Simon Urbanek at Sep 3, 2012 at 2:32 am

    On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote:
    In do_matrix in src/array.c there is a type switch containing :

    case LGLSXP :
    for (i = 0; i < nr; i++)
    for (j = 0; j < nc; j++)
    LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

    That seems page inefficient, iiuc. Think it should be :

    case LGLSXP :
    for (j = 0; j < nc; j++)
    for (i = 0; i < nr; i++)
    LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

    or more simply :

    case LGLSXP :
    for (i = 0; i < nc*nr; i++)
    LOGICAL(ans)[i] = NA_LOGICAL;

    ( with some fine tuning required since NR is type R_xlen_t whilst i, nc
    and nr are type int ).

    Same goes for all the other types in that switch.

    This came up on Stack Overflow here :
    http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r
    That is completely irrelevant - modern compilers will optimize the loops accordingly and there is no difference in speed. If you don't believe it, run benchmarks ;)

    original
    microbenchmark(matrix(nrow000, ncol�99), times)
    Unit: milliseconds
    expr min lq median uq max
    1 matrix(nrow = 10000, ncol = 9999) 940.5519 940.6644 941.136 954.7196 1409.901


    swapped
    microbenchmark(matrix(nrow000, ncol�99), times)
    Unit: milliseconds
    expr min lq median uq max
    1 matrix(nrow = 10000, ncol = 9999) 949.9638 950.6642 952.7497 961.001 1246.573

    Cheers,
    Simon

    Matthew

    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
  • Simon Urbanek at Sep 3, 2012 at 3:47 am
    Actually, my apologies, I was assuming that your example was based on the SO question while it is not at all (the code is not involved in that test case). Reversing the order does indeed cause a delay. Switching to a single index doesn't seem to have any impact. R-devel has the faster version now (which now also works with large vectors).

    Cheers,
    Simon
    On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote:
    On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote:


    In do_matrix in src/array.c there is a type switch containing :

    case LGLSXP :
    for (i = 0; i < nr; i++)
    for (j = 0; j < nc; j++)
    LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

    That seems page inefficient, iiuc. Think it should be :

    case LGLSXP :
    for (j = 0; j < nc; j++)
    for (i = 0; i < nr; i++)
    LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

    or more simply :

    case LGLSXP :
    for (i = 0; i < nc*nr; i++)
    LOGICAL(ans)[i] = NA_LOGICAL;

    ( with some fine tuning required since NR is type R_xlen_t whilst i, nc
    and nr are type int ).

    Same goes for all the other types in that switch.

    This came up on Stack Overflow here :
    http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r
    That is completely irrelevant - modern compilers will optimize the loops accordingly and there is no difference in speed. If you don't believe it, run benchmarks ;)

    original
    microbenchmark(matrix(nrow000, ncol�99), times)
    Unit: milliseconds
    expr min lq median uq max
    1 matrix(nrow = 10000, ncol = 9999) 940.5519 940.6644 941.136 954.7196 1409.901


    swapped
    microbenchmark(matrix(nrow000, ncol�99), times)
    Unit: milliseconds
    expr min lq median uq max
    1 matrix(nrow = 10000, ncol = 9999) 949.9638 950.6642 952.7497 961.001 1246.573

    Cheers,
    Simon

    Matthew

    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
  • Matthew Dowle at Sep 4, 2012 at 12:07 pm

    Actually, my apologies, I was assuming that your example was based on the
    SO question while it is not at all (the code is not involved in that test
    case). Reversing the order does indeed cause a delay. Switching to a
    single index doesn't seem to have any impact. R-devel has the faster
    version now (which now also works with large vectors).

    Cheers,
    Simon
    I was intrigued why the compiler doesn't swap the loops when you thought
    it should, though. You're not usually wrong! From GCC's documentation (end
    of last paragraph is the most significant) :

    ===
    -floop-interchange
    Perform loop interchange transformations on loops. Interchanging two
    nested loops switches the inner and outer loops. For example, given a loop
    like:
    DO J = 1, M
    DO I = 1, N
    A(J, I) = A(J, I) * C
    ENDDO
    ENDDO

    loop interchange transforms the loop as if it were written:

    DO I = 1, N
    DO J = 1, M
    A(J, I) = A(J, I) * C
    ENDDO
    ENDDO

    which can be beneficial when N is larger than the caches, because in
    Fortran, the elements of an array are stored in memory contiguously by
    column, and the original loop iterates over rows, potentially creating at
    each access a cache miss. This optimization applies to all the languages
    supported by GCC and is not limited to Fortran. To use this code
    transformation, GCC has to be configured with --with-ppl and --with-cloog
    to enable the Graphite loop transformation infrastructure.

    ===
    Could R build scripts be configured to set these gcc flags to turn on
    "Graphite", then? I guess one downside could be the time to compile.

    Matthew

    On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote:
    On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote:


    In do_matrix in src/array.c there is a type switch containing :

    case LGLSXP :
    for (i = 0; i < nr; i++)
    for (j = 0; j < nc; j++)
    LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

    That seems page inefficient, iiuc. Think it should be :

    case LGLSXP :
    for (j = 0; j < nc; j++)
    for (i = 0; i < nr; i++)
    LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

    or more simply :

    case LGLSXP :
    for (i = 0; i < nc*nr; i++)
    LOGICAL(ans)[i] = NA_LOGICAL;

    ( with some fine tuning required since NR is type R_xlen_t whilst i, nc
    and nr are type int ).

    Same goes for all the other types in that switch.

    This came up on Stack Overflow here :
    http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r
    That is completely irrelevant - modern compilers will optimize the loops
    accordingly and there is no difference in speed. If you don't believe
    it, run benchmarks ;)

    original
    microbenchmark(matrix(nrow000, ncol�99), times)
    Unit: milliseconds
    expr min lq median uq
    max
    1 matrix(nrow = 10000, ncol = 9999) 940.5519 940.6644 941.136 954.7196
    1409.901


    swapped
    microbenchmark(matrix(nrow000, ncol�99), times)
    Unit: milliseconds
    expr min lq median uq
    max
    1 matrix(nrow = 10000, ncol = 9999) 949.9638 950.6642 952.7497 961.001
    1246.573

    Cheers,
    Simon

    Matthew

    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
  • Simon Urbanek at Sep 4, 2012 at 9:05 pm
    On Sep 4, 2012, at 8:07 AM, "Matthew Dowle" wrote:
    Actually, my apologies, I was assuming that your example was based on the
    SO question while it is not at all (the code is not involved in that test
    case). Reversing the order does indeed cause a delay. Switching to a
    single index doesn't seem to have any impact. R-devel has the faster
    version now (which now also works with large vectors).

    Cheers,
    Simon
    I was intrigued why the compiler doesn't swap the loops when you thought
    it should, though. You're not usually wrong! From GCC's documentation (end
    of last paragraph is the most significant) :

    ===>
    -floop-interchange
    Perform loop interchange transformations on loops. Interchanging two
    nested loops switches the inner and outer loops. For example, given a loop
    like:
    DO J = 1, M
    DO I = 1, N
    A(J, I) = A(J, I) * C
    ENDDO
    ENDDO

    loop interchange transforms the loop as if it were written:

    DO I = 1, N
    DO J = 1, M
    A(J, I) = A(J, I) * C
    ENDDO
    ENDDO

    which can be beneficial when N is larger than the caches, because in
    Fortran, the elements of an array are stored in memory contiguously by
    column, and the original loop iterates over rows, potentially creating at
    each access a cache miss. This optimization applies to all the languages
    supported by GCC and is not limited to Fortran. To use this code
    transformation, GCC has to be configured with --with-ppl and --with-cloog
    to enable the Graphite loop transformation infrastructure.

    ===>
    Could R build scripts be configured to set these gcc flags to turn on
    "Graphite", then? I guess one downside could be the time to compile.
    The is something odd happening - when I use stand-alone code it works:

    $ gcc -o t2 -O3 t2.c

    $ time ./t2
    0x7fbae3b4f010
    real 0m1.045s
    user 0m0.784s
    sys 0m0.260s

    $ gcc -o t2 -floop-interchange -O3 t2.c

    $ time ./t2
    0x7f4e516f2010
    real 0m0.418s
    user 0m0.044s
    sys 0m0.372s

    However, when I split off the loop into a parametrized function it doesn't:

    $ gcc -floop-interchange -O3 t.c tt.c -o t && ./t
    0x7fdd37cca010
    loop time = 1772.085ms
    $ gcc -O3 t.c tt.c -o t && ./t
    0x7f3aa8777010
    loop time = 1763.888ms

    For comparison , manually swapping i and j:
    $ gcc -floop-interchange -O3 t.c tt.c -o t && ./t
    0x7feecd4c9010
    loop time = 451.744ms

    For the same reason, it doesn't work for the old R code. I wonder what's happening there - I guess the optimizer is not smart enough to realize the coverage is the entire m*n span despite the fact that m and n are parameters ... But it's certainly something it should optimize as it did when used directly in a function... Odd ...

    Cheers,
    Simon

    --

    PS: Note this has nothing to do with "R builds scripts" - it is user's responsibility to add optimization flags since they are very much compiler and architecture-dependent. Also optimizations are occasionally known to backfire, so the default is always more conservative.

    FWIW this is my latest incarnation of configuring optimized R on E5 machines (minus BLAS and prefix flags which will vary by system):

    '--enable-lto' '--enable-R-shlib' 'CFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'CXXFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FCFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native'


    -- test code (test compiled with gcc (Ubuntu/Linaro 4.7.1-7ubuntu1) 4.7.1 20120814 (prerelease))

    t.c:

    void foo(int *x, const unsigned int m, const unsigned int n) {
    int i, j;
    for (i = 0; i < m; i++)
    for(j = 0; j < n; j++)
    x[i + j * m] = 1;
    }

    tt.c:

    #include <sys/time.h>
    #include <stdlib.h>
    #include <stdio.h>

    void foo(int *, int, int);

    static double ts() {
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (double)tv.tv_sec + ((double) tv.tv_usec) / 1000000.0;
    }

    int main() {
    int m = 10000, n = 10000;
    int *x = (int*) malloc(m * n * sizeof(int));
    double a = ts(), b;
    foo(x, m , n);
    b = ts();
    printf("%p\nloop time = %.3fms\n", x, (b - a) * 1000);
    return 0;
    }

    t2.c:

    #include <stdio.h>
    #include <stdlib.h>

    int *foo() {
    int m = 10000, n = 10000;
    int *x = (int*) malloc(m * n * sizeof(int));
    int i, j;
    for (i = 0; i < m; i++)
    for(j = 0; j < n; j++)
    x[i + j * m] = 1;
    return x;
    }

    int main() {
    printf("%p", foo());
    return 0;
    }

    Matthew

    On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote:
    On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote:


    In do_matrix in src/array.c there is a type switch containing :

    case LGLSXP :
    for (i = 0; i < nr; i++)
    for (j = 0; j < nc; j++)
    LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

    That seems page inefficient, iiuc. Think it should be :

    case LGLSXP :
    for (j = 0; j < nc; j++)
    for (i = 0; i < nr; i++)
    LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

    or more simply :

    case LGLSXP :
    for (i = 0; i < nc*nr; i++)
    LOGICAL(ans)[i] = NA_LOGICAL;

    ( with some fine tuning required since NR is type R_xlen_t whilst i, nc
    and nr are type int ).

    Same goes for all the other types in that switch.

    This came up on Stack Overflow here :
    http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r
    That is completely irrelevant - modern compilers will optimize the loops
    accordingly and there is no difference in speed. If you don't believe
    it, run benchmarks ;)

    original
    microbenchmark(matrix(nrow000, ncol�99), times)
    Unit: milliseconds
    expr min lq median uq
    max
    1 matrix(nrow = 10000, ncol = 9999) 940.5519 940.6644 941.136 954.7196
    1409.901


    swapped
    microbenchmark(matrix(nrow000, ncol�99), times)
    Unit: milliseconds
    expr min lq median uq
    max
    1 matrix(nrow = 10000, ncol = 9999) 949.9638 950.6642 952.7497 961.001
    1246.573

    Cheers,
    Simon

    Matthew

    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupr-devel @
categoriesr
postedSep 3, '12 at 2:04a
activeSep 4, '12 at 9:05p
posts5
users2
websiter-project.org
irc#r

2 users in discussion

Simon Urbanek: 3 posts Matthew Dowle: 2 posts

People

Translate

site design / logo © 2022 Grokbase