http://cr.openjdk.java.net/~kvn/7119644/webrev

7119644: Increase superword's vector size up to 256 bits

Increase superword's vector size up to 256-bits for YMM AVX registers on x86.
Added generation of different vector sizes for different types of arrays in the
same loop. Allow to generate small (4 bytes) vectors for loops which were
unrolled small number of iterations.
Add new C2 types for vectors and rework VectorNode implementation. Used
MachTypeNode as base node for vector mach nodes to keep vector type.
Moved XMM registers definition and vector instructions into one file x86.ad
(have to rename eRegI to rRegI in x86_32.ad).

Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

Thanks,
Vladimir

Search Discussions

  • Vladimir Kozlov at Apr 3, 2012 at 4:33 pm
    I found few issues in superword code during testing on SB, the rest changes
    seems fine. I will send updated webrev after I resolve issues.

    Vladimir

    Vladimir Kozlov wrote:
    http://cr.openjdk.java.net/~kvn/7119644/webrev

    7119644: Increase superword's vector size up to 256 bits

    Increase superword's vector size up to 256-bits for YMM AVX registers on
    x86. Added generation of different vector sizes for different types of
    arrays in the same loop. Allow to generate small (4 bytes) vectors for
    loops which were unrolled small number of iterations.
    Add new C2 types for vectors and rework VectorNode implementation. Used
    MachTypeNode as base node for vector mach nodes to keep vector type.
    Moved XMM registers definition and vector instructions into one file
    x86.ad (have to rename eRegI to rRegI in x86_32.ad).

    Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

    Thanks,
    Vladimir
  • Vladimir Kozlov at Apr 4, 2012 at 3:52 pm
    I updated webrev.

    Thanks,
    Vladimir

    Vladimir Kozlov wrote:
    I found few issues in superword code during testing on SB, the rest
    changes seems fine. I will send updated webrev after I resolve issues.

    Vladimir

    Vladimir Kozlov wrote:
    http://cr.openjdk.java.net/~kvn/7119644/webrev

    7119644: Increase superword's vector size up to 256 bits

    Increase superword's vector size up to 256-bits for YMM AVX registers
    on x86. Added generation of different vector sizes for different types
    of arrays in the same loop. Allow to generate small (4 bytes) vectors
    for loops which were unrolled small number of iterations.
    Add new C2 types for vectors and rework VectorNode implementation.
    Used MachTypeNode as base node for vector mach nodes to keep vector type.
    Moved XMM registers definition and vector instructions into one file
    x86.ad (have to rename eRegI to rRegI in x86_32.ad).

    Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

    Thanks,
    Vladimir
  • Tom Rodriguez at Apr 9, 2012 at 2:56 pm
    VecX and VecY aren't great names. Why not do Op_Vec{4,8,16,32} for these instead? Maybe it needs to a trailing identifier like B to indicate the sizing?

    sparc.ad:

    apart from moving the code around and the new predicates, did the instruction definitions change in any other way?

    x86.ad:

    Please make sure all that %{ are on the preceding lines instead of by themselves.

    Is is possible to share the new spill copy code in here instead of duplicating it?

    chaitin.cpp:

    Why do we need the machine dependent ShouldNotReachHeres? Aren't the asserts enough?

    Can't this:

    if( lrg.num_regs() == 1 ) {
    reg = tempmask.find_first_elem();
    + } else if (lrg._is_vector) {
    + tempmask.clear_to_sets(lrg.num_regs());
    + reg = tempmask.find_first_set(lrg.num_regs());
    } else {
    ! tempmask.ClearToPairs();
    ! tempmask.clear_to_pairs();
    reg = tempmask.find_first_pair();
    }

    be simplified to this:

    tempmask.clear_to_sets(lrg.num_regs());
    reg = tempmask.find_first_set(lrg.num_regs());

    same with the other uses and the Clear/Insert/set_mask_size sequence.

    matcher.cpp:

    this is kind of gross.

    + #if defined(IA32) || defined(AMD64)
    + Op_VecS, Op_VecD, Op_VecX, Op_VecY, /* Vectors */
    + #else
    + 0, Op_RegD, 0, 0, /* Vectors */
    + #endif

    can't the other platforms be required to use Op_VecS instead?

    postaloc.cpp:

    Don't you need to generalize this for n_reg > 2:


    // See if it happens to already be in the correct register!
    // (either Phi's direct register, or the common case of the name
    // never-clobbered original-def register)
    ! if( value[val_reg] == val &&
    ! if (value[val_reg] == val &&
    // Doubles check both halves
    ! ( single || value[val_reg-1] == val ) ) {
    ! ((n_regs == 1) || value[val_reg-1] == val)) {
    blk_adjust += use_prior_register(n,k,regnd[val_reg],current_block,value,regnd);
    if( n->in(k) == regnd[val_reg] ) // Success! Quit trying
    return blk_adjust;
    }



    Why aren't you doing copy elimination on vectors?

    regmask.cpp:

    the logic in num_registers is too clever. Please make it return the appropriate values directly.

    superword.cpp:

    You have an extra copy of the s2 line.

    set_alignment(s2, align + data_size(s1));
    + if (align == top_align || align == bottom_align )
    + set_alignment(s2, align);
    + else {
    + set_alignment(s2, align + data_size(s1));
    + }

    I'm not sure I understand/trust the new logic in find_adjacent_refs. It's iterating over all the candidate memory ops, collecting sets at potentially different alignments and then setting the overall alignment? Somehow I don't think this logic will work in the general case or at least it has some hidden limitations. Can you explain this part more?

    find_align_to_ref shouldn't be setting _align_to_ref directly. It should be returning the values it finds and the caller should be responsible for saving off the proper value.

    I like the simplification of the loads and stores but I end up wondering why the rest of the operations aren't similarly simplified.

    Overall the code looks good.

    tom

    On Apr 3, 2012, at 10:03 AM, Vladimir Kozlov wrote:

    http://cr.openjdk.java.net/~kvn/7119644/webrev

    7119644: Increase superword's vector size up to 256 bits

    Increase superword's vector size up to 256-bits for YMM AVX registers on x86. Added generation of different vector sizes for different types of arrays in the same loop. Allow to generate small (4 bytes) vectors for loops which were unrolled small number of iterations.
    Add new C2 types for vectors and rework VectorNode implementation. Used MachTypeNode as base node for vector mach nodes to keep vector type.
    Moved XMM registers definition and vector instructions into one file x86.ad (have to rename eRegI to rRegI in x86_32.ad).

    Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

    Thanks,
    Vladimir
  • Vladimir Kozlov at Apr 9, 2012 at 5:54 pm

    Tom Rodriguez wrote:
    VecX and VecY aren't great names. Why not do Op_Vec{4,8,16,32} for these instead? Maybe it needs to a trailing identifier like B to indicate the sizing?
    I known, but code in adlc is a lot simpler this way. For example, in archDesc.cpp:

    // Match Vector types.
    if (strncmp(idealOp, "Vec",3)==0) {
    switch(last_char) {
    case 'S': return "TypeVect::VECTS";
    case 'D': return "TypeVect::VECTD";
    case 'X': return "TypeVect::VECTX";
    case 'Y': return "TypeVect::VECTY";
    default:
    internal_err("Vector type %s with unrecognized type\n",idealOp);
    }
    }

    I don't think Vec32B is better then VecY. And I don't want VecQ and VecO which
    are difficult to distinguish.
    sparc.ad:

    apart from moving the code around and the new predicates, did the instruction definitions change in any other way?
    I did not change existed Replicate instructions. I changed a little to simplify
    replicate_immI() code which constructs an initialization vector placed into
    constant table. I also added replication for Float values and replaced type
    specific Load instruction with one loadV8 instruction.
    x86.ad:

    Please make sure all that %{ are on the preceding lines instead of by themselves.
    Fixed (Vec* operands had it).
    Is is possible to share the new spill copy code in here instead of duplicating it?
    Yes, it looks like possible (just tried). The stack-stack spills are different
    but xmm-xmm and xmm-stack could be moved.
    chaitin.cpp:

    Why do we need the machine dependent ShouldNotReachHeres? Aren't the asserts enough?
    I removed them.
    Can't this:

    if( lrg.num_regs() == 1 ) {
    reg = tempmask.find_first_elem();
    + } else if (lrg._is_vector) {
    + tempmask.clear_to_sets(lrg.num_regs());
    + reg = tempmask.find_first_set(lrg.num_regs());
    } else {
    ! tempmask.ClearToPairs();
    ! tempmask.clear_to_pairs();
    reg = tempmask.find_first_pair();
    }

    be simplified to this:

    tempmask.clear_to_sets(lrg.num_regs());
    reg = tempmask.find_first_set(lrg.num_regs());

    same with the other uses and the Clear/Insert/set_mask_size sequence.
    Yes, it could be done. I tested it by calling *_sets() methods from *_pairs()
    methods. But I thought it is too aggressive code change :)
    If you OK with such change I will do it.
    matcher.cpp:

    this is kind of gross.

    + #if defined(IA32) || defined(AMD64)
    + Op_VecS, Op_VecD, Op_VecX, Op_VecY, /* Vectors */
    + #else
    + 0, Op_RegD, 0, 0, /* Vectors */
    + #endif

    can't the other platforms be required to use Op_VecS instead?
    You mean Op_VecD? The reason I did this way is to reduce code changes in .ad
    files on other platforms which will not add new functionality. There is also
    code in RA for pairs misalignment on SPARC which I don't want to duplicate for
    vectors.
    postaloc.cpp:

    Don't you need to generalize this for n_reg > 2:

    // See if it happens to already be in the correct register!
    // (either Phi's direct register, or the common case of the name
    // never-clobbered original-def register)
    ! if( value[val_reg] == val &&
    ! if (value[val_reg] == val &&
    // Doubles check both halves
    ! ( single || value[val_reg-1] == val ) ) {
    ! ((n_regs == 1) || value[val_reg-1] == val)) {
    blk_adjust += use_prior_register(n,k,regnd[val_reg],current_block,value,regnd);
    if( n->in(k) == regnd[val_reg] ) // Success! Quit trying
    return blk_adjust;
    }
    Yes, I cut a corner here. I will add correct check for all vector's value[] == val.
    Why aren't you doing copy elimination on vectors?
    What do you mean? All changes in postaloc.cpp are done for that to work for
    vectors. I verified that it eliminates MSC where possible. Do you think I missed
    something?
    regmask.cpp:

    the logic in num_registers is too clever. Please make it return the appropriate values directly. Done.
    superword.cpp:

    You have an extra copy of the s2 line.

    set_alignment(s2, align + data_size(s1));
    + if (align == top_align || align == bottom_align )
    + set_alignment(s2, align);
    + else {
    + set_alignment(s2, align + data_size(s1));
    + } Fixed.
    I'm not sure I understand/trust the new logic in find_adjacent_refs. It's iterating over all the candidate memory ops, collecting sets at potentially different alignments and then setting the overall alignment? Somehow I don't think this logic will work in the general case or at least it has some hidden limitations. Can you explain this part more?
    New code is added to vectorize operations which have different basic types:

    static void test_IBci(int[] a, byte[] b) {
    for (int i = 0; i < a.length; i+=1) {
    a[i] = -456;
    b[i] = -(byte)123;
    }
    }

    0a0 B12: # B12 B13 <- B11 B12 Loop: B12-B12 inner main of N90 Freq: 9340.37
    0a0 movslq R10, R11 # i2l
    0a3 movdqu [RSI + #16 + R10 << #2],XMM0 ! store vector (16 bytes)
    0aa movq [RDX + #16 + R10],XMM1 ! store vector (8 bytes)
    0b1 movslq R10, R11 # i2l
    0b4 movdqu [RSI + #32 + R10 << #2],XMM0 ! store vector (16 bytes)
    0bb addl R11, #8 # int
    0bf cmpl R11, R8
    0c2 jl,s B12 # loop end P=0.999893 CF701.000000

    previous code vectorized only one type in such case. New code collects during
    one iteration only related memory operations (as before these changes). Then it
    removes these operations from memops list and tries to collect other related mem
    ops. Such vectors need different loop index alignment since vector sizes are
    different. The new code set maximum loop index alignment. Max alignment also
    works for smaller sizes since sizes are power of 2.
    find_align_to_ref shouldn't be setting _align_to_ref directly. It should be returning the values it finds and the caller should be responsible for saving off the proper value. Done.
    I like the simplification of the loads and stores but I end up wondering why the rest of the operations aren't similarly simplified.
    They could, but they would need different formats since they use different asm
    instructions for different basic type sizes. But I can definitely merge
    ReplicateS and ReplicateC which have the same size.
    Overall the code looks good.

    tom
    Thanks,
    Vladimir
    On Apr 3, 2012, at 10:03 AM, Vladimir Kozlov wrote:

    http://cr.openjdk.java.net/~kvn/7119644/webrev

    7119644: Increase superword's vector size up to 256 bits

    Increase superword's vector size up to 256-bits for YMM AVX registers on x86. Added generation of different vector sizes for different types of arrays in the same loop. Allow to generate small (4 bytes) vectors for loops which were unrolled small number of iterations.
    Add new C2 types for vectors and rework VectorNode implementation. Used MachTypeNode as base node for vector mach nodes to keep vector type.
    Moved XMM registers definition and vector instructions into one file x86.ad (have to rename eRegI to rRegI in x86_32.ad).

    Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

    Thanks,
    Vladimir
  • Tom Rodriguez at Apr 9, 2012 at 6:15 pm

    On Apr 9, 2012, at 5:54 PM, Vladimir Kozlov wrote:

    Tom Rodriguez wrote:
    VecX and VecY aren't great names. Why not do Op_Vec{4,8,16,32} for these instead? Maybe it needs to a trailing identifier like B to indicate the sizing?
    I known, but code in adlc is a lot simpler this way. For example, in archDesc.cpp:

    // Match Vector types.
    if (strncmp(idealOp, "Vec",3)==0) {
    switch(last_char) {
    case 'S': return "TypeVect::VECTS";
    case 'D': return "TypeVect::VECTD";
    case 'X': return "TypeVect::VECTX";
    case 'Y': return "TypeVect::VECTY";
    default:
    internal_err("Vector type %s with unrecognized type\n",idealOp);
    }
    }

    I don't think Vec32B is better then VecY. And I don't want VecQ and VecO which are difficult to distinguish.
    Except that VecY and VecX are fairly meaningless names.
    sparc.ad:
    apart from moving the code around and the new predicates, did the instruction definitions change in any other way?
    I did not change existed Replicate instructions. I changed a little to simplify replicate_immI() code which constructs an initialization vector placed into constant table. I also added replication for Float values and replaced type specific Load instruction with one loadV8 instruction.
    OK.

    x86.ad:
    Please make sure all that %{ are on the preceding lines instead of by themselves.
    Fixed (Vec* operands had it).
    OK.
    Is is possible to share the new spill copy code in here instead of duplicating it?
    Yes, it looks like possible (just tried). The stack-stack spills are different but xmm-xmm and xmm-stack could be moved.
    chaitin.cpp:
    Why do we need the machine dependent ShouldNotReachHeres? Aren't the asserts enough?
    I removed them.
    Can't this:
    if( lrg.num_regs() == 1 ) {
    reg = tempmask.find_first_elem();
    + } else if (lrg._is_vector) {
    + tempmask.clear_to_sets(lrg.num_regs());
    + reg = tempmask.find_first_set(lrg.num_regs());
    } else {
    ! tempmask.ClearToPairs();
    ! tempmask.clear_to_pairs();
    reg = tempmask.find_first_pair();
    }
    be simplified to this:
    tempmask.clear_to_sets(lrg.num_regs());
    reg = tempmask.find_first_set(lrg.num_regs());
    same with the other uses and the Clear/Insert/set_mask_size sequence.
    Yes, it could be done. I tested it by calling *_sets() methods from *_pairs() methods. But I thought it is too aggressive code change :)
    If you OK with such change I will do it.
    It's aggressive but it should work I think. I could go either way.
    matcher.cpp:
    this is kind of gross.
    + #if defined(IA32) || defined(AMD64)
    + Op_VecS, Op_VecD, Op_VecX, Op_VecY, /* Vectors */
    + #else
    + 0, Op_RegD, 0, 0, /* Vectors */
    + #endif
    can't the other platforms be required to use Op_VecS instead?
    You mean Op_VecD? The reason I did this way is to reduce code changes in .ad files on other platforms which will not add new functionality. There is also code in RA for pairs misalignment on SPARC which I don't want to duplicate for vectors.
    Could those be filled in by asking the ad file instead?
    postaloc.cpp:
    Don't you need to generalize this for n_reg > 2:
    // See if it happens to already be in the correct register!
    // (either Phi's direct register, or the common case of the name
    // never-clobbered original-def register)
    ! if( value[val_reg] == val &&
    ! if (value[val_reg] == val &&
    // Doubles check both halves
    ! ( single || value[val_reg-1] == val ) ) {
    ! ((n_regs == 1) || value[val_reg-1] == val)) {
    blk_adjust += use_prior_register(n,k,regnd[val_reg],current_block,value,regnd);
    if( n->in(k) == regnd[val_reg] ) // Success! Quit trying
    return blk_adjust;
    }
    Yes, I cut a corner here. I will add correct check for all vector's value[] == val.
    Why aren't you doing copy elimination on vectors?
    What do you mean? All changes in postaloc.cpp are done for that to work for vectors. I verified that it eliminates MSC where possible. Do you think I missed something?
    The block to update the mapping doesn't have the eliminate_copy_of_constant logic or the replace_and_yank_if_dead logic.
    regmask.cpp:
    the logic in num_registers is too clever. Please make it return the appropriate values directly. Done.
    superword.cpp:
    You have an extra copy of the s2 line.
    set_alignment(s2, align + data_size(s1));
    + if (align == top_align || align == bottom_align )
    + set_alignment(s2, align);
    + else {
    + set_alignment(s2, align + data_size(s1));
    + } Fixed.
    I'm not sure I understand/trust the new logic in find_adjacent_refs. It's iterating over all the candidate memory ops, collecting sets at potentially different alignments and then setting the overall alignment? Somehow I don't think this logic will work in the general case or at least it has some hidden limitations. Can you explain this part more?
    New code is added to vectorize operations which have different basic types:

    static void test_IBci(int[] a, byte[] b) {
    for (int i = 0; i < a.length; i+=1) {
    a[i] = -456;
    b[i] = -(byte)123;
    }
    }

    0a0 B12: # B12 B13 <- B11 B12 Loop: B12-B12 inner main of N90 Freq: 9340.37
    0a0 movslq R10, R11 # i2l
    0a3 movdqu [RSI + #16 + R10 << #2],XMM0 ! store vector (16 bytes)
    0aa movq [RDX + #16 + R10],XMM1 ! store vector (8 bytes)
    0b1 movslq R10, R11 # i2l
    0b4 movdqu [RSI + #32 + R10 << #2],XMM0 ! store vector (16 bytes)
    0bb addl R11, #8 # int
    0bf cmpl R11, R8
    0c2 jl,s B12 # loop end P=0.999893 CF701.000000

    previous code vectorized only one type in such case. New code collects during one iteration only related memory operations (as before these changes). Then it removes these operations from memops list and tries to collect other related mem ops. Such vectors need different loop index alignment since vector sizes are different. The new code set maximum loop index alignment. Max alignment also works for smaller sizes since sizes are power of 2.
    How does it deal with index variables that might be offset? Something like this:
    static void test_IBci(int[] a, byte[] b) {
    for (int i = 0; i < a.length - 1; i+=1) {
    a[i] = -456;
    b[i + 1] = -(byte)123;
    }
    }

    It's not obvious to me where that will be weeded out.

    tom
    find_align_to_ref shouldn't be setting _align_to_ref directly. It should be returning the values it finds and the caller should be responsible for saving off the proper value. Done.
    I like the simplification of the loads and stores but I end up wondering why the rest of the operations aren't similarly simplified.
    They could, but they would need different formats since they use different asm instructions for different basic type sizes. But I can definitely merge ReplicateS and ReplicateC which have the same size.
    Right, but that's the same as LoadVector and StoreVector.
    Overall the code looks good.
    tom
    Thanks,
    Vladimir
    On Apr 3, 2012, at 10:03 AM, Vladimir Kozlov wrote:
    http://cr.openjdk.java.net/~kvn/7119644/webrev

    7119644: Increase superword's vector size up to 256 bits

    Increase superword's vector size up to 256-bits for YMM AVX registers on x86. Added generation of different vector sizes for different types of arrays in the same loop. Allow to generate small (4 bytes) vectors for loops which were unrolled small number of iterations.
    Add new C2 types for vectors and rework VectorNode implementation. Used MachTypeNode as base node for vector mach nodes to keep vector type.
    Moved XMM registers definition and vector instructions into one file x86.ad (have to rename eRegI to rRegI in x86_32.ad).

    Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

    Thanks,
    Vladimir
  • Vladimir Kozlov at Apr 9, 2012 at 7:35 pm

    Tom Rodriguez wrote:
    On Apr 9, 2012, at 5:54 PM, Vladimir Kozlov wrote:

    Tom Rodriguez wrote:

    I don't think Vec32B is better then VecY. And I don't want VecQ and VecO which are difficult to distinguish.
    Except that VecY and VecX are fairly meaningless names.
    They are not meaningless, they represent YMM and XMM cpu registers. It is not
    general but they are used now only on x86.
    be simplified to this:
    tempmask.clear_to_sets(lrg.num_regs());
    reg = tempmask.find_first_set(lrg.num_regs());
    same with the other uses and the Clear/Insert/set_mask_size sequence.
    Yes, it could be done. I tested it by calling *_sets() methods from *_pairs() methods. But I thought it is too aggressive code change :)
    If you OK with such change I will do it.
    It's aggressive but it should work I think. I could go either way.
    OK. I will go with _sets() and simplified code. And I will rerun tests again
    after that.
    matcher.cpp:
    this is kind of gross.
    + #if defined(IA32) || defined(AMD64)
    + Op_VecS, Op_VecD, Op_VecX, Op_VecY, /* Vectors */
    + #else
    + 0, Op_RegD, 0, 0, /* Vectors */
    + #endif
    can't the other platforms be required to use Op_VecS instead?
    You mean Op_VecD? The reason I did this way is to reduce code changes in .ad files on other platforms which will not add new functionality. There is also code in RA for pairs misalignment on SPARC which I don't want to duplicate for vectors.
    Could those be filled in by asking the ad file instead?
    I can move this whole table initialization into .ad file. It works.
    Why aren't you doing copy elimination on vectors?
    What do you mean? All changes in postaloc.cpp are done for that to work for vectors. I verified that it eliminates MSC where possible. Do you think I missed something?
    The block to update the mapping doesn't have the eliminate_copy_of_constant logic or the replace_and_yank_if_dead logic.
    Vector value can't be constant (there is no type which represent it). The
    construction of vector from constant value is hidden in Mach replicate node.

    But you are right about the case when (n->is_Copy() && value[nreg] == val). I
    need call replace_and_yank_if_dead() in such case.
    0a0 B12: # B12 B13 <- B11 B12 Loop: B12-B12 inner main of N90 Freq: 9340.37
    0a0 movslq R10, R11 # i2l
    0a3 movdqu [RSI + #16 + R10 << #2],XMM0 ! store vector (16 bytes)
    0aa movq [RDX + #16 + R10],XMM1 ! store vector (8 bytes)
    0b1 movslq R10, R11 # i2l
    0b4 movdqu [RSI + #32 + R10 << #2],XMM0 ! store vector (16 bytes)
    0bb addl R11, #8 # int
    0bf cmpl R11, R8
    0c2 jl,s B12 # loop end P=0.999893 CF701.000000

    previous code vectorized only one type in such case. New code collects during one iteration only related memory operations (as before these changes). Then it removes these operations from memops list and tries to collect other related mem ops. Such vectors need different loop index alignment since vector sizes are different. The new code set maximum loop index alignment. Max alignment also works for smaller sizes since sizes are power of 2.
    How does it deal with index variables that might be offset? Something like this:
    static void test_IBci(int[] a, byte[] b) {
    for (int i = 0; i < a.length - 1; i+=1) {
    a[i] = -456;
    b[i + 1] = -(byte)123;
    }
    }

    It's not obvious to me where that will be weeded out.
    0b3 movq [RDX + #17 + R10],XMM1 ! store vector (8 bytes)

    It generates unaligned move. For x86 it does not matter (I used only unaligned
    asm instructions) but for SPARC it is disaster (40 times slow since it traps):

    IBci: 2382
    IBvi: 65

    I will need to add a check (current align vs max align) into find_adjacent_refs
    to vectorize only aligned mem ops on SPARC. I want to keep unaligned mem ops on
    x86 since we can win on vector arithmetic.

    Vladimir
    tom
    find_align_to_ref shouldn't be setting _align_to_ref directly. It should be returning the values it finds and the caller should be responsible for saving off the proper value. Done.
    I like the simplification of the loads and stores but I end up wondering why the rest of the operations aren't similarly simplified.
    They could, but they would need different formats since they use different asm instructions for different basic type sizes. But I can definitely merge ReplicateS and ReplicateC which have the same size.
    Right, but that's the same as LoadVector and StoreVector.
    Overall the code looks good.
    tom
    Thanks,
    Vladimir
    On Apr 3, 2012, at 10:03 AM, Vladimir Kozlov wrote:
    http://cr.openjdk.java.net/~kvn/7119644/webrev

    7119644: Increase superword's vector size up to 256 bits

    Increase superword's vector size up to 256-bits for YMM AVX registers on x86. Added generation of different vector sizes for different types of arrays in the same loop. Allow to generate small (4 bytes) vectors for loops which were unrolled small number of iterations.
    Add new C2 types for vectors and rework VectorNode implementation. Used MachTypeNode as base node for vector mach nodes to keep vector type.
    Moved XMM registers definition and vector instructions into one file x86.ad (have to rename eRegI to rRegI in x86_32.ad).

    Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

    Thanks,
    Vladimir
  • Tom Rodriguez at Apr 9, 2012 at 7:49 pm

    On Apr 9, 2012, at 7:35 PM, Vladimir Kozlov wrote:

    Tom Rodriguez wrote:
    On Apr 9, 2012, at 5:54 PM, Vladimir Kozlov wrote:
    Tom Rodriguez wrote:

    I don't think Vec32B is better then VecY. And I don't want VecQ and VecO which are difficult to distinguish.
    Except that VecY and VecX are fairly meaningless names.
    They are not meaningless, they represent YMM and XMM cpu registers. It is not general but they are used now only on x86.
    I mean that they lack implicit meaning. They are platform dependent names promoted to the ideal form and the S and D forms are really xmm too. Anyway, I seem to remember that there are places where we seem to want single letters for the ideal reg types so maybe we just live with it. I agree that VecQ and VecO are unclear and I can't think of a better name for the octo word case. Maybe Vec[1234] where the number represents power of 2 size? Any [SDXY] are livable I guess.
    be simplified to this:
    tempmask.clear_to_sets(lrg.num_regs());
    reg = tempmask.find_first_set(lrg.num_regs());
    same with the other uses and the Clear/Insert/set_mask_size sequence.
    Yes, it could be done. I tested it by calling *_sets() methods from *_pairs() methods. But I thought it is too aggressive code change :)
    If you OK with such change I will do it.
    It's aggressive but it should work I think. I could go either way.
    OK. I will go with _sets() and simplified code. And I will rerun tests again after that.
    matcher.cpp:
    this is kind of gross.
    + #if defined(IA32) || defined(AMD64)
    + Op_VecS, Op_VecD, Op_VecX, Op_VecY, /* Vectors */
    + #else
    + 0, Op_RegD, 0, 0, /* Vectors */
    + #endif
    can't the other platforms be required to use Op_VecS instead?
    You mean Op_VecD? The reason I did this way is to reduce code changes in .ad files on other platforms which will not add new functionality. There is also code in RA for pairs misalignment on SPARC which I don't want to duplicate for vectors.
    Could those be filled in by asking the ad file instead?
    I can move this whole table initialization into .ad file. It works.
    Why aren't you doing copy elimination on vectors?
    What do you mean? All changes in postaloc.cpp are done for that to work for vectors. I verified that it eliminates MSC where possible. Do you think I missed something?
    The block to update the mapping doesn't have the eliminate_copy_of_constant logic or the replace_and_yank_if_dead logic.
    Vector value can't be constant (there is no type which represent it). The construction of vector from constant value is hidden in Mach replicate node.
    Oh right, I remember kind of running into that with the original RegQ stuff. Since there's no ConNode for it will never trigger.
    But you are right about the case when (n->is_Copy() && value[nreg] == val). I need call replace_and_yank_if_dead() in such case.
    Ok.
    0a0 B12: # B12 B13 <- B11 B12 Loop: B12-B12 inner main of N90 Freq: 9340.37
    0a0 movslq R10, R11 # i2l
    0a3 movdqu [RSI + #16 + R10 << #2],XMM0 ! store vector (16 bytes)
    0aa movq [RDX + #16 + R10],XMM1 ! store vector (8 bytes)
    0b1 movslq R10, R11 # i2l
    0b4 movdqu [RSI + #32 + R10 << #2],XMM0 ! store vector (16 bytes)
    0bb addl R11, #8 # int
    0bf cmpl R11, R8
    0c2 jl,s B12 # loop end P=0.999893 CF701.000000

    previous code vectorized only one type in such case. New code collects during one iteration only related memory operations (as before these changes). Then it removes these operations from memops list and tries to collect other related mem ops. Such vectors need different loop index alignment since vector sizes are different. The new code set maximum loop index alignment. Max alignment also works for smaller sizes since sizes are power of 2.
    How does it deal with index variables that might be offset? Something like this:
    static void test_IBci(int[] a, byte[] b) {
    for (int i = 0; i < a.length - 1; i+=1) {
    a[i] = -456;
    b[i + 1] = -(byte)123;
    }
    }
    It's not obvious to me where that will be weeded out.
    0b3 movq [RDX + #17 + R10],XMM1 ! store vector (8 bytes)

    It generates unaligned move. For x86 it does not matter (I used only unaligned asm instructions) but for SPARC it is disaster (40 times slow since it traps):

    IBci: 2382
    IBvi: 65

    I will need to add a check (current align vs max align) into find_adjacent_refs to vectorize only aligned mem ops on SPARC. I want to keep unaligned mem ops on x86 since we can win on vector arithmetic.
    Ok.

    tom
    Vladimir
    tom
    find_align_to_ref shouldn't be setting _align_to_ref directly. It should be returning the values it finds and the caller should be responsible for saving off the proper value. Done.
    I like the simplification of the loads and stores but I end up wondering why the rest of the operations aren't similarly simplified.
    They could, but they would need different formats since they use different asm instructions for different basic type sizes. But I can definitely merge ReplicateS and ReplicateC which have the same size.
    Right, but that's the same as LoadVector and StoreVector.
    Overall the code looks good.
    tom
    Thanks,
    Vladimir
    On Apr 3, 2012, at 10:03 AM, Vladimir Kozlov wrote:
    http://cr.openjdk.java.net/~kvn/7119644/webrev

    7119644: Increase superword's vector size up to 256 bits

    Increase superword's vector size up to 256-bits for YMM AVX registers on x86. Added generation of different vector sizes for different types of arrays in the same loop. Allow to generate small (4 bytes) vectors for loops which were unrolled small number of iterations.
    Add new C2 types for vectors and rework VectorNode implementation. Used MachTypeNode as base node for vector mach nodes to keep vector type.
    Moved XMM registers definition and vector instructions into one file x86.ad (have to rename eRegI to rRegI in x86_32.ad).

    Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

    Thanks,
    Vladimir
  • Vladimir Kozlov at Apr 10, 2012 at 10:02 am

    Tom Rodriguez wrote:
    On Apr 9, 2012, at 7:35 PM, Vladimir Kozlov wrote:

    Tom Rodriguez wrote:
    On Apr 9, 2012, at 5:54 PM, Vladimir Kozlov wrote:
    Tom Rodriguez wrote:

    I don't think Vec32B is better then VecY. And I don't want VecQ and VecO which are difficult to distinguish.
    Except that VecY and VecX are fairly meaningless names.
    They are not meaningless, they represent YMM and XMM cpu registers. It is not general but they are used now only on x86.
    I mean that they lack implicit meaning. They are platform dependent names promoted to the ideal form and the S and D forms are really xmm too. Anyway, I seem to remember that there are places where we seem to want single letters for the ideal reg types so maybe we just live with it. I agree that VecQ and VecO are unclear and I can't think of a better name for the octo word case. Maybe Vec[1234] where the number represents power of 2 size? Any [SDXY] are livable I guess.
    I think we can use Greek prefixes:

    VecS (English single, I don't like VecM for Greek mono)
    VecD (duo)
    VecT (tetra)
    VecO (octa)

    What do you think?

    Vladimir
    be simplified to this:
    tempmask.clear_to_sets(lrg.num_regs());
    reg = tempmask.find_first_set(lrg.num_regs());
    same with the other uses and the Clear/Insert/set_mask_size sequence.
    Yes, it could be done. I tested it by calling *_sets() methods from *_pairs() methods. But I thought it is too aggressive code change :)
    If you OK with such change I will do it.
    It's aggressive but it should work I think. I could go either way.
    OK. I will go with _sets() and simplified code. And I will rerun tests again after that.
    matcher.cpp:
    this is kind of gross.
    + #if defined(IA32) || defined(AMD64)
    + Op_VecS, Op_VecD, Op_VecX, Op_VecY, /* Vectors */
    + #else
    + 0, Op_RegD, 0, 0, /* Vectors */
    + #endif
    can't the other platforms be required to use Op_VecS instead?
    You mean Op_VecD? The reason I did this way is to reduce code changes in .ad files on other platforms which will not add new functionality. There is also code in RA for pairs misalignment on SPARC which I don't want to duplicate for vectors.
    Could those be filled in by asking the ad file instead?
    I can move this whole table initialization into .ad file. It works.
    Why aren't you doing copy elimination on vectors?
    What do you mean? All changes in postaloc.cpp are done for that to work for vectors. I verified that it eliminates MSC where possible. Do you think I missed something?
    The block to update the mapping doesn't have the eliminate_copy_of_constant logic or the replace_and_yank_if_dead logic.
    Vector value can't be constant (there is no type which represent it). The construction of vector from constant value is hidden in Mach replicate node.
    Oh right, I remember kind of running into that with the original RegQ stuff. Since there's no ConNode for it will never trigger.
    But you are right about the case when (n->is_Copy() && value[nreg] == val). I need call replace_and_yank_if_dead() in such case.
    Ok.
    0a0 B12: # B12 B13 <- B11 B12 Loop: B12-B12 inner main of N90 Freq: 9340.37
    0a0 movslq R10, R11 # i2l
    0a3 movdqu [RSI + #16 + R10 << #2],XMM0 ! store vector (16 bytes)
    0aa movq [RDX + #16 + R10],XMM1 ! store vector (8 bytes)
    0b1 movslq R10, R11 # i2l
    0b4 movdqu [RSI + #32 + R10 << #2],XMM0 ! store vector (16 bytes)
    0bb addl R11, #8 # int
    0bf cmpl R11, R8
    0c2 jl,s B12 # loop end P=0.999893 CF701.000000

    previous code vectorized only one type in such case. New code collects during one iteration only related memory operations (as before these changes). Then it removes these operations from memops list and tries to collect other related mem ops. Such vectors need different loop index alignment since vector sizes are different. The new code set maximum loop index alignment. Max alignment also works for smaller sizes since sizes are power of 2.
    How does it deal with index variables that might be offset? Something like this:
    static void test_IBci(int[] a, byte[] b) {
    for (int i = 0; i < a.length - 1; i+=1) {
    a[i] = -456;
    b[i + 1] = -(byte)123;
    }
    }
    It's not obvious to me where that will be weeded out.
    0b3 movq [RDX + #17 + R10],XMM1 ! store vector (8 bytes)

    It generates unaligned move. For x86 it does not matter (I used only unaligned asm instructions) but for SPARC it is disaster (40 times slow since it traps):

    IBci: 2382
    IBvi: 65

    I will need to add a check (current align vs max align) into find_adjacent_refs to vectorize only aligned mem ops on SPARC. I want to keep unaligned mem ops on x86 since we can win on vector arithmetic.
    Ok.

    tom
    Vladimir
    tom
    find_align_to_ref shouldn't be setting _align_to_ref directly. It should be returning the values it finds and the caller should be responsible for saving off the proper value. Done.
    I like the simplification of the loads and stores but I end up wondering why the rest of the operations aren't similarly simplified.
    They could, but they would need different formats since they use different asm instructions for different basic type sizes. But I can definitely merge ReplicateS and ReplicateC which have the same size.
    Right, but that's the same as LoadVector and StoreVector.
    Overall the code looks good.
    tom
    Thanks,
    Vladimir
    On Apr 3, 2012, at 10:03 AM, Vladimir Kozlov wrote:
    http://cr.openjdk.java.net/~kvn/7119644/webrev

    7119644: Increase superword's vector size up to 256 bits

    Increase superword's vector size up to 256-bits for YMM AVX registers on x86. Added generation of different vector sizes for different types of arrays in the same loop. Allow to generate small (4 bytes) vectors for loops which were unrolled small number of iterations.
    Add new C2 types for vectors and rework VectorNode implementation. Used MachTypeNode as base node for vector mach nodes to keep vector type.
    Moved XMM registers definition and vector instructions into one file x86.ad (have to rename eRegI to rRegI in x86_32.ad).

    Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

    Thanks,
    Vladimir
  • Tom Rodriguez at Apr 11, 2012 at 9:41 pm

    On Apr 10, 2012, at 10:02 AM, Vladimir Kozlov wrote:

    Tom Rodriguez wrote:
    On Apr 9, 2012, at 7:35 PM, Vladimir Kozlov wrote:
    Tom Rodriguez wrote:
    On Apr 9, 2012, at 5:54 PM, Vladimir Kozlov wrote:
    Tom Rodriguez wrote:

    I don't think Vec32B is better then VecY. And I don't want VecQ and VecO which are difficult to distinguish.
    Except that VecY and VecX are fairly meaningless names.
    They are not meaningless, they represent YMM and XMM cpu registers. It is not general but they are used now only on x86.
    I mean that they lack implicit meaning. They are platform dependent names promoted to the ideal form and the S and D forms are really xmm too. Anyway, I seem to remember that there are places where we seem to want single letters for the ideal reg types so maybe we just live with it. I agree that VecQ and VecO are unclear and I can't think of a better name for the octo word case. Maybe Vec[1234] where the number represents power of 2 size? Any [SDXY] are livable I guess.
    I think we can use Greek prefixes:

    VecS (English single, I don't like VecM for Greek mono)
    VecD (duo)
    VecT (tetra)
    VecO (octa)

    What do you think?
    I don't know. It's got a clearer meaning in some ways but it's not so much better that it seems worth changing. I'll leave it up to you.

    tom
    Vladimir
    be simplified to this:
    tempmask.clear_to_sets(lrg.num_regs());
    reg = tempmask.find_first_set(lrg.num_regs());
    same with the other uses and the Clear/Insert/set_mask_size sequence.
    Yes, it could be done. I tested it by calling *_sets() methods from *_pairs() methods. But I thought it is too aggressive code change :)
    If you OK with such change I will do it.
    It's aggressive but it should work I think. I could go either way.
    OK. I will go with _sets() and simplified code. And I will rerun tests again after that.
    matcher.cpp:
    this is kind of gross.
    + #if defined(IA32) || defined(AMD64)
    + Op_VecS, Op_VecD, Op_VecX, Op_VecY, /* Vectors */
    + #else
    + 0, Op_RegD, 0, 0, /* Vectors */
    + #endif
    can't the other platforms be required to use Op_VecS instead?
    You mean Op_VecD? The reason I did this way is to reduce code changes in .ad files on other platforms which will not add new functionality. There is also code in RA for pairs misalignment on SPARC which I don't want to duplicate for vectors.
    Could those be filled in by asking the ad file instead?
    I can move this whole table initialization into .ad file. It works.
    Why aren't you doing copy elimination on vectors?
    What do you mean? All changes in postaloc.cpp are done for that to work for vectors. I verified that it eliminates MSC where possible. Do you think I missed something?
    The block to update the mapping doesn't have the eliminate_copy_of_constant logic or the replace_and_yank_if_dead logic.
    Vector value can't be constant (there is no type which represent it). The construction of vector from constant value is hidden in Mach replicate node.
    Oh right, I remember kind of running into that with the original RegQ stuff. Since there's no ConNode for it will never trigger.
    But you are right about the case when (n->is_Copy() && value[nreg] == val). I need call replace_and_yank_if_dead() in such case.
    Ok.
    0a0 B12: # B12 B13 <- B11 B12 Loop: B12-B12 inner main of N90 Freq: 9340.37
    0a0 movslq R10, R11 # i2l
    0a3 movdqu [RSI + #16 + R10 << #2],XMM0 ! store vector (16 bytes)
    0aa movq [RDX + #16 + R10],XMM1 ! store vector (8 bytes)
    0b1 movslq R10, R11 # i2l
    0b4 movdqu [RSI + #32 + R10 << #2],XMM0 ! store vector (16 bytes)
    0bb addl R11, #8 # int
    0bf cmpl R11, R8
    0c2 jl,s B12 # loop end P=0.999893 CF701.000000

    previous code vectorized only one type in such case. New code collects during one iteration only related memory operations (as before these changes). Then it removes these operations from memops list and tries to collect other related mem ops. Such vectors need different loop index alignment since vector sizes are different. The new code set maximum loop index alignment. Max alignment also works for smaller sizes since sizes are power of 2.
    How does it deal with index variables that might be offset? Something like this:
    static void test_IBci(int[] a, byte[] b) {
    for (int i = 0; i < a.length - 1; i+=1) {
    a[i] = -456;
    b[i + 1] = -(byte)123;
    }
    }
    It's not obvious to me where that will be weeded out.
    0b3 movq [RDX + #17 + R10],XMM1 ! store vector (8 bytes)

    It generates unaligned move. For x86 it does not matter (I used only unaligned asm instructions) but for SPARC it is disaster (40 times slow since it traps):

    IBci: 2382
    IBvi: 65

    I will need to add a check (current align vs max align) into find_adjacent_refs to vectorize only aligned mem ops on SPARC. I want to keep unaligned mem ops on x86 since we can win on vector arithmetic.
    Ok.
    tom
    Vladimir
    tom
    find_align_to_ref shouldn't be setting _align_to_ref directly. It should be returning the values it finds and the caller should be responsible for saving off the proper value. Done.
    I like the simplification of the loads and stores but I end up wondering why the rest of the operations aren't similarly simplified.
    They could, but they would need different formats since they use different asm instructions for different basic type sizes. But I can definitely merge ReplicateS and ReplicateC which have the same size.
    Right, but that's the same as LoadVector and StoreVector.
    Overall the code looks good.
    tom
    Thanks,
    Vladimir
    On Apr 3, 2012, at 10:03 AM, Vladimir Kozlov wrote:
    http://cr.openjdk.java.net/~kvn/7119644/webrev

    7119644: Increase superword's vector size up to 256 bits

    Increase superword's vector size up to 256-bits for YMM AVX registers on x86. Added generation of different vector sizes for different types of arrays in the same loop. Allow to generate small (4 bytes) vectors for loops which were unrolled small number of iterations.
    Add new C2 types for vectors and rework VectorNode implementation. Used MachTypeNode as base node for vector mach nodes to keep vector type.
    Moved XMM registers definition and vector instructions into one file x86.ad (have to rename eRegI to rRegI in x86_32.ad).

    Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.

    Thanks,
    Vladimir

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphotspot-compiler-dev @
categoriesopenjdk
postedApr 3, '12 at 10:03a
activeApr 11, '12 at 9:41p
posts10
users2
websiteopenjdk.java.net

2 users in discussion

Vladimir Kozlov: 6 posts Tom Rodriguez: 4 posts

People

Translate

site design / logo © 2021 Grokbase