New instruction that intel/amd should add

Programming Topics (Computer Chess) and technical aspects as test techniques, book building, program tuning etc

Moderator: Andres Valverde

New instruction that intel/amd should add

Postby Michael Sherwin » 05 Dec 2006, 23:51

If anyone from intel/amd reads here or somebody that knows somebody at intel/amd then they need to get the word out that there should be a new instruction included in future chips. This instruction would be very powerful and useful in any aplication that needs to compress scattered bits into a smaller space or just move bits around.

It would work like this:

Given a source mask and a destination mask, apply the source mask to a bit pattern in a register, so that the first bit common to the source mask and the bit pattern is set in the corresponding position (determined by the destination mask) of a register. So bit 7 of the source mask will be mapped to bit 7 of the destination mask if it exist in the bit pattern.

Please, please, please, please! :D . Not that I'm begging! :mrgreen:
User avatar
Michael Sherwin
 
Posts: 168
Joined: 21 Jul 2005, 06:10
Location: USA

Re: New instruction that intel/amd should add

Postby H.G.Muller » 06 Dec 2006, 00:13

I would have much more use for instructions that rotate a bitboard by n*45 degrees.
User avatar
H.G.Muller
 
Posts: 3340
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: New instruction that intel/amd should add

Postby Michael Sherwin » 06 Dec 2006, 01:39

H.G.Muller wrote:I would have much more use for instructions that rotate a bitboard by n*45 degrees.


Are you sure? With what I suggested an index to a lookup table containg an attack set would take very few machine instructions and would produce a perfectly compressed index. I bet people could think up a hundred uses for such an instruction other than creating an index. Bits would be able to be shifted seperatly, some could be shifted up while others are shifted down at the same time as well as mapped in all kinds of crazy ways and they could be expanded as well as compressed. A sequence of such instructions could do massive amounts of parallel processing of smaller data types!

What are some of the things that can be done with a degree rotation instruction?
User avatar
Michael Sherwin
 
Posts: 168
Joined: 21 Jul 2005, 06:10
Location: USA

Re: New instruction that intel/amd should add

Postby Michael Sherwin » 06 Dec 2006, 08:38

Michael Sherwin wrote:If anyone from intel/amd reads here or somebody that knows somebody at intel/amd then they need to get the word out that there should be a new instruction included in future chips. This instruction would be very powerful and useful in any aplication that needs to compress scattered bits into a smaller space or just move bits around.

It would work like this:

Given a source mask and a destination mask, apply the source mask to a bit pattern in a register, so that the first bit common to the source mask and the bit pattern is set in the corresponding position (determined by the destination mask) of a register. So bit 7 of the source mask will be mapped to bit 7 of the destination mask if it exist in the bit pattern.

Please, please, please, please! :D . Not that I'm begging! :mrgreen:


An excellant idea would be to be able to do simple math on the source data pattern controled by the source mask such that, if for example, you wanted to add 1 to the masked pattern the 1 would be added to the first bit of the source data under the source mask and the carry would go to the next masked bit. This would allow simple iteration through all the possible sets (values) of the masked data. Also an add mask could be created that would add one to every bit in the pattern under the add mask.

If anyone of any influence is reading this and can see the power of this computing paradigm then they need to get the ball rolling on this. The next generation single core chip that would result would be more powerful at doing many things than this generations Quad core!

Maybe a non contiguous bit coprosessor can be created first and integrated in the future. Graphics card manufacturers may also want this ability. Just make sure that us chess programmers have access to it!

So people of this forum, we need to enumerate all the useful things that this idea can do, create a monster thread and then send it all to intel and/or amd. :D Or everyone could just ignor me. :(
User avatar
Michael Sherwin
 
Posts: 168
Joined: 21 Jul 2005, 06:10
Location: USA

Re: New instruction that intel/amd should add

Postby H.G.Muller » 06 Dec 2006, 10:20

Michael Sherwin wrote:What are some of the things that can be done with a degree rotation instruction?

What you could do is rotate the occupation board such that the attack ray you want to calculate becomes contiguous and ordered in the direction low bit to high bit, so that you can use hardware carry propagation to create the attacks without any table lookups.

An even more useful alternative would be an instruction that packed the bits corresponding to various half-rays in the bytes of a (64-bit) word. It would be similar in syntax to a shift instruction, one operand would be the 64-bit source/destination register containing the bitboard, the low-order 6 bits of the second source operand would contain the square number. If that number was (say) D3, E3-H3 would go to bits 0-3, C3-A3 to bits 8-10, D4-D8 to bits 16-20, D2-D1 to bits 24-25, E4-H7 to bits 32-35, C2-B1 to bits 40-41, C4-A6 to bits 48-50 and E2-F1 to bits 56-57. The remaining bits would stay zero.

Applying such an instruction to the occupied bitboard, you could then calculate the attacks of all 8 rays simultaneaously by normal carry propagation, through subtracting 0x0101010101010101LL (similar to the 'occupied^(occupied-2*rooks)' trick. To calculate the attacks for a Rook or Bishop you would subtract 0x01010101 or 0x0101010100000000LL, respectively. Of course you would have the inverse instruction, shuffling the 8 rays back into regular bitboard order (filling all bits not on the rays from the specified square with zeros) to directly give you the attack set.

For collecting key bits the instruction already exists. It is called multiply...
User avatar
H.G.Muller
 
Posts: 3340
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: New instruction that intel/amd should add

Postby H.G.Muller » 06 Dec 2006, 11:35

To not make the demands on chip real-estate too heavy, I might even settle for an instruction that does the shuffle of the bits in a square-independent way. Using a normal rotate instruction (ror) to position the occupied square into the most-significant bit without loss of bits, and then use the new instruction to collect the 8 x 7 bits that could be on the 8 rays (if they had not wrapped around the board edge) in the proper order into the 8 bytes of the word.

This would just require 2 new instructions with a fixed shuffled pattern in its single (64-bit) register operand: ray-to-board (r2b) and its inverse board-to-ray (b2r).

You could then generate e.g. QueenAttacks by rotating the Occupied bitboard right by the Square number, applying a board-to-ray instruction, masking with a square-dependent mask (from a 64-entry table) to kill any bits that went over the edge (using an OR to create edge stops in the 8th (most-significant) bit of each byte), subtracting 0x01010101...., XORing the result with the unsubtracted value, and use ray-to-board and the opposite rotate to create the Attacks bitboard. So 2 rotates, the 2 new instructions, an OR, a SUB and an XOR, 7 instructions in total.

The exact mapping of the new instructions would be:
Code: Select all
brd ray
  1  0 Rank to the left
  2  1
  3  2
  4  3
  5  4
  6  5
  7  6
 63 8 Rank to the right
 62  9
 61 10
 60 11
 59 12
 58 13
 57 14
  8 16 File forward
 16 17
 24 18
 32 19
 40 20
 48 21
 56 22
 56 24 File backward
 48 25
 40 26
 32 27
 24 28
 16 29
  8 30
  9 32 Diagonal left forward
 18 33
 27 34
 36 35
 45 36
 54 37
 63 38
 55 40 Diagonal right backward
 46 41
 37 42
 28 43
 19 44
 10 45
  1 46
  7 48 Diagonal right forward
 14 49
 21 50
 28 51
 35 52
 42 53
 49 54
 57 56 Diagonal left backward
 50 57
 43 58
 36 59
 29 60
 22 61
 15 62

Note that the mapping is not 1-to-1, and some bits might have two destinations (in board-to-ray) or two sources (in ray-to-board). With two sources the result bit would simply be the OR of the two source bits. This would not create any ambiguity, since the Attacks, once generated in ray form, would not contain any over-the-edge bits.

That a certain square (e.g. 7) seems to be on 2 rays to board-to-ray is no problem: one of the two paths reaching it will always cross the board edge, and will be masked off (to a one). Which one of the two that is depends on the location of the square, so the mask can know that.
Last edited by H.G.Muller on 06 Dec 2006, 12:13, edited 4 times in total.
User avatar
H.G.Muller
 
Posts: 3340
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: New instruction that intel/amd should add

Postby Michael Sherwin » 06 Dec 2006, 11:43

H.G.Muller wrote:
Michael Sherwin wrote:What are some of the things that can be done with a degree rotation instruction?

What you could do is rotate the occupation board such that the attack ray you want to calculate becomes contiguous and ordered in the direction low bit to high bit, so that you can use hardware carry propagation to create the attacks without any table lookups.

An even more useful alternative would be an instruction that packed the bits corresponding to various half-rays in the bytes of a (64-bit) word. It would be similar in syntax to a shift instruction, one operand would be the 64-bit source/destination register containing the bitboard, the low-order 6 bits of the second source operand would contain the square number. If that number was (say) D3, E3-H3 would go to bits 0-3, C3-A3 to bits 8-10, D4-D8 to bits 16-20, D2-D1 to bits 24-25, E4-H7 to bits 32-35, C2-B1 to bits 40-41, C4-A6 to bits 48-50 and E2-F1 to bits 56-57. The remaining bits would stay zero.

Applying such an instruction to the occupied bitboard, you could then calculate the attacks of all 8 rays simultaneaously by normal carry propagation, through subtracting 0x0101010101010101LL (similar to the 'occupied^(occupied-2*rooks)' trick. To calculate the attacks for a Rook or Bishop you would subtract 0x01010101 or 0x0101010100000000LL, respectively. Of course you would have the inverse instruction, shuffling the 8 rays back into regular bitboard order (filling all bits not on the rays from the specified square with zeros) to directly give you the attack set.

For collecting key bits the instruction already exists. It is called multiply...


Very nice for chess! :D However, if it can not apply to other domains other than chess and chess like games then the chip manufacturers won't add it to their chips. What other types of applications can benifit from those instructions?
User avatar
Michael Sherwin
 
Posts: 168
Joined: 21 Jul 2005, 06:10
Location: USA

Re: New instruction that intel/amd should add

Postby Michael Sherwin » 06 Dec 2006, 11:54

H.G.Muller wrote:To not make the demands on chip real-estate too heavy, I might even settle for an instruction that does the shuffle of the bits in a square independent way. Using a normal rotate instruction (ror) to position the occupied square into the most-significant bit without loss of bits, and then use the new instruction to collect the 8 x 7 bits that could be on the 8 rays (if they had not wrapped around the board edge) in the proper order into the 8 bytes of the word.

This would just require 2 new instructions with a fixed shuffled pattern in its single (64-bit) register operand: ray-to-board (r2b) and its inverse board-to-ray (b2r).

You could then generate e.g. QueenAttacks by rotating the Occupied bitboard right by the Square number, applying a board-to-ray instruction, masking with a square-dependent mask (from a 64-entry table) to kill any bits that went over the edge (using an OR to create edge stops in the 8th (most-significant) bit of each byte), subtracting 0x01010101...., XORing the result with the unsubtracted value, and use ray-to-board and the opposite rotate to create the Attacks bitboard. So 2 rotates, the 2 new instructions, an OR, a SUB and an XOR, 7 instructions in total.


What we have both proposed would surely allow the next generation single core chip to run circles around todays quad core chips. I do not see a realestate problem in that respect. Even if a quad core had but one specialized bit operations coprossor built in, the extra realestate added would be minimal compared to that of the 4 cores!
User avatar
Michael Sherwin
 
Posts: 168
Joined: 21 Jul 2005, 06:10
Location: USA

Re: New instruction that intel/amd should add

Postby H.G.Muller » 06 Dec 2006, 12:20

I am not sure we couldn't convince CPU builders to add two simple instructions if it helps them to score better on the SpecInt benchmark. Crafty is part of that benchmark.
User avatar
H.G.Muller
 
Posts: 3340
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: New instruction that intel/amd should add

Postby Michael Sherwin » 06 Dec 2006, 12:54

H.G.Muller wrote:I am not sure we couldn't convince CPU builders to add two simple instructions if it helps them to score better on the SpecInt benchmark. Crafty is part of that benchmark.


Cool! 8-)
User avatar
Michael Sherwin
 
Posts: 168
Joined: 21 Jul 2005, 06:10
Location: USA

Re: New instruction that intel/amd should add

Postby Gian-Carlo Pascutto » 08 Dec 2006, 17:17

You would also need a new compiler that is smart enough to use the new instructions.

BTW. The latest SPEC contains Sjeng, but not Crafty.
Gian-Carlo Pascutto
 
Posts: 42
Joined: 04 Jul 2005, 13:24
Location: Belgium

Re: New instruction that intel/amd should add

Postby Steffan Westcott » 10 Dec 2006, 01:08

Michael Sherwin wrote:An excellant idea would be to be able to do simple math on the source data pattern controled by the source mask such that, if for example, you wanted to add 1 to the masked pattern the 1 would be added to the first bit of the source data under the source mask and the carry would go to the next masked bit. This would allow simple iteration through all the possible sets (values) of the masked data.

Code to do this with simple existing instructions has already been presented to you, perhaps you missed it:
Code: Select all
void enumsets(const BitBoard d)
{
    BitBoard n = 0;
    do {
        do_something_with_set(n);
        n = (n-d) & d;
    } while (n);
}


Cheers,
Steffan
Steffan Westcott
 
Posts: 3
Joined: 04 Aug 2005, 08:58
Location: England, UK

Re: New instruction that intel/amd should add

Postby Michael Sherwin » 10 Dec 2006, 07:17

Steffan Westcott wrote:
Michael Sherwin wrote:An excellant idea would be to be able to do simple math on the source data pattern controled by the source mask such that, if for example, you wanted to add 1 to the masked pattern the 1 would be added to the first bit of the source data under the source mask and the carry would go to the next masked bit. This would allow simple iteration through all the possible sets (values) of the masked data.

Code to do this with simple existing instructions has already been presented to you, perhaps you missed it:
Code: Select all
void enumsets(const BitBoard d)
{
    BitBoard n = 0;
    do {
        do_something_with_set(n);
        n = (n-d) & d;
    } while (n);
}


Cheers,
Steffan


As anything can be done in software then the whole point of my idea was to do it faster in hardware. Or did you not grasp that?

Cheers,
Mike
User avatar
Michael Sherwin
 
Posts: 168
Joined: 21 Jul 2005, 06:10
Location: USA

Re: New instruction that intel/amd should add

Postby Gerd Isenberg » 10 Dec 2006, 10:35

Michael Sherwin wrote: As anything can be done in software then the whole point of my idea was to do it faster in hardware. Or did you not grasp that?


You really want a suband alu-instruction? Seems like a waste of silicon. We will get population-count and leading zero count in the near future - bsf/bsr has become fast again on new intels. I still vote for a reverse add 2+2=1 ;-)

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: New instruction that intel/amd should add

Postby H.G.Muller » 10 Dec 2006, 10:57

Would that really be more useful than the board-to-ray and ray-to-board?

How would you use it?
User avatar
H.G.Muller
 
Posts: 3340
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: New instruction that intel/amd should add

Postby Gerd Isenberg » 10 Dec 2006, 11:16

H.G.Muller wrote:Would that really be more useful than the board-to-ray and ray-to-board?

How would you use it?


Board-to-ray and ray-to-board would be nice, but seems quite restricted to the domain of chess or some other board games, while reverse add/sub might be more general for bit-twiddling tricks based to mask/reset msb instead of lsb.
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: New instruction that intel/amd should add

Postby H.G.Muller » 10 Dec 2006, 12:47

Reverse add also seems not very general, and is rather expensive to add: you would basically need a separate adder for that.

You could achieve nearly the same by a reverse instruction, to reverse the bit-order before you add, with much less new hardware. If we are going for generality, we might best ask for that, as part of an entire series of bit-shuffle instructions. Or, preferably, a programmable bit-shuffle instruction, where a second operand specifies what kind of shuffle we want.

How about a 'barrel swapper', similar to a barrel shifter. In the latter, the n-th bit of the second operand (the shift count) controls if you shift by 2^n or not, and it is usually implemented as a sequence of two-input multiplexers that are wired to either transmit the unshifted input, or the input shifted (or rotated) by 2^n. (In practice they probably implement it as two 8-input multiplexers in tandem, because I can't imagine that you could propagate through the 6 multiplexers needed to do any shift from 0 to 63 in a single clock cycle.) In the barrel swapper you would use the similar strategy, but in stead of shifting by 2^n bit (or not) you would swap the two halfs of a subunit of 2^(n+1) bits (or not). So if the swap selector would be 63, you would completely reverse the word, if it was 7, you would only reverse the bits in each byte, if it was 56 you would reverse the byte order, etc.

Of course it would be a pity to only use 6 bits out of the available 64 in the swap-selector operand (as the shifts do). Without any additional hardware you could control the multiplexers that perform the swap over a certain distance not for the full width of the word, but in sub-sections. E.g. in stead of having a single bit that controls if bit 2*i and 2*i+1 are swapped for every i=0-31 you could have 8 bits that independently control it for i=0-3, i=4-7, ... i=28-31, so that you can independently shuffle all eight bytes in a word. Similarly, for swap distances of a byte or more, you could independently specify if that swap should occur for each bit in the byte.

This should provide very flexible bit manipulation.
Last edited by H.G.Muller on 10 Dec 2006, 13:24, edited 3 times in total.
User avatar
H.G.Muller
 
Posts: 3340
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: New instruction that intel/amd should add

Postby Gerd Isenberg » 10 Dec 2006, 13:20

While my idea was in the sense to don't discriminate carry propagation in different directions, your 'barrel swapper' is really a serious proposal.
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: New instruction that intel/amd should add

Postby H.G.Muller » 10 Dec 2006, 13:28

If I am not mistaking, you can mirror a bitboard in the diagonal with two such instructions: (At least, for 4x4, I was too lazy to write it out for 8x8, but I believe it follows through induction. :wink: )
Code: Select all
original
0 1 2 3
4 5 6 7
8 9 A B
C D E F

swap bits of bytes (well, nybbles in this example, actually)
0 1 2 3   1st 'byte': unchanged
5 4 7 6   2nd 'byte': bits of duos
A B 8 9   3rd 'byte': duos of nybbles
F E D C   4th 'byte': both

swap bytes of bits (in same instruction!)
 _______ 1st bit: unchanged
|  _____ 2nd bit: swap neighboring 'bytes'
| |  ___ 3rd bit: swap neighboring 'words'
| | |  _ 4th bit: swap both
| | | |
v v v v
0 4 8 C
5 1 D 9
A E 2 6
F B 7 3

swap bits of bytes (second instruction)
0 4 8 C   1st 'byte': unchanged
1 5 9 D   2nd 'byte': bits of duos
2 6 A E   3rd 'byte': duos of nybble
3 7 B F   4th 'byte': both

(shuffling bits between bytes is not necessary in the second instruction)

Horizontal and vertical reflections go with a single instruction, (actually, hallf such an instruction :D ), as does 180-degree rotation. 90-degree rotations are just variants of the diagonal mirror, and also take two instructions.

From a hardware point of view the instructions would not be any more complex than normal shifts. (Just wired differently. Perhaps a clever design could even combine the existing barrel shifter with the barrel swapper, because half of those are doing the same thing.) So this should be an instruction with latency 1, and a throughput of 1 per cycle.
User avatar
H.G.Muller
 
Posts: 3340
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: New instruction that intel/amd should add

Postby Michael Sherwin » 10 Dec 2006, 18:39

Gerd Isenberg wrote:
Michael Sherwin wrote: As anything can be done in software then the whole point of my idea was to do it faster in hardware. Or did you not grasp that?


You really want a suband alu-instruction? Seems like a waste of silicon. We will get population-count and leading zero count in the near future - bsf/bsr has become fast again on new intels. I still vote for a reverse add 2+2=1 ;-)

Gerd


There was a whole lot more to my proposal than the one simple example that Westcott decided to pick on and was not even the main focus of my idea. The main focus of my idea was a noncintiguous bit sub processor under control of source mask and destination mask. A much bigger waste of silicon! :wink:
User avatar
Michael Sherwin
 
Posts: 168
Joined: 21 Jul 2005, 06:10
Location: USA

Next

Return to Programming and Technical Discussions

Who is online

Users browsing this forum: No registered users and 5 guests