Fast(er) bitboard move generator

Programming Topics (Computer Chess) and technical aspects as test techniques, book building, program tuning etc

Moderator: Andres Valverde

Re: Fast(er) bitboard move generator

Postby Gerd Isenberg » 29 Jun 2006, 21:24

great - interestingly your popcounts are 45,46 and 48.
Guess your random generator takes care of that - or is it some luck?

Shrinking down an value range is really fascinating stuff.
I really wonder what is the theorical lowest value range for rook-attacks with and/mul/shift? Intuitively i would say that at least one edge becomes very difficult, if not impossible, to go with 11-bit addresses. The one where most of 12 relevant occupied bits already reside in the upper 11-bits. Is your h8-square 63?
I'm not a mathematician and may be wrong of course. May be with so many magic bits set and magic overflows you'll find all the good collisions ;-)

Lasse's antirotated approach is really a serious one - and the fastest at least when we have very fast and huge caches (and fast mul). I fear my 1.5KByte approach is nice for bootloading to initialize the arrays ;-)

Cheers,
Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fast(er) bitboard move generator

Postby Grant Osborne » 29 Jun 2006, 21:48

Gerd

In my engine A8 = 0, B8 = 1... H1 = 63, and the most difficult edge is A1 - H1 but I am making sure I generate very dense magic numbers here and this seems to be working. I have now found C1 and D1.

Just for fun I tried 10-bits. No success yet but I am getting very high good collisions. My feelings are that 10 bits would be the limit for rooks and 7-bits for bishops.

Regards
Grant
User avatar
Grant Osborne
 
Posts: 69
Joined: 16 Jun 2006, 14:05

Re: Fast(er) bitboard move generator

Postby Edsel Apostol » 30 Jun 2006, 04:14

Hi Lasse,

I think this is possible:
Code: Select all
AttackRook = ((AllWpieces|AllBpieces)&RookMask[sq])  * RookMagicTable[sq];


You only have to test the 12 bit combination or 4096 combinations of the RookMask every random number you generate for every square. If a random number produces the right attack bits for the the 4096 combinations of the RookMask for that square, that would be the magic number.

Would you give it a try?

Edsel
User avatar
Edsel Apostol
 
Posts: 73
Joined: 01 Aug 2005, 05:27
Location: Antique, Philippines

Re: Fast(er) bitboard move generator

Postby Gerd Isenberg » 30 Jun 2006, 06:07

Edsel,
i suggest before you post your proposals - to prove at least the simple case, where ((AllWpieces|AllBpieces)&RookMask[sq]) becomes zero. May be you need some additional xor with a supermagic ;-)

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fast(er) bitboard move generator

Postby Grant Osborne » 30 Jun 2006, 12:01

Lasse

Keys found for rook 2048 :-

C1 2AFFFF7FEDBFE74A
D1 4DFFFFDCAFFFDAFB
E1 4BFFFFEAEFE992B6

Only 3 left to find.

Grant
User avatar
Grant Osborne
 
Posts: 69
Joined: 16 Jun 2006, 14:05

Re: Fast(er) bitboard move generator

Postby Edsel Apostol » 04 Jul 2006, 05:05

Edsel,
i suggest before you post your proposals - to prove at least the simple case, where ((AllWpieces|AllBpieces)&RookMask[sq]) becomes zero. May be you need some additional xor with a supermagic

Gerd


Hi Gerd,

I think you are right. I may have overlooked it, you know, I'm just brainstorming and my ideas are often wild. I am just a 22 year old student and I don't have much knowledge on chess programming, so please forgive me sometimes for my not so knowledgeable postings. Thanks for pointing it out to me anyway.

Edsel
User avatar
Edsel Apostol
 
Posts: 73
Joined: 01 Aug 2005, 05:27
Location: Antique, Philippines

Re: Fast(er) bitboard move generator

Postby Gerd Isenberg » 04 Jul 2006, 08:06

Hi Edsel,
no problem ;-)
The fact to consider with integer multiplication - there is no way to shift some bits right. I think if we want rook- or bishop attacks without lookup but pure computation, Kogge-Stone comes in mind - for one of four straight directions there is the
Code: Select all
occupied ^ (occupied -2*rooks)
trick, rooks element of occupied. Kogge-Stone as well as this sub/xor-trick are able to get disjoint direction attack sets of multiple rooks or bishops.

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fast(er) bitboard move generator

Postby Tony van Roon-Werten » 05 Jul 2006, 09:26

Gerd Isenberg wrote:Hi Edsel,
no problem ;-)
The fact to consider with integer multiplication - there is no way to shift some bits right. ....

Gerd


Isn't there ?

How about:

Code: Select all

newMask=(oldMask & ~bitsToShift)|((bitsToShift & oldMask) >>shiftAmount)



Tony
Tony van Roon-Werten
 
Posts: 99
Joined: 02 Oct 2004, 15:31
Location: 's Hertogenbosch, Netherlands

Re: Fast(er) bitboard move generator

Postby Gerd Isenberg » 05 Jul 2006, 11:19

Yes, with additional instructions - i mean with a pure multiplication it is hard - if one factor has some trailing zeros - to get some of those bits set in the product, independent on how many and which bits are set in the second factor ;-)

If you have one single rook bitboard (let say e4) you will not find a magic factor to get either the rank- or file-attacks of this rook only by 64-bit integer multiplication, assuming otherwise empty file or rank. Thus, an obvious contradiction of Edsel's proposed formula.

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fast(er) bitboard move generator

Postby Grant Osborne » 05 Jul 2006, 14:11

Lasse

Did you find any of the 3 remaining rook 2048 keys?

Grant
User avatar
Grant Osborne
 
Posts: 69
Joined: 16 Jun 2006, 14:05

Re: Fast(er) bitboard move generator

Postby Lasse Hansen » 06 Jul 2006, 00:18

I have not focused on finding 2048 rook keys, so I havent found any new ones. Instead I have tried to improve move generation, and am currently implementing a faster check detection routine. The old routine is dead slow, because its testing -all- possible checks for -every- move.

One nice improvement I got by using an idea from Gerd, when using deBruin key/table for FirstOne, I use the (A&-A) result to clear the bit in the attack board. This gave a quite remarkable speed improvement for making quasi-legal moves.
Pos1:"rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w QKqk -"
Pos2:"r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w QKqk -"
Pos3:"q3k3/6q1/4q3/2q5/5Q2/3Q4/1Q6/3K3Q w - -"

Pos1: from 50Mnps to 71Mnps perft 6ply
Pos2: from 68Mnps to 86Mnps perft 5ply
Pos3: from 83Mnps to 124Mnps perft 5ply

Lasse
Lasse Hansen
 
Posts: 41
Joined: 07 Jun 2006, 21:25
Location: Porsgrunn, Norway

Re: Fast(er) bitboard move generator

Postby Gerd Isenberg » 06 Jul 2006, 18:51

Lasse Hansen wrote:I have not focused on finding 2048 rook keys, so I havent found any new ones. Instead I have tried to improve move generation, and am currently implementing a faster check detection routine. The old routine is dead slow, because its testing -all- possible checks for -every- move.

One nice improvement I got by using an idea from Gerd, when using deBruin key/table for FirstOne, I use the (A&-A) result to clear the bit in the attack board. This gave a quite remarkable speed improvement for making quasi-legal moves.
Pos1:"rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w QKqk -"
Pos2:"r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w QKqk -"
Pos3:"q3k3/6q1/4q3/2q5/5Q2/3Q4/1Q6/3K3Q w - -"

Pos1: from 50Mnps to 71Mnps perft 6ply
Pos2: from 68Mnps to 86Mnps perft 5ply
Pos3: from 83Mnps to 124Mnps perft 5ply

Lasse


Puhh - that is a huge speedup of 1.26 up to 1.5!
I guess your bitboard traversals are really a hotspot in your perft application, most critcal inner loops. How did you reset the bits before? Xor by Lookup with powOf2[foundIndex]?
It would be really interesting to see the generated assembly of a typical bitboard serialization loop. As an assembly programmer, not yet experienced with x86-64 calling convention, i would do something like this, keeping the loop invariant magic factor and pointer to DeBruijn lookup- table inside registers, as well the prepared from square:

Code: Select all
; input rdx - the target bitboard to serialize b
;       ecx - the source square
;       rdi - pointer to the move buffer
;       direction flag
    mov  rax, rdx ;  b
    neg  rax      ; -b
    jz   done
    shl  ecx, 6  ; from << 6
    lea  ebx, [magicLookup]
    mov  rsi, 0x03... ; De Bruijn factor
align 16
tloop:
    and  rax, rdx  ; b & -b
    xor  rdx, rax  ; reset found bit
    imul rax, rsi
    shr  rax, 58
    mov  eax, [rbx+4*rax]  ; table lookup -> to square
    add  eax, ecx  ; (from << 6) + to
    mov [rdi], eax ; store 32-bit move *
    mov  rax, rdx  ; b
    add  rdi, 4    ; pointer++ *
    neg  rax       ; -b
    jnz   tloop
done:

* four cycle vector path stosd replaced by mov,add

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fast(er) bitboard move generator

Postby Lasse Hansen » 07 Jul 2006, 07:43

Gerd Isenberg wrote:I guess your bitboard traversals are really a hotspot in your perft application, most critcal inner loops. How did you reset the bits before? Xor by Lookup with powOf2[foundIndex]?

I reset the bits using A&=ClrBit[square]; where Clrbit[.]=~SetBit[.]. So the cache have an easier time :). Here's a code example:
Code: Select all
      Pcs=WRooks;                     // WRooks
      while (Pcs) {
         From=FirstOne2(C=Pcs&-Pcs);
         Pattern = (AllWPieces | AllBPieces) & MaskRook[From];
         Address = (Pattern*HashKeyRook[From])>>(64-RAT_BITS);
         A = AllBPieces & RookAttacks[From][Address];
         while (A) {
            To=FirstOne2(B=A&-A);
            if ((CaptPiece=Board[To])==BKing) return -1;
            *M++=From+(To<<6)+(WRook<<12)+(CaptPiece<<16);
            A &= ~B;
         }
         Pcs &= ~C;
      }

I also modified FirstOne() to not do (A&-A). The assembly code generated by gcc with full optimisation is so dirty that I wont post it here.

Lasse
Lasse Hansen
 
Posts: 41
Joined: 07 Jun 2006, 21:25
Location: Porsgrunn, Norway

Re: Fast(er) bitboard move generator

Postby Gerd Isenberg » 07 Jul 2006, 18:59

Lasse Hansen wrote:I reset the bits using A&=ClrBit[square]; where Clrbit[.]=~SetBit[.]. So the cache have an easier time :). Here's a code example:
Code: Select all
      Pcs=WRooks;                     // WRooks
      while (Pcs) {
         From=FirstOne2(C=Pcs&-Pcs);
         Pattern = (AllWPieces | AllBPieces) & MaskRook[From];
         Address = (Pattern*HashKeyRook[From])>>(64-RAT_BITS);
         A = AllBPieces & RookAttacks[From][Address];
         while (A) {
            To=FirstOne2(B=A&-A);
            if ((CaptPiece=Board[To])==BKing) return -1;
            *M++=From+(To<<6)+(WRook<<12)+(CaptPiece<<16);
            A &= ~B;
         }
         Pcs &= ~C;
      }

I also modified FirstOne() to not do (A&-A). The assembly code generated by gcc with full optimisation is so dirty that I wont post it here.

Lasse


I see, already a symptom of cache pollution. Some profiler that reports L1-cache misses would be a nice tool. Isn't xor a tight bit faster than &~, aka one versus two instructions?

I suggest some loop hoisting and to reset immediatly to reduce register pressure. Sometimes if-do-while is faster than while.
<Edit>But yes, i saw to late it was already a capture loop, thus the loop hoisting is almost done by yourself. Since the set is rarely populated, loop hoisting is not so important here as i thought before.</Edit>
Code: Select all
      Pcs=WRooks;                     // WRooks
      while (Pcs) {
         From=FirstOne2(C=Pcs&-Pcs);
         Pcs ^= C;
         Pattern = (AllWPieces | AllBPieces) & MaskRook[From];
         Address = (Pattern*HashKeyRook[From])>>(64-RAT_BITS);
         A = AllBPieces & RookAttacks[From][Address];
         if ( A & Kings ) return -1;
         while (A) {
            To=FirstOne2(B=A&-A);
            A ^= B;
            *M++=From+(To<<6)+(WRook<<12)+(Board[To]<<16);
         }
      }

Is the assembly so worse? I guess x86-64 calling conventions.
Btw. Agner Fog's updated manuals might be of interest:
http://www.agner.org/optimize/#manuals

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fast(er) bitboard move generator

Postby Lasse Hansen » 08 Jul 2006, 11:04

I have tried those ideas Gerd gave.

1. Having an own Kings bitmap to check king captured outside inner loop improved the speed by ~1Mnps (quasi-legal).

2. Moving the bit-clearing to after FirstOne -redused- the speed by ~4Mnps.

3. Using XOR instead of &~ increased the speed by ~0.5 Mnps, but -only- in the capture move generator. When i tried in non-capture gen. it slowed down. It might have something to do with the inner loop code size (?).

Gcc acts strange..., even by adding comments to the source code I -might- get reduced speed.

Lasse
Lasse Hansen
 
Posts: 41
Joined: 07 Jun 2006, 21:25
Location: Porsgrunn, Norway

Re: Fast(er) bitboard move generator

Postby Gerd Isenberg » 08 Jul 2006, 21:08

Lasse Hansen wrote:I have tried those ideas Gerd gave.

1. Having an own Kings bitmap to check king captured outside inner loop improved the speed by ~1Mnps (quasi-legal).

2. Moving the bit-clearing to after FirstOne -redused- the speed by ~4Mnps.

3. Using XOR instead of &~ increased the speed by ~0.5 Mnps, but -only- in the capture move generator. When i tried in non-capture gen. it slowed down. It might have something to do with the inner loop code size (?).

Gcc acts strange..., even by adding comments to the source code I -might- get reduced speed.

Lasse

Advanced chaos theory ;-)
You may give this a try:
Code: Select all
      // pragma I LOVE GCC
      Pcs = WRooks;
      while ( Pcs ) {
         A = Pcs;
         Pcs &= Pcs-1; // reset found rook
         From=FirstOne2(Pcs ^ A);
         Pattern = (AllWPieces | AllBPieces) & MaskRook[From];
         Address = (Pattern*HashKeyRook[From])>>(64-RAT_BITS);
         A = AllBPieces & RookAttacks[From][Address];
         if ( A & Kings ) return -1;
         if ( A )  {
            From += (WRook<<12);
            do {
               B = A;
               A&= A-1;
               To = FirstOne2(A ^ B);
               *M++ = From + (To<<6) + (Board[To]<<16);
            } while ( A );
         }
      }

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fast(er) bitboard move generator

Postby Rémi Coulom » 09 Jul 2006, 07:15

Gerd Isenberg wrote:I see, already a symptom of cache pollution. Some profiler that reports L1-cache misses would be a nice tool.

http://valgrind.org/
in particular:
http://valgrind.org/info/tools.html#cachegrind

R?mi
Rémi Coulom
 
Posts: 96
Joined: 12 Nov 2004, 13:47
Location: Lille, France

Re: Fast(er) bitboard move generator

Postby Lasse Hansen » 12 Jul 2006, 13:52

I just had another idea of using hash keys to find moves. Instead of looking up an attack board, why not replace tha attack bitboard with a list of To-squares? A piece can never capture more than 8 pieces at once, and by using 6 bits per To-square, one will have 64-48=16 bits to tell how many genuine moves there are in the 64 bit word. This replaces the FirstOne with a shift+and. For non-captures, the queen might attack 27 squares, this could fit in 3x 64 bit words.

Lasse
Lasse Hansen
 
Posts: 41
Joined: 07 Jun 2006, 21:25
Location: Porsgrunn, Norway

Re: Fast(er) bitboard move generator

Postby Gerd Isenberg » 12 Jul 2006, 19:18

Lasse Hansen wrote:I just had another idea of using hash keys to find moves. Instead of looking up an attack board, why not replace tha attack bitboard with a list of To-squares? A piece can never capture more than 8 pieces at once, and by using 6 bits per To-square, one will have 64-48=16 bits to tell how many genuine moves there are in the 64 bit word. This replaces the FirstOne with a shift+and. For non-captures, the queen might attack 27 squares, this could fit in 3x 64 bit words.

Lasse

You really mean queen Addresses? Table will take some space.
Code: Select all
queenPattern = allPieces & queenMask[queenSquare];
queenAddress = (queenPattern * queenMagic[queenSquare]) >> QUAT_BITS;
...

Even for rooks and bishops you need to check up to four potential occluded squares, whether they are occupied by own or opposite piece, especially the enemy king for legality check. That requires some conditions in the inner loop with the risk of branch miss-predictions. Also if you pack the to-squares so dense, you need additional costs to prepare the moves. Consider how fast a 0x88 approach with precalculated nextSquare, nextDir is.

The point with bitboards for movegen is a fast capture generation for q-search. To distinguish certain disjoint subsets of target squares by intersection (and), like occupied by opposite pieces, queen, rooks etc. and empty square, or not attacked by enemy pawns etc.

I like the idea to avoid bitscans as well - i have the bitboard metric world with Kogge-Stone fills in mind with an almost branchless movegen...

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fast(er) bitboard move generator

Postby kalyankumar_77 » 16 Oct 2011, 06:30

Hi Lasse,

Can you please point me towards a detailed forum discussion or some other documentation with some code snippets preferably for bitboard move generation?
Note that I'm not interested in rotated bitboards.

Thanks in advance!

--kalyan
kalyankumar_77
 
Posts: 1
Joined: 19 Dec 2010, 16:59

PreviousNext

Return to Programming and Technical Discussions

Who is online

Users browsing this forum: No registered users and 2 guests

cron