Winboard Forum

by **Jeroen de Bruin** » 07 Jul 2010, 12:34

Hi

I'm implementing a few bitboard methods and when done will benchmark them on various different hardware platforms, so far I have the following (most courtesy of Harald Lüßen):

Code: Select all: +----------------------------+-----------------------+--------------+ | Name | Inventor or Supporter | Table Size | +----------------------------+-----------------------+--------------+ | Rotated Bitboards (naive) | Robert Hyatt | 77.2 KByte | | Rotated Bitboards (aligned)| ? | 35.0 KByte | | Rotated Bitboards (switch) | (Harald Lüßen?) | 14.9 KByte | | Rotated Indices | Alessandro Damiani | 128.3 KByte | | Exploding Bitboards | Harald Lüßen | 3.5 KByte | | Magic Multiplication | Gerd Isenberg | 9.7 KByte | | Magic Multiplication 32 bit| Gerd Isenberg | 9.7 KByte | | Sherwin Index | Michael Sherwin | 1402.4 KByte | | Pradu Minimize Magic | Pradyumna Kannan | 844.0 KByte | | Pradu Perfect Magic Hash | and | 627.4 KByte | | Pradu Big | Lasse Hansen | 2306.0 KByte | | Kogge-Stone | Steffan Westcott | 0.0 KByte | | Simple Shift | -- | 0.0 KByte | | Naive Shift | -- | 0.0 KByte | | Onno Fixed Shift Magic | Onno Garms | -- | +----------------------------+-----------------------+--------------+

I was wondering if any competitive variants are omitted or if there were any significant advancements made in any of the aforementioned methods in the last 3 odd years that I might be unaware of?

Jeroen

by **Gerd Isenberg** » 07 Jul 2010, 23:05

Hi Jeroen,

A perft-like isolated performance benchmark is not necessarily adaptable to a real chess program and its own memory-, cache- and ipc foot-print. With a real eval, or where one often temporary alters occupancy for x-ray stuff and pinned piece detection, etc.. For the usual piece-wise attack design, magic seems fastest in most programs, may be even with fixed shift and MB pages instead of 4K pages. My favorite with SSE* is to build attack-tables and legal move-target sets from a quad-bitboard on the fly after pre-fetching a cache-line from TT to hide its latency with register and ALU intensive direction-wise Kogge-Stone stuff, which implies to probe TT also in qs.

https://chessprogramming.wikispaces.com ... ce+Attacks

Cheers,
Gerd

by **Miguel A. Ballicora** » 08 Jul 2010, 02:24

Jeroen de Bruin wrote:Hi

I'm implementing a few bitboard methods and when done will benchmark them on various different hardware platforms, so far I have the following (most courtesy of Harald Lüßen):

Code: Select all
+----------------------------+-----------------------+--------------+ | Name | Inventor or Supporter | Table Size | +----------------------------+-----------------------+--------------+ | Rotated Bitboards (naive) | Robert Hyatt | 77.2 KByte | | Rotated Bitboards (aligned)| ? | 35.0 KByte | | Rotated Bitboards (switch) | (Harald Lüßen?) | 14.9 KByte | | Rotated Indices | Alessandro Damiani | 128.3 KByte | | Exploding Bitboards | Harald Lüßen | 3.5 KByte | | Magic Multiplication | Gerd Isenberg | 9.7 KByte | | Magic Multiplication 32 bit| Gerd Isenberg | 9.7 KByte | | Sherwin Index | Michael Sherwin | 1402.4 KByte | | Pradu Minimize Magic | Pradyumna Kannan | 844.0 KByte | | Pradu Perfect Magic Hash | and | 627.4 KByte | | Pradu Big | Lasse Hansen | 2306.0 KByte | | Kogge-Stone | Steffan Westcott | 0.0 KByte | | Simple Shift | -- | 0.0 KByte | | Naive Shift | -- | 0.0 KByte | | Onno Fixed Shift Magic | Onno Garms | -- | +----------------------------+-----------------------+--------------+

I was wondering if any competitive variants are omitted or if there were any significant advancements made in any of the aforementioned methods in the last 3 odd years that I might be unaware of?

Jeroen

This is what I do in Gaviota. Certainly not the fastest, but it is very simple. I does some sort of a bucketfill procedure. I have no idea how original it is, but I have not heard that anybody do it that way. At least, I did not see a big slowdown compared to my own rotated version that I had before. I do not think you will choose them if performance is an issue though.

I hope the code is self explanatory. You will need an array uint64_t Reach[PIECE][FROM] that contains the bitboard that a PIECE attacks from the square FROM in an empty board. I generate those in some other part of the program. "occ" is the bitboard with the occupancy (bits that are "on" are the ones with a piece or pawn in them, regardless of color)
not_A and not_H are bitboards with the column A and H turned off.

Miguel

Code: Select all: static uint64_t grow (uint64_t x) { x |= (x << 1) & not_A; x |= (x >> 1) & not_H; x |= x >> 8; x |= x << 8; return x; } static uint64_t growlat (uint64_t x) { uint64_t y = x; x |= (x << 1) & not_A; x |= (x >> 1) & not_H; y |= y >> 8; y |= y << 8; return x | y; } extern uint64_t B_attacks (SQUARE from, uint64_t occ) { uint64_t g, corral, prev, r, ini; r = Reach[BISHOP][from]; corral = r & ~occ; ini = U64(1) << from; g = ini; do { prev = g; g = grow (prev) & corral; } while (prev != g); g |= ini; return (grow(g) & r) & ~ini; } extern uint64_t R_attacks (SQUARE from, uint64_t occ) { uint64_t g, corral, prev, r, ini; r = Reach[ROOK][from]; corral = r & ~occ; ini = U64(1) << from; g = ini; do { prev = g; g = grow (prev) & corral; } while (prev != g); g |= ini; return (grow(g) & r) & ~ini; } extern uint64_t Q_attacks (SQUARE from, uint64_t occ) { uint64_t g, corral, prev, r, ini, h; r = Reach[QUEEN][from]; ini = U64(1) << from; g = grow (ini) & ~ini; h = growlat (g & occ); h &= ~g; corral = r & ~occ; corral &= ~h; g &= ~occ; do { prev = g; g = grow (prev) & corral; } while (prev != g); g |= ini; return (grow(g) & r) & ~ini & ~h ; }

by **Jeroen de Bruin** » 08 Jul 2010, 13:56

Gerd Isenberg wrote:Hi Jeroen,

A perft-like isolated performance benchmark is not necessarily adaptable to a real chess program and its own memory-, cache- and ipc foot-print. With a real eval, or where one often temporary alters occupancy for x-ray stuff and pinned piece detection, etc.. For the usual piece-wise attack design, magic seems fastest in most programs, may be even with fixed shift and MB pages instead of 4K pages. My favorite with SSE* is to build attack-tables and legal move-target sets from a quad-bitboard on the fly after pre-fetching a cache-line from TT to hide its latency with register and ALU intensive direction-wise Kogge-Stone stuff, which implies to probe TT also in qs.

https://chessprogramming.wikispaces.com ... ce+Attacks

Cheers,
Gerd

Hi Gerd

Thanks for the tip. Would fixed depth search in a few positions (start - middle - endgame) suffice? Is there a set of positions that is deemed typical? Perhaps I should implement the various methods in Stockfish (I assume it relatively bugfree).

Your solution sounds interesting! Would you be so kind to send me your implementation so I can add it to my benchmark or better still if you have time could you implement it (in Stockfish) and send me the source so I can benchmark it on the various platforms? PMed you my email address.

I ran across Obstruction Difference on the wiki. Is it faster than your Hyperbola Quintessence approach on current Intel and AMD processors? Any guess on how it would do on ARM's Cortex?

by **Jeroen de Bruin** » 08 Jul 2010, 14:29

Miguel A. Ballicora wrote:
Jeroen de Bruin wrote:Hi

I'm implementing a few bitboard methods and when done will benchmark them on various different hardware platforms, so far I have the following (most courtesy of Harald Lüßen):

Code: Select all
+----------------------------+-----------------------+--------------+ | Name | Inventor or Supporter | Table Size | +----------------------------+-----------------------+--------------+ | Rotated Bitboards (naive) | Robert Hyatt | 77.2 KByte | | Rotated Bitboards (aligned)| ? | 35.0 KByte | | Rotated Bitboards (switch) | (Harald Lüßen?) | 14.9 KByte | | Rotated Indices | Alessandro Damiani | 128.3 KByte | | Exploding Bitboards | Harald Lüßen | 3.5 KByte | | Magic Multiplication | Gerd Isenberg | 9.7 KByte | | Magic Multiplication 32 bit| Gerd Isenberg | 9.7 KByte | | Sherwin Index | Michael Sherwin | 1402.4 KByte | | Pradu Minimize Magic | Pradyumna Kannan | 844.0 KByte | | Pradu Perfect Magic Hash | and | 627.4 KByte | | Pradu Big | Lasse Hansen | 2306.0 KByte | | Kogge-Stone | Steffan Westcott | 0.0 KByte | | Simple Shift | -- | 0.0 KByte | | Naive Shift | -- | 0.0 KByte | | Onno Fixed Shift Magic | Onno Garms | -- | +----------------------------+-----------------------+--------------+

I was wondering if any competitive variants are omitted or if there were any significant advancements made in any of the aforementioned methods in the last 3 odd years that I might be unaware of?

Jeroen

This is what I do in Gaviota. Certainly not the fastest, but it is very simple. I does some sort of a bucketfill procedure. I have no idea how original it is, but I have not heard that anybody do it that way. At least, I did not see a big slowdown compared to my own rotated version that I had before. I do not think you will choose them if performance is an issue though.

I hope the code is self explanatory. You will need an array uint64_t Reach[PIECE][FROM] that contains the bitboard that a PIECE attacks from the square FROM in an empty board. I generate those in some other part of the program. "occ" is the bitboard with the occupancy (bits that are "on" are the ones with a piece or pawn in them, regardless of color)
not_A and not_H are bitboards with the column A and H turned off.

Miguel
Code: Select all
static uint64_t grow (uint64_t x) { x |= (x << 1) & not_A; x |= (x >> 1) & not_H; x |= x >> 8; x |= x << 8; return x; } static uint64_t growlat (uint64_t x) { uint64_t y = x; x |= (x << 1) & not_A; x |= (x >> 1) & not_H; y |= y >> 8; y |= y << 8; return x | y; } extern uint64_t B_attacks (SQUARE from, uint64_t occ) { uint64_t g, corral, prev, r, ini; r = Reach[BISHOP][from]; corral = r & ~occ; ini = U64(1) << from; g = ini; do { prev = g; g = grow (prev) & corral; } while (prev != g); g |= ini; return (grow(g) & r) & ~ini; } extern uint64_t R_attacks (SQUARE from, uint64_t occ) { uint64_t g, corral, prev, r, ini; r = Reach[ROOK][from]; corral = r & ~occ; ini = U64(1) << from; g = ini; do { prev = g; g = grow (prev) & corral; } while (prev != g); g |= ini; return (grow(g) & r) & ~ini; } extern uint64_t Q_attacks (SQUARE from, uint64_t occ) { uint64_t g, corral, prev, r, ini, h; r = Reach[QUEEN][from]; ini = U64(1) << from; g = grow (ini) & ~ini; h = growlat (g & occ); h &= ~g; corral = r & ~occ; corral &= ~h; g &= ~occ; do { prev = g; g = grow (prev) & corral; } while (prev != g); g |= ini; return (grow(g) & r) & ~ini & ~h ; }

Thanks Miguel. Very nicely formatted and certainly self-explanatory.

by **Gerd Isenberg** » 08 Jul 2010, 16:16

Jeroen de Bruin wrote:Hi Gerd

Thanks for the tip. Would fixed depth search in a few positions (start - middle - endgame) suffice? Is there a set of positions that is deemed typical? Perhaps I should implement the various methods in Stockfish (I assume it relatively bugfree).

Your solution sounds interesting! Would you be so kind to send me your implementation so I can add it to my benchmark or better still if you have time could you implement it (in Stockfish) and send me the source so I can benchmark it on the various platforms? PMed you my email address.

I ran across Obstruction Difference on the wiki. Is it faster than your Hyperbola Quintessence approach on current Intel and AMD processors? Any guess on how it would do on ARM's Cortex?

Hi Jeroen,

piece-wise and direction-wise approaches or paradigms are not "compatible" and mine uses very different data structures also acquiring data used later elsewhere, and a huge finite state plausible movegen. I don't like to share that code, except some "public domain" snippets already part of cpw, based on SSE2 C++ wrapper - even if it is a perft framework only. It also color-flips the side each move, so that it is a "White only" approach.

OD relies on fast bsr or lzc and it seems (a lot) slower than HQ on AMD-boxes (might be different for core2 and Nehalem, though). In my experience and what I've heard from a few people, HQ in combination with 1/2KByte rank-lookups is very close to magics on K8, K10 with 1/3 cycle bswap throughput. No idea on ARM's Cortex.

You may try all none rotated approaches relying on same data-structures as mentioned in Hiding the Implementation.

Gerd

Winboard Forum

Bitboards

Bitboards

Re: Bitboards

Re: Bitboards

Re: Bitboards

Re: Bitboards

Re: Bitboards

Who is online