Population count (popcount) algorithm

Brian Kernighan’s Algorithm

1
2
3
4
5
6
7
int popcount(unsigned x)
{
    int c = 0;
    for (; x != 0; x &= x - 1)
        c++;
    return c;
}

iter 1:

x = 0b100011
x - 1 = 0b100010
x & (x-1) = 0b100010

iter 2:

x = 0b100010
x - 1 = 0b100001
x & (x-1) = 0b100000

iter 3:

x = 0b100000
x - 1 = 0b011111
x & (x-1) = 0b000000

因為 x - 1 會把 x 的最靠近低位的那個 1 變成 0，其餘更下方的位數為 1（x 的其餘更下方位數為 0），兩者 AND 起來就會把最靠近低位的那個 1 變成 0。

總共 iteration 數量為 1 的數量。

SWAR (SIMD within a register) Algorithm

塞在一個 register 用類似 SIMD 做 divide and conquer，這是 gcc 在沒有 popcnt instruction 可以用的時候會採用的軟體實現

1
2
3
4
5
6
7
8
int popcount2(int n) {
    n = (n & 0x55555555) + ((n >> 1) & 0x55555555); // (1)
    n = (n & 0x33333333) + ((n >> 2) & 0x33333333); // (2)
    n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); // (3)
    n = (n & 0x00FF00FF) + ((n >> 8) & 0x00FF00FF); // (4)
    n = (n & 0x0000FFFF) + ((n >> 16) & 0x0000FFFF); // (5)
    return n;
}

0x5555555 就是 8 個重複的 0b0101，也就是說 (1) 做完就是 2 個相鄰 bit 相加的結果。(2) 做完就是 4 個相鄰 bit 相加，依此類推，總共 iteration 數量為 log(bit 的數量)。除了快以外，另外一個好處是可以不用用到 branch instruction。

(1) 還可以進一步優化成：

1
n = n - (n >> 1) & 0x55555555;

分析：只考慮 n 只有兩個 bits 的情況的話 ($n = 2 * x + y, x, y \in \{0,1\}$)：

1
2
3
4
5
(n & 0b01) + ((n >> 1) & 0b01)
= y + x
= (2 * x + y) - x
= n - x
= n - (n >> 1) & 0b01

剩下的也是同理只是左移了幾個 bits，依此類推。

(3) 還可以進一步優化成：

1
n = (n + (n >> 4)) & 0x0F0F0F0F;

這是因為兩個 4 bits 的 popcount 相加一定小於等於 8，所以其實可以加完再做 AND。

(4), (5) 也可以用相同方法簡化，這邊只需要 5 個 bits 就可以儲存最終的結果了，所以其實可以直接以 byte 為單位將四個區塊相加：

1
n = (n + (n << 8) + (n << 16) + (n <<24)) >> 24;

也就是：

1
n = (n * 0x01010101) >> 24;

最終的 code：

1
2
3
4
5
6
int popcount2(int n) {
    n = n - (n >>1) & 0x55555555;
    n = (n & 0x33333333) + (n >> 2) & 0x33333333;
    n = (n + n >> 4) & 0x0F0F0F0F;
    return (n * 0x01010101) >> 24;
}

Interesting Algorithm (?)

1
2
3
4
5
6
7
8
unsigned popcnt(unsigned x) {
	unsigned diff = x;
	while (x) {
		x >>= 1;
		diff -= x;
	}
	return diff;
}

其實就是：

1
2
3
4
5
6
7
8
9
unsigned popcnt(unsigned x) {
	unsigned sum = 0;
	unsigned remaining = x;
	while (remaining) {
		remaining >>= 1;
		sum += remaining;
	}
	return x - sum;
}

大家都知道 x/2 + x/4 + x/8 + ... == x，但是這是如果 x 為 real number 才會成立，如果 x 是 integer 的話，會因為丟掉 bit shift 的結果造成誤差，如下例子所示。

x = 0b10111
      real     vs int
x/2:  11.5     vs 11 (diff1: 0b1 / 2)
x/4:  5.75     vs 5  (diff2: 0b11 / 4)
x/8:  2.875    vs 2  (diff3: 0b111 / 8)
x/16: 1.4375   vs 1  (diff4: 0b0111 / 16)
x/32: 0.71875  vs 0  (diff5: 0b10111 / 32)
x/64: 0.359375 vs 0  (diff6: 0b10111 / 64)

這個造成誤差的位數會被不斷重複被計算，也就是說只要當 x 的某個位元是 1 的話，計算無限項的 sum 的時候就會造成 1/2 + 1/4 + 1/8 + ... == 1 的誤差。因此我們只需計算使用 x 做除法到無限項的這個誤差為多少，即可知道 x 有多少個 1，實際上 iteration 的次數就是最高位 1 的 bit 位數。

這個算法雖然不是最快的算法，不過計算方式滿有趣的。

References

Algorithms behind Popcount | Nimrod’s Coding Lab

Counting set bits in an interesting way | Hacker News

Counting set bits in an interesting way