Simon Fell > Its just code > It Depends, episode 1

Tuesday, January 26, 2021

The answer to almost every software engineering question is "It Depends". In this post we explore options for finding an item in a small array. Grab a cup of tea and down the rabbit hole we go.

I've been working on art a golang implementation of an adaptive radix tree. In this the tree can use a few differently sized nodes, and move between them to optimize memory usage. One of these node sizes is node16, a node that can store upto 16 child items. Each item is identified by a single byte key. The common read path is given a key tell me the child item or tell me its not there.

My first implementation stored the keys & values in no particular order and scanned them for lookups. At this time in the project I was more interested in something that worked than something that was the most optimal. Here's a simplified version of the code. Note that put makes many simplifying assumptions that aren't relevant to the discussion at hand. You can get all the code from this article from the loopVsBinarySearch repo.

    type nodeLoop struct {
        key   [16]byte
        val   [16]interface{}
        count int
    }
    func (n *nodeLoop) put(k byte, v interface{}) {
        idx := n.count
        n.key[idx] = k
        n.val[idx] = v
        n.count++
    }
    func (n *nodeLoop) get(k byte) interface{} {
        for i := 0; i < n.count; i++ {
            if n.key[i] == k {
                return n.val[i]
            }
        }
        return nil
    }

Should i use a binary search instead? It depends. Binary Search would do better for larger arrays, but is it still better for our little 16 byte array? The loop has o(n) time while the binary search is o(log n). In my experience though Big O estimates of algorithmic time can be very miss-leading for small values of n. Lets find out, go has a binary search function in the sort package, which makes this straightforward.

    type nodeSorted struct {
        key   [16]byte
        val   [16]interface{}
        count int
    }
    func (n *nodeSorted) put(k byte, v interface{}) {
        idx := sort.Search(n.count, func(i int) bool {
            return n.key[i] >= k
        })
        copy(n.key[idx+1:], n.key[idx:int(n.count)])
        copy(n.val[idx+1:], n.val[idx:int(n.count)])
        n.key[idx] = k
        n.val[idx] = v
        n.count++
    }
    func (n *nodeSorted) get(k byte) interface{} {
        idx := sort.Search(n.count, func(i int) bool {
            return n.key[i] >= k
        })
        if idx < int(n.count) && n.key[idx] == k {
            return n.val[idx]
        }
        return nil
    }

And lets write a benchmark to compare the 2

var rnd = rand.New(rand.NewSource(42))
var keys = make([]byte, 16)
func init() {
    for i := 0; i < len(keys); i++ {
        keys[i] = byte(i)
    }
    rnd.Shuffle(len(keys), func(i, j int) {
        keys[i], keys[j] = keys[j], keys[i]
    })
}
func Benchmark_Loop(b *testing.B) {
    n := new(nodeLoop)
    for _, k := range keys {
        n.put(k, int(k))
    }
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for _, k := range keys {
            v := n.get(k)
            if v != int(k) {
                b.Errorf("Got unexpected value of %d for key %d", v, k)
            }
        }
    }
}
func Benchmark_BinarySearch(b *testing.B) {
    n := new(nodeSorted)
    for _, k := range keys {
        n.put(k, int(k))
    }
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for _, k := range keys {
            v := n.get(k)
            if v != int(k) {
                b.Errorf("Got unexpected value of %d for key %d", v, k)
            }
        }
    }
}

Lets run these and see what we get. All these tests were done using go 1.15.7 on OSX 10.15.7 running on an iMac with an Intel i7-6700K CPU @ 4.00GHz. You may get different results with different combinations of OS, go version & processor, it depends.

$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/superfell/loopVsBinarySearch
Benchmark_Loop-8                      11602429           104 ns/op
Benchmark_BinarySearch-8               4848440           247 ns/op
PASS

The loop version doesn't look so shabby anymore, it runs in ~40% of the time of the binary search. (side note for VSCode users, the go plugin defaults to running code coverage on running benchmarks, which can change the relative output between benchmarks quite a bit. Either turn this off, or run the from the command line). What's going on? fortunately go has some great tools integrated into the tool chain, lets grab a cpu profile and look at that. We'll use the `-benchtime=10000000x` option to have each benchmark run the same number of times rather than the same duration. This will make comparing times between the different implementations easier.

$ go test -bench . -benchtime=10000000x -cpuprofile=cpu.prof
goos: darwin
goarch: amd64
pkg: github.com/superfell/loopVsBinarySearch
Benchmark_Loop-8               10000000           104 ns/op
Benchmark_BinarySearch-8       10000000           253 ns/op
PASS

We can use pprof to examine the generated cpu profile and generate timing annotated versions of the source code. We use the pprof command `list get` to show the functions called get.

$ go tool pprof cpu.prof
Type: cpu
Time: Jan 26, 2021 at 11:05am (PST)
Duration: 3.76s, Total samples = 3.13s (83.32%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

type list get at the (pprof) prompt.

(pprof) list get
Total: 3.13s
ROUTINE ============= github.com/superfell/loopVsBinarySearch.(*nodeLoop).get in /Users/simon/Github/loopVsBinarySearch/n.go
     480ms      560ms (flat, cum) 17.89% of Total
         .          .     15:    n.key[idx] = k
         .          .     16:    n.val[idx] = v
         .          .     17:    n.count++
         .          .     18:}
         .          .     19:
      20ms       40ms     20:func (n *nodeLoop) get(k byte) interface{} {
     110ms      170ms     21:    for i := 0; i < n.count; i++ {
     260ms      260ms     22:        if n.key[i] == k {
      90ms       90ms     23:            return n.val[i]
         .          .     24:        }
         .          .     25:    }
         .          .     26:    return nil
         .          .     27:}
         .          .     28:
ROUTINE ============= github.com/superfell/loopVsBinarySearch.(*nodeSorted).get in /Users/simon/Github/loopVsBinarySearch/n.go
     290ms      1.93s (flat, cum) 61.66% of Total
         .          .    157:    n.key[idx] = k
         .          .    158:    n.val[idx] = v
         .          .    159:    n.count++
         .          .    160:}
         .          .    161:
      50ms       50ms    162:func (n *nodeSorted) get(k byte) interface{} {
     110ms      1.75s    163:    idx := sort.Search(n.count, func(i int) bool {
         .          .    164:        return n.key[i] >= k
         .          .    165:    })
      70ms       70ms    166:    if idx < int(n.count) && n.key[idx] == k {
      60ms       60ms    167:        return n.val[idx]
         .          .    168:    }
         .          .    169:    return nil
         .          .    170:}
         .          .    171:
ROUTINE ============= github.com/superfell/loopVsBinarySearch.(*nodeSorted).get.func1 in /Users/simon/Github/loopVsBinarySearch/n.go
     680ms      680ms (flat, cum) 21.73% of Total
         .          .    158:    n.val[idx] = v
         .          .    159:    n.count++
         .          .    160:}
         .          .    161:
         .          .    162:func (n *nodeSorted) get(k byte) interface{} {
     260ms      260ms    163:    idx := sort.Search(n.count, func(i int) bool {
     420ms      420ms    164:        return n.key[i] >= k
         .          .    165:    })
         .          .    166:    if idx < int(n.count) && n.key[idx] == k {
         .          .    167:        return n.val[idx]
         .          .    168:    }
         .          .    169:    return nil

Hmmmm, that call to Search is taking a lot of time, lets see what that's doing. (pprof) list Search

ROUTINE ======================== sort.Search in /usr/local/go/src/sort/search.go
950ms      1.64s (flat, cum) 52.40% of Total
    .          .     54://            return s != "" && s[0] == 'y'
    .          .     55://        })
    .          .     56://        fmt.Printf("Your number is %d.\n", answer)
    .          .     57://    }
    .          .     58://
 50ms       50ms     59:func Search(n int, f func(int) bool) int {
    .          .     60:    // Define f(-1) == false and f(n) == true.
    .          .     61:    // Invariant: f(i-1) == false, f(j) == true.
    .          .     62:    i, j := 0, n
290ms      290ms     63:    for i < j {
110ms      110ms     64:        h := int(uint(i+j) >> 1) // avoid overflow when computing h
    .          .     65:        // i ≤ h < j
430ms      1.12s     66:        if !f(h) {
 30ms       30ms     67:            i = h + 1 // preserves f(i-1) == false
    .          .     68:        } else {
    .          .     69:            j = h // preserves f(j) == true
    .          .     70:        }
    .          .     71:    }
    .          .     72:    // i == j, f(i-1) == false, and f(j) (= f(i)) == true  =>  answer is i.
 40ms       40ms     73:    return i
    .          .     74:}
    .          .     75:

    

The f(h) call (line 66) is consuming most of the time here. This is the callback function we passed in. At a guess the compiler can't inline the callback function and because there's so little else going on the function call overhead comes to dominate the time taken. As an experiment we can be our own optimizer and manually inline the search call function. This is basically a cut and paste job from sort.Search, replacing !f(h) with n.key[h] < k

func (n *nodeSorted) getInlinedBinSearch(k byte) interface{} {
    // impl of sort.Search manually inlined here
    i, j := 0, int(n.count)
    for i < j {
        h := int(uint(i+j) >> 1) // avoid overflow when computing h
        // i ≤ h < j
        if n.key[h] < k {
            i = h + 1 // preserves f(i-1) == false
        } else {
            j = h // preserves f(j) == true
        }
    }
    // i == j, f(i-1) == false, and f(j) (= f(i)) == true  =>  answer is i.
    if i < int(n.count) && n.key[i] == k {
        return n.val[i]
    }
    return nil
}
$ go test -bench . 
Benchmark_Loop-8                      11763114           101 ns/op
Benchmark_BinarySearch-8               4796678           246 ns/op
Benchmark_BinarySearchInlined-8       10631300           114 ns/op
PASS

That looks better, but surprisingly, still slower than the initial loop version. Lets grab another cpu profile and look at the getInlinedBinSearch function

$ go test -bench . -benchtime=10000000x -cpuprofile=cpu.prof
...
$ go tool pprof cpu.prof
...
(pprof) list getInlinedBinSearch
ROUTINE === github.com/superfell/loopVsBinarySearch.(*nodeSorted).getInlinedBinSearch in /Users/simon/Github/loopVsBinarySearch/n.go
     660ms      680ms (flat, cum) 16.67% of Total
         .          .    178:        return n.val[idx]
         .          .    179:    }
         .          .    180:    return nil
         .          .    181:}
         .          .    182:
      40ms       40ms    183:func (n *nodeSorted) getInlinedBinSearch(k byte) interface{} {
         .          .    184:    // impl of sort.Search manually inlined here
      70ms       70ms    185:    i, j := 0, int(n.count)
     180ms      200ms    186:    for i < j {
      40ms       40ms    187:        h := int(uint(i+j) >> 1) // avoid overflow when computing h
         .          .    188:        // i ≤ h < j
      70ms       70ms    189:        if n.key[h] < k {
     100ms      100ms    190:            i = h + 1 // preserves f(i-1) == false
         .          .    191:        } else {
         .          .    192:            j = h // preserves f(j) == true
         .          .    193:        }
         .          .    194:    }
         .          .    195:    // i == j, f(i-1) == false, and f(j) (= f(i)) == true  =>  answer is i.
      20ms       20ms    196:    if i < int(n.count) && n.key[i] == k {
     140ms      140ms    197:        return n.val[i]
         .          .    198:    }
         .          .    199:    return nil
         .          .    200:}
         .          .    201:

Comparing back to the profile of the original loop, this version spends less time comparing the key to the array entries. That makes sense, Thats the whole point of the binary search. The rest of the loop overhead though negates that though. Possibly at this point we're at the mercy of the CPU's branch predictor. If you know more about the relative overhead of the 2 loops, let me know

Is there anything else we can do to make this faster? In both cases the n.key[index] access comes with a bounds check. I vaguely remember reading somewhere about the compiler being able to optimize away some of the bounds checks, if it checks a high bounds first. Makes sense, if n[10] is not out of bounds of they array, then you already know that a subsequent n[9] call is also not out of bounds. Lets try reversing the order of our loop, see if that helps.

    func (n *nodeLoop) getReverse(k byte) interface{} {
        for i := n.count - 1; i >= 0; i-- {
            if n.key[i] == k {
                return n.val[i]
            }
        }
        return nil
    }
$ go test -bench . -benchtime=10000000x -cpuprofile=cpu.prof 
Benchmark_Loop-8                      10000000           101 ns/op
Benchmark_LoopRev-8                   10000000           108 ns/op
Benchmark_BinarySearch-8              10000000           248 ns/op
Benchmark_BinarySearchInlined-8       10000000           114 ns/op

Slightly worse, but were those bounds checks skipped? I'd guess not given the runtime, how can we tell? As far as i know, at this point we have to look at the code generated by the compiler. We can get the compiler to give us an assembly dump of the generated code.

If you get this far, its time to make another cup of tea.

$ go tool compile -S n.go > n5.asm

Look through the file and find the section for (*nodeLoop).getReverse. Now, I haven't done assembly since writing 68000 assembly for an Amiga A500. So after copious amounts of web searches, i've annotated the assembly with what (i think) is going on.

"".(*nodeLoop).getReverse STEXT nosplit size=125 args=0x20 locals=0x18
    0x0000 00000 (n.go:39)    TEXT    "".(*nodeLoop).getReverse(SB), NOSPLIT|ABIInternal, $24-32
    0x0000 00000 (n.go:39)    SUBQ    $24, SP
    0x0004 00004 (n.go:39)    MOVQ    BP, 16(SP)
    0x0009 00009 (n.go:39)    LEAQ    16(SP), BP
    0x000e 00014 (n.go:39)    FUNCDATA    $0, gclocals·4032f753396f2012ad1784f398b170f4(SB)
    0x000e 00014 (n.go:39)    FUNCDATA    $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
    0x000e 00014 (n.go:40)    MOVQ    "".n+32(SP), DX           ;; register DX points to the start of the nodeLoop struct
                                                                ;; in memory (the n variable in the code)
    0x0013 00019 (n.go:40)    MOVQ    272(DX), BX               ;; copy the value from DX + 272 bytes into BX. 16 for n.keys, 
                                                                ;; 16 x 16 for n.vals get you 272, so BX has the value of the
                                                                ;; count field
    0x001a 00026 (n.go:40)    DECQ    BX                        ;; bx = bx-1. register bx is the loop count variable i
    0x001d 00029 (n.go:40)    MOVBLZX    "".k+40(SP), SI        ;; set register SI to the value of the parameter k
    0x0022 00034 (n.go:40)    JMP    39
    0x0024 00036 (n.go:40)    DECQ    BX                        ;; i-- 
    0x0027 00039 (n.go:40)    TESTQ   BX, BX                    ;; bitwise AND of BX and BX, a way to see if i == 0
    0x002a 00042 (n.go:40)    JLT    93                         ;; if i < 0 goto 93 which will do the stack dance for the
                                                                ;; return nil statement.
    0x002c 00044 (n.go:41)    CMPQ    BX, $16                   ;; cmp i to the literal value 16
    0x0030 00048 (n.go:41)    JCC    111                        ;; if i >= 16 goto 111 which will generate an array index out
                                                                ;; of bounds panic
    0x0032 00050 (n.go:41)    MOVBLZX    (DX)(BX*1), DI         ;; set register DI to n.keys[i]
    0x0036 00054 (n.go:41)    CMPB    DIB, SIB
    0x0039 00057 (n.go:41)    JNE    36                         ;; if n.keys[i] != k goto back to 36, which decrements the loop
                                                                ;; counter and repeats the above tests
    0x003b 00059 (n.go:42)    SHLQ    $4, BX                    ;; bit shift BX left 4, i = i * 16
                                                                ;; remember that interface{} takes up 16 bytes
    0x003f 00063 (n.go:42)    MOVQ    16(DX)(BX*1), AX          ;; 
    0x0044 00068 (n.go:42)    MOVQ    24(DX)(BX*1), CX          ;; ax & cx contain n.vals[i] 
    0x0049 00073 (n.go:42)    MOVQ    AX, "".~r1+48(SP)         ;; shuffle them onto the stack for the return value
    0x004e 00078 (n.go:42)    MOVQ    CX, "".~r1+56(SP)
    0x0053 00083 (n.go:42)    MOVQ    16(SP), BP
    0x0058 00088 (n.go:42)    ADDQ    $24, SP
    0x005c 00092 (n.go:42)    RET                               ;; we're done
    0x005d 00093 (n.go:45)    XORPS    X0, X0                   ;; here to 110 is for the return nil
    0x0060 00096 (n.go:45)    MOVUPS    X0, "".~r1+48(SP)
    0x0065 00101 (n.go:45)    MOVQ    16(SP), BP
    0x006a 00106 (n.go:45)    ADDQ    $24, SP
    0x006e 00110 (n.go:45)    RET                               ;; we're done
    0x006f 00111 (n.go:41)    MOVQ    BX, AX
    0x0072 00114 (n.go:41)    MOVL    $16, CX
    0x0077 00119 (n.go:41)    PCDATA    $1, $1
    0x0077 00119 (n.go:41)    CALL    runtime.panicIndex(SB)
    0x007c 00124 (n.go:41)    XCHGL    AX, AX

There's some pointer math going on to access the fields and array indexes in the nodeLoop struct. This struct is laid out in memory following the definition of the code.

Byte OffsetValue
0-15n.key[0] through n.key[15]
16-31n.val[0]
32-47n.val[1]
......
256-271n.val[15]
272-280n.count

Code offsets 36 through 57 are the loop, and you can see that it checks BX which is the variable i against 16 every time around. So it has not optimized away any of the bounds checks. Lets see if we can do better at convincing the compiler about this. We'll access the largest array index from the loop before starting the loop.

    func (n *nodeLoop) getReverse2(k byte) interface{} {
        _ = n.key[n.count-1]
        for i := n.count - 1; i >= 0; i-- {
            if n.key[i] == k {
                return n.val[i]
            }
        }
        return nil
    }
$ go test -bench . -benchtime=10000000x -cpuprofile=cpu.prof 
Benchmark_Loop-8                      10000000           102 ns/op
Benchmark_LoopRev-8                   10000000           108 ns/op
Benchmark_LoopRevOneBoundsCheck-8     10000000            82.4 ns/op
Benchmark_BinarySearch-8              10000000           250 ns/op
Benchmark_BinarySearchInlined-8       10000000           116 ns/op

Alright, looks like we might of done it this time, that's a ~20% decrease in time. Lets double check the assembly that it is doing what we think.

"".(*nodeLoop).getReverse2 STEXT nosplit size=124 args=0x20 locals=0x18
    0x0000 00000 (n.go:48)  TEXT    "".(*nodeLoop).getReverse2(SB), NOSPLIT|ABIInternal, $24-32
    0x0000 00000 (n.go:48)  SUBQ    $24, SP
    0x0004 00004 (n.go:48)  MOVQ    BP, 16(SP)
    0x0009 00009 (n.go:48)  LEAQ    16(SP), BP
    0x000e 00014 (n.go:48)  FUNCDATA    $0, gclocals·4032f753396f2012ad1784f398b170f4(SB)
    0x000e 00014 (n.go:48)  FUNCDATA    $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
    0x000e 00014 (n.go:49)  MOVQ    "".n+32(SP), DX     ;; register DX points to the start of the nodeLoop struct
    0x0013 00019 (n.go:49)  MOVQ    272(DX), BX         ;; BX = n.count
    0x001a 00026 (n.go:49)  LEAQ    -1(BX), AX          ;; AX = BX-1 (I think, not sure)
    0x001e 00030 (n.go:49)  NOP
    0x0020 00032 (n.go:49)  CMPQ    AX, $16
    0x0024 00036 (n.go:49)  JCC    113                  ;; if AX >= 16 goto 113, which will generate a panic
    0x0026 00038 (n.go:50)  MOVBLZX    "".k+40(SP), CX  ;; CX = param k
    0x002b 00043 (n.go:50)  JMP    48
    0x002d 00045 (n.go:50)  DECQ    AX                  ;; AX-- (AX is our loop variable i)
    0x0030 00048 (n.go:50)  TESTQ    AX, AX         
    0x0033 00051 (n.go:50)  JLT    95                   ;; if i < 0 goto 95, which is return nil
    0x0035 00053 (n.go:51)  MOVBLZX    (DX)(AX*1), BX   ;; BX = n.keys[i]
    0x0039 00057 (n.go:51)  CMPB    BL, CL              
    0x003b 00059 (n.go:51)  JNE    45                   ;; if n.keys[i] != k goto 45 
    0x003d 00061 (n.go:52)  SHLQ    $4, AX
    0x0041 00065 (n.go:52)  MOVQ    16(DX)(AX*1), CX    ;; here through 94 are return n.vals[i]
    0x0046 00070 (n.go:52)  MOVQ    24(DX)(AX*1), AX
    0x004b 00075 (n.go:52)  MOVQ    CX, "".~r1+48(SP)
    0x0050 00080 (n.go:52)  MOVQ    AX, "".~r1+56(SP)
    0x0055 00085 (n.go:52)  MOVQ    16(SP), BP
    0x005a 00090 (n.go:52)  ADDQ    $24, SP
    0x005e 00094 (n.go:52)  RET
    0x005f 00095 (n.go:55)  XORPS    X0, X0
    0x0062 00098 (n.go:55)  MOVUPS    X0, "".~r1+48(SP)
    0x0067 00103 (n.go:55)  MOVQ    16(SP), BP
    0x006c 00108 (n.go:55)  ADDQ    $24, SP
    0x0070 00112 (n.go:55)  RET
    0x0071 00113 (n.go:49)  MOVL    $16, CX                 ; array out of bounds panic
    0x0076 00118 (n.go:49)  PCDATA    $1, $1
    0x0076 00118 (n.go:49)  CALL    runtime.panicIndex(SB)
    0x007b 00123 (n.go:49)  XCHGL    AX, AX

Now the loop is between 45 to 59, and you can see that the bounds check (at 32/36) only occurs once.

We can add the same additional array check to the other methods and see if it helps those.

    Benchmark_Loop-8                                	10000000	       102 ns/op
    Benchmark_LoopOneBoundsCheck-8                  	10000000	        88.3 ns/op
    Benchmark_LoopRev-8                             	10000000	       113 ns/op
    Benchmark_LoopRevOneBoundsCheck-8               	10000000	        82.1 ns/op
    Benchmark_BinarySearch-8                        	10000000	       250 ns/op
    Benchmark_BinarySearchOneBoundsCheck-8          	10000000	       253 ns/op
    Benchmark_BinarySearchInlined-8                 	10000000	       114 ns/op
    Benchmark_BinarySearchInlinedOneBoundsCheck-8   	10000000	       106 ns/op

That worked for the forward loop as well. Didn't work the for function based binary search. From the numbers its not clear what's going on with the inlined binary search versions. Looking at the assembly they both seem to have the same main loop, and do end up doing bounds checks each time. Why the last one is quicker I'm not sure.

For the loops, its interesting that with the bounds checks limited to one that reverse is faster, but with a bounds check each time, forward is faster. Comparing assembly again, the reverse with one bounds check uses TESTQ to compare to 0, and the forward one has to use CMPQ to test against n.count. Perhaps TESTQ is faster than CMPQ? Perhaps the branch predictor can predict the terminate on zero case for the reverse loop better than it can predict the terminate on n.count for the forward loop. If you have a more concrete explanation for the differences between these 2 I'd love to hear it.

Come back for episode 2 where we look at some more esoteric approaches to this array search problem.