Go when pointers hide the benefits of CPU caching
<p>We often use the <code>(*Struct)</code> syntax for receivers, as in this example, even though in this case, we don’t need a pointer since we aren’t planning to modify the <code>User</code> structure. So why do we do it? Because we’ve been taught that working with pointers is much faster than working with values, which is true in most cases. But not always.</p>
<p>You’ve probably heard about processor cache lines. This is a small-sized memory that is orders (even tens of times) faster than main memory. Any data you work with is cached by the processor to optimize operations.</p>
<p>Here are the cache lines (different levels of processor cache) on my Mac:</p>
<pre>
sudo sysctl -a | grep cache
...
hw.l1icachesize: 32768
hw.l1dcachesize: 32768
hw.l2cachesize: 262144
hw.l3cachesize: 12582912</pre>
<p>These are byte values. Now let’s imagine that our structure fits into L1 cache. Does it make sense to cache the pointer? If the cache contains the pointer, caching doesn’t provide much benefit when accessing a field of the structure that the pointer points to. After fetching the pointer that points to a location in main memory, you’ll still need to access the main memory to retrieve the data. Consequently, <em>it’s much more advantageous to pass small structures by value</em>.</p>
<p>For instance, consider having the following structure:</p>
<pre>
type Small struct {
str string // 16 bytes
i int // 8 bytes
b bool // 1 byte + 7 padding bytes
}</pre>
<p>This structure occupies only 32 bytes in memory. And it will fit into L1! Consequently, it’s advantageous to access it without using a pointer. Let’s verify this.</p>
<pre>
func BenchmarkValue_L1(b *testing.B) {
s := Small{"str", 1, true} // 32 bytes, fits L1 cache
println(unsafe.Sizeof(s), unsafe.Offsetof(s.str), unsafe.Offsetof(s.i), unsafe.Offsetof(s.b))
// prepare data
items := make([]Small, 0, 100)
for j := 0; j < 100; j++ {
items = append(items, s)
}
// bench
for i := 0; i < b.N; i++ {
processSmallByVal(items)
}
}
func BenchmarkPointer_L1(b *testing.B) {
s := &Small{"str", 1, true} // 32 bytes, fits L1 cache
// prepare data
items := make([]*Small, 0, 100)
for j := 0; j < 100; j++ {
items = append(items, s)
}
// bench
for i := 0; i < b.N; i++ {
processSmallByPointer(items)
}
}
/* data manipulation */
func processSmallByVal(items []Small) {
for _, item := range items { // get item value, item var reused to store values from slice
for i := 0; i < 10000; i++ {
item.i = i // access item's field from cache, mutate field, store in cache line directly
item.str = "123" // same
someint, somestr, somebool := item.i, item.str, item.b // read item's fields from cache
_, _, _ = someint, somestr, somebool // stub
}
}
}
func processSmallByPointer(items []*Small) {
for _, item := range items { // get pointer, store in 'item'
for i := 0; i < 10000; i++ {
item.i = i // go to the main mem
item.str = "123" // go to the main mem again
someint, somestr, somebool := item.i, item.str, item.b // go to the main mem again
_, _, _ = someint, somestr, somebool // stub
}
}
}</pre>
<p>Results:</p>
<pre>
go test -bench=. -benchtime=10s
goos: darwin
goarch: amd64
pkg: module05/fibonachi
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkValue_L1-12 44923 257923 ns/op
BenchmarkPointer_L1-12 15602 775480 ns/op</pre>
<p>Directly accessing the value without using a pointer <strong>is three times faster</strong>.</p>
<p>If we increase the size of our structure to be larger than 32kb but smaller than 262,144kb, then we will be working with values from L2.</p>
<pre>
type Big struct {
arr [5000]int // 40kb
}</pre>
<p>This structure is too large for L1 on my machine but fits into L2. Let’s write a similar benchmark:</p>
<pre>
func BenchmarkValue_L2(b *testing.B) {
a := [5000]int{} // 40kb, fits L2 cache
for i := 0; i < 5000; i++ {
a[i] = i // fill array with values
}
s := Big{arr: a}
// bench
for i := 0; i < b.N; i++ {
processBigByVal(s)
}
}
func BenchmarkPointer_L2(b *testing.B) {
a := [5000]int{} // 40kb, fits L2 cache
for i := 0; i < 5000; i++ {
a[i] = i
}
s := &Big{arr: a}
// bench
for i := 0; i < b.N; i++ {
processBigByPointer(s)
}
}
/* data manipulation */
func processBigByVal(item Big) {
for i := 0; i < 5000; i++ {
item.arr[i] = i
somearr := item.arr
_ = somearr
}
}
func processBigByPointer(item *Big) {
for i := 0; i < 5000; i++ {
item.arr[i] = i
somearr := item.arr
_ = somearr
}
}</pre>
<p>Results:</p>
<pre>
go test -bench=. -benchtime=1s
BenchmarkValue_L2-12 502042 2396 ns/op
BenchmarkPointer_L2-12 468314 2855 ns/op</pre>
<p>The results are nearly identical. As you may have noticed, this time I used <code>-benchtime=1s</code> instead of 10s, as before, because the longer the benchmark time, the more iterations with the function call <code>processBigByVal(s)</code> will occur. Now that we have a relatively large structure, copying it when passing to the function starts to overshadow all the benefits of accessing the value from the cache (what happens inside the function). Copying a pointer is easier than copying a structure because it's smaller. The larger the size of the structures we use, the less efficient the cache becomes for our values, and the more expensive it becomes to pass such values to functions and methods.</p>
<p><a href="https://blog.stackademic.com/go-when-pointers-reduce-the-impact-of-cpu-caching-1e290fc30077">Website</a></p>