Understanding the Impact of CPU Cache on Concurrent Performance

December 30, 2019·

·2 min read

Tecker Yu

AI Native Cloud Engineer × Part-time Investor

Generally, each CPU core has L1 and L2 caches, while L3 is a shared cache Stored in cache line units, typically 64 bytes per line

This type of cache utilizing locality principles is the main reason why array access is faster than linked list access

False Sharing Problem

If two cores both have the same cache line, but they operate on different variables within that cache line, a false sharing problem occurs. The processor uses flags to mark the modification status of cache lines and needs to transmit modified cache lines through MESI protocol communication. On the surface, these two variables are modified independently in different L1 caches, but in fact they share the same cache line in L3, resulting in significant communication overhead for data synchronization.

Prerequisites for communication to acquire cache:

Thread work moves from one core to another (related to bound M in GPM model, which is also why it’s better for the originally bound M to run G - utilizing processor local cache!)
Two different cores need to process the same cache line

Solving False Sharing Problems

Example: When multiple threads need to process multiple adjacent data items? Such as: concurrent modification of arrays, adjacent fields in structures being concurrently read and modified, etc.

One feasible approach is boundary padding. In Go, you can typically use anonymous underscore []byte for padding, ensuring that affected data items are located in different cache lines. This breaks sharing and avoids the communication overhead of data synchronization.

Views

Common Go Concurrency Patterns Memory Leak Caused by Using Go STL to Query DB