Making Sense of the Intel Haswell Transactional Synchronization eXtensionsby Johan De Gelas on September 20, 2012 12:15 AM EST
Multi-core and Locking
Intel has released additional information on their Transactional Synchronization technology (TSX), which is basically a instruction set architecture (ISA) extension to make hardware accelerated transactional memory possible. What does that mean in the real world?
The more cores you get in a system, the more threads you need to keep them busy. Unfortunately, doing so is not that easy: threads have to work on shared data and need locks to make sure that the end result is correct. A thread has to acquire a lock, which may necessitate waiting until another thread releases the lock. That can lead to serious lock contention which can result in bad scaling, even to the point where more cores (and threads) can lead to a performance loss instead of a gain.
We measured this in a real-world benchmark (based upon MySQL InnoDB) and showed that spinning locks were indeed to blame. Spinlocks that are "busy waiting" too much do more than just destroying the scalability of your multi-core server: they can waste quite a bit of energy. As cores hardly perform any useful work and can no longer go to sleep, a torrent of spinlocks can wreak havoc on a server's performance per watt ratio. The following chart, based upon measurements on a HPC benchmark, is another example. Looking at the performance that you get from each core helps to summarize the problem.
The HPC bench shows how the more cores the software uses, the less performance we get per core. When 16 cores are working, each core offers only 60% of the performance of a single threaded run. A shared data structure is the problem. Careful tuning could make this software scale better, but this would require a substantial amount of time and effort.
The root of the locking problems is that locking is a trade-off. Suppose you have a shared data structure such as a small table or list. If the developer is time constrained, which is often the case, the easiest way to guarantee consistency is to let a certain thread lock the entire data structure (e.g. the table lock of MySQL MyISAM). The thread acquires the lock and then executes its code (the critical section). Meanwhile all other threads are blocked until the thread releases the lock on the structure. However, locking a piece of data is only necessary if two threads want to write to it. The classical example is a sequence of two bank transactions:
Locking an entire data structure is something else. You can imagine that the other threads might all write to different values and should be able to do a lot of work in parallel. The coarse grained locking of entire structures is generally not necessary, but it makes the developers' job easier.
The developer could rewrite his/her code and apply a much finer grained locking, but this is a pretty complex and time consuming task, and the critical code section will probably need several variables to do its job. In that case you can get deadlocks: one thread locks variable A, another thread variable B. They both need A and B, and they keep trying to get a lock on the other variable. There is a reason why getting good multi-core scalability requires a large investment of time and money.