Introduction to Shared Memory Multiprocessors

edit

Cache-coherence protocols, memory-consistency models, and synchronization support are three main types of support necessary for the accurate execution of shared memory parallel programs on a multiprocessor systems.

Multiprocessors

edit

Inter-Process Communication (IPC)

edit

IPC defines the methods which are used to exchange information between multiple threads in a threaded program. IPC is sometimes also referred to as inter-thread communication. The four main types of communication between processes are message passing, synchronization, shared memory, and remote procedure calls. While shared memory is the fastest form of IPC, it requires synchronization mechanisms involving mutexes, locks, etc. so that the shared data has serial consistency. Another form of IPC commonly used is Messgae-passing. Message Passing protocol imitates communication between a server, which sends information related to a shared resource, and a client, which receives the information as an input to calculating some output. The communication medium in this protocol usually lies within the Operating System's kernel. For a multi-threaded program, these messages are queued using FIFO organization to maintain serial consistency.

Symmetric Shared-Memory Multi-Processing (SMP)

edit

In traditional SMP (Symmetric Multiprocessing) systems, the computer has a single memory controller that is shared by all CPUs. This single memory connection often becomes a bottleneck when all processors access memory at the same time. It also does not scale very well for larger systems with a higher number of CPUs. For this reason, more and more modern systems are using a CC/NUMA (Cache Coherent/Nonuniform Memory Access) architecture. Examples are AMD* Opteron*, IBM* Power5*, HP* Superdome, and SGI* Altix*. [1]

Non-Uniform Memory Access (NUMA)

edit

NUMA refers to a hardware architecture which may access its local memory quicker than the local memory of another processor. The latency is influenced by the physical distance between the processor and memory. NUMA multiprocessors require some sort of support by the operating system in order to achieve good performance, by considering spatial locality when allocating memory pages. One great advantage of the NUMA architecture is that even in a big system with many CPUs, it is possible to get very low latency on the local memory. Because modern CPUs are much faster than memory chips, the CPU often spends quite some time waiting when reading data from memory. Minimizing the memory latency can therefore improve software performance.[1] NUMA is also referred to as Distributed Shared Memory.

In a NUMA system, each node is in itself a multiprocessor. As a result, if one some memory is committed to a specific multiprocessor, it will be treated as the owner of the resource and therefore all threads running on a different node that may need this resource will generate traffic on the interconnect and experience higher latency. NUMA systems usually have kernel level support for bandwidth optimization, interleaving policies, etc.

File:Ccnuma.PNG

Cache-coherent NUMA (CC-NUMA) is most commonly used in modern systems. Most modern systems use some sort of local, non-shared cache, and hence NUMA hardware can incur memory-access overhead due to coherence misses. In an SMP-based CC-NUMA multiprocessor system, SMP noes are interconnected via the interconnection network based on the cache-coherent non-uniform memory access model. All processors belonging to the same SMP node are allowed to uniformly access the node memory modules. All SMP nodes have access to all physically distributed memories. [2]

NUMA configurations are also common in Massively Parallel Processing (MPP) Systems because they provide scalability in terms of disk space. MPP systems refers to systems with a mesh of processors acting in itself as a functional unit. Many supercomputers involving large database services, business intelligence software, etc. make use of MPP for high performance and granularity with large volume data. A majority of the academics surveyed (58%) would recommend MPP technology as mature enough for use by commercial IT Managers or CIOs, though priceperformance was rated fourth after Data Parallelism, High Transaction Processing Capability and Scalability as contributing to commercial advantage. [3]


The Cray XT5m, which is one of the fastest supercomputer in the world is a form of such MPP Systems. A description of this supercomputer can be found at http://www.cray.com/Assets/PDF/products/xt/CrayXT5mBrochure.pdf

Cache Coherence

edit

Hardware-Based Coherence

edit

Hardware solutions are often reduced to the use of shared caches, snooping schemes, directory schemes, write through protocols, write-back protocols, update protocols, invalidation protocols, dirty-sharing, and non-dirty sharing protocols.[4] Invalidation protocols are when a cache is writing, all other caches that hold a copy are invalidated, while with update, all other caches that hold a copy are updated with the new written data. Cache memory is a cost-effective method of increasing performance in uniprocessor systems. Write through schemes update main memory and update or invalidate other caches containing the item. It is simple to implement but uses a large amount of bandwidth. Cache memory is a cost-effective method of increasing performance in uniprocessor systems. Write-back schemes are more efficient, despite increased hardware complexity of cache-coherency support, generating less bus traffic than write-through[5]

Snoop devices are used in cores and their caches so that shared data is cached. A comparison of a variety of snoop-based cache coherency schemes portrays a “sensitivity to cache write policy more than the specific coherency protocol.” [6].

Comparison between snoop-based cache coherency schemes shows strong sensitivity to cache write policy more than specific coherency protocol.

Sofware-Based Coherence

edit

In sofware-based coherence, shared data are not cached. Advanced schemes would require the compiler to perform correct analysis, so that some shared data may be cached when it is safe.[6] 'Light-weight' schemes avoid caching nonshared data for energy efficiency and are have the advantage of scalability.

In the sample code[6], the shared keyword used on a variable implies the fact that it cannot be cached. |Sample Code

OS-Based Coherence

edit

In OS-based coherence, a communication infrastructure using message queues which are implemented as packets. Remote processes use global identifiers in queues to obtain packet buffers and locks are used to for synchronized access of buffers. The OS is thus able to guarantee coherence. OS-based coherence however, has an extremely high cost and “general-purpose libraries... are not a practical alternative in a highly-performance and power-constrained context.”[6]

File:Cacheorganization.PNG[5]


Light-weight schemes avoid caching nonshared data, so they are energy efficient. They have scalability.

Memory Consistency

edit

Memory consistency problems occur due to reordering of memory based instructions during execution of threaded programs.

Load Queues

edit

Enforcing memory consistency can be done by strictly ordering memory operations, but can result in unnecessary overhead. As a result, load queues track dependencies between memory operations to ensure memory consistency while preventing program violations. There are mainly two tupes of load queues.

In a processor with a snooping load queue, originally described by Gharachorloo et al., the memory system forwards external write requests (i.e.invalidate messages from other processors or I/O devices) to the load queue, which searches for already-issued loads whose addresses match the invalidation address [9], quashing any overlapping load. If inclusion is enforced between the load queue and any cache, replacements from that cache will also result in an external load queue search. Insulated load queues enforce the memory consistency model without processing external invalidations, by squashing and replaying loads that may have violated the consistency model. [7]

Alpha Memory Model

edit

The Alpha model provides two different fence instructions, the memory barrier (MB) and the write memory barrier (WMB). The MB instruction can be used to maintain program order from any memory operations before the MB to any memory operations after the MB. The WMB instruction provides this guarantee only among write operations. The Alpha model does not require a safety net for write atomicity. [8]

File:Alphamodel1.PNG[9]

Synchronization

edit

MCS Lock

edit

MCS lock is a spin lock algorithm designed by Mellor-Crummey and Scott, which ensures FIFO ordering of lock reception, spins on local flag variables, uses a small amount of space per lock, and works well on machines regardless of coherent caches. The only issue is that the time necessary to release an MCS lock is dependent on whether or not another processor is waiting. The MCS-lock algorithm "relies on two atomic read-modify-write operations"[10]. The swap instruction contains a pointer and a value in which the value is put into the memory location provided by the pointer, and the old value in the pointer is returned. The compare_and_swap instruction in the code below takes in a pointer (L) and two values (I, X). Thus, The old value of the pointer is returned, while I is placed into L only if L already contains X. Furthermore, each processor allocates a record when it sets a lock. The record has a boolean flag for forming a queue with the lock holder pointing to the head of the queue and pointer L pointing at the tail of the queue.

The algorithm, seen in the sample code below, maintains a queue of processors which each request a lock. This enables each processor to wait on a “unique, locally-accessible flag variable”. [11]

File:MCSalgorithm.PNG[11]

PowerPC Sync Instruction

edit

In context-synchronizing, the isync instruction is used to guarantee that memory access has been completed. Instructions after the synch instruction will execute in a new context. In execution-synchronizing, the sync instruction synchronizes execution and broadcasts addresses on the bus. This may be done to synchronize coherent memory with alternate processors. The difference between isync and sync is that with sync, external addresses must complete “with respect to other processors and mechanisms that access memory”. [12]

References

edit
  1. ^ a b A NUMA API For Linux (http://www.novell.com/rc/docrepository/public/37/basedocument.2009-11-18.5883877819/4621437_en.pdf?noredir=True)
  2. ^ Cache Coherent Protocols in NUMA Multiprocessors (http://ettrends.etri.re.kr/PDFData/13-5-2.pdf)
  3. ^ MPP Systems (http://www.springerlink.com/content/p22v3316888827jn/)
  4. ^ Cache Coherence(http://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf)
  5. ^ a b A Low-Overhead Coherence Solution for Multiprocessors With Private Cache Memories (http://portal.acm.org/citation.cfm?id=808204)
  6. ^ a b c d Cache Coherence Tradeoffs in Shared-Memory MPSoCs (https://wiki.ittc.ku.edu/ittc/images/0/0f/Loghi.pdf)
  7. ^ Memory Ordering: A Value based approach (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.2874&rep=rep1&type=pdf)
  8. ^ Shared Memory Consistency Models: A Tutorial
  9. ^ Shared memory consistency protocol verification against weak memory models: refinement via model-checking? (http://www.cs.utah.edu/formal_verification/papers/cav02paper.pdf)
  10. ^ A simple correctness proof of the MCS contention-free lock (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.41.1194&rep=rep1&type=ps?noredir=True)
  11. ^ a b Synchronization Without Contention (http://www.freescale.com/files/32bit/doc/app_note/AN2540.pdf?noredir=True)
  12. ^ Synchronizing Instructions of PowerPC Instruction Set Architecture (http://www.freescale.com/files/32bit/doc/app_note/AN2540.pdf?noredir=True)