Homework 1: Design a cache coherence protocol

Due Wednesday, April 14

This assignment is to complete the high-level design of a directory-based multiprocessor cache coherence protocol, while using Murphi to verify the design.

A completed solution will consist of the following:

  1. Your murphi description, with comments.
  2. A 1-2 page description (in plain text) of your solution and two of the most interesting problems you encountered.
  3. A table of state counts for various values of NumProcs, NumAddrs, and NumValues, with and without symmetry reduction. If you have any energy left, also try to demonstrate data saturation using symmetry reduction.
  4. Any improvements you were able to make beyond the basic requirements above.

I encourage you to discuss this problem and your solutions with others, but I don't want you to share code. So don't borrow from the text of others solutions.

This is hard. Start as soon as you can.

Please direct all questions to me (call 5-3642, drop by, or email me at "dill@cs.stanford.edu").

What is directory-based cache coherence?

Shared memory multiprocessors want to maintain the illusion of a single memory that is shared among many processors, while allowing individual processors to cache memory locations for speed. All the caches need to be kept consistent, so that you don't have two processors writing private copies of the (supposedly) same memory location.

If there are few processors, one solution is to have all of them "snoop" the memory bus, so everyone can see what everyone else is doing. If a processor asks for a writable copy of a memory value, other processors that have readable copies can invalidate them immediately, because they see the request from the first processor.

In a large multiprocessor, it becomes too expensive to broadcast all requests to all processors (which are no longer sharing a common bus). Instead, a directory-based protocol sends messages to only those processors who need to know about a transaction. For example, when a processor asks for a writable copy, only those processors that have read-only copies are notified (so they can invalidate their copies). To do this, the multiprocessor needs to know which processors have copies of each memory location. This information is kept in a directory. (In reality, "memory locations" are multi-word blocks of memory, called "cache lines" when they are in the cache.)

These protocols can be very tricky, especially when they are optimized. Murphi has been used to verify a number of them, and now is pretty well suited for the task. Your assignment is to take a core protocol and add some missing cases to it, with optimizations.

The protocol in the assignment

The protocol assumes that there is a central memory with directory information for each cache line, and a collection of processors with caches. (In real processors, the memory is often distributed among the processors.) The processors are not modeled; instead, the caches "spontaneously" decide to write back or invalidate entries, when in reality these events would occur because of instructions executed in the processor (and algorithms in the cache system that decide which entry to dump when the cache wants to load a new line).

Each memory location in the cache has some associated state information. It can be invalid (I, meaning the location is not cached), read-only (RO), meaning that the value can be read but not written, or writable (W), meaning that the value can be updated. There are additional states that keep track of transactions in progress; for example, Wait_Read means that the cache has sent out a request for a readable copy of an address, but hasn't received it yet.

When a writable value is updated, it is not automatically written back to the main memory, so the value in the cache may be newer than the main memory.

The main memory has a copy of each memory location and its value. In addition, there is a directory entry for each memory location. The directory entry has a state, which says whether the location is writable in some cache (the state is called Excl, since that cache should have exclusive access to the location), readable (the Shared state), or not cached (Invalid). In addition, the directory entry has a field called "users" holds the set of processor ids that have the location cached read-only, represented as a bit-vector, and a field called "owner," which is the id of the latest processor to have an exclusive copy (I would like to have avoided the "owner" field, but I couldn't find an efficient and correct protocol without it, given the other decisions I made).

This protocol is strongly influenced by the nature of the communications network that handles the messages. It is reliable (no messages are lost or duplicated), but it provides no guarantees about ordering. Two messages can be sent by the same processor, and the first can be received long after the second. This assumption would be appropriate if the communication network could route consecutive messages differently. (Many multiprocessors assume message ordering will be preserved, though.)

Specification

Specifying the correct behavior of a cache coherence protocol is a problem that we are only partially going to address here. Essentially, the memory system is presenting an illusion to the user of a big shared memory. However, the world is not perfect, because simple memory models for the user require inefficient cache coherence protocols. So various liberal memory models, which sometimes produce surprising results, have been proposed. Ideally, we could verify that our protocol implements the appropriate memory model.

Unfortunately, this is a hard problem. Instead, we will specify some simple invariant conditions which will catch many errors. The most important conditions are assertions and error messages about illegal combinations of states (in the processors or directory) and messages. There should also be global invariants to check that there is no more than one exclusive copy of an address at any time, and that there are never both shared and exclusive copies, that the directory state correctly represents the states of the caches, etc.

Another major limitation of Murphi is that it can't check "liveness" properties. For example, we would like to be able to prove that every transaction eventually completes, but there is no way to do that. Murphi can check for deadlocks, whether the protocol gets totally stuck (you'll probably have some of these), but it will miss problems where a message is ping-ponged back and forth forever, or part of the system freezes while the rest keeps going, even though some activity will fail to make further progress. The problem is that these conditions can be fatal design errors. They can also hide other problems: e.g., the protocol would have fail catastrophically, but it didn't get to that point because something else got stuck.

The assignment

There is a Murphi description of the "core" of the algorithm, which handles the transactions for getting a read-only copy, invalidating it, get a writable copy, and writing it back. However, this description "cheats" in the following way: If a processor (say P1) requests read-only access to a location, and (according to the directory state), another processor already has a writable copy, the request for the read-only copy is ignored and left sitting in the network. It will not be processed until the directory state becomes invalid, which will only happen after the writable copy is written back to the home. Unfortunately, there is currently no mechanism for letting the holder of the writable copy know that someone else is waiting for a copy! There is a similar cheat when a writable copy is requested and there are already read-only copies.

The assignment is to do the following:

  1. Verify early and often. Run the protocol through Murphi every time you make a change. Record some of the more interesting bugs as you go.
  2. Add assertions to check for various properties. This will be an on-going process as you develop the protocol. Sometimes, you will see errors that could have been caught much earlier by an additional invariant. In this case, add the invariant! Initially, it is easier if add logic saying "if the network is empty", then the property should hold (partially completed transactions make things more complicated). The core solution has a property like this. However, this is very dangerous if there are transactions that fail to complete. To receive full credit, you must eventually weaken the assumptions to check the properties when the network is not empty.
  3. Handle the case where a Read_Req message arrives when the address in the directory is in the Excl state. Your solution should have the following properties: This probably sounds a lot easier than it is. I found an astonishing number of bugs in my solution, including several that required major changes in the protocol.
  4. I intended to ask you to "Handle the case where a Read_Excl message arrives when the directory is in the Shared state. This requires causing the processors with readable copies to invalidate their copies before the exclusive copy is written." However, I failed do it in the time I allotted to work on this, so it will be extra credit , especially if you can find a nice solution.

Hints on using Murphi

Make small changes, and run Murphi after every change. Sometimes, you know that a change is incomplete and there will be a particular error, but even then Murphi might find another error you don't know about.

Work on one transaction at a time, until you get it working.

Think of Murphi as an astute critic who can always find the example where your protocol doesn't work. Fixing it by trial and error only works for minor problems. At several points, you should stop and re-think using pencil and paper.

At first, verify with the smallest scaling parameters you can get away with. There are several advantages to this:

You also need to run large instances of the description to uncover errors that happen, say, with 3 processors instead of 2. You should try this as soon as you can (I found such a bug that caused me to rewrite a lot of the protocol). Then go back to the smaller scaling parameters when you are revising the protocol.

Even when you think you understand an error, look through the counterexample carefully (at least, whenever you get a new error). You will probably see a number of weird things that will increase your understanding of the protocol, uncover conditions you need to check, or reveal bugs that you need to fix.

More on the previous point: remember that Murphi doesn't check every property your protocol should have. It can't check for "livelock", where part of the system is making no progress but there is some other activity that prevents it from being a deadlock. It can't detect most performance problems in your protocol. You need to keep thinking, and not rely on Murphi for everything. Specification suggestions:

Whenever you have a FOR loop that loops over a ScalarSet, Murphi generates a warning saying the iterations of the loop should be independent. Please check that this condition holds, then don't worry about it. Murphi generates the message all the time because it can't easily check that the loop iterations are independent.

Don't use the multiset type for anything except the message network. There are a number of surprising restrictions that will cause you problems.

Murphi sometimes reports "internal error". This looks like a system problem but usually isn't. Run-time type checks, bounds checks, etc. are treated like invariants, and Murphi is reporting that they are violated. One such message is something about the network overflowing. If your protocol goes berserk and generates lots of messages, or if you forget to remove messages from the network after processing them, you get this problem (or if you make your network too small).

Getting and running Murphi

The sources to Murphi can be downloaded from a web page, and you can install it on the unix system of your choice.

I've gotten it running on the saga systems. I will make this publically accessible right after the lecture on Monday.

Murphi sources are in ~dill/Murphi3.1 on DCC (e.g. epic) machines. It only runs on Solaris at the moment.

Directions for Murphi:

Copy
~dill/list5.m
to your directory. Run
 ~dill/mu list5.m 
Run
 g++ -I ~dill/murphi_include list5.C 
Ignore the warning about an abstract declarator.

Run

 a.out -tf 
Check
 a.out -h 
for more information about options.

Files

You can download these files in Netscape by holding down the shift key when you click them with the mouse.

Murphi user documentation

Core description of the protocol

Copyright © by David L. Dill, 2001, 2002, 2003, 2004. All rights reserved. This material may be freely distributed for educational purposes, so long as (1) this copyright notice remains, (2) the notes are not modified in any way, and (3) no fee is charged. Other uses require the express written consent of the authors.