AOS Lecture Notes - Lesson 7 - Distributed Subsystems

Distributed Subsystems

Memory pressure
- Different for each node
- How best can we use idle cluster memory?
- Remote memory access may soon be faster than local disk accesses
  - 10 gigabit links will soon be common for LANs. This is faster than local disk access speeds (especially given spinning rust technology if relevant)
Memory manager
- VA->PA or read from disk
- gives the illusion that all of virtual address space is in physical memory (covered thoroughly in GIOS notes)
Global memory system
- VA-> or cluster memory or read from disk
Generally, GMS favors networked acess to cluster memory over reading from local disk.
- Only pull from local disk if required page is not in cluster memory
- Does not interfere with writing to local disk, though. Disk always has local copies of all pages. Only pages that are in cluster memory are paged out pages that are clean (not “dirty”).
- GMS does not add any new causes for worrying abot failures, as a node crash will only lose clean pages, replicated in local disk

“Cache” refers to physical memory (i.e. DRAM) not processor cache
Sense of “community” to handle page faults at a node
- Use peer memories as a supplement for the disk
Physical memory of a node
- - Global component is where community service storage is kept
  - Boundary/size between local and global memory shifts depending on memory pressure on the node. As working set grows, global memory shrinks and vice versa.
  - Shared pages will be in the local component of node memory for multiple nodes. So pages in local memory may be private or shared.
  - Paradoxically, though, pages in the global part of a node’s memory are guaranteed to be private copes, as the global section is equivalent to a disk.
The function of GSM is to serve as a remote paging facility
Coherence of shared pages it outside the purview of GSM. Maintaining coherence is the concern of the application
GSM, as do most pagers, uses a variant of the LRU page replacement algorithm
- As it is a global structure, it picks the globally oldest page to replace. This means age information needs to be managed by GSM

Common case with memory pressurea t P
- page fualt for x on node p
- Swap LRU page y for X

- Note that if the least recently used page on R is in global section it must be clean, so can be discarded without worrying about storing. If it is in local, it could be dirty so must be stored to disk.
Faulting page on disk
- page fault for x on node p
- page not in cluster

- Page x is copied to p’s local section but not removed from q’s local section
Faulting page actively shared
- Page fault for x on node p
- page in some peer node q’s local cache

How can we identify the globally oldest page for our LRU page replacmenet variant?
Epoch parameters
- T: Max duration
- M: Max replacements
At the start of each epoch
- Every node will send age info to initiator
  - What is the age of the pages resident at this node.
- Initiator node will compute and respond with:
  - Identify M oldest pages in cluster, that need to be replaced
  - Out of those M pages, what is the minimum age (the “youngest” page that must be replaced)
    - Expressed as MinAge
  - What fraction of pages that belong to each node will be replaced (called here a weight)
    - Express as Wi
Selection of an initator should be dynamic. Any one node should not be stuck with this work every time, especially if it gets busy.
- Initiator for next epoch will be the node with highest replacement weight (Wi). This is easy because each node receives the weights of all nodes from the initiator each epoch.
Action at a node on page fault
- Page y eviction candidate
- if: age(page y) > MinAge => discard
- if: age(page y) < MinAge => send to peer Ni
  - Ni identified using the Wi weights from most recent epoch. Decision algorithm slightly complicated, not given in detail.
Age management only approximates global LRU, does not actually establish it definitively. This is a compromise for performance.
Think global, act local. Hammered on throughout the course, mentioned again here.

GMS integrated with DEC OSF/1
- Access to anonymous pages and filesystem-mapped pages go through GMS on reads
Collecting age information to calculate global LRU is the hard part, as described above
- In particlar within the VM manager the OS does not see what user applications are doing with their pages.
- The solution here was to add a daemon to dump information from the TLB periodically into a data structure the GMS maintains.