CPS 210 22F, Midterm 2 Study Notes

Covers lectures L09 through L18
L19 is review and preview
Exam is cumulative: L01-L08 are in scope.

This part 2 of the course covers PL constructs and their implementation on the metal.  It then starts into topics relating to operating systems, performance, and memory/storage devices and their behavior.

Topics:

Instructions and their operands: registers and references (L08-L09)
Control constructs (L10, L11)
Arrays: their use and misuse (L12, L13, L15)
Procedures and the stack (L12, L13)
Interaction of procedure calls and registers (L13)
Risks: vulnerabilities, attacks, and defenses (L12, L13)
Separate compilation and linking (L14)
Memory, I/O, the memory/storage hierarchy, locality, and caching (L15-16)
OS kernel and exceptional control flow (ECF): trap, fault, interrupt; process model and virtualization (L17-18)

L10

How many jump instructions (minimum) does it take to implement a while loop?
How many jump instructions (minimum) does it take to implement a do-while loop?
How many of those jumps are conditional jumps?  How many are unconditional?
How does a label define the target of a jump?
How about a for loop?
how about an if-then-else statement?
Give an example of a loop that needs more than the minimal number of jump instructions. 
How/when do the assembler and linker determine the address of a label?
How/when do the assembler and linker put the address of a label in a machine instruction? 
Direct vs. indirect jumps: give two examples where indirect jumps are useful / how to use them.
How does a jump table reduce the number of jumps needed to implement a switch statement?
How many bytes for a jump table with n targets (on x64)?
Why is an indirect jump more expensive (slower) than a direct jump?
Why is a conditional jump more expensive (slower) than an unconditional jump?
What benefit is offered by pc-relative addressing of direct jump targets?


L11

What is LIFO and how does it match the PL abstraction of procedures?   (procedure == function, routine, subroutine)
How does use of a stack for procedure calls help ensure that code is re-entrant?
The pushq and popq instructions access memory: at what address?
How do pushq and popq affect the value in the stack pointer register %rsp?
How many operands for pushq and popq?  What is the meaning of the operand?
There is no pushb or pushl instruction on x64.   Why not?   Push with no suffix always pushes 8 bytes.  Why?
How is the x64 call (callq) instruction like a jump/branch?   Is it conditional or unconditional?
How does call (callq) instruction affect %rsp?
How does call (callq) instruction affect memory?  Where does the stored value come from?
How does ret (retq) instruction affect %rsp?
In what way is ret/retq an indirect jump?  Why not use a direct jump, which is less expensive?
From a code example, determine how many stack frames are created, and the maximum depth.
Why do procedure call/return often use subq and addq instructions to modify %rsp?  What determines the operand?
Give an example of an x64 instruction to allocate space for local variables on the stack.
What is the role of the compiler in determining how much space to allocate for local variables on the stack?
How/why are uninitialized local variables "worse" than uninitialized global variables?
What determines the value that appears in an uninitialized local variable?
Give an example where an "old" value of a previous instance of local variable x reappears in a new instance of x.
Give an example where value of a previous instance of local variable x reappears in a new instance of y with a different type.
Why is it a bad idea for a procedure to return a pointer to a local variable (e.g., with &)?
What happens if the caller uses a pointer to a callee's local variable returned by the callee?
Give examples for these possible outcomes: (a) it "works", (b) it returns a "garbage" value, (c) it results in a segfault.
Give an example of a dangling reference to a variable on the stack.
How do x64 register conventions reduce the need for stack space?
How do x64 register conventions reduce the need for stack accesses?  How does this improve performance?
Why are leaf procedures "faster"?  (Or why might they be faster?) 

L12

Much of L12 and L13 deal with choices made by the compiler.  These choices are "static": they are represented the binary and happen the same every time the program runs.  Sometimes we say the choice is made by a part of the program (like a procedure) but we mean it is a stack choice made by the compiler and "baked in" to the compiled code.  How are these static choices represented in the compiled code?   

How can you determine from disassembled code how the compiler lays out a struct in memory?
It is important that compiler choices for struct layout are known and deterministic.  Why?

In what ways does the compiler "minimize use of the stack"?  Why is this desirable? 
What is register allocation?  (compiler "picks registers")
How does smart register allocation reduce the need for stack space and stack accesses?
What is a register spill?
Why might it be necessary to save the value of a register and restore it later?  (save/restore)  Give an example.
What instructions are used to save and restore registers?
Why is it important to avoid register save/restore when we can?  Why does it impact performance, i.e., slow down a program?
Distinguish caller vs. callee.
Can a procedure be both a caller and callee?
Register save/restore conventions: caller-saved vs. callee-saved registers.  Why do we need these conventions?  Why both?
Give an example where callee-saved is better, e.g., where callee-saved might avoid a spill.
Give an example where caller-saved is better, e.g., where caller-saved might avoid a spill.

When is the compiler forced to place local variables in the stack frame rather than in registers?
Why does the callee allocate space for local variables in the stack frame (generally), rather than the caller?
Why is it the caller (rather than the callee) that pushes the return address on the stack?
How does the caller determine the return address to push on the stack?   What instruction does it use to push it?

Answer these with reference to the "x = 0b101" example in L12:
For register save/restore by caller or by callee, why not just save the value at a fixed memory location?  (like a global variable)
Why not just save the return address for each procedure at a fixed memory location?  (like a global variable)
Recursion uses a lot of stack space: what factors determine how much?

Why does chase say "don't take large arrays on the stack"?   Can the compiler avoid this choice?  Can the programmer?  How?
What happens if an executing program takes "too much" space on the stack?  (stack overflow)

Stacks may have a smaller fixed size (e.g., 64 KiB or less) in some environments, e.g., in kernel space or in a multithreaded process.  How does that influence a programmer's choices about recursion and local variables?

L13 

Why are function pointers useful?  Why are tables of function pointers useful?
What segment of the VAS does a function pointer point into?

Give an example in which failure to check inputs from an untrusted source leads to a buffer overflow?
How can C code check these inputs or otherwise guard against buffer overflow?
How are C's "n" variants for string-handling routines valuable for guarding against buffer overflow?  (e.g., strncpy vs. strcpy)
In what way is buffer/array overflow "worse" when the array is a local variable on the stack?
Give an example of buffer overflow when reading (as opposed to writing) an array?
How does a buffer overflow on read constitute a vulnerability?  What is the risk?

Stack smash:
Why does overflow when writing an array on the stack "clobber" data on the stack that is "earlier" than the array?
A stack smashing buffer overflow can cause control to "jump to anywhere".  What instruction makes the jump?
When overflow of a character buffer "clobbers" a return address on the stack, which byte of the address is clobbered first?
Does it matter which byte of the return address is clobbered first?
For a stack smash attack with code injection, what segment contains the attacker's injected code?
Can an attacker use a stack smash to branch to a selected address in the text or heap segments?  How?
How does randomizing the text start address (ASLR) reduce the risk of successful stack smash attacks?
How does randomizing the stack start addresses (ASLR) reduce the risk of successful stack smash attacks?
How does protecting the stack and heap segments with "no-execute" protection (NX) reduce the risk of successful attacks?
Would stack smash attacks be less dangerous if the stack grows toward higher addresses? 

L14

In what respect are targets in a makefile "like" procedures?
What is a dependency of a target?
Under what conditions does make execute the actions of a target?
Under what conditions does make execute the actions of a dependency?
How does separating the linker from the compiler enable separate compilation?
What is a symbol?   Wh chooses the set of symbols and their names?
What does it mean for a symbol to be U undefined/unresolved in an object file?  Why would a U symbol appear in the symbol table?
How does the linker handle a U symbol in an object file?
How does the linker handle a global symbol (text or global data) that is defined in multiple object files?
An object file might contain multiple relocation records for a symbol.  What determines how many?
What does it mean to say that an object file is relocatable?
Would an object file ever have a relocation record for a symbol defined in the same file?

Which segments of a virtual address space are mapped directly to some section of an executable file? 
Why is it useful to map these certain segments to the executable file?
Which segments are not?  Why not these segments too?
Distinguish static libraries from shared libraries (DLLs).
Consider two different programs that use the same library.  What determines whether they can share a copy of the library code in memory?
Why is PC-relative (%rip relative) addressing useful?  How does it facilitate code sharing and ASLR?

L15

Locality: spatial vs. temporal: examples

Consider a program that traverses (accesses each element of) an array once.  Can these accesses benefit from temporal locality?  Spatial locality?
Arrays are stored in row-major order.  What is the fastest way to traverse a 2D row-major array?
How does this principle apply to 3D arrays?  ("inside out from right to left")

Building/running programs (e.g., with make) shows locality of access for file data.  Explain how/why, with reference to header files, assembly files, and linking.

Consider an instance of instruction MOVB (%rxx), %ryy.  (Pretend %rxx and %ryy are any valid 64-bit general-purpose registers.)
- How many bytes of data are needed from memory?
- What/where is the start address of the bytes needed from memory? 
- How may bytes of data are overwritten in register %ryy?  (This x64 detail is too detailed to test.)
- This instruction causes the CPU to fetch data from the DRAM?   Choose one of: {never, sometimes but not always, always}
- How many bytes of data does this instruction fetch from DRAM, if any?
- Do your answers change if the instruction is MOVW, MOVL, MOVQ?
- How about for two such MOV* instructions executed as a pair?

How would you measure memory latency of a read from DRAM if you could observe signals on the memory bus? 

In general and for purposes of this question, OS software reads from disk in units of blocks of size N bytes, where N is some fixed power of two, e.g., 8KB (meaning 8 KiB).  A disk read is an I/O command from a CPU to a disk controller to read a physical block number "bx" from a disk and use DMA to deposit its contents in a DRAM buffer at address "vy".  Suppose the controller is idle and then receives a pair of two commands from the CPU in sequence within a few hundred microseconds, and both of the commands are disk reads from the same spinning disk (called a hard disk drive or HDD).  Suppose the first command is a disk read for some block number a and the second is for block number b.
- Could you measure the latency of those disk reads by observing the memory bus?  How? 
- Could you measure the latency of those disk reads by observing the I/O bus?  (or serial I/O interconnect)  How?
- What effect does increasing the block size N have on latency?
- What effect does block distance have on latency?  The distance is |a-b|, the absolute difference of the block numbers.
- The answer to the previous question might be "it depends".  What does it depend on?
- How does increasing rotational speed affect that latency, other things being equal?
- How does widening the platter (e.g., from 2 inches to 4) affect the latency, other things being equal?

Suppose the CPU issues a sequence of K such disk read commands in a tight burst within a few hundred microseconds.  Suppose the last DMA transfer resulting from the request sequence completes t seconds later.  
- What is the throughput in blocks per second?   (Called I/O operations per second or IOPS)
- What is the throughput in bytes per second?
- Can you determine the maximum transfer rate of the disk device in bytes per second?  (Called its peak bandwidth or "spindle speed".)
- The answer to the previous question (determine spindle speed) might be "it depends".  What does it depend on?
- What is the effect of increasing N on throughput in bytes per second, other things being equal?
- What is the effect of increasing N on throughput in IOPS, other things being equal?

Reconsider your answers to the above questions if the disk is a solid-state storage device (SSD, e.g., flash) instead of a spinning disk (HDD).

L16

Be sure to understand how to think of an array of bytes as an array of blocks each of size N bytes, where N is some fixed power of two.  Think of memory and storage devices as an array of bytes or as an array of blocks.  Any cache examples on the exam will be like that.

So, bytes and blocks, byte addresses vs. block numbers:
- Given N and a byte address, what block number contains that byte?
- Given N and a byte address, what is that byte's offset in its block?
- What C operations would you use to compute these functions?
- Why is this WAY EASIER when the byte address is in hex (or binary)?
- 53 cents: how many dimes?  how many pennies?  It's easy, right?

A cache of capacity C blocks may hold up to C of the blocks, indexed by their block numbers. 
- Hits: what effect on performance? 
- Misses: what brings a block into the cache?
- Placement: which "slots" of the cache can a block occupy, given the block number?
- Placement: direct-mapped vs. set-associative.  Direct-mapped has set size 1.
- Placement: consider set size 2 or 4.  
- Placement: fully associative means any block can occupy any cache slot: set size is C.
- Generally C is a power of two.
- Lookup/indexing: given a block number, how to tell if it is a hit or miss?  How to find the block contents if it is a hit?
- Lookup/indexing.  Data structures for set sizes 1, 2, 4, and C.
- Lookup/indexing: Importance of tags and valid bits.
- Replacement/eviction: for a given N, C, and set size: which slots are candidates to evict to cache a block, given its number?
- Replacement/eviction: if there ae multiple eligible candidates, how to choose?
- Replacement policies: random, not last used, least recently used (LRU), FIFO.
- Replacement: conflict misses for direct-mapped.
- Why is higher associativity (bigger sets) likely to improve hit rates?
- Why use the low-order bits and not the high-order bits to index the set?

Memory caches implemented in hardware (CPU caches L1, L2, L3) tend to have low set size and evict from one of the slots in the corresponding set.  Why? 

Caches implemented in software (like the I/O cache) tend to be fully associative and use a variant of LRU replacement.  Why?

Lookup/indexing:
- Direct-mapped: each block can occupy only one slot, which is determined from the block number.  How?
- Direct-mapped: how to lookup a block in the cache?
- For set size 2 or 4, how to determine the set number from block number?
- How does software do lookup in a fully associative cache?  What data structure?
- How does software implement LRU eviction?
- Software cache lookup: a hash table indexed by block number, with tag in each entry.
- Software cache lookup: on a hit, tag matches block number, valid bit is set.
- Software cache lookup: on a hit, entry maps/points to a slot that contains the indexed block.
- Keep a list of blocks for each hash bucket (set), move entry to the tail on each access, replace from the front for LRU.

Writes:
- Consider the subset of write operations that hit in the cache (write hit rate or write hit ratio)
- What is the impact of write hit ratio at level k on the number/rate of writes to level k+1?
- Equivalently, what percentage of writes are "absorbed" by the cache at level k without a write to level k+1?
- What is it for a write-through cache?
- What is it for a write-back cache?
- What are the drawbacks of a write-back cache?
- Software-implemented caches (like the I/O cache) tend to be write-back with periodic "push" of "dirty" blocks to level k+1.

Suppose a cache (at level k) holds a subset of blocks from some home block space (at level k+1).   Suppose levels k and k+1 each have some given fixed access times.  If access time for the cache k is 25% of the access time to the home k+1, and 80% of the accesses hit in the cache (hit rate), then how much does the cache help average access time?

Concept of working set: the subset of blocks that the workload or reference stream is "actively using".  If the working set fits in the cache, then hit rates are very high.  That depends on cache size C and the locality of the access requests in the workload.
- What happens to level-k hit rate as we increase/decrease C to hold a larger/smaller share of the blocks in the home (level k+1)?
- What happens to level-k hit rate as we increase the share of the blocks that are in "active use"?

L17

ECF: exceptional control flow: trap, fault, interrupt
CPU exceptions: trap, fault, interrupt

The OS kernel!  What is it?  Where is it?  How does it get there?  How does control enter the kernel?
- Kernel space: just another segment in each process virtual address space, always at the same address in every process.
- Kernel space contains kernel code and data structures.
- Kernel space is protected against access from untrusted user code.
- Kernel code runs with the CPU in kernel mode.  (aka supervisor mode or protected mode)
- User code runs with the CPU in user mode.
- Kernel vs. user mode is indicated by a bit in a special register.
- A CPU/core in user mode cannot access kernel space: reference to a kernel space address causes CPU to raise a fault.
- A CPU core in user mode can call the OS by executing a special instruction that causes a trap.
- Trap or fault causes CPU to enter kernel mode.
- Trap or fault causes CPU to transfer control to a corresponding kernel routine designated as a handler at boot time.
- Handler routine executes kernel code in kernel mode in kernel space.
- Trap or fault handler may call other kernel code in kernel mode in kernel space.
- Kernel code in kernel mode can change any memory, change any register values, issue commands to devices.
- Kernel code in kernel mode can execute a special instruction to return to user mode
- Kernel code in kernel mode can return to user mode whenever and wherever it "wants"
- Kernel code in kernel mode can return to user mode with whatever values in registers, that it "wants".
- Kernel code can set PC (%rip) to go back to faulting instruction, return to the next instruction, or go somewhere else.
- In this way, the kernel can fix problems, service requests, and suspend/resume running programs (processes, threads).
- In this way, the kernel can control EVERYTHING. 
- Trap, fault, interrupt

Trap examples: system calls.
Fault examples: page fault, segfault ("not on the map"), divide by zero, illegal instruction, ...

Interrupts
- CPU raises an interrupt according to some external condition
- Interrupts occur as time passes: timer interrupts occur at regular intervals as the clock ticks.
- Interrupts occur when an I/O command completes to notify OS that requested DMA is done. 
- Interrupts occur when device needs attention (e.g., incoming network packets, tap or click, mouse is moving).
- On interrupt, CPU (core) transfers control to a corresponding kernel routine designated as an interrupt handler at boot time.
- On interrupt, CPU switches to kernel mode.
- Runs interrupt handler in kernel code in kernel mode in kernel space.
- In this way, the kernel can control EVERYTHING. 
- Trap, fault, interrupt


L18

Syscall API and ABI
Syscall numbers
Need for register usage conventions for syscalls
Syscall stubs in the runtime library: what do they do?
errno: what does it mean?  How/when does kernel return it to user code?  What does it look like to C?

exit syscall
exit status

The process abstraction
The process illusion
Virtualization: virtualizing the CPU (timeslicing), virtual memory

Each process has a process ID.
A process can invoke a system call (e.g., fork) to create another process: parent and child.
A syscall to create a child process returns the child's process ID to the parent (e.g., fork)
A syscall to exit a process passes its exit status.  (e.g., exit)
A syscall to wait for a child process with a given ID to exit returns the child's exit status after it exits (e.g., wait*).
A syscall to execute a binary program file sets up a process virtual address space as specified in the file (e.g., exec*).

Your terminal shell does fork/exec/exit/wait all day long.

The Unix/Linux fork/exec system calls are cool and weird.


Other notes: 

How to address args passed by reference
Addressing relative to stack pointer, base pointer, PC: why?

DIS Chapters 7, 10, 11 cover most of it.

L08

registers don't have addresses
register allocation by compiled code
register conventions: why?
size specifier on instructions: b, w, l, q
address computation specifiers
LEAQ: what is it for?
Why can't we add more registers just like we add more memory?
ABI

end of moore's law
cycles vs. mcpi
register usage conventions
caller-saved vs. callee saved
asm suffix for byte, word, longword, quadword

L09

addressing structs with offsets 
instruction sequence to push a stack frame (allocate space on the stack)
how does the compiled code "know" how big a stack frame to push?
condition code bits, what do they mean, implicit vs. explict setting of the codes, ways to act on the codes

For condition codes, you should understand the big picture but we won't go into the x64 details.