DSC102
Missing Vocabs:
Most common approach: load-store architecture in processing
Registers: Tiny local memory (“scratch space”) on proc. into which instructions and data are copied
Concepts of Memory Management:
❖ Caching: Buffering a copy of bytes (instructions and/or data) from a lower level at a higher level to exploit locality ❖ Prefetching: Preemptively retrieving bytes (typically data) from addresses not explicitly asked yet by program ❖ Spill/Miss/Fault: Data needed for program is not yet available at a higher level; need to get it from lower level ❖ Register Spill (register to cache); Cache Miss (cache to main memory); “Page” Fault (main memory to disk) ❖ Hit: Data needed is already available at higher level ❖ Cache Replacement Policy: When new data needs to be loaded to higher level, which old data to evict to make room? Many policies exist with different properties
Memory Hierarchy in PA0: ❖ Dask DataFrame automatically manages Disk vs DRAM for you ❖ Full data sits on Disk, brought to DRAM upon compute() ❖ Dask stages out computations using Pandas
OS: An OS is a large set of interrelated programs that make it easier for applications and user-written programs to use computer hardware effectively, efficiently, and securely. Without OS, computer users must speak machine code!
The Abstraction of a Process: High-level steps OS takes to get a process going: 1. Create a process (get Process ID; add to Process List) 2. Assign part of DRAM to process, aka its Address Space 3. Load code and static data (if applicable) to that space 4. Set up the inputs needed to run program’s main() 5. Update process’ State to Ready 6. When process is scheduled (Running), OS temporarily hands off control to process to run the show! 7. Eventually, process finishes or run Destroy
Virtualization: Each hardware resource is treated as a virtual entity that OS can divide up and share among processes in a controlled way
Limited Direct Execution: ❖ OS mechanism to time-share CPU and preempt a process to run a different one, aka “context switch” ❖ A Scheduling policy tells OS what time-sharing to use ❖ Processes also must transfer control to OS for “privileged” operations (e.g., I/O); System Calls API
Concurrency – Multiprocessing: Different processes run on different cores (or entire CPUs) simultaneously. Thread: Generalization of OS’s Process abstraction ❖ A program spawns many threads; each run parts of the program’s computations simultaneously ❖ Multithreading: Same core used by many threads. Issues in dealing with multithreaded programs that write shared data: ❖ Cache coherence ❖ Locking; deadlocks ❖ Complex scheduling. Partitioning / replication of data simplifies concurrency
Filesystem: The part of OS that helps programs create, manage, and delete files on disk (secondary storage)
File Descriptor: An OS-assigned positive integer identifier/reference for a file’s virtual object that a process can use. File Handle: A PL’s abstraction on top of a file descriptor (fd)
Database: An organized collection of interrelated data. Data Model: An abstract model to define organization of data in a formal (mathematically precise) way. ❖ Logical level: Data model for higher-level reasoning ❖ Physical level: How bytes are layered on top of files
Virtualization of DRAM with Pages — Page: An abstraction of fixed size chunks of memory/storage ❖ Makes it easier to virtualize and manage DRAM. Page Frame: Virtual slot in DRAM to hold a page’s content
A process’s Address Space: Slice of virtualize DRAM assigned to it alone!
“Persistent Memory (PMEM)”: Marketing term for large DRAM that is backed up by battery power! Non-Volatile RAM (NVRAM): Popular term for DRAM-like device that is genuinely non-volatile (no battery)
Data Organization on Disk – ❖ Disk space is organized into files ❖ Files are made up of disk pages aka blocks
Magnetic Disk Quirks – Key Principle: Sequential vs. Random Access Dichotomy
Cloud Computing — ❖ Manageability: Managing hardware is not user’s problem ❖ Pay-as-you-go: Fine-grained pricing economics based on actual usage (granularity: seconds to years!) ❖ Elasticity: Can dynamically add or reduce capacity based on actual workload’s demand
IaaS: ❖ Compute: § Elastic Compute Cloud (EC2) § Elastic Container Service (ECS) § Serverless compute engines: Fargate (serverless containers), Lambda (serverless functions) ❖ Storage: § Simple storage service (S3) § Elastic Block Store (EBS) § Elastic File System (EFS) § Glacier (storage classes) ❖ Networking: § CloudFront (low latency content delivery) § Virtual Private Cloud (VPC)
PaaS: ❖ Database/Analytics Systems: Aurora, Redshift, Neptune, ElastiCache, DynamoDB, Timestream, EMR, Athena ❖ Blockchain: QLDB ❖ IoT: Greengrass ❖ ML/AI: SageMaker*
SaaS: ❖ ML/AI: SageMaker* , Elastic Inference, Lex, Polly, Translate, Transcribe, Textract, Rekognition, Ground Truth ❖ Business Apps: Chime, WorkDocs, WorkMail
Decoupling of compute+memory from storage is common in cloud ❖ Hybrids of shared-disk parallelism + shared-nothing parallelism
Data Partition sizes: ❖ Avoid too few chunks (low degree of par.) ❖ Avoid too many chunks (task graph overhead) ❖ Be mindful of available DRAM
Dask: Task-Parallelism Best Practices — ❖ Use the Diagnostics dashboard: ❖ Monitor # tasks, core/node usage, task completion ❖ Task Graph sizes: If Too large: Bottlenecks (serialization / communication / scheduling) If Too small: Under-utilization of cores/nodes To do Rough guidelines: ❖ Tune data chunk size to adjust # tasks (see previous point) ❖ Break up a task/computation ❖ Fuse tasks/computations aka “batching”, or in other cases break jobs apart into distinct stages.