Summary of GFS
2021/12/05
The different assumptions and technological environment of GFS with old distributed file system:
- component failures are common
- files are huge
- prefer append to in-place edit
- files are most read-only once written
- well-defined semantics for concurrent append to support producer-consumer queue
- prefer bandwidth to latency
Components
- One master and several master replicas and shadow masters:
- Single master makes design simpler, it has global information to make decisions. The communication between clients and master is minimized(only metadata, not file data), so master won't become the bottleneck
- shadow masters are used when master is down to support read-only access
- master replicas as backup if master can't be recovered
- master persists namespaces and file-to-chunk mapping, but doesn't persist chunk locations, getting them from heartbeat after startup. Because chunkserver changes(join/leave/restart/fail) are too often, request the data from chunkserver is simpler
- Chunkservers to save data, serve reads and writes
- The files in chunkservers is divided by chunk, each 64MB, the larger the chunk, the smaller the metadata
One interesting about GFS is it doesn't distinguish between normal and abnormal termination. So it's a crash-only software.
Read/Write
When clients write data, it gets chunk locations(for example, server A, B, C) from master, then choose a closest chunkserver(let's say server A), chunkserver A receives the data, save it in the LRU buffer and transfer the data to a closest chunkserver B at the same time it receives the data. This pipeline is used to minimize the latency and maximizing the bandwidth. When all the 3 chunkservers receive the data, client sends a request to primary(for example server B, which gets a lease to become primary of this chunk from master), then primary will send a request with a consistent order for all the file data to all secondaries(A and C). After receiving the request, the file data is finally wrote into chunks.
Data flow and control flow are decoupled for network efficiency.
Concurrent writes may result in failures, GFS doesn't keep every chunk replica byte-wise identical, each chunk may have padding or duplicate data(when client appends fail, it'll retry, even though some chunkserver may already write them). The client API will handle padding, but leave the duplicates to client code.
When clients read data, the chunk server will compare the in-memory checksums with the chunks to ensure data is not corrupted.
Master operation
Master handles chunk location management, replicates replicas based on different priorities, rebalance chunks and garbage collection. When doing all these operations, master needs to consider chunkserver's disk utilization, network bandwidth and physical server distribution.
The default behavior of file deletion is just to rename the file to another location, wait for 3 days then actually "deleted". This has many benefits:
- safety net for accidental deletion
- simpler and uniform deletion logic: failures in distributed cluster are common, active scan of orphan chunks is always needed even the system has direct deletion. So to simplify the design, just use active garbage collection(active scan) solely.
The backwards of GC is slow reclamation of disk, then GFS provides a option to delete the file again to reclaim the storage immediately, and some different reclamation policies.
Other materials
http://nil.csail.mit.edu/6.824/2020/notes/l-gfs.txt
https://queue.acm.org/detail.cfm?id=1594206
- The design of single master simplify the system, help the system come on line in time, but finally become the bottleneck
- When you design both infrastructure and the apps on that, you can push the problems around and find the best accommodation
- Use a system in a way that's not designed for will be a pain. (Use GFS for latency sensitive apps)
- Colossus is the successor to GFS
- The metadata model is distributed, more scalable
- Colossus can support different use cases, make them scalable to exabytes of storage and tens of thousands of machines. Provision for peak demand of low latency workloads, and fill idle time with batch analytic tasks
- Build on many different storage hardware, provide different tiers by specifying I/O, availability and durability requirements
- Hot data in flash, aged data in disk
https://www.systutorials.com/colossus-successor-to-google-file-system-gfs/