Project Title: Distributed MapReduce Framework Implementation in Go
System Overview
This project implements a fault-tolerant, distributed MapReduce framework based on the Google MapReduce paradigm. Developed in Go, the system is designed for cloud deployment, executing across a cluster of AWS EC2 instances. It handles distributed coordination, fault tolerance, and network communication to process large-scale datasets in parallel.
Architecture: Coordinator & Worker Nodes
The system employs a strict Coordinator-Worker architecture, utilizing Go's RPC (Remote Procedure Call) package for control plane communication.
- Coordinator: Acts as the central authority, managing the task lifecycle, tracking worker availability, and synchronizing the transition between processing phases.
- Workers: Execute actual data processing. Uniquely, workers in this implementation also function as ephemeral file servers.
- Peer-to-Peer Data Transfer: Rather than relying on a shared filesystem (like S3 or GFS), intermediate data is transferred directly between workers. Map workers store data locally, and Reduce workers retrieve specific partitions via RPC calls to the source node.
Implementation Details
1. The Map Phase
The Coordinator divides input data into discrete Map tasks. Assigned workers perform the following operations:
- Execution: The user-defined
Map function is applied to input files to generate key-value pairs.
- Partitioning: Output pairs are segregated into
NReduce buckets using a consistent hashing algorithm (ihash(key) % NReduce).
- Serialization: Intermediate partitions are serialized as JSON and persisted to the worker's local storage, ready for retrieval.
2. The Reduce Phase
Upon completion of all Map tasks, the system transitions to the Reduce phase:
- Shuffle Mechanism: Reduce workers query the Coordinator to locate intermediate files and pull the data directly from the relevant Map workers.
- Aggregation: Data is sorted by key and processed by the
Reduce function.
- Atomic Commits: To ensure output consistency, results are written to temporary files first. Only upon successful completion are they atomically renamed to the final output file, preventing corrupted data in the event of a crash.
Fault Tolerance & Reliability
Given the unreliability of distributed environments, the system implements robust failure handling mechanisms:
- Liveness Detection: The Coordinator tracks task status. If a worker fails to report completion within a 10-second threshold, it is presumed dead.
- Task Re-assignment: Failed tasks are reset to an "Idle" state and immediately reassigned to the next available worker.
- Idempotency: The system handles "phantom" timeouts (where a slow worker finishes after a task has been reassigned) by ensuring that atomic file operations overwrite previous attempts cleanly, maintaining output consistency.
- Concurrency Control:
sync.Mutex is utilized within the Coordinator to ensure thread-safe state management during concurrent worker requests.
Key Technical Competencies
- Horizontal Scalability: Validated deployment and orchestration across multiple AWS EC2 instances.
- Network Resilience: Implementation of error handling for RPC failures and network latency inherent in cloud infrastructure.
- State Synchronization: Complex management of distributed state transitions (Idle → In-Progress → Completed) across independent nodes.