Distributed Systems Course Project

Overview

As part of my Software Engineering curriculum at CODE University of Applied Sciences, I designed and implemented a distributed key-value store that demonstrates core concepts of distributed systems including consensus, replication, and fault tolerance.

System Design

Architecture

The system consists of:

Multiple server nodes forming a cluster
Raft consensus algorithm for leader election and log replication
Client library for read/write operations
Monitoring and debugging tools

Key Features

Fault Tolerance: System remains operational even when minority of nodes fail
Consistency: Strong consistency guarantees through Raft consensus
Scalability: Horizontal scaling by adding more nodes
Partition Tolerance: Handles network partitions gracefully

Implementation Details

Raft Consensus Algorithm

Implemented the complete Raft protocol including:

Leader election
Log replication
Safety guarantees
Log compaction

type RaftNode struct {
    id          int
    state       NodeState
    currentTerm int
    votedFor    *int
    log         []LogEntry
    commitIndex int
    lastApplied int
}

Network Communication

gRPC for inter-node communication
Protocol buffers for message serialization
Efficient batching for log replication

Storage Layer

In-memory data store with periodic snapshots
Write-ahead logging for durability
Configurable persistence options

Testing Strategy

Unit Tests

Individual component testing
Mock network layer for isolation
Edge case coverage

Integration Tests

Multi-node cluster testing
Network partition simulation
Leader failure scenarios

Chaos Engineering

Random node failures
Network latency injection
Message dropping simulation

Performance Metrics

Throughput: 10,000+ operations/second
Latency: <10ms for committed writes
Recovery Time: <5 seconds after leader failure
Consistency: 100% under all tested scenarios

Technologies Used

Go for high-performance concurrent programming
gRPC for RPC communication
Protocol Buffers for serialization
Docker for deployment and testing
Prometheus for metrics

Challenges Overcome

Timing Issues

Distributed systems are notoriously difficult to debug:

Implemented comprehensive logging
Built visualization tools for state machine
Created deterministic test framework

Network Partitions

Handling split-brain scenarios:

Implemented proper quorum logic
Added fencing mechanisms
Thorough testing of edge cases

Performance Optimization

Balancing consistency and performance:

Batching of log entries
Pipelining of network operations
Efficient state machine implementation

Learning Outcomes

This project provided deep insights into:

Consensus algorithms and their practical challenges
Trade-offs in distributed systems (CAP theorem)
Importance of testing in distributed environments
Performance optimization in concurrent systems

Future Enhancements

Multi-Raft for sharding
Read-only replicas for scaling reads
Dynamic membership changes
Compression and more efficient snapshot mechanisms

Code Repository

The project is open-source and available on GitHub with comprehensive documentation and examples.