3 mins
Redundancy
Redundancy is when parts of a system are duplicated with the intention of increasing the reliability of a system. If you had one copy of a very critical key in your system, you’d hope to have a duplicate or redundant copy available if it ever got lost. Conversely, we should deploy our systems with multiple replicas to ensure that traffic and workloads can still be processed in the case of a single node failure.
tl:dr Distributed systems should have multiple replicas of things to remove single points of failure and provide backups if needed in a crisis.
Replication
Replication is a similar but slightly different concept. This StackOverflow answer puts it well: .
Replication is the synchronization of state between redundant nodes
Synchronization of data between services can help eliminate single points of failure in stateful systems. In the case of a network partition, we can choose to design for consistency or availability.
- Consistency: We do not make our service available to accept or forward traffic until the network partition is resolved. This ensures that all nodes have the exact same state. This is known as strong consistency.
- Availability: We continue to replicate data across nodes and do not wait for the node that cannot be reached to come back online. Nodes may have different states, particularly the node that is down. When the node comes back online, up-to-date data will be replicated to it. This is known as eventual consistency.
Replication is a surprisingly deep topic. When designing scalable systems you may have to consider what models of replication are most appropriate for you, or how nodes in a system gather consensus on what the correct state of the system is. To learn more on the topic check out my articles on distributed caching and consensus algorithms!
Redundancy vs Replication (top vs bottom)
Replication is widely used in many database management systems (DBMS), usually with a primary-replica relationship between the original and the copies. The primary server gets all the updates, which then ripple through to the replica servers. Each replica outputs a message stating that it has received the update successfully, thus allowing the sending of subsequent updates.