Global State and Gossip

Similar documents
EECS 498 Introduction to Distributed Systems

Distributed Systems. Clocks, Ordering, and Global Snapshots

Outline for February 6, 2001

Synchronisation in Distributed Systems

Synchronisation in Distributed Systems

Clock Synchronization

M U LT I C A S T C O M M U N I C AT I O N S. Tarik Cicic

Achieving Network Consistency. Octav Chipara

Today's Lecture. Clocks in a Distributed System. Last Lecture RPC Important Lessons. Need for time synchronization. Time synchronization techniques

Energy-Efficient Data Management for Sensor Networks

Sharing Multiple Messages over Mobile Networks! Yuxin Chen, Sanjay Shakkottai, Jeffrey G. Andrews

Distributed Systems. Time Synchronization

Link State Routing. Stefano Vissicchio UCL Computer Science CS 3035/GZ01

A Review of Current Routing Protocols for Ad Hoc Mobile Wireless Networks

CS 787: Advanced Algorithms Homework 1

Simple, Optimal, Fast, and Robust Wireless Random Medium Access Control

Local Area Networks NETW 901

CS434/534: Topics in Networked (Networking) Systems

Mobile & Wireless Networking. Lecture 4: Cellular Concepts & Dealing with Mobility. [Reader, Part 3 & 4]

Mathematical Analysis of Peer to Peer Communication in Networks B. Hajek (with thank you to collaborators L. Massoulie, S. Sanghavi, and Z.

Distributed Network Protocols Lecture Notes 1

Link State Routing. Brad Karp UCL Computer Science. CS 3035/GZ01 3 rd December 2013

Lecture on Sensor Networks

Fast and efficient randomized flooding on lattice sensor networks

CSE6488: Mobile Computing Systems

Diffracting Trees and Layout

Increasing Broadcast Reliability for Vehicular Ad Hoc Networks. Nathan Balon and Jinhua Guo University of Michigan - Dearborn

CS 621 Mobile Computing

Optimisation and Operations Research

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications

Topology Control. Chapter 3. Ad Hoc and Sensor Networks. Roger Wattenhofer 3/1

Performance Evaluation of MANET Using Quality of Service Metrics

Link-state protocols and Open Shortest Path First (OSPF)

Meme Tracking. Abhilash Chowdhary CS-6604 Dec. 1, 2015

PEAK GAMES IMPLEMENTS VOLTDB FOR REAL-TIME SEGMENTATION & PERSONALIZATION

Data Dissemination in Wireless Sensor Networks

UCS-805 MOBILE COMPUTING NIT Agartala, Dept of CSE Jan-May,2011

CS649 Sensor Networks IP Lecture 9: Synchronization

Biologically-inspired Autonomic Wireless Sensor Networks. Haoliang Wang 12/07/2015

Department of Computer Science and Engineering. CSE 3213: Computer Networks I (Fall 2009) Instructor: N. Vlajic Date: Dec 11, 2009.

Adapting to the Wireless Channel: SampleRate

Lecture 8 Link-State Routing

Microwave Radio Rapid Ring Protection in Pubic Safety P-25 Land Mobile Radio Systems

GWiQ-P: : An Efficient, Decentralized Quota Enforcement Protocol

MOBILE COMPUTING NIT Agartala, Dept of CSE Jan-May,2012

Lecture 11: Clocking

The topic for the third and final major portion of the course is Probability. We will aim to make sense of statements such as the following:

Outline of the Lecture

Time Iteration Protocol for TOD Clock Synchronization. Eric E. Johnson. January 23, 1992

A Wireless Communication System using Multicasting with an Acknowledgement Mark

olsr.org 'Optimized Link State Routing' and beyond December 28th, 2005 Elektra

Optimal Clock Synchronization in Networks. Christoph Lenzen Philipp Sommer Roger Wattenhofer

Energy-Efficient MANET Routing: Ideal vs. Realistic Performance

RFID (radio frequency identification) tags are becoming

M2M massive wireless access: challenges, research issues, and ways forward

Digital Transmission using SECC Spring 2010 Lecture #7. (n,k,d) Systematic Block Codes. How many parity bits to use?

An Optimal (d 1)-Fault-Tolerant All-to-All Broadcasting Scheme for d-dimensional Hypercubes

Reliable Videos Broadcast with Network Coding and Coordinated Multiple Access Points

Contents. IEEE family of standards Protocol layering TDD frame structure MAC PDU structure

ANT Channel Search ABSTRACT

Clock Synchronization

Residential Ethernet (access control considerations)

A virtually nonblocking self-routing permutation network which routes packets in O(log 2 N) time

VOLUSIA ARES DEPLOYMENT MANUAL

Single Error Correcting Codes (SECC) 6.02 Spring 2011 Lecture #9. Checking the parity. Using the Syndrome to Correct Errors

An Adaptive Distributed Channel Allocation Strategy for Mobile Cellular Networks

Luca Schenato joint work with: A. Basso, G. Gamba

Broadcast Scheduling Optimization for Heterogeneous Cluster Systems

Peer-to-Peer Architecture

Politecnico di Milano Advanced Network Technologies Laboratory. Radio Frequency Identification

CIS 480/899 Embedded and Cyber Physical Systems Spring 2009 Introduction to Real-Time Scheduling. Examples of real-time applications

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD

Data Gathering. Chapter 4. Ad Hoc and Sensor Networks Roger Wattenhofer 4/1

SpiNNaker SPIKING NEURAL NETWORK ARCHITECTURE MAX BROWN NICK BARLOW

Performance analysis of different checkpointing and recovery schemes using stochastic model

Design of Parallel Algorithms. Communication Algorithms

CS188 Spring 2014 Section 3: Games

Implementing Logic with the Embedded Array

Mobile and Sensor Systems. Lecture 6: Sensor Network Reprogramming and Mobile Sensors Dr Cecilia Mascolo

COMP Online Algorithms. Paging and k-server Problem. Shahin Kamali. Lecture 9 - Oct. 4, 2018 University of Manitoba

BASIC CONCEPTS OF HSPA

Politecnico di Milano Advanced Network Technologies Laboratory. Beyond Standard MAC Sublayer

OSPF Fundamentals. Agenda. OSPF Principles. L41 - OSPF Fundamentals. Open Shortest Path First Routing Protocol Internet s Second IGP

OSPF - Open Shortest Path First. OSPF Fundamentals. Agenda. OSPF Topology Database

Link State Routing. In particular OSPF. dr. C. P. J. Koymans. Informatics Institute University of Amsterdam. March 4, 2008

glideinwms Training HTCondor Overview by Igor Sfiligoi, UC San Diego Aug 2014 HTCondor Overview 1

VLSI Design Verification and Test Delay Faults II CMPE 646

Security in Sensor Networks. Written by: Prof. Srdjan Capkun & Others Presented By : Siddharth Malhotra Mentor: Roland Flury

Chapter 12. Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks

1. The chance of getting a flush in a 5-card poker hand is about 2 in 1000.

Multiplexing. Dr. Manas Khatua Assistant Professor Dept. of CSE IIT Jodhpur

Outline. EEC-484/584 Computer Networks. Homework #1. Homework #1. Lecture 8. Wenbing Zhao Homework #1 Review

Graphs and Network Flows IE411. Lecture 14. Dr. Ted Ralphs

TSIN01 Information Networks Lecture 9

Lecture 19: Design for Skew

(Refer Slide Time: 2:23)

The next several lectures will be concerned with probability theory. We will aim to make sense of statements such as the following:

Configuring OSPF. Information About OSPF CHAPTER

Low-Latency Multi-Source Broadcast in Radio Networks

Wireless Communication

Transcription:

Global State and Gossip CS 240: Computing Systems and Concurrency Lecture 6 Marco Canini Credits: Indranil Gupta developed much of the original material.

Today 1. Global snapshot of a distributed system 2. Chandy-Lamport s algorithm 3. Gossip 2

Distributed snapshot Let s think of this as a picture of all servers and their states comprising a distributed system How do you calculate a global snapshot in a distributed system? What does a global snapshot even mean? Why is the ability to obtain a global snapshot important? 3

Some uses of global system snapshot Checkpointing can restart distributed system on failure Gargabe collection of objects objects at servers that don t have any other objects (at any servers) with references to them Deadlock detection useful in database transaction systems Termination of computation useful in batch computing systems Debugging useful to inspect the global state of the system 4

What s a global snapshot? Global Snapshot = Global State = Individual state of each process in the distributed system + Individual state of each communication channel in the distributed system Capture the instantaneous state of each process And the instantaneous state of each communication channel, i.e., messages in transit on the channels 5

A strawman solution Synchronize clocks of all processes Ask all processes to record their states at known time t Problems? Time synchronization always has error Your bank might inform you, We lost the state of our distributed cluster due to a 1 ms clock skew in our snapshot algorithm. Also, does not record the state of messages in the channels Again: synchronization not required causality is enough! 6

Example Cij Pi Pj Cji 7

Cij [empty] Pi [$1000, 100 iphones] [empty] Pj Cji [$600, 50 Androids] [Global Snapshot 0] 8

Cij [$299, Order Android ] Pi [$701, 100 iphones] [empty] Pj Cji [$600, 50 Androids] [Global Snapshot 1] 9

Cij [$299, Order Android ] Pi [$701, 100 iphones] [$499, Order iphone] Pj Cji [$101, 50 Androids] [Global Snapshot 2] 10

Cij [$299, Order Android ] Pi [$1200, 1 iphone order from Pj, 100 iphones] [empty] Pj Cji [$101, 50 Androids] [Global Snapshot 3] 11

Cij [ ($299, Order Android), (1 iphone) ] Pi Pj [$1200, 99 iphones] [empty] Cji [$101, 50 Androids] [Global Snapshot 4] 12

[ (1 iphone) ] Cij Pi Pj [$1200, 99 iphones] [empty] Cji [$400, 1 Android order from Pi, 50 Androids] [Global Snapshot 5] 13

Cij [empty] Pi [$1200, 99 iphones] [empty] and so on Pj Cji [$400, 1 Android order from Pi, 50 Androids, 1 iphone] [Global Snapshot 6] 14

Moving from State to State Whenever an event happens anywhere in the system, the global state changes Process receives message Process sends message Process takes a step State to state movement obeys causality Next: Causal algorithm for Global Snapshot calculation 15

Today 1. Global snapshot of a distributed system 2. Chandy-Lamport s algorithm 3. Gossip 16

System Model Problem: Record a global snapshot (state for each process, and state for each channel) System Model: N processes in the system There are two uni-directional communication channels between each ordered process pair Pj à Pi and Pi à Pj Communication channels are FIFO-ordered First in First out No failure All messages arrive intact, and are not duplicated Other papers later relaxed some of these assumptions 17

Requirements Snapshot should not interfere with normal application actions, and it should not require application to stop sending messages Each process is able to record its own state Process state: Application-defined state or, in the worst case: its heap, registers, program counter, code, etc. (essentially the coredump) Global state is collected in a distributed manner Any process may initiate the snapshot We ll assume just one snapshot run for now 18

Chandy-Lamport Global Snapshot Algorithm First: Initiator Pi records its own state Initiator process creates special messages called Marker messages Not an application message, does not interfere with application messages for j=1 to N except i Pi sends out a Marker message on outgoing channel C ij (N-1) channels Starts recording the incoming messages on each of the incoming channels at Pi: C ji (for j=1 to N except i) 19

Chandy-Lamport Global Snapshot Algorithm (2) Whenever a process Pi receives a Marker message on an incoming channel C ki if (this is the first Marker Pi is seeing) Pi records its own state first Marks the state of channel C ki as empty for j=1 to N except i Pi sends out a Marker message on outgoing channel C ij Starts recording the incoming messages on each of the incoming channels at Pi: C ji (for j=1 to N except i and k) else // already seen a Marker message Mark the state of channel C ki as all the messages that have arrived on it since recording was turned on for C ki 20

Chandy-Lamport Global Snapshot Algorithm (3) The algorithm terminates when All processes have received a Marker To record their own state All processes have received a Marker on all the (N-1) incoming channels at each To record the state of all channels Then, (if needed), a central server collects all these partial state pieces to obtain the full global snapshot 21

Example P1 P2 A B C D E E F G Time P3 H I J Instruction or Step Message 22

P1 is Initiator: Record local state S1, Send out markers Turn on recording on channels C 21, C 31 P1 P2 A B C D E E F G Time P3 H I J 23

P1 P2 S1, Record C 21, C 31 A B C D E E F G Time P3 H I J First Marker! Record own state as S3 Mark C 13 state as empty Turn on recording on other incoming C 23 Send out Markers 24

P1 P2 S1, Record C 21, C 31 A B C D E E F G Time P3 H I J S3 C 13 = < > Record C 23 25

P1 P2 Duplicate Marker! S1, Record C State of channel C 31 = < > 21, C 31 A B C D E Time E F G P3 H I J S3 C 13 = < > Record C 23 26

P1 P2 S1, Record C 21, C 31 C 31 = < > A B C D E E F G Time P3 H I J S3 C 13 = < > Record C 23 First Marker! Record own state as S2 Mark C 32 state as empty Turn on recording on C 12 Send out Markers 27

P1 P2 S1, Record C 21, C 31 C 31 = < > A B C D E E F G Time P3 H I J S3 C 13 = < > Record C 23 S2 C 32 = < > Record C 12 28

P1 P2 S1, Record C 21, C 31 C 31 = < > A B C D E E F G Time P3 H I J S3 C 13 = < > Record C 23 S2 C 32 = < > Record C 12 Duplicate! C 12 = < > 29

Duplicate! C 21 = <message GàD > P1 P2 S1, Record C 21, C 31 C 31 = < > A B C D E E F G Time P3 H I J S3 C 13 = < > Record C 23 S2 C 12 = < > C 32 = < > Record C 12 30

C 21 = <message GàD > P1 P2 S1, Record C 21, C 31 C 31 = < > A B C D E E F G Time P3 H I J S3 C 13 = < > Record C 23 S2 C 12 = < > C 32 = < > Record C 12 Duplicate! C 23 = < > 31

Algorithm has terminated P1 P2 C 21 = <message GàD > S1 C 31 = < > A B C D E Time E F G P3 H I J S3 C 13 = < > S2 C 32 = < > C 12 = < > C 23 = < > 32

Collect the global snapshot pieces P1 P2 C 21 = <message GàD > S1 C 31 = < > A B C D E Time E F G P3 H I J S3 C 13 = < > S2 C 32 = < > C 12 = < > C 23 = < > 33

Next Global Snapshot calculated by Chandy-Lamport algorithm is causally correct What? 34

Cuts Cut = time frontier at each process and at each channel Events at the process/channel that happen before the cut are in the cut And happening after the cut are out of the cut 35

Consistent Cuts Consistent Cut: a cut that obeys causality Cut C is a consistent cut if and only if: for (each pair of events e, f in the system) Such that event e is in the cut C, and if f à e (f happens-before e) Then: Event f is also in the cut C 36

Example P1 P2 A B C D E E F G Time P3 H I J Consistent Cut Inconsistent Cut G à D, but only D is in cut 37

Our Global Snapshot Example C 21 = <message GàD > S1 C 31 = < > A B C D E P1 Time E F G P2 P3 H I J S3 C 13 = < > S2 C 32 = < > C 12 = < > C 23 = < > 38

is causally correct C 21 = <message GàD > S1 C 31 = < > A B C D E P1 Time E F G P2 P3 H I J S3 C 13 = < > Consistent Cut captured by our Global Snapshot Example S2 C 32 = < > C 12 = < > C 23 = < > 39

In fact Any run of the Chandy-Lamport Global Snapshot algorithm creates a consistent cut 40

Chandy-Lamport Global Snapshot algorithm creates a consistent cut Let s quickly look at the proof Let e i and e j be events occurring at Pi and Pj, respectively such that e i à e j (e i happens before e j ) The snapshot algorithm ensures that if e j is in the cut then e i is also in the cut That is: if e j à <Pj records its state>, then it must be true that e i à <Pi records its state> 41

Chandy-Lamport Global Snapshot algorithm creates a consistent cut if e j à <Pj records its state>, then it must be true that e i à <Pi records its state> By contradiction, suppose e j à <Pj records its state> and <Pi records its state> à e i Consider the path of app messages (through other processes) that go from e i à e j Due to FIFO ordering, markers on each link in above path will precede regular app messages Thus, since <Pi records its state> à e i, it must be true that Pj received a marker before e j Thus e j is not in the cut => contradiction 42

Summary The ability to calculate global snapshots in a distributed system is very important But don t want to interrupt running distributed application Chandy-Lamport algorithm calculates global snapshot Obeys causality (creates a consistent cut) 43

Distributed snapshot algorithm summary Chandy & Lamport,1985 algorithm to select a consistent cut any process may initiate a snapshot at any time processes can continue normal execution send and receive messages assumes: no failures of processes & channels strong connectivity at least one path between each process pair unidirectional, FIFO channels reliable delivery of messages 44

Today 1. Global snapshot of a distributed system 2. Chandy-Lamport s algorithm 3. Gossip 45

Multicast problem 46

Fault-tolerance and Scalability Needs: 1. Reliability (Atomicity) 100% receipt 2. Speed 47

Centralized 48

Tree-Based 49

Tree-based Multicast Protocols Build a spanning tree among the processes of the multicast group Use spanning tree to disseminate multicasts Use either acknowledgments (ACKs) or negative acknowledgements (NAKs) to repair multicasts not received SRM (Scalable Reliable Multicast) Uses NAKs But adds random delays, and uses exponential backoff to avoid NAK storms RMTP (Reliable Multicast Transport Protocol) Uses ACKs But ACKs only sent to designated receivers, which then retransmit missing multicasts These protocols still cause an O(N) ACK/NAK overhead [Birman99] 50

A Third Approach 51

A Third Approach 52

A Third Approach 53

A Third Approach 54

Epidemic Multicast (or Gossip ) 55

Push vs. Pull So that was Push gossip Once you have a multicast message, you start gossiping about it Multiple messages? Gossip a random subset of them, or recently-received ones, or higher priority ones There s also Pull gossip Periodically poll a few randomly selected processes for new multicast messages that you haven t received Get those messages Hybrid variant: Push-Pull As the name suggests 56

Properties Claim that the simple Push protocol Is lightweight in large groups Spreads a multicast quickly Is highly fault-tolerant 57

Analysis From old mathematical branch of Epidemiology [Bailey75] Population of (n+1) individuals mixing homogeneously Contact rate between any individual pair is b At any time, each individual is either uninfected (numbering x) or infected (numbering y) Then, x0 = n, y0 = 1 and at all times x + y = n +1 Infected uninfected contact turns latter infected, and it stays infected 58

Analysis (contd.) Continuous time process Then dx = -bxy dt n( n + 1) n + e (why?) with solution: ( n + 1) 1+ ne x =, y = b ( n+ 1) t -b ( n+ 1) t (can you derive it?) 59

Epidemic Multicast 60

Epidemic Multicast Analysis b = b n (why?) Substituting, at time t=clog(n), the number of infected is y 1» ( n + 1) - cb-2 n (correct? can you derive it?) 61

Analysis (contd.) Set c, b to be small numbers independent of n Within clog(n) rounds, [low latency] all but 1 cbn 2 number of nodes receive the multicast [reliability] each node has transmitted no more than cblog(n) gossip messages [lightweight] 62

Why is log(n) low? log(n) is not constant in theory But pragmatically, it is a very slowly growing number Base 2 log(1000) ~ 10 log(1m) ~ 20 log (1B) ~ 30 log(all IPv4 address) = 32 63

Fault-tolerance Packet loss 50% packet loss: analyze with b replaced with b/2 To achieve same reliability as 0% packet loss, takes twice as many rounds Node failure 50% of nodes fail: analyze with n replaced with n/2 and b replaced with b/2 Same as above 64

Fault-tolerance With failures, is it possible that the epidemic might die out quickly? Possible, but improbable: Once a few nodes are infected, with high probability, the epidemic will not die out So the analysis we saw in the previous slides is actually behavior with high probability [Galey and Dani 98] Think: why do rumors spread so fast? why do infectious diseases cascade quickly into epidemics? why does a virus or worm spread rapidly? 65

Pull Gossip: Analysis In all forms of gossip, it takes O(log(N)) rounds before about N/2 processes get the gossip Why? Because that s the fastest you can spread a message a spanning tree with fanout (degree) of constant degree has O(log(N)) total nodes Thereafter, pull gossip is faster than push gossip After the ith, round let p i be the fraction of non-infected processes. Let each round have k pulls. Then ( ) 1 p i p i = k + + 1 This is super-exponential Second half of pull gossip finishes in time O(log(log(N)) 66

Summary Multicast is an important problem Tree-based multicast protocols When concerned about scale and fault-tolerance, gossip is an attractive solution Also known as epidemics Fast, reliable, fault-tolerant, scalable, topology-aware 67

Next Topic: Primary-backup replication (pre-reading: VM replication) 68