Learning via Delayed Knowledge A Case of Jamming SaiDhiraj Amuru and R. Michael Buehrer 1
Why do we need an Intelligent Jammer? Dynamic environment conditions in electronic warfare scenarios failure of predict-then-adapt approaches Lack of knowledge of the optimal jamming strategy A naïve always jamming strategy is sub-optimal Energy is wasted Easy to detect Easy to be neutralized Cognitive capabilities necessary to survive in harsh environments We explore the learning capabilities of a jammer with delayed environment knowledge study using an 802.11-type network that uses RTS-CTS protocol 2
State of the art Attacks at various protocol layers PHY / MAC / Network layers Naive jamming strategies continuous jamming, periodic jamming, partial-band, single/multi tone jamming Sensing-based jamming deceptive jamming, reactive jamming Perfect and instantaneous knowledge of the adversary jamming control packets jamming synchronization signals 3
Common Techniques to address Jamming Optimization framework assuming knowledge of certain parameters maximize BER/SER/PER Game theory one shot zero sum games, repeated games minimax formulation, Mutual Information games Information Theory channel capacity under jamming, saddle point solutions DoF, Mutual Information Problem : Lot of knowledge required Solution : Employ Learning Techniques 4
What is Learning Adaptation See the data and change the strategy Adapt in order to survive in the environment No memory in the system Learning More than an adaptive system Ability to detect patterns in the data Understand what s happening in the data and adapt Remember the strategies used and relate to the data Evaluate the outcome of the decisions taken, and gather knowledge to be exploited in future 5
Formal Definition of Learning A system is said to learn from experience/feedback E with respect to some class of actions A and performance measure C (similar to cost function), if its performance at tasks in A, as measured by C, improves with experience E. Learned hypothesis: model of problem/task T Model quality: performance measured by C 6
Different types of Learning Supervised Learning Teacher-student type learning Unsupervised Learning The student is left on his own Semi-supervised Learning Mixture of the above two learning techniques Reinforcement Learning Online Learning Learn by experimenting Experience is the only teacher 7 Image courtesy: : Simon blog.bigml.com www.infiniteai.com Francesco Dennis Escolana Introduction Ruiz to Neural Networks
Intro to RL Reinforcement Learning a radio/agent learns the optimal strategy (for example, survival strategy) by repeatedly interacting with the environment. the agent receives feedback indicating whether the actions performed were good or bad learn to take actions which yield higher rewards Prior information Goals/Metrics Agent Past Experience Observations Environment Actions 8
Framework for RL Sequential Decision Model Action Action Present State Next State Reward Reward Time= t Time= t+1 Decision rule - At each time, the system state is used to choose an action Policy : set of rules mapping states to actions Sequence of decision rules generates Rewards 9 Commonly modeled as a Markov Decision Process
Markov Decision Process (MDP) something more than a Markov chain, think of a controlled MC MDP = {States, Actions, Transition Probability, Rewards} - {S,A,P,R} eg: from a jammer s perspective, the environment states could be Tx/No Tx, and the actions of the jammer could be Jam/Don t Jam. P is the S * A * S state transition probability matrix governs the dynamics of the environment, p(s s,a) R indicates the S * A reward matrix r( s, a) = reward obtained in state s when action a is executed Π = policy, mapping between states and actions 10
Goals of RL Maximize the cumulative discounted reward 0<= γ <= 1 > discount factor - how much do you value future - For finite time horizon, γ =1 is used (un-discounted MDP) The goal of the decision-maker is to choose a behavior that maximizes the expected return, irrespectively of how the process started (initial state) - A decision process that achieves the optimal values in all states is optimal For a given policy, value function V 11
Finding the optimal policy Notice the similarity to Dynamic Programming - The above equation is known as Bellman Equation Bellman operator - Affine linear operator - For, is a contraction mapping - By Banach Fixed point Theorem => unique solution exists 1) A function T:X->X is a contraction mapping if d(t(x),t(y)) <= qd(x,y) for some 0<=q<1, d=distance measure 2) Banach Fixed point Theorem = T admits a unique fixed point T(x*)=x*. - Can be found by starting with x 0 and define a sequence {x n }=T(x n-1 ), then x n converges to x* 12
More information about MDP If P is known a priori (known as indirect learning/ planning), can evaluate various and find the best one Note: works for small size MDPs only MDPs in general work well for small sizes of S, A (more about it at the end of the talk, Multi Armed Bandits) Online learning techniques used when P is not known Exploration versus Exploitation dilemma Common algorithm = ε-greedy Q-Learning, SARSA are other online learning techniques 13 Images courtesy : Microsoft Research
Can we have instantaneous knowledge? As tasks and environments grow more complex, an agent s observations of its environment are more often than not delayed - Littman 2009 e.g., direct control of the Mars rover from Earth is limited by the communication latency. delay may not be limited to a single time step. When a jammer disrupts a DATA packet, it is not aware whether the jamming was successful or not until an ACK packet is sent by the receiver. A 'wait' agent is sub-optimal in such scenarios; better utilize the time by doing some actions. 14
How do we handle delayed state observations? A new MDP framework is developed to handle delayed learning scenarios - Altman 1992 {S,A,P,R,k}, k= observation delay {I k,a,p,r} = equivalent augmented MDP I k augmented state space of size S * A k & Since the state s t-k+1 is unknown perfectly, 15
Transition-based rewards But again, these frameworks assume state-based rewards What if there are transition-based rewards? We developed a new framework to handle this Delayed Learning Framework with Transition-based Rewards Bellman s optimality rules still hold true P π and R π are now I k * I k matrices, can handle transition-based rewards (a jamming example will be shown soon) 16
Jamming via Delayed Learning We consider an 802.11 wireless network with one user MAC-layer jamming attack is studied Fig: Basic 802.11 Protocol RTS = Request to Send CTS = Clear to Send ACK = Acknowledgement Fig: Model for Victim 17
Jammer s Model Assumptions MAC protocol is known to the jammer Can identify the ACK/NACK packets Jamming success probability ρ is unknown The 802.11 packets form the MDP states Jammer can jam any of them, so find optimal among 16 policies Feedback = Energy expended and Throughput Allowed to jam RTS, CTS & ACK = -E to jam DATA = -10E 18 Throughput allowed = -T (WAIT followed by ACK indicates this)
So what is delayed? Just to make things clear The jammer cannot identify the packet before transmission happens Packet type known perfectly after 1 time slot Energy cost known instantaneously based on actions taken Throughput cost known only when ACK to WAIT transition happens Notice that reward is based on transition and not on states themselves Objective Minimize Costs and deny any communication exchange 19
Optimal Performance benchmark result Assume ρ is known E = -10, T = -100 The optimal theoretical policy follows from the novel delayed learning framework. Why is jamming a CTS packet better than jamming RTS or ACK packets? 20
Which policy to use? Jamming as a function of Energy and Throughput Costs, ρ=0.5 21
What effect does delay have? True ρ=0.3 Learn ρ by jamming al states and observe environment 1 episode = 1000 time slots One policy is evaluated per episode ε-greedy is used for exploration vs exploitation 22
What effect does delay have? (unknown model) True ρ=0.5 Unknown Model = Learn ReTx limit, average CW sizes by jamming all states 1 episode = 1000 time slots One policy is evaluated per episode ε-greedy is used for exploration vs exploitation 23
So in this work - We explored whether a jammer can learn its surroundings or not Instantaneous knowledge is not readily available in most practical systems Need to deal with delay (recall states are known with a delay) A Delayed Reinforcement Learning framework was developed to address such delayed cognitive learning scenarios. An example 802.11 framework was considered and optimal jamming policies against this network were obtained. The optimal policies match intuition. To be done Varying ρ, error in the feedback?? 24
What did we learn from this problem? Small time delays can be modeled easily using the MDP framework MDPs work well for small sizes of S and A Finite time gurantees can be given MDPs model single user scenarios very well Our experience with multi-user MDPs Not so good Especially when the MDPs are coupled (as in 802.11 framework) Alternative learning algorithms being explored 25
Multi-armed bandits Another widely explored learning algorithm Can be related to the MDP theory and creation of bandit-processes Gittin s Indices An alternative defintion based on Regret formulation Learn to intelligently explore and exploit, and choose the best arm Widely used algorithm Upper Confidence bound (UCB1) 26 Image Courtesy : Daniel Jakubisin
What did we do with MAB? Learn the optimal physical layer jamming strategies Actions = {Signaling Scheme, P J,ON-OFF duration} Only needs ACK/ NACK as feedback Can give theoretical guarantees for the jamming performance cumulative and one-step regret 27
Convergence to optimal strategy 28
Tracking adaptive users 29