Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer

Similar documents
Jamming Bandits. arxiv: v1 [cs.it] 13 Nov 2014 I. INTRODUCTION

Policy Teaching. Through Reward Function Learning. Haoqi Zhang, David Parkes, and Yiling Chen

Fast Online Learning of Antijamming and Jamming Strategies

Jamming mitigation in cognitive radio networks using a modified Q-learning algorithm

A Multi Armed Bandit Formulation of Cognitive Spectrum Access

Optimizing Media Access Strategy for Competing Cognitive Radio Networks Y. Gwon, S. Dastangoo, H. T. Kung

10703 Deep Reinforcement Learning and Control

Jamming-resistant Multi-radio Multi-channel Opportunistic Spectrum Access in Cognitive Radio Networks

A Systematic Learning Method for Optimal Jamming

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

Resource Management in QoS-Aware Wireless Cellular Networks

Wireless Network Security Spring 2016

Wireless Network Security Spring 2015

Efficiency and detectability of random reactive jamming in wireless networks

/13/$ IEEE

CS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s

Decentralized Cognitive MAC for Opportunistic Spectrum Access in Ad-Hoc Networks: A POMDP Framework

Reinforcement Learning

Optimal Defense Against Jamming Attacks in Cognitive Radio Networks using the Markov Decision Process Approach

TUD Poker Challenge Reinforcement Learning with Imperfect Information

CS188 Spring 2011 Written 2: Minimax, Expectimax, MDPs

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

A Survey on Machine-Learning Techniques in Cognitive Radios

CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

CandyCrush.ai: An AI Agent for Candy Crush

Opportunistic Communications under Energy & Delay Constraints

The Game-Theoretic Approach to Machine Learning and Adaptation

Dynamic Spectrum Access in Cognitive Radio Networks. Xiaoying Gan 09/17/2009

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)

DeepMind Self-Learning Atari Agent

Reinforcement Learning-based Cooperative Sensing in Cognitive Radio Ad Hoc Networks

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

ANTI-JAMMING PERFORMANCE OF COGNITIVE RADIO NETWORKS. Xiaohua Li and Wednel Cadeau

CSE 473 Midterm Exam Feb 8, 2018

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

Cognitive Radio Technology using Multi Armed Bandit Access Scheme in WSN

A Novel Cognitive Anti-jamming Stochastic Game

An Artificially Intelligent Ludo Player

CS 387: GAME AI BOARD GAMES

arxiv: v1 [cs.it] 24 Aug 2010

CS434/534: Topics in Networked (Networking) Systems

Tutorial of Reinforcement: A Special Focus on Q-Learning

A Reinforcement Learning Scheme for Adaptive Link Allocation in ATM Networks

Reinforcement Learning Agent for Scrolling Shooter Game

COGNITIVE Radio (CR) [1] has been widely studied. Tradeoff between Spoofing and Jamming a Cognitive Radio

Summary Overview of Topics in Econ 30200b: Decision theory: strong and weak domination by randomized strategies, domination theorem, expected utility

Communication over a Time Correlated Channel with an Energy Harvesting Transmitter

Chapter 3 Learning in Two-Player Matrix Games

Real-time Distributed MIMO Systems. Hariharan Rahul Ezzeldin Hamed, Mohammed A. Abdelghany, Dina Katabi

Imperfect Monitoring in Multi-agent Opportunistic Channel Access

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal

Module 3. Problem Solving using Search- (Two agent) Version 2 CSE IIT, Kharagpur

Capacity Analysis and Call Admission Control in Distributed Cognitive Radio Networks

Channel Sensing Order in Multi-user Cognitive Radio Networks

Multiple MAC Protocols Selection Strategies. Presented by Chen-Hsiang Feng

Index. Index. More information. in this web service Cambridge University Press

Almost Optimal Dynamically-Ordered Multi-Channel Accessing for Cognitive Networks

Reinforcement Learning for CPS Safety Engineering. Sam Green, Çetin Kaya Koç, Jieliang Luo University of California, Santa Barbara

Deep Learning for Launching and Mitigating Wireless Jamming Attacks

Section Notes 6. Game Theory. Applied Math 121. Week of March 22, understand the difference between pure and mixed strategies.

On the Predictability of Underwater Acoustic Communications Performance: the KAM11 Data Set as a Case Study

Institute for Critical Technology and Applied Science. Machine Learning for Radar State Determination. Status report 2017/11/09

How (Information Theoretically) Optimal Are Distributed Decisions?

Optimal Foresighted Multi-User Wireless Video

Lecture Notes on Game Theory (QTM)

Q-Learning Algorithms for Constrained Markov Decision Processes with Randomized Monotone Policies: Application to MIMO Transmission Control

ROBUST SATELLITE COMMUNICATIONS UNDER HOSTILE INTERFERENCE

UAV-Aided 5G Communications with Deep Reinforcement Learning Against Jamming

Fast Online Learning of Antijamming and Jamming Strategies

Cognitive Radio: Brain-Empowered Wireless Communcations

IN the last few years, Wireless Sensor Networks (WSNs)

Game-Playing & Adversarial Search

CMU-Q Lecture 20:

CS188 Spring 2014 Section 3: Games

Performance Analysis of Multiuser MIMO Systems with Scheduling and Antenna Selection

Distributed Learning and Stable Orthogonalization in Ad-Hoc Networks with Heterogeneous Channels

Deep Learning for Autonomous Driving

Alternation in the repeated Battle of the Sexes

Applying Modern Reinforcement Learning to Play Video Games. Computer Science & Engineering Leung Man Ho Supervisor: Prof. LYU Rung Tsong Michael

Wireless Network Security Spring 2012

OPPORTUNISTIC spectrum access (OSA), first envisioned

More on games (Ch )

Downlink Scheduler Optimization in High-Speed Downlink Packet Access Networks

Frequency-Hopped Spread-Spectrum

3432 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 10, OCTOBER 2007

arxiv: v1 [cs.it] 26 Jan 2016

Bandit Algorithms Continued: UCB1

Fast Reinforcement Learning for Energy-Efficient Wireless Communication

AN ABSTRACT OF THE THESIS OF. Pavithra Venkatraman for the degree of Master of Science in

Adversarial Reasoning: Sampling-Based Search with the UCT algorithm. Joint work with Raghuram Ramanujan and Ashish Sabharwal

Optimizing Media Access Strategy for Competing Cognitive Radio Networks

Simple, Optimal, Fast, and Robust Wireless Random Medium Access Control

The Necessity of Average Rewards in Cooperative Multirobot Learning

Multi-user Space Time Scheduling for Wireless Systems with Multiple Antenna

Cooperative Multi-Agent Learning and Coordination for Cognitive Radio Networks

Spectrum Sharing in Cognitive Radio Networks

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Joint Adaptation of Frequency Hopping and Transmission Rate for Anti-jamming Wireless Systems

Traffic-Aware Transmission Mode Selection in D2D-enabled Cellular Networks with Token System

Transcription:

Learning via Delayed Knowledge A Case of Jamming SaiDhiraj Amuru and R. Michael Buehrer 1

Why do we need an Intelligent Jammer? Dynamic environment conditions in electronic warfare scenarios failure of predict-then-adapt approaches Lack of knowledge of the optimal jamming strategy A naïve always jamming strategy is sub-optimal Energy is wasted Easy to detect Easy to be neutralized Cognitive capabilities necessary to survive in harsh environments We explore the learning capabilities of a jammer with delayed environment knowledge study using an 802.11-type network that uses RTS-CTS protocol 2

State of the art Attacks at various protocol layers PHY / MAC / Network layers Naive jamming strategies continuous jamming, periodic jamming, partial-band, single/multi tone jamming Sensing-based jamming deceptive jamming, reactive jamming Perfect and instantaneous knowledge of the adversary jamming control packets jamming synchronization signals 3

Common Techniques to address Jamming Optimization framework assuming knowledge of certain parameters maximize BER/SER/PER Game theory one shot zero sum games, repeated games minimax formulation, Mutual Information games Information Theory channel capacity under jamming, saddle point solutions DoF, Mutual Information Problem : Lot of knowledge required Solution : Employ Learning Techniques 4

What is Learning Adaptation See the data and change the strategy Adapt in order to survive in the environment No memory in the system Learning More than an adaptive system Ability to detect patterns in the data Understand what s happening in the data and adapt Remember the strategies used and relate to the data Evaluate the outcome of the decisions taken, and gather knowledge to be exploited in future 5

Formal Definition of Learning A system is said to learn from experience/feedback E with respect to some class of actions A and performance measure C (similar to cost function), if its performance at tasks in A, as measured by C, improves with experience E. Learned hypothesis: model of problem/task T Model quality: performance measured by C 6

Different types of Learning Supervised Learning Teacher-student type learning Unsupervised Learning The student is left on his own Semi-supervised Learning Mixture of the above two learning techniques Reinforcement Learning Online Learning Learn by experimenting Experience is the only teacher 7 Image courtesy: : Simon blog.bigml.com www.infiniteai.com Francesco Dennis Escolana Introduction Ruiz to Neural Networks

Intro to RL Reinforcement Learning a radio/agent learns the optimal strategy (for example, survival strategy) by repeatedly interacting with the environment. the agent receives feedback indicating whether the actions performed were good or bad learn to take actions which yield higher rewards Prior information Goals/Metrics Agent Past Experience Observations Environment Actions 8

Framework for RL Sequential Decision Model Action Action Present State Next State Reward Reward Time= t Time= t+1 Decision rule - At each time, the system state is used to choose an action Policy : set of rules mapping states to actions Sequence of decision rules generates Rewards 9 Commonly modeled as a Markov Decision Process

Markov Decision Process (MDP) something more than a Markov chain, think of a controlled MC MDP = {States, Actions, Transition Probability, Rewards} - {S,A,P,R} eg: from a jammer s perspective, the environment states could be Tx/No Tx, and the actions of the jammer could be Jam/Don t Jam. P is the S * A * S state transition probability matrix governs the dynamics of the environment, p(s s,a) R indicates the S * A reward matrix r( s, a) = reward obtained in state s when action a is executed Π = policy, mapping between states and actions 10

Goals of RL Maximize the cumulative discounted reward 0<= γ <= 1 > discount factor - how much do you value future - For finite time horizon, γ =1 is used (un-discounted MDP) The goal of the decision-maker is to choose a behavior that maximizes the expected return, irrespectively of how the process started (initial state) - A decision process that achieves the optimal values in all states is optimal For a given policy, value function V 11

Finding the optimal policy Notice the similarity to Dynamic Programming - The above equation is known as Bellman Equation Bellman operator - Affine linear operator - For, is a contraction mapping - By Banach Fixed point Theorem => unique solution exists 1) A function T:X->X is a contraction mapping if d(t(x),t(y)) <= qd(x,y) for some 0<=q<1, d=distance measure 2) Banach Fixed point Theorem = T admits a unique fixed point T(x*)=x*. - Can be found by starting with x 0 and define a sequence {x n }=T(x n-1 ), then x n converges to x* 12

More information about MDP If P is known a priori (known as indirect learning/ planning), can evaluate various and find the best one Note: works for small size MDPs only MDPs in general work well for small sizes of S, A (more about it at the end of the talk, Multi Armed Bandits) Online learning techniques used when P is not known Exploration versus Exploitation dilemma Common algorithm = ε-greedy Q-Learning, SARSA are other online learning techniques 13 Images courtesy : Microsoft Research

Can we have instantaneous knowledge? As tasks and environments grow more complex, an agent s observations of its environment are more often than not delayed - Littman 2009 e.g., direct control of the Mars rover from Earth is limited by the communication latency. delay may not be limited to a single time step. When a jammer disrupts a DATA packet, it is not aware whether the jamming was successful or not until an ACK packet is sent by the receiver. A 'wait' agent is sub-optimal in such scenarios; better utilize the time by doing some actions. 14

How do we handle delayed state observations? A new MDP framework is developed to handle delayed learning scenarios - Altman 1992 {S,A,P,R,k}, k= observation delay {I k,a,p,r} = equivalent augmented MDP I k augmented state space of size S * A k & Since the state s t-k+1 is unknown perfectly, 15

Transition-based rewards But again, these frameworks assume state-based rewards What if there are transition-based rewards? We developed a new framework to handle this Delayed Learning Framework with Transition-based Rewards Bellman s optimality rules still hold true P π and R π are now I k * I k matrices, can handle transition-based rewards (a jamming example will be shown soon) 16

Jamming via Delayed Learning We consider an 802.11 wireless network with one user MAC-layer jamming attack is studied Fig: Basic 802.11 Protocol RTS = Request to Send CTS = Clear to Send ACK = Acknowledgement Fig: Model for Victim 17

Jammer s Model Assumptions MAC protocol is known to the jammer Can identify the ACK/NACK packets Jamming success probability ρ is unknown The 802.11 packets form the MDP states Jammer can jam any of them, so find optimal among 16 policies Feedback = Energy expended and Throughput Allowed to jam RTS, CTS & ACK = -E to jam DATA = -10E 18 Throughput allowed = -T (WAIT followed by ACK indicates this)

So what is delayed? Just to make things clear The jammer cannot identify the packet before transmission happens Packet type known perfectly after 1 time slot Energy cost known instantaneously based on actions taken Throughput cost known only when ACK to WAIT transition happens Notice that reward is based on transition and not on states themselves Objective Minimize Costs and deny any communication exchange 19

Optimal Performance benchmark result Assume ρ is known E = -10, T = -100 The optimal theoretical policy follows from the novel delayed learning framework. Why is jamming a CTS packet better than jamming RTS or ACK packets? 20

Which policy to use? Jamming as a function of Energy and Throughput Costs, ρ=0.5 21

What effect does delay have? True ρ=0.3 Learn ρ by jamming al states and observe environment 1 episode = 1000 time slots One policy is evaluated per episode ε-greedy is used for exploration vs exploitation 22

What effect does delay have? (unknown model) True ρ=0.5 Unknown Model = Learn ReTx limit, average CW sizes by jamming all states 1 episode = 1000 time slots One policy is evaluated per episode ε-greedy is used for exploration vs exploitation 23

So in this work - We explored whether a jammer can learn its surroundings or not Instantaneous knowledge is not readily available in most practical systems Need to deal with delay (recall states are known with a delay) A Delayed Reinforcement Learning framework was developed to address such delayed cognitive learning scenarios. An example 802.11 framework was considered and optimal jamming policies against this network were obtained. The optimal policies match intuition. To be done Varying ρ, error in the feedback?? 24

What did we learn from this problem? Small time delays can be modeled easily using the MDP framework MDPs work well for small sizes of S and A Finite time gurantees can be given MDPs model single user scenarios very well Our experience with multi-user MDPs Not so good Especially when the MDPs are coupled (as in 802.11 framework) Alternative learning algorithms being explored 25

Multi-armed bandits Another widely explored learning algorithm Can be related to the MDP theory and creation of bandit-processes Gittin s Indices An alternative defintion based on Regret formulation Learn to intelligently explore and exploit, and choose the best arm Widely used algorithm Upper Confidence bound (UCB1) 26 Image Courtesy : Daniel Jakubisin

What did we do with MAB? Learn the optimal physical layer jamming strategies Actions = {Signaling Scheme, P J,ON-OFF duration} Only needs ACK/ NACK as feedback Can give theoretical guarantees for the jamming performance cumulative and one-step regret 27

Convergence to optimal strategy 28

Tracking adaptive users 29