Reinforcement Learning for Fuzzy Control with Linguistic States

Similar documents
Design of Neuro-Fuzzy System Controller for DC Servomotor- Based Satellite Tracking System

Simplified Algorithm and Hardware Implementation for the (24, 12, 8) Extended Golay Soft Decoder Up to 4 Errors

Sinusoidal Steady State Analysis

Adaptive modified backpropagation algorithm based on differential errors

The Existence, Uniqueness and Error Bounds of Approximation Splines Interpolation for Solving Second-Order Initial Value Problems

An Optimal Method for Using Multiple Gateways in Cellular IP Networks

Data Attribute Reduction using Binary Conversion

Performance Evaluation of Survivable Multifiber WDM Networks

A Substractive Clustering Based Fuzzy Hybrid Reference Control Design for Transient Response Improvement of PID Controller

Math Circles Finite Automata Question Sheet 3 (Solutions)

A Novel Control Method for Input Output Harmonic Elimination of the PWM Boost Type Rectifier Under Unbalanced Operating Conditions

Pre-distortion Linearization for 64-QAM Modulation in Ka-Band Satellite Link

International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN

Rough Set Approach for Categorical Data Clustering 1

MAXIMUM FLOWS IN FUZZY NETWORKS WITH FUNNEL-SHAPED NODES

EFFECTIVE CURRENT CONTROL DESIGN AND ANALYSIS OF SINGLE PHASE INVERTER FOR POWER QUALITY IMPROVEMENT

Control of assessment for demersal fish stocks in ICES area: analysis for 36 stocks and investigation of some potential bias sources.

VI.C CIRCUIT BREAKERS

BnB-ADOPT + with Several Soft Arc Consistency Levels

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

i S1 V IN i C1 i N i C2 i S2

A Comparison of South East Asian Face Emotion Classification Based on Optimized Ellipse Data Using Clustering Technique

Content Based Color Image Retrieval via Wavelet Transforms

CONTAINER BERTH SCHEDULING POLICY WITH VARIABLE COST FUNCTION

458 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 15, NO. 1, JANUARY 2016

Fuzzy Logic Controller for Three Phase PWM AC-DC Converter

High Speed, Low Power And Area Efficient Carry-Select Adder

Example. Check that the Jacobian of the transformation to spherical coordinates is

Exercise 1-1. The Sine Wave EXERCISE OBJECTIVE DISCUSSION OUTLINE. Relationship between a rotating phasor and a sine wave DISCUSSION

Lecture 20. Intro to line integrals. Dan Nichols MATH 233, Spring 2018 University of Massachusetts.

The Discussion of this exercise covers the following points:

Robot Deception: Recognizing when a Robot Should Deceive

MTBF PREDICTION REPORT

High Speed ADC Sampling Transients

IMPACT OF AIRPORT NOISE REGULATIONS ON NETWORK TOPOLOGY AND DIRECT OPERATING COSTS OF AIRLINES

Test 2. ECON3161, Game Theory. Tuesday, November 6 th

Software Pipelining for the Pegasus IR

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

Web-based Remote Human Pulse Monitoring System with Intelligent Data Analysis for Home Healthcare

Synchronous Machine Parameter Measurement

METHOD OF LOCATION USING SIGNALS OF UNKNOWN ORIGIN. Inventor: Brian L. Baskin

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

Understanding the Spike Algorithm

GLONASS Inter-frequency Biases and Their Effects on RTK and PPP Carrier-phase Ambiguity Resolution

Digital Transmission

CS 135: Computer Architecture I. Boolean Algebra. Basic Logic Gates

5 October 2015 Stereo Cross-feed Network for Headphones 1 of 12 Copyright 2015 Peter H. Lehmann. All Rights Reserved.

Lecture 10: Bipolar Junction Transistor Construction. NPN Physical Operation.

Polar Coordinates. July 30, 2014

MEASURE THE CHARACTERISTIC CURVES RELEVANT TO AN NPN TRANSISTOR

DYE SOLUBILITY IN SUPERCRITICAL CARBON DIOXIDE FLUID

Performance Evaluation of an Optical Packet Scheduling Switch

A MODIFIED DIFFERENTIAL EVOLUTION ALGORITHM IN SPARSE LINEAR ANTENNA ARRAY SYNTHESIS

antenna antenna (4.139)

Energy Efficient Session Key Establishment in Wireless Sensor Networks

An iterative approach to an integrated land use and transportation planning tool for small urban areas

Simultaneous Adversarial Multi-Robot Learning

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

Module 9. DC Machines. Version 2 EE IIT, Kharagpur

The Performance Improvement of BASK System for Giga-Bit MODEM Using the Fuzzy System

Review: Our Approach 2. CSC310 Information Theory

UNIT 11 TWO-PERSON ZERO-SUM GAMES WITH SADDLE POINT

NETWORK 2001 Transportation Planning Under Multiple Objectives

A New Medium Access Control Protocol for TDD Mode Wideband CDMA Wireless Local Area Networks

Misty. Sudnow Dot Songs

A Control Strategy Based on UTT and ISCT for 3P4W UPQC

Solving Continuous Action/State Problem in Q-Learning Using Extended Rule Based Fuzzy Inference Systems

Optimal Toll Locations and Levels in Congestion Pricing Schemes: a Case Study of Stockholm

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

On The Study of Establishing a Responsive Infrastructure for a Massively Multiplayer On-Line Game

Rejection of PSK Interference in DS-SS/PSK System Using Adaptive Transversal Filter with Conditional Response Recalculation

Knowledge Unit Relation Recognition Based on Markov Logic Networks

A Slot-Asynchronous MAC Protocol Design for Blind Rendezvous in Cognitive Radio Networks

@IJMTER-2015, All rights Reserved 383

2-7 Calibration of SAR Probe

Research on error compensation and measurement technology in robot flexible measurement

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center

HUAWEI TECHNOLOGIES CO., LTD. Huawei Proprietary Page 1

Resource Allocation Optimization for Device-to- Device Communication Underlaying Cellular Networks

Target Response Adaptation for Correlation Filter Tracking

Student Book SERIES. Fractions. Name

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Figure 1. DC-DC Boost Converter

An Algorithm Forecasting Time Series Using Wavelet

10.4 AREAS AND LENGTHS IN POLAR COORDINATES

Optimal Placement of PMU and RTU by Hybrid Genetic Algorithm and Simulated Annealing for Multiarea Power System State Estimation

29. Network Functions for Circuits Containing Op Amps

MODEL ORDER REDUCTION AND CONTROLLER DESIGN OF DISCRETE SYSTEM EMPLOYING REAL CODED GENETIC ALGORITHM J. S. Yadav, N. P. Patidar, J.

THE GENERATION OF 400 MW RF PULSES AT X-BAND USING RESONANT DELAY LINES *

A Development of Earthing-Resistance-Estimation Instrument

Performance Analysis of Multi User MIMO System with Block-Diagonalization Precoding Scheme

A Multi-Agent Solution to Distribution System Management by Considering Distributed Generators

Kirchhoff s Rules. Kirchhoff s Laws. Kirchhoff s Rules. Kirchhoff s Laws. Practice. Understanding SPH4UW. Kirchhoff s Voltage Rule (KVR):

Energy Harvesting Two-Way Channels With Decoding and Processing Costs

Redundancy Data Elimination Scheme Based on Stitching Technique in Image Senor Networks

Application of Intelligent Voltage Control System to Korean Power Systems

Adaptive System Control with PID Neural Networks

ANNUAL OF NAVIGATION 11/2006

Transcription:

54545454543 Journl of Uncertn Systems ol., No., pp.54-66, 8 Onlne t: www.jus.org.uk Renforcement Lernng for Fuzzy Control wth Lngustc Sttes Mohmmd Hossen Fzel Zrnd,*, Jvd Jouzdn, Mrym Fzel Zrnd Deprtment of Industrl Engneerng, AmrKbr Unversty of Technology, Tehrn, Irn P.O. Box: 5875-443 Deprtment of Computer Scence Kng s College Rod, Unversty of Toronto, Toronto, ON., Cnd M5S3G4 zrnd@ut.c.r jvd@ut.c.r mfzel@cs.toronto.edu Receved 9 September 7; Accepted October 7 Abstrct Ths pper presents generlzed renforcement lernng methodology for tunng fuzzy logc controllers wth nonlner dynmc behvors. To ths m, the Generlzed Approxmte Resonng-Bsed Intellgent Controller (GARIC) model n [3] s modfed to hndle vgueness n control sttes. The proposed rchtecture hs self-tunng cpblty even when only wek renforcement sgnl such s bnry flure sgnl s vlble. The controller s tested nd vldted by the well-know Crt-Pole control problem. Compred to smlr models, the proposed controller exhbts better performnce wth regrds to the lernng speed nd robustness to chnges n controlled system dynmcs, even n the presence of uncertnty n the sttes obtned. 8 World Acdemc Press, UK. All rghts reserved. Keywords: fuzzy control, fuzzy renforcement lernng, fuzzy system modelng Introducton Fuzzy logc s proved to be effectve n solvng mny nonlner control problems, where the nonlner behvor of the system mkes t dffcult, f not mpossble, to buld n nlytcl model of the system. Nevertheless, buldng fuzzy controller hs ts own dffcultes tht should be resolved through the mplementton of sutble technques. There re two pproches to the development of fuzzy model. The frst pproch s bsed on descrbng the rules governng the system lngustclly usng terms of nturl lnguge, nd then trnsformng them nto fuzzy rules. In ths pproch, whch s referred to s the drect pproch to fuzzy modelng, the lngustc descrptons re constructed subjectvely ccordng to pror knowledge of the system. Ths mkes the process hghly dependent on the expert s knowledge, nd f the expert s knowledge bout the system s fulty ths would result n developng n un-robust model of the system. Thus, the rules should be fne-tuned n order to be used for control purposes. The second pproch uses nput-output dt n the development of the fuzzy model, nd s clled the ndrect pproch to fuzzy modelng. The problem of extrctng fuzzy rules from dt rose n the erly yers fter the brth of fuzzy modelng concepts. Snce n ths pproch the fuzzy rules re constructed bsed on dt, f the dt s fulty, dmged, or nosy, the obtned rules my not be relble;.e., crude rules my be obtned tht need to be fne-tuned. When nput-output trnng dt re vlble, supervsed lernng technques perform well on the tsk of tunng the controller. However, when such dt re not vlble, unsupervsed methods such s renforcement lernng cn be used to solve the problem. In renforcement lernng, t s ssumed tht the equtons descrbng the system re not well-known to the controller nd the only nformton vlble re the sttes of the system nd renforcement sgnl evlutng the performnce v flure or success sgnl. The controller s expected to lern the best polcy through trl-nd-error ntercton wth the dynmc envronment. Renforcement lernng cn be used to fne-tune fuzzy logc controller; ether for structure dentfcton (e.g., [,]) or for prmeter dentfcton (e.g., [3,4]). Ths pper concentrtes on the ltter ssue. A number of prevous studes exst n the lterture. In [9], Mustph nd Lchver present model clled the Generlzed Renforcement * Correspondng uthor. Eml: zrnd@ut.c.r (M. H. F. Zrnd)

Journl of Uncertn Systems, ol., No., pp.54-66, 8 55 Lernng Fuzzy Controller (GRLFC), whch s smlr to the model proposed n ths pper. The rchtecture proposed by Berenj nd Khedkr [4], clled the Generlzed Approxmte-Resonng-Bsed Intellgent Controller (GARIC) s the mn nsprton for genertng GRLFC. However, one shortcomng of ther model s tht t needs lrge number of trls to be tuned. Seo et l. [] notce nd model vgueness both n sttes nd gols, but they gnore the generlzton ssue. Feng [] ntroduces model n whch the lernng rtes re vrble. However, the vgueness n the sttes s gnored. In [6], Jouffe proposes good model whch uses elgblty trces to enhnce the speed of lernng. In hs work, he nvestgtes the contnuous nd dscrete ctons. In ths pper we extend the GARIC rchtecture wth respect to vrous spects. The vgueness n the nput sttes re modeled by ddng new component clled Fuzzfer. In ddton, lernng mechnsm n Crtc s dfferent from tht of Acton Evluton Network whch plys the role of the crtc n GARIC. Furthermore, vrble lernng rtes re used for updtng the prmeters of the Actor nd the Crtc. Ths ncreses the speed of lernng tht leds to stsfctory performnce of GRLFC, even when vgueness exsts n the nput vrbles. Also, Explorer n GRLFC s n extenson of the Explorer n GARIC nd uses the Epsode Informton Feedbck from the plnt to perform better explorton nd explotton. GRLFC not only cptures the vgueness n the nput sttes, but lso hs superor performnce n comprson wth smlr GARIC models. The rest of ths pper s orgnzed s follows: In Secton, the fundmentls of the fuzzy nference system tht s used n ths pper re revewed. Secton 3 presents bref revew of renforcement lernng nd ts relton wth fuzzy nference systems. The proposed Generlzed Renforcement Lernng Fuzzy Controller (GRLFC) s dscussed n Secton 4. In Secton 5, smulton results re presented to show the effcency nd superorty of the proposed model n comprson wth smlr models lke GARIC. Fnlly, Secton 6 concludes the pper wth dscusson of the contrbutons nd res of future work. Fuzzy Inference System A fuzzy nference system (FIS) cn be defned s set of IF-THEN rules tht mps n nput spce onto n output spce. Therefore, t cn be consdered s method for generlzton nd functonl pproxmton. μ. μ ( x) ( x) A μ B C LS C C b LS b C + RS C b C b + RS b x Fg.. Trngulr membershp functons, ther prmeters nd the MxMn pont determnng the degree of mtchng. The nput vrbles of n FIS re usully consdered to hve crsp vlues. However, n most rel-world stutons, nput vrbles (sttes) re vgue. The proposed FIS cn cpture the vgueness n the nput vrbles by consderng them s trngulr fuzzy numbers (Fg. ). In ths reserch trngulr membershp functons re used n both ntecedent nd consequent prts of the FIS, for the ske of smplcty. These membershp functons re determned by three prmeters: centre ( C ), rght spred ( RS ), nd left spred ( LS ). We cn use ny knd of mtchng opertors (t-conorm/t-norm) to clculte the degree of mtchng of nput vrbles nd ther correspondng ntecedents. However, we use the MxMn opertor snce t s more wdely used. For future work, we wll try other opertors to justfy nd select the most sutble ones for our problem. Of course, f we use prmetrc t-conorms nd t-norms, we should dd n optmzton module to tune the prmeters of t-conorm/t-norm whch re used for clcultng the degree of mtchng. Under the bove ssumptons, the degree of mtchng between n nput vrble nd ts correspondng ntecedent lbel cn be esly clculted by usng the MxMn opertor:

56 M. H. F. Zrnd et l.:renforcement Lernng for Fuzzy Control wth Lngustc Sttes, f RS = or LSb = MxMn ( A, B ) =, f RS = nd LSb = () C Cb + RS + LS b ψ, otherwse RS + LSb where ψ s functon defned s ψ ( x), f x = x, f x> The degree of pplcblty of rule cn be determned by pplyng t-norm to the degrees of mtchng between ech nput vrble of the rule nd ts correspondng ntecedent lbels. GRLFC nherts ts t-norm from GARIC n whch Softmn, defned n equton (3), s used s the t-norm. In ddton, we ssume tht k = [3]. Snce we do not tune the ntecedent lbels nd we do not need ts dfferentblty, ny other knd of t-norm would lso be pproprte for our system. However, the results should be vldted nd justfed bsed on the stuton of the problem domn. Usng prmetrc t-conorm nd t-norm would lso need new optmzton module to tune the prmeters of the unon nd ntersecton opertors. We usully use Neurl Networks together wth n evolutonry lgorthm for ths purpose. For future work, we wll use other knds of t-conorm nd t-norm, nd compre the results obtned. We consder ths to be very necessry n developng dptve systems. SoftMn ( x x,, x ) n = = ( kx ) ( kx ) x exp, n =. (3) n exp Thus, usng the Softmn opertor nd degrees of mtchng, the degree of pplcblty or the degree of frng of rule R cn be clculted by ( MxMn ( A B ),, MxMn ( A B )) w = Soft Mn, (4), where w s the degree of frng of rule R, A s the j th nput vrble, nd B s the j th ntecedent lbel n the th R j R rule,. By pplyng the degree of frng of to ts consequent prt, the output of the rule cn be clculted. The defuzzfed output s clculted by μ C ( ) n n j z = μ w (5) where s defuzzfcton method, C s the consequent lbel of the th rule, nd s the defuzzfed output for the th rule. In ths pper, we use the Locl Men of Mxmum (LMOM) method for defuzzfcton [3]. Combnng the outputs of ll rules, crsp control cton n the form of weghted verge s obtned, usng the followng equton F m = z w = m w = where m s the number of the rules. Ths cn lso be extended for multple output vrbles. 3 Renforcement Lernng Smlr to mny other technques n the feld of rtfcl ntellgence, renforcement lernng (RL) hs ts roots n psychology. The de behnd RL s lernng from experence nd through trl-nd-error nterctons wth dynmc envronment, smlr to wht ny ntellgent creture would do durng ts lfetme. z () (6)

Journl of Uncertn Systems, ol., No., pp.54-66, 8 57 In mny tsks to whch we would lke to pply renforcement lernng, most sttes encountered hve never been exctly experenced before. Ths s lmost lwys the cse when the stte or cton spces nclude contnuous vrbles or complex nd vgue senstons. Therefore, generlzton method s needed; more specfclly, the knd of generlzton we requre s often clled functon pproxmton becuse t tkes exmples from desred functon (e.g., vlue functon) nd ttempts to generlze to construct n pproxmton of the entre functon. Fuzzy Inference Systems (FIS) re pproprte tools for generlzton. The use of FIS s opposed to globl functon pproxmtor lke Neurl Networks hs two mjor dvntges: ) the FIS s nherent loclty property permts the ntroducton of humn knowledge, nd ) loclzes the lernng process to only mplcted prmeters [5]. However, the process of fne-tunng the fuzzy controller stll remns dffcult tsk. It should be noted tht supervsed control lernng requres trnng dt or techer of the subject domn. In most rel-world pplctons, trnng dt s often hrd to obtn or my not be vlble t ll. An pproch to solvng ths problem s bsed on renforcement lernng, prdgm tht stems from the desre to mke systems tht lern from utonomous nterctons wth ther envronments. Therefore, renforcement lernng technques cn be effectve for the fne-tunng of fuzzy controllers when no trnng dt s vlble nd only wek renforcement cn be obtned. Ths re of reserch hs ttrcted mny reserchers. Some hve ppled RL technques to tune conventonl Neurl Network s the crtc whle the ctor s n FIS [3,7]. Others hve used Fuzzy Inference Systems to ncrese the knowledge of the crtc bout the goodness of the sttes nd consequently enhnce the performnce of the system [4,5]. However, not mny reserchers hve consdered the vgueness n nput sttes. In ths pper, we propose model tht cptures the uncertnty n the stte of the system. In ddton, the proposed model demonstrtes superor performnce n comprson wth smlr models even n presence of uncertnty. The proposed GRLFC s tested nd vldted by severl test cses. 4 The Archtecture of GRLFC In the proposed model, fuzzy controller s mplemented n the form of n FIS whch plys the role of the Actor. The Actor mplements the knowledge of the expert opertor bout how to control the system. The Crtc whch evlutes the vlue of the current stte s nother FIS, nd t ncorportes the knowledge bout the goodness of the sttes of the plnt. Both of these components smultneously lern to mprove ther performnce through ntercton wth dynmc envronment nd by recevng renforcement sgnl. Fg.. The rchtecture of GRLFC. The rchtecture of GRLFC s shown n Fg.. The system hs fve components: Fuzzfer, Actor, Crtc, Explorer nd Updter. Current stte of the plnt s fed nto the Fuzzfer whch cptures the vgueness of tht stte. The Actor uses ths fuzzy stte to determne the cton nd the Crtc evlutes ts vlue. Combnng the vlue determned by the Crtc nd the renforcement sgnl, n nternl renforcement s generted whch s used for fne-tunng both the Actor nd the Crtc. Lernng n both Actor nd Crtc s through tunng of prmeters of the consequent prt lbels. Explorer perturbs the cton suggested by the Actor n order to provde better serch of the stte spce. To ccomplsh ths tsk, the Explorer uses current epsode nformton, the nternl renforcement, nd the cton suggested by the Actor. In wht follows, we present ech component n more detl.

58 M. H. F. Zrnd et l.:renforcement Lernng for Fuzzy Control wth Lngustc Sttes 4. Fuzzfer As lredy mentoned, n mny rel-world problems, uncertnty exsts n the sttes of system. Ths cn be cused by mny fctors lke uncertnty n sensor redngs or uncertn nture of the sttes of the system (lngustc nput sttes). Fg.. FIS scheme The Fuzzfer consders the uncertnty n the nput vrbles by constructng symmetrc or symmetrc trngulr fuzzy membershp functon usng the crsp nput sttes. In other words, the crsp nput mkes the centre of the trngulr membershp functon nd the spreds re determned by the Fuzzfer ccordng to the specfctons of the problem. Thus, the shpe of the membershp functon constructed s determned by the nture nd mount of uncertnty n the nput sttes. 4. Actor nd Crtc Both the Actor nd the Crtc re fuzzy nference systems descrbed n Secton, nd hve smlr rchtectures depcted n Fg. 3. In ths scheme, the frst lyer s the nput lyer whch cn ccept trngulr fuzzy membershp functons s fuzzy sttes. Lyer contns the ntecedent lbels nd determnes the degree of mtchng usng the MxMn opertor. In lyer 3 the degree of frng of the rule s clculted usng the Softmn opertor s well s the degrees of mtchng between the fuzzy nput vrbles nd ther correspondng ntecedent lbels. Consequent lbels re n lyer 4, where defuzzfcton s performed usng LMOM defuzzfcton method [3] nd the output of ech rule s clculted. Lyer 5 s the output lyer n whch the crsp control cton s determned. 4.3 Explorer Ths component mkes trde-off between explorton nd explotton of the stte spce. Snce we ssume tht the knowledge-bses of the Actor nd the Crtc my be rough nd out of tune, n the erly steps of smulton (or the ctul run), the stte spce must be suffcently explored. When the tme psses nd the Actor lerns to suggest more pproprte ctons, the Crtc lerns to correctly predct the stte vlues by trl-nd-error nterctons wth the envronment. Therefore, explorton s smoothly substtuted by explotton. The Explorer ccomplshes ths tsk by perturbng the cton, F, suggested by the Actor usng the TD predcton error [8], δ, gven by equton (3), nd the length of the prevous epsode, T. Ths process s done v the followng equton ( δ T ) F = F + Rσ, (7)

Journl of Uncertn Systems, ol., No., pp.54-66, 8 59 where R s stndrd unformly dstrbuted rndom vrble defned on [, ], nd σ s some monotonclly decresng functon wth respect to T nd the mgntude of δ. In ths wy, when the mgntude of δ s lrge (smll), there wll be lrge (smll) dfference between the Actor s suggested cton, F, nd wht s ctully ppled to the plnt, F. Ths provdes the explorton of the stte spce. The Explorer lso provdes the perturbton needed by the Updter n updtng the prmeters of the Actor nd the Crtc. The perturbton, s, s clculted usng the followng equton: ((( ) ) ) s = exp F F δ. (8) 4.4 Updter Ths component tunes the lbels n the consequent prts of the Actor nd the Crtc usng decyng lernng rte, the TD error, nd the grdent of ech FIS (Actor nd Crtc) wth respect to prmeters of the consequent prts of the correspondng FIS. To be more specfc, the centers nd the spreds n the ntecedents of the lbels n the consequent prts of ech FIS re tuned. For the Actor, the prmeters re tuned n order to rech the objectve of mxmzng the stte vlue so tht the system would end up n good stte nd eventully vod flure. Ths cn be done through equton (9) n whch s s the perturbton term clculted by the Explorer, p s prmeter of Actor to be tuned, v s the vlue of the stte clculted by the Crtc, δ s the TD error, nd sgn ( ) s the sgn functon. The reson ( δ ) sgn s used s tht when ths term s negtve, t mens tht the current stte s worse thn the prevous one nd therefore step towrd the opposte drecton s needed. On the other hnd, when ths term s postve, t mens tht the current stte s better thn the prevous one nd therefore the current step should be tken n the drecton of the prevous step. Δp = α v p ( δ ) s = α sgn( δ ) v F s F p sgn. (9) To clculte equton (9), we need to clculte two dervtves on ts rght hnd sde. v F s pproxmted usng (): v F dv df v t t v t F F t. () Snce ths pproxmtor gnores the chnge n stte between successve tme steps, t s very crude estmtor of the dervtve nd thus we consder only the sgn of ths estmtor nd not ts mgntude. The exstence of the dervtve s n mplct ssumpton [4]. Clculton of F p s not dffcult., lbel n the consequent prt of the Actor, s prmeterzed by whch my be center, left spred, or rght spred nd s the rule tht ts consequent lbel s. In ddton, s the defuzzfed output of the rule clculted by () usng LMOM defuzzfcton method [4] nd s the degree of frng of rule R. Thus, z R R R ( RS LS )( w ) = C + R w R z R p. () In (), C s the center of the lbel nd RS nd LS re ts rght nd left spreds, respectvely. The dervtve needed for tunng of the Actor cn be clculted usng () nd (). F z = wr p w = R p R. ()

6 M. H. F. Zrnd et l.:renforcement Lernng for Fuzzy Control wth Lngustc Sttes { } In the bove equton,,,, m, where m s the number of the rules n the knowledge-bse of the Actor. For the Crtc, the objectve s to mnmze the TD predcton error, δ, gven by the followng equton where ( ) s j, strtng stte δt = rt+ t( st), flure stte rt+ + γ t( st+ ) t( st), otherwse s the Crtc estmton for the vlue of stte s n tme step. The term r + t+ γ t ( st+ ) s ctully n estmton of t ( s t ) nd, therefore, j t (3) δ t s the error of tht estmton. γ s the rewrd dscount rte whch determnes the mportnce of the future tme steps n the current lernng cycle. If γ s set to, only the mmedte rewrd s consdered nd f t s set to, the vlue of ll the future sttes nd the effect of the ppled ctons re tken nto ccount n the lernng process. In our experments, we useγ =. 95, snce ths vlue s usully used by reserchers nd yelds better results. Therefore, the lernng rule for the Crtc s smlr to tht of the Actor nd s gven by (4) n whch β s the lernng rte nd, lkeα, t s smll postve vrble. An epsode begns when the smulton (or the run of the ctul system) strts nd t ends when flure occurs. In the begnnng of ech epsode, the lernng rtes re set to reltvely lrge vlue. Then, durng n epsode the lernng rtes decy quckly nd rech to smll vlues nd from tht pont forwrd, the lernng rtes decy fter ech N tme steps to provde more explotton of the good polcy found by the controller. δ δ v Δp = β sgn ( δ ) s = β sgn( δ ) s. (4) p v p In (4), the term v p cn esly be clculted smlr to the clculton of F p n (). Ths s becuse the Crtc, lke the Actor, s n FIS. The other term, does not depend on r. δ dδ = v dv δ v s pproxmted usng (5), ssumng tht the dervtve ( γ ) + γ ( d v) where, smlr to the Actor lernng rules, only the sgn of δ v s used n the clcultons. Furthermore, s gven by the fnte dfference v v v. 5 Experments t t + t To show the effcency nd superorty of the proposed system, we ppled the GRLFC to well-known control problem clled Crt-Pole. In ths problem, pole s hnged to crt whch moves long trck. The control objectve s to pply n pproprte force, F, to keep the pole n vertcl poston nd the crt wthn trck boundres. The x, x, θ, θ n whch x nd x re the dsplcement nd velocty of the crt, nd stte of the system s determned by ( ) θ θ nd re the ngulr dsplcement nd ngulr velocty of the pole, respectvely. A flure occurs when ether θ > or x >. 4m or F > N, wheres success s when the pole stys blnced for, tme steps. In our experment, δ s clculted usng =. 95 (5) d v γ. In ddton, hlf-pole length, l p 5m, pole mss, m p =. kg nd crt mss, m c = kg. The dynmcs of the crt-pole system re modeled by the followng nonlner dfferentl equtons [3] =.

Journl of Uncertn Systems, ol., No., pp.54-66, 8 6 ( ) sgn ( ) F m plpθ sn θ + μc x μ pθ g sn ( θ ) + cos( θ ) mc + m p m plp θ =, 4 m p cos ( θ ) l p 3 mc + m p F + m pl p θ sn ( θ ) θ cos( θ ) μc sgn ( x) x =. m + m c p (6) One of the mportnt strengths of the proposed model s ts cpblty of cpturng the uncertnty n the stte of the system. In our prtculr experment wth the crt-pole problem, vgueness my be cused by uncertnty n sensor redngs. The Fuzzfer cptures the uncertnty through constructng trngulr fuzzy membershp functon tht hs the crsp stte s ts center, C, nd C LS nd C + RS s ts left spred nd rght spred, respectvely. In ths experment, we consder symmetrc membershp functons wth RS = LS =.. Moreover, the Actor hs the sme 9+4 rules of the Acton Selecton Network (ASN) nd the Crtc hs sme 5+5 rules of the Acton Evluton Network (AEN) n GARIC [4]. The Actor rules re s follows.. If Thet s POSITIE nd ThetDot s POSITIE then Force s POSITIE LARGE.. If Thet s POSITIE nd ThetDot s ZERO then Force s POSITIE MEDIUM. 3. If Thet s POSITIE nd ThetDot s NEGATIE then Force s ZERO. 4. If Thet s ZERO nd ThetDot s POSITIE then Force s POSITIE SMALL. 5. If Thet s ZERO nd ThetDot s ZERO then Force s ZERO. 6. If Thet s ZERO nd ThetDot s NEGATIE then Force s NEGATIE SMALL. 7. If Thet s NEGATIE nd ThetDot s POSITIE then Force s ZERO. 8. If Thet s NEGATIE nd ThetDot s ZERO then Force s NEGATIE MEDIUM. 9. If Thet s NEGATIE nd ThetDot s NEGATIE then Force s NEGATIE LARGE.. If Thet s ERY SMALL nd ThetDot s ERY SMALL nd x s POSITIE nd xdot s POSITIE then Force s POSITIE SMALL.. If Thet s ERY SMALL nd ThetDot s ERY SMALL nd x s POSITIE nd xdot s POSITIE SMALL then Force s POSITIE ERY SMALL.. If Thet s ERY SMALL nd ThetDot s ERY SMALL nd x s NEGATIE nd xdot s NEGATIE then Force s NEGATIE SMALL. 3. If Thet s ERY SMALL nd ThetDot s ERY SMALL nd x s NEGATIE nd xdot s NEGATIE SMALL then Force s NEGATIE ERY SMALL. The Crtc hs the followng rules:. If Thet s POSITIE nd ThetDot s POSITIE then Stte s BAD.. If Thet s POSITIE nd ThetDot s NEGATIE then Stte s OK. 3. If Thet s ZERO nd ThetDot s ZERO then Stte s GOOD. 4. If Thet s NEGATIE nd ThetDot s POSITIE then Stte s OK. 5. If Thet s NEGATIE nd ThetDot s NEGATIE then Stte s BAD. 6. If x s POSITIE nd xdot s POSITIE then Stte s BAD. 7. If x s POSITIE nd xdot s NEGATIE then Stte s OK. 8. If x s ZERO nd xdot s ZERO then Stte s GOOD. 9. If x s NEGATIE nd xdot s POSITIE then Stte s OK.. If x s NEGATIE nd xdot s NEGATIE then Stte s BAD. The Actor nd the Crtc knowledge-bses re depcted n Fg.4 nd Fg.5, respectvely. Fg.6 depcts the th rules of the Actor s n exmple. Fg.7 nd Fg.8 summrze the rules of the Actor nd the Crtc, respectvely. Fgures 9- show the smulton results of the experments n whch vrous lbels of the Actor or the Crtc re dmged nd the system hs mnged to tune those prmeters to ther pproprte vlues. Prtculrly, Fg. depcts the success of GRLFC n ccomplshng dffcult tsk n whch two of the most mportnt lbels of both the Actor nd the Crtc re dmged nd the spreds re set to., tht s tmes more vgueness n the nput vrbles;.e., GRLFC s cpble of lernng n few trls even wth the exstence of vgueness n the nput sttes whch re provded by slower sensor sgnls. In ddton, Fg.3 depcts the robustness of GRLFC n the cse of chnges n the dynmcs of the plnt. Fg.3 llustrtes only the erly stges of lernng for ths tsk;.e., lernng took more tme steps to complete. In Fg.4, vrous lbels of the Actor re dmged. Furthermore, the smulton tme step s ncresed to.6ms (from.ms ;.e., the sensors respond 3 tme slower) nd the strtng sttes re rndom. Fg.4 lso shows tht

6 M. H. F. Zrnd et l.:renforcement Lernng for Fuzzy Control wth Lngustc Sttes GRLFC hs been successful n vodng flure for 33 mnutes,.e., 33 smulton tme steps n ths experment. In both of these dffcult tsks (.e. the one whose results re shown n Fg. nd Fg.4) GRLFC mnged to dpt to the new stuton only fter few trls. In Fg.5, the tolernce of flure condtons s confned;.e., flure condtons re θ > 6, x >. 4m nd F > N. In ddton, the hlf-length of the pole s decresed to.4 from. 5. GRLFC dpts tself to the new stuton fter only trls..5.5.5 -.4 -...4.6-5 5.5.5.5 -.4 -...4.6-5 5.5.5.5 -.4 -...4.6 - -.5.5.5 -.4 -...4.6-4 6 8.5.5.5 -.4 -...4.6 - -.5.5.5 -.4 -...4.6 - - -8-6 -4 -.5.5.5 -.4 -...4.6 - -.5.5.5 -.4 -...4.6 - -5 - -5.5.5.5 -.4 -...4.6 - -5 - -5.5.5.5.5.5 -.5.5 - - - 5.5.5.5.5.5 -.5.5 - - -.5.5.5.5.5 -.5.5 - - - - -8-6 -4 -.5.5.5.5.5 -.5.5 - - - - - Fg. 3. The Actor s Knowledge bse..5 -.4 -...4.6.5 -.4 -...4.6.5 -.4 -...4.6.5 -.4 -...4.6.5 -.4 -...4.6.5 - -.5.5.5 - -.5.5.5 - -.5.5.5 - -.5.5.5.5.5.5.5.5.5.5.5 - - - - - - - - -.5 -.5 -.5 -.5 -.5 -.5 -.5 -.5 -.5 -.5.5.5 - -.5.5 - - Fg. 4. The Crtc s Knowledge bse. "thet" s very smll "thetdot" s very smll "x" s postve "xdot" s postve smll "force" s postve very smll.5.5.5.5.5 -.. - - - - Fg. 5. The th rule of the Actor

Journl of Uncertn Systems, ol., No., pp.54-66, 8 63 NE ZE PO S NE NL NS ZE ZE NM ZE PM S PO ZE PS PL NE ZE PO S NE NS ZE S NS PS PO PS Fg. 7. 9+4 rules used n Actor. PO ZE NE PO BAD OK ZE GOOD NE OK BAD PO ZE NE PO BAD OK ZE GOOD NE OK BAD Fg. 8. 5+5 rules used n Crtc..5. crt poston (m) -.5 - -.5 - pole poston (rd) -. -.4 -.6 -.8 -. -.5 5 5 5 3 35 tme steps () Crt poston -. 5 5 5 3 35 tme steps (b) Pole poston Fg. 9. The center of ZE force lbel ws shfted to +5. The system shfted t bck to bout. Ech epsode strted n non-rndom stte. The lernng took only 3 trls..5.6.4.5. crt poston (m) -.5 - pole poston (rd) -. -.4 -.5 -.6 - -.8 -.5 3 4 5 6 tme steps () Crt poston -. 3 4 5 6 tme steps (b) Pole poston Fg.. The center of ZE force lbel ws shfted to +3. The system shfted t bck to bout. Ech epsode strted n rndom stte. The lernng took only 4 trls.

64 M. H. F. Zrnd et l.:renforcement Lernng for Fuzzy Control wth Lngustc Sttes.5. crt poston (m) -.5 - -.5 pole poston (rd) -. -.4 -.6 - -.8 -.5 5 5 5 3 tme steps () Crt poston -. 3 4 5 6 7 tme steps (b) Pole poston Fg.. The center of GOOD lbel ws shfted to.5. The system shfted t bck to bout. Ech epsode strted n non-rndom stte. The lernng took only 3 trls..5 Crt Poston.8.6 Pole Poston -.5.4. - -.5 -. -.4 - -.6 -.5 4 6 8 4 6 -. 4 6 8 4 6 () Crt poston (b) Pole poston Fg.. The center of GOOD lbel nd the center of ZE lbel were dmged (GOOD-.5 nd ZE+3). In ddton spreds re set to.. The lernng took only 4 trls. -.8.5.4 crt poston (m)..5 -.5 -. -.5 -. pole poston (rd).3.. -. -. -.5 5 5 5 3 35 4 45 5 tme steps -.3 5 5 5 3 35 4 45 5 tme steps () Crt poston (b) Pole poston Fg. 3. The mss of the crt s chnged from kg to kg. The system s dpted to new stuton wthout flure.

Journl of Uncertn Systems, ol., No., pp.54-66, 8 65.5 Crt Poston. Pole Poston.8.5.6.4.5. -. -.5 -.4 - -.6 -.5 -.8 -.5.5.5 3 3.5 4 x 4 -..5.5.5 3 3.5 4 () Crt poston (b) Pole poston Fg. 4. rous force lbels were dmged: ZE-,PL+5,PM+3,PS+,NM+. In ddton, smulton tme step ws ncresed to.6 from.. Ech epsode strted n rndom stte. The lernng took 6 trls. GRLFC voded flure for 33 mnutes. x 4.4 Crt Poston.6 Pole Poston.3.4... -. -. -. -.4 -.3 -.4 -.6 -.5 3 4 5 6 -.8 3 4 5 6 () Crt poston (b) Pole poston Fg. 5. Tolernce for the flure condtons s chnged nd the hlf length of the pole s decresed to.4 from.5. The system s dpted to new stuton wth only flure ( trls). Tble. Lernng method comprson on crt-pole problem System Proposed by # of Trls gueness n Input Sttes -- O. Esogbue et l. No -- S. M. Mustph et l. 5 No GARIC H. R. Berenj et l. 4 No FACL L. Jouffe 3.7 No GRLFC M. H. Fzel et l. 3.35 Yes 6 Conclusons nd Future Works Ths reserch hs modfed the GARIC model [3]. The proposed model extends GARIC n severl wys. The model s cpble of cpturng the vgueness n nput vrbles. Furthermore, the lernng speed s ncresed nd the number of flures tht occur before success s decresed usng vrble lernng rtes. Lernng strtegy for the Crtc s dfferent from tht of Acton Evluton Network (AEN). Ths new strtegy stems from vewng wht s the so clled nternl renforcement from nother pont of vew; more specfclly, consderng the nternl renforcement s the Temporl Dfference error (TD error). The Explorer component lso extends the Stochstc Acton Modfer (SAM) n GARIC to provde better explorton of the stte spce. Smulton results show the superorty nd effcency of the proposed GRLFC n comprson wth other models. Tble summrzes the comprson between the lernng models, regrdng the speed of lernng nd the fetures of ech.

66 M. H. F. Zrnd et l.:renforcement Lernng for Fuzzy Control wth Lngustc Sttes An nterestng topc of future work s to use lngustc output nsted of usng crsp one. Ths would cpture the vgueness n the output vrbles. In ths wy, nsted of pplyng the TSK-lke method whch we used n our model, pproches lke Mmdn or Logcl, whch re ble to generte nd process lngustc outputs, should be ncorported. By usng these methods, the performnce of ech of these technques nd ther combntons cn be nvestgted. Consderng dfferent spreds for the trngulr fuzzy membershp functon of ech nput element n the nput vector cn lso be consdered s nother topc of future work. Investgtng the effect of the shpe of these membershp functons, ncludng the symmetry nd spreds, cn be n nterestng study. Moreover, nsted of usng the grdent descent method, pplyng more effcent technque my contrbute to the qulty of the soluton nd further decrese the number of tme steps for rechng soluton. Furthermore, rechng such soluton my need less number of tme steps. References [] Ln, C. nd Y. Xu, A novel genetc renforcement lernng for nonlner fuzzy control problem, Neurocomputng, vol.69, nos. 6-8, pp.78-89, 6. [] Ln, C. nd C. Lee, Renforcement structure/prmeter lernng for neurl-network-bsed fuzzy logc control systems, IEEE Trns. Fuzzy Syst, vol., no., pp. 46 63, 994. [3] Berenj, H. R. nd P. S. Khedkr, Lernng nd tunng fuzzy logc controllers through renforcements, IEEE Trns. on Neurl Networks, vol.3, no.5, pp.74-74, 99. [4] Berenj, H. R. nd P. S. Khedkr, Usng fuzzy logc for performnce evluton n renforcement lernng, Intl. J. Approxmte Resonng, pp.3-44, 997. [5] Jouffe, L., Actor-crtc lernng bsed on fuzzy nference system, Proc. of IEEE Interntonl Conference on Systems, Mn, nd Cybernetcs, vol., pp.339-344, 996. [6] Jouffe, L., Fuzzy nference system lernng by renforcement methods, IEEE Trns. Syst., Mn, Cybernetcs, Prt C: Applctons nd Revews, vol.8, no.3, pp.338-355, 998. [7] Chng, C., H. Chung nd J. Ln, A self-lernng fuzzy logc controller usng genetc lgorthms wth renforcements, IEEE Trns. Fuzzy Syst., vol.5, no.3, pp.46-467, 997. [8] Sutton, R. S., Renforcement Lernng: An Introducton, MIT Press, 998. [9] Mustph, S. M. nd G. Lchver, A modfed ctor-crtc renforcement lernng lgorthm, Proc. of Cndn Conference on Electrcl nd Computer Engneerng, Hlfx, NS, Cnd, vol., pp.66-69, July 7-8,. [] Seo, H. S., S. J. Youn, K. W. Oh, A fuzzy renforcement functon for the ntellgent gent to process vgue gols, Proc. of 9 th Interntonl Conference of NAFIPS, Atlnt, GA, USA, pp.9-33, July 3-5,. [] Feng, H. M., A self-tunng fuzzy control system desgn, Proc. of Jont Conference of IFSA nd NAFIPS, ncouver, BC, Cnd, vol., pp.9-4. July 5-8,.