I' I THE GAME OF CHECKERS SOME STUDIES IN MACHINE LEARNING USING. should eventually eliminate the need for much of this detailed programming

', ' SOME STUDES N MACHNE LEARNNG USNG THE GAME OF CHECKERS A. L. Samuel ntroducton The studes reported here have been concerned wth the programmng of a dgtal computer to behave n a way whch, f done by human bengs or anmals, would be descrbed as nvolvng the process of learnng. Whle ths s not the place to dwell on the mportance of machne-learnng procedures, or to dscourse on the phlosophcal aspects, 1 there s obvously a very large amount of work, now done by people, whch s qute trval n ts demands on the ntellect but does, nevertheless, nvolve some learnng. We have at our command computers wth adequate data-handlng ablty and wth suffcent computatonal speed to make use of machne-learnng technques, but our knowledge of the basc prncples of these technques s stll rudmentary. Lackng such knowledge, t s necessary to specfy methods of problem soluton n mnute and exact detal, a tme-consumng and costly procedure. Programmng computers to learn from experence should eventually elmnate the need for much of ths detaled programmng effort.! General Methods of Approach At the outset t mght be well to dstngush sharply between two general approaches to the problem of machne learnng. One method, whch mght be called the Neural-Net Approach, deals wth the possblty of nducng learned behavor nto a randomly connected swtchng net (or ts smula- 1 Some of these are qute profound and have a bearng on the questons rased by Nelson Goodman n Fact, Fcton and Forecast, Cambrdge, Mass.: Harvard, 1954. 71

72 ARTFCAL NTELLGENCE ton on a dgtal computer) as a result of a reward-and-punshment routne. A second, and much more effcent approach, s to produce the equvalent of a hghly organzed network whch has been desgned to learn only certan specfc thngs. The frst method should lead to the development of general-purpose learnng machnes. A comparson between the sze of the swtchng nets that can be reasonably constructed or smulated at the present tme and the sze of the neural nets used by anmals, suggests that we have a long way to go before we obtan practcal devces. 2 The second procedure requres reprogrammng for each new applcaton, but t s capable of realzaton at the present tme. The experments to be descrbed here were based on ths second approach. Choce of Problem For some years the wrter has devoted hs spare tme to the subject of machne learnng and has concentrated on the development of learnng procedures as appled to games. 3 A game provdes a convenent vehcle for such study as contrasted wth a problem taken from lfe, snce many of the complcatons of detal are removed. Checkers, rather than chess (Shannon, 1950; Bernsten and Roberts, 19586; Kster et al., 1957; Newell, Shaw, and Smon, 19586), was chosen because the smplcty of ts rules permts greater emphass to be placed on learnng technques. Regardless of the relatve merts of the two games as ntellectual pastmes, t s far to state that checkers contans all of the basc characterstcs of an ntellectual actvty n whch heurstc procedures and learnng processes can play a major role and n whch theseprocesses can be evaluated. Some of these characterstcs mght well be enumerated. They are: ( 1 ) The actvty must not be determnstc n the practcal sense. There exsts no known algorthm whch wll guaranteea wn or a draw n checkers, and the complete exploratons of every possble path through a checker game would nvolve perhaps 10 40 choces of moves whch, at 3 choces per mllmcrosecond, would stll take 10 21 centures to consder. (2) A defnte goal must exst the wnnng of the game and at least one crteron or ntermedate goal must exst whch has a bearng on the achevement of the fnal goal and for whch the sgn should be known. n checkers the goal s to deprve the opponent of the possblty of movng, 2 Warren S. McCulloch (1949) has compared the dgtal computer to the nervous system of a flatworm. To extend ths comparson to the stuaton under dscusson would be unfar to the worm, snce ts nervous system s actually qute hghly organzed as compared wth the random-net studes by Farley and Clark (1954), Rochester, Holland, Habt, and Duda (1956), and by Rosenblatt (1958). 3 The frst operatng checker program for the BM 701 was wrtten n 1952. Ths was recoded for the BM 704 n 1954. The frst program wth learnng was completed n 1955 and demonstratedon televsonon February 24, 1956.

tl MACHNE LEARNNG USNG THE GAME OF CHECKERS 73 and the domnant crteron s the number of peces of each color on the board. The mportance of havng a known crteron wll be dscussed later. (3) The rules of the actvty must be defnte and they should be known. Games satsfy ths requrement. Unfortunately, many problems of economc mportance do not. Whle n prncple the determnaton of the rules can be a part of the learnng process, ths s a complcaton whch mght well be left untl later. (4) There should be a background of knowledge concernng the actvty aganst whch the learnng progress can be tested. (5) The actvty should be one that s famlar to a substantal body of people so that the behavor of the program can be made understandable to them. The ablty to have the program play aganst human (or opponents antagonsts) adds spce to the study and, ncdentally, provdes a convncng demonstraton for those who do not beleve that machnes can learn. Havng settled on the game of checkers for our learnng studes, we must, of course, frst program the computer to play legal checkers; that s, we must express the rules of the game n machne language and we must arrange for the mechancs of acceptng an opponent's moves and of reportng the computer's moves, together wth all pertnent data desred by the expermenter. The general methods for dong ths were descrbed by Shannon n 1950 as gram appled to chess rather than checkers. The basc pro- used n these experments s qute smlar to the program descrbed by Strachey n 1952. The avalablty of a larger and faster machne (the BM 704), coupled wth many detaled changes n the programmng procedure, leads to a farly nterestng game, even wthout any learnng. The basc forms of the program wll now be descrbed. The Basc Checker-playng Program The computer plays by lookng ahead a few moves and by evaluatng the resultng board postons much as a human player mght do. Board postons are stored by sets of machne words, four words normally beng used to represent any partcular board poston. Thrty-two bt postons (of the 36 avalable n an BM 704 word) are, by conventon, assgned to the 32 playng squares on the checkerboard, and peces appearng on these squares are represented by l's appearng n the assgned bt postons of the correspondng word. "Lookng ahead" s prepared for by computng all possble next moves, startng wth a gven board poston. The ndcated moves are explored n turn by producng new board-poston records correspondng to the condtons after the move n queston (the old board Postons beng saved to facltate a return to the startng pont) and the Process can be repeated. Ths look-ahead procedure s carred several! ''?! '! 1 * M

74 Fgure 1. A "tree" of moves whch mght be nvestgated durng the look-ahead procedure. The actual branchngs are much more numerous than those shown, and the "tree" s apt to extend to as many as 20 levels. moves n advance, as llustrated n Fg. 1. The resultng board postons are then scored n terms of ther relatve value to the machne. The standard method of scorng the resultng board postons has been n terms of a lnear polynomal. A number of schemes of an abstract sort were tred for evaluatng board postons wthout regard to the usual checker concepts, but none of these at the varous terms n the scorng was successful. 4 One way of lookng polynomal s that those terms wth ' One of the more nterestng of these was to express a board poston n terms of the frst and hgher moments of the whte and black peces separately about two orthogonal axes on the board. Two such sets of axes were tred, one set beng parallel to the sdes of the board and the second set beng those through th? dagonals

MACHNE LEARNNG USNG THE GAME OF CHECKERS 75 numercally small coeffcents should measure crtera related as ntermedate goals to the crtera measured by the larger terms. The achevement of these ntermedate goals ndcates that the machne s gong n the rght drecton, such that the larger terms wll eventually ncrease. f the program could look far enough ahead we need only ask, "s the machne stll n the game?" 5 Snce t cannot look ths far ahead n the usual stuaton, we must substtute somethng else, say the pece rato, and let the machne contnue the look-ahead untl one sde has ganed a pece advantage. But even ths s not always possble, so we have the program test to see f the machne has ganed a postonal advantage, et cetera. Numercal measures of these varous propertes of the board postons are then added together (each wth an approprate coeffcent whch defnes ts relatve mportance) to form the evaluaton polynomal. More specfcally, as defned by the rules for checkers, the domnant scorng parameter s the nablty for one sde or the other to move. 0 Snce ths can occur but once n any game, t s tested for separately and s not ncluded n the scorng polynomal as tabulated by the computer durng play. The next parameter to be consdered s the relatve pece advantage. t s always assumed that t s to the machne's advantage to reduce the number of the opponent's peces as compared to ts own. A reversal of the sgn of ths term wll, n fact, cause the program to play "gveaway" checkers, and wth learnng t can only learn to play a better and better gveaway game. Were the sgn of ths term not known by the programmer t could, of course, be determned by tests, but t must be fxed by the expermenter and, n effect, t s one of the nstructons to the machne defnng ts task. The numercal computaton of the pece advantage has been arranged n such a way as to account for the well-known property that t s usually to one's advantage to trade peces when one s ahead and to avod trades when behnd. Furthermore, t s assumed that kngs are more valuable than peces, the relatve weghts assgned to them beng three to two. 7 Ths rato means that the program wll trade three men for two kngs, or two kngs for three men, f by so dong t can obtan some postonal advantage. The choce for the parameters to follow ths frst term of the scorng polynomal and ther coeffcents then becomes a matter of concern. Two courses are open ether the expermenter can decde what these subsequent terms are to be, or he can arrange for the program to make the selecton. We wll dscuss the frst case n some detal n connecton wth 5 Ths apt phraseology was suggested by John McCarthy. 0 Not the capture of all the opponent's peces, as popularly assumed, although all games end n ths fashon. 7 The use of a weght rato rather than ths, conformng more closely to the values assumed by many players, can lead nto certan logcal complcatons, as found by Strachey (1952). ' "! " f : s

76 ARTFCAL NTELLGENCE Machne chooses branch wth largest score Opponent expected to choose branch wth smallest score (D Machne chooses branch wth most postve score Evaluatons made at ths level \ / \ / \ / \ / \ / V +100 +50 +20-7 +4-3 0 +3-10 -20-70 -100 +3 +7 +15-5 Fgure 2. Smplfed dagram showng how the evaluatons are backed up through the "tree" of possble moves to arrve at the best next move. The evaluaton process starts at (3). the rote-learnng studes and leave for a later secton the dscusson of varous program methods of selectng parameters and adjustng ther coeffcents. t s not satsfactory to select the ntal move whch leads to the board poston wth the hghest score, snce to reach ths poston would requre the cooperaton of the opponent. nstead, an analyss must be made proceedng backward from the evaluated board postons through the "tree" of possble moves, each tme wth consderaton of the ntent of the sde whose move s beng examned, assumng that the opponent would always attempt to mnmze the machne's score whle the machne acts to maxmze ts score. At each branch pont, then, the correspondng board poston s gven the score of the board poston whch would result from the most favorable move. Carryng ths "mnmax" procedure back to the startng pont results n the selecton of a "best move." The score of the board poston at the end of the most lkely chan s also brought back, and for learnng purposes ths score s now assgned to the present board poston. Ths process s shown n Fg. 2. The best move s executed, reported on the console lghts, and tabulated by the prnter. The opponent s then permtted to make hs move, whch can be communcated to the machne ether by means of console swtches or by means of punched cards. The computer verfes the legalty of the opponent's move, rejectng 8 or acceptng t, and the process s repeated. When the program can look ahead and predct a wn, ths fact s reported on the 'The only departure from complete generalty of the game as programmed s that the program requres the opponent to make a permssble move, ncludng the takng of a capture f one s offered. "Huffng" s not permtted.

MACHNE LEARNNG USNG THE GAME OF CHECKERS 77 prnter. Smlarly, the program concedes when t sees that t s gong to lose. Ply Lmtatons Playng-tme consderatons make t necessary to lmt the look-ahead dstance to some farly small value. Ths dstance s defned as the ply (a ply of 2 consstng of one proposed move by the machne and the antcpated reply by the opponent). The ply s not fxed but depends upon the dynamcs of the stuaton, and t vares from move to move and from branch to branch durng the move analyss. A great many schemes of adjustng the look-ahead dstance have been tred at varous tmes, some of them qute complcated. The most effectve one, although qute detaled, s smple n concept and s as follows. The program always looks ahead a mnmum dstance, whch for the openng game and wthout learnng s usually set at three moves. At ths mnmum ply the program wll evaluate the board poston f none of the followng condtons occurs: (1) the next move s a jump, (2) the last move was a jump, or (3) an exchange offer s possble. f any one of these condtons exsts, the program contnues lookng ahead. At a ply of 4 the program wll stop and evaluate the resultng board poston f condtons ( 1) and ( 3 ) above are not met. At a ply of 5 or greater, the program stops the look-ahead whenever the next ply level does not offer a jump. At a ply of 1 1 or greater, the program wll termnate the look-ahead, even f the next.move s to be a jump, should one sde at ths tme be ahead by more than two kngs (to prevent the needless exploraton of obvously losng or wnnng sequences). The program stops at a ply of 20 regardless of all condtons (snce the memory space for the look-ahead moves s then exhausted) and an adjustment n score s made to allow for the pendng jump. Fnally, an adjustment s made n the levels of the break ponts between the dfferent condtons when tme s saved through rote learnng (see below) and when the total number of peces on the board falls below an arbtrary number. All break ponts are determned by sngle data words whch can be changed at any tme by manual nterventon. Ths tyng of the ply wth board condtons acheves three desred results. n the frst place, t permts board evaluatons to be made under condtons of relatve stablty for so-called dead postons, as defned by Turng (Bowden, 1953). Secondly, t causes greater survellance of those paths whch offer better opportuntes for ganng or losng an advantage. Fnally, snce branchng s usually serously restrcted by a jump stuaton, the total number of board postons and moves to be consdered s stll held down to a reasonable number and s more equtably dstrbuted between the varous possble ntal moves. As a practcal matter, machne playng tme usually has been lmted "^Jljl * ',1' 1 ;".'!! :

78 ARTFCAL NTELLGENCE to approxmately 30 seconds per move. Elaborate table look-up procedures, fast sortng and searchng procedures, and a varety of new programmng trcks were developed, and full use was made of all of the resources of the BM 704 to ncrease the operatng speed as much as possble. One can, of course, set the playng tme at any desred value by adjustments of the permtted ply; too small a ply results n a bad game and too large a ply makes the game unduly costly n terms of machne tme.!l Other Modes of Play For study purposes the program was wrtten to accommodate several varatons of ths basc plan. One of these permts the program to play aganst tself, that s, to play both sdes of the game. Ths mode of play has been found to be especally good durng the early stages of learnng. The program can also follow book games presented to t ether on cards or on magnetc tape. When operatng n ths mode, the program decdes at each pont n the game on ts next move n the usual way and reports ths proposed move. nstead of actually makng ths move, the program refers to the stored record of a book game and makes the book move. The program records ts evaluaton of the two moves, and t also counts and reports the number of possble moves whch the program rates as beng better than the book move and the number t rates as beng poorer. The sdes are then reversed and the process s repeated. At the end of a book game a correlaton coeffcent s computed, relatng the machne's ndcated moves to those moves adjudged best by the checker masters. 9 t should be noted that the emphass throughout all of these studes has been on learnng technques. The temptaton to mprove the machne's game by gvng t standard openngs or other man-generated knowledge of playng technques has been consstently ressted. Even when book games are played, no weght s gven to the fact that the moves as lsted are presumably the best possble moves under the crcumstances. For demonstraton purposes, and also as a means of avodng lost machne tme whle an opponent s thnkng, t s sometmes convenent to play several smultaneous games aganst dfferent opponents. Wth the program n ts present form the most convenent number for ths purpose has been found to be sx, although eght have been played on a number of occasons. Games may be started wth any ntal confguraton for the board poston so that the program may be tested on end games, checker puzzles, et cetera. For nonstandard startng condtons, the program lsts the ntal "Ths coeffcent s defned as C = (L H)/(L -f H), where L s the total number of dfferent legal moves whch the machne judged to be poorer than the ndcated book moves, and H s the total number whch t judged to be better than the book moves.

MACHNE LEARNNG USNG THE GAME OF CHECKERS 79 pece arrangement. From tme to tme, and at the end of each game, the program also tabulates varous bts of statstcal nformaton whch assst n the evaluaton of playng performance. Numerous other features have also been added to make the program convenent to operate (for detals see Appendx A), but these have no drect bearng on the problem of learnng, to whch we wll now turn our attenton.,' ;! ".!, " n» l 'V" l M ' Rote Learnng and ts Varants Perhaps the most elementary type of learnng worth dscussng would be a form of rote learnng n whch the program smply saved all of the board postons encountered durng play, together wth ther computed scores. Reference could then be made to ths memory record and a certan amount of computng tme mght be saved. Ths can hardly be called a very advanced form of learnng; nevertheless, f the program then utlzes the saved tme to compute further n depth t wll mprove wth tme. Fortunately, the ablty to store board nformaton at a ply of 0 and to look up boards at a larger ply provdes the possblty of lookng much farther n advance than mght otherwse be possble. To understand ths, consder a very smple case where the look ahead s always termnated at a fxed ply, say 3. Assume further that the program saves only the board postons encountered durng the actual play wth ther assocated backed-up scores. Now t s ths lst of prevous board postons that s used to look up board postons whle at a ply level of 3 n the subsequent games. f a board poston s found, ts score has, n effect, already been backed up by three levels, and f t becomes effectve n determnng the move to be made, t s a 6-ply score rather than a smple 3-ply score. Ths new ntal board poston wth ts 6-ply score s, n turn, saved and t may be encountered n a future game and the score backed up by an addtonal set of three levels, et cetera. Ths procedure s llustrated n Fg. 3. The ncorporaton of ths varaton, together wth the smpler rote-learnng feature, results n a farly powerful learnng technque whch has been studed n some detal. Several addtonal features had to be ncorporated nto the program before t was practcal to embark on learnng studes usng ths storage scheme. n the frst place, t was necessary to mpart a sense of drecton to the program n order to force t to press on toward a wn. To llustrate ths, consder the stuaton of two kngs aganst one kng, whch s a wnnng combnaton for practcally all varatons n board postons. n tme, the program can be assumed to have stored all of these varatons, each assocated wth a wnnng score. Now, f such a stuaton s encountered, the program wll look ahead along all possble paths and each path wll j] "

80 ARTFCAL NTELLGENCE Typcal board poston found n memory wth score from prevous look-ahead search Ply number 1 2 3 Fgure 3. Smplfed representaton of the rote-learnng process, n whch nformaton saved from a prevous game s used to ncrease the effectve ply of the backed-up score. lead to a wnnng combnaton, n spte of the fact that only oneof the possble ntal moves may be along the drect path toward the wn whle all of the rest may be wastng tme. How s the program to dfferentate between these? A good soluton s to keep a record of the ply value of the dfferent board postons at all tmes and to make a further choce between board postons on ths bass. f ahead, the program can be arranged to push drectly toward the wn whle, f behnd, t can be arranged to adopt delayng tactcs. The most recent method used s to carry the effectve ply along wth the score by smply decreasng the magntude of the score a small amount each tme t s backed up a ply level durng the analyses. f the program s now faced wth a choce of board postons whose scores dffer only by the ply number, t wll automatcally make the most advantageous choce, choosng a low-ply alternatve f wnnng and a hgh-ply alternatve f losng. The sgnfcance of ths concept of a drecton sense should not be overlooked. Even wthout "learnng," t s very mportant. Several of the early attempts at learnng faled because the drecton sense was not properly taken nto account. Catalogng and Cullng Stored nformaton Snce practcal consderatons lmt the number of board postons whch can be saved, and snce the tme to search through those that are saved can easly become unduly long, one must devse systems ( 1 ) to catalog boards that are saved, (2) to delete redundances, and (3) to dscard board pos-

"H j,' : p MACHNE LEARNNG USNG THE GAME OF CHECKERS 81 tons whch are not beleved to be of much value. The most effectve catalogng system found to date starts by standardzng all board postons, frst by reversng the peces and pece postons f t s a board poston n whch Whte s to move, so that all boards are reported as f t were Black's turn to move. Ths reduces by nearly a factor of two the number of boards whch must be saved. Board postons, n whch all of the peces are kngs, can be reflected about the dagonals wth a possble fourfold reducton n the number whch must be saved. A more compact board representaton than the one employed durng play s also used so as to mnmze the storage requrements. After the board postons are standardzed, they are grouped nto records on the bass of (1) the number of peces on the board, (2) the presence or absence of a pece advantage, (3) the sde possessng ths advantage, (4) the presence or absence of kngs on the board, (5) the sde havng the so-called "move," or opposton advantage, and fnally (6) the frst moments of the peces about normal and dagonal axes through the board. Durng play, newly acqured board postons are saved n the memory untl a reasonable number have been accumulated, and they are then merged wth those on the "memory tape" and a new memory tape s produced. Board postons wthn a record are lsted n a seral fashon, beng sorted wth respect to the words whch defne them. The records are arranged on the tape n the order that they are most lkely to be needed durng the course of a game; board postons wth 12 peces to a sde comng frst, et cetera. Ths method of catalogng s very mportant because t cuts tapesearchng tme to a mnmum. Reference must be made, of course, to the board postons already saved, and ths s done by readng the correct record nto the memory and searchng through t by a dchotomous search procedure. Usually fve or more records are held n memory at one tme, the exact number at any tme dependng upon the lengths of the partcular records n queston. Normally, the program calls three or four new records nto memory durng each new move, makng room for them as needed, by dscardng the records whch have been held the longest. Two dfferent procedures have been found to be of value n lmtng the number of board postons that are saved; one based on the frequency f" use, and the second on the ply. To keep track of the frequency of use, an age term s carred along wth the score. Each new board poston to be saved s arbtrarly assgned an age. When reference s made to a stored board poston, ether to update ts score or to utlze t n the look-ahead procedure, the age recorded for ths board poston s dvded by two. Ths s called refreshng. Offsettng ths, each board poston s automatcally aged by one unt at the memory merge tmes (normally occurrng about once every 20 moves). When the age of any one board poston reaches an 1 ' ','ff h-j s

r*' 82 ARTFCAL NTELLGENCE arbtrary maxmum value ths board poston s expunged from the record. Ths s a form of forgettng. New board postons whch reman unused are soon forgotten, whle board postons whch are used several tmes n successon wll be refreshed to such an extent that they wll be remembered even f not used thereafter for a farly long perod of tme. Ths form of refreshng and forgettng was adopted on the bass of reflectons as to the fralty of human memores. t has proven to be very effectve. n addton to the lmtatons mposed by forgettng, t seemed desrable to place a restrcton on the maxmum sze of any one record. Whenever an arbtrary lmt s reached, enough of the lowest-ply board postons are automatcally culled from the record to brng the sze well below the maxmum. Before embarkng on a study of the learnng capabltes of the system as just descrbed, t was, of course, frst necessary to fx the terms and coeffcents n the evaluaton polynomal. To do ths, a number of dfferent sets of values were tested by playng through a seres of book games and computng the move correlaton coeffcents. These values vared from 0.2 for the poorest polynomal tested, to approxmately 0.6 for the one fnally adopted. The selected polynomal contaned four terms (as contrasted wth the use of 16 terms n later experments). n decreasng order of mportance these were: (1) pece advantage, (2) denal of occupancy, (3) moblty, and (4) a hybrd term whch combned control of the center and pece advancement. Rote-learnng Tests After a scorng polynomal was arbtrarly pcked, a seres of games was played, both self-play and play aganst many dfferent ndvduals (several of these beng checker masters). Many book games were also followed, some of these beng end games. The program learned to play a very good openng game and to recognze most wnnng and losng end postons many moves n advance, although ts mdgame play was not greatly mproved. Ths program now qualfes as a rather better-than-average novce, but defntely not as an expert. At the present tme the memory tape contans somethng over 53,000 board postons (averagng 3.8 words each) whch have been selected from a much larger number of postons by means of the cullng technques descrbed. Whle ths s stll far from the number whch would tax the lstng and searchng procedures used n the program, rough estmates, based on the frequency wth whch the saved boards are utlzed durng normal play (these fgures beng tabulated automatcally), ndcate that a lbrary tape contanng at least 20 tmes the present number of board postons would be needed to mprove the mdgame play sgnfcantly.at the

MACHNE LEARNNG USNG THE GAME OF CHECKERS 83 present rate of acquston of new postons ths would requre an nordnate amount of play and, consequently, of machne tme. 10 The general conclusons whch can be drawn from these tests are that: (1) An effectve rote-learnng technque must nclude a procedure to gve the program a sense of drecton, and t must contan a refned system for catalogng and storng nformaton. (2) Rote-learnng procedures can be used effectvely on machnes wth the data-handlng capacty of the BM 704 f the nformaton whch must be saved and searched does not occupy more than, roughly, one mllon words, and f not more than one hundred or so references need to be made to ths nformaton per mnute. These fgures are, of course, hghly dependent upon the exacteffcency of catalogng whch can be acheved. (3) The game of checkers, when played wth a smple scorng scheme and wth rote learnng only, requres more than ths number of words for master calber of play and, as a consequence, s not completely amenable to ths treatment on the BM 704. (4) A game, such as checkers, s a sutable vehcle for use durng the development of learnng technques, and t s a very satsfactory devce for demonstratng machne learnng procedures to the unbelevng. VMr : j; '' ' '" J! n! 1- Learnng Procedure nvolvng Generalzatons An obvous way to decrease the amount of storage needed to utlze past experence s to generalze on the bass of experence and to save only the generalzatons. Ths should, of course, be a contnuous process f t s to be truly effectve, and t should nvolve several levels of abstracton. A start has been made n ths drecton by havng the program select a subset of possble terms for use n the evaluaton polynomal and by havng the program determne the sgn and magntude of the coeffcents whch multply these parameters. At the present tme ths subset conssts of 16 terms chosen from a lst of 38 parameters. The pece-advantage term needed to defne the task s computed separately and, of course, s not altered by the program. After a number of relatvely unsuccessful attempts to have the program generalze whle playng both sdes of the game, the program was arranged to act as two dfferent players, for convenence called Alpha and Beta. Alpha generalzes on ts experence after each move by adjustng the coeffcents n ts evaluaton polynomal and by replacng terms whch appear to be unmportant by new parameters drawn from a reserve lst. Beta, on the contrary, uses the same evaluaton polynomal for the dura- Ths playng-tme requrement, whle large n terms of cost, would be less than the tme whch the checker master probably spends to acqure hs profcency. ' f t

fl 84 ARTFCAL NTELLGENCE ton of any one game. Program Alpha s used to play aganst human opponents, and durng self-play Alpha and Beta play each other. At the end of each self-play game a determnaton s made of the relatve playng ablty of Alpha, as compared wth Beta, by a neutral porton of the program. f Alpha wns- or s adjudged to be ahead when a game s otherwse termnated the then current scorng system used by Alpha s gven to Beta. f, on the other hand, Beta wns or s ahead, ths fact s recorded as a black mark for Alpha. Whenever Alpha receves an arbtrary number of black marks (usually set at three) t s assumed to be on the wrong track, and a farly drastc and arbtrary change s made n ts scorng polynomal (by reducng the coeffcent of the leadng term to zero). Ths acton s necessary on occason, snce the entre learnng process s an attempt to fnd the hghest pont n multdmensonal scorng space n the presence of many secondary maxma on whch the program can become trapped. By manual nterventon t s possble to return to some prevous condton or make some other change f t becomes apparent that the learnng process s not functonng properly. n general, however, the program seeks to extrcate tself from traps and to mprove more or less contnuously. The capablty of the program can be tested at any tme by havng Alpha play one or more book games (wth the learnng procedure temporarly mmoblzed) and by correlatng ts play wth the recommendatons of the masters or, more nterestngly, by pttng t aganst a human player. f Polynomal Modfcaton Procedure f Alpha s to make changes n ts scorng polynomal, t must be gven some trustworthy crtera for measurng performance. A logcal dffculty presents tself, snce the only measurng parameter avalable s ths same scorng polynomal that the process s desgned to mprove. Recourse s had to the pecular property of the look-ahead procedure, whch makes t less mportant for the scorng polynomal to be partcularly good the further ahead the process s contnued. Ths means that one can evaluate the relatve change n the postons of two players, when ths evaluaton s made over a farly large number of moves, by usng a scorng system whch s much too gross to be sgnfcant on a move-by-movebass. Perhaps an even better way of lookng at the matter s that we are attemptng to make the score, calculated for the current board poston, look lke that calculated for the termnal board poston of the chan of moves whch most probably wll occur durng actual play. Of course, f one could develop a perfect system of ths sort t would be the equvalent of always lookng ahead to the end of the game. The nearer ths deal s approached, the better would be the play. 11 11 There s a logcal fallacy n ths argument. The program mght save only nvarant terms whch have nothng to do wth goodness of play; for example, t mght

MACHNE LEARNNG USNG THE GAME OF CHECKERS 85 n order to obtan a suffcently large span to make use of ths characterstc, Alpha keeps a record of the apparent goodness of ts board postons as the game progresses. Ths record s kept by computng the scorng polynomal for each board poston encountered n actual play and by savng ths polynomal n ts entrety. At the same tme, Alpha also putes com- the backed-up score for all board postons, usng the look-aheadprocedure descrbed earler. At each play by Alpha the ntal board score saved from the prevous Alpha move, s compared wth the backed-up score for the current poston. The dfference between these scores, defned as delta, s used to check the scorng polynomal. f delta s postve t s reasonable to assume that the ntal board evaluaton was n error and terms whch contrbuted postvely should have been gven more weght, whle those that contrbuted negatvely should have been gven less weght! A converse statement can be made for the case where delta s negatve. Presumably, n ths case, ether the ntal board evaluaton was ncorrect, or a wrong choce of moves was made, and greater weght should have been gven to terms makng negatve contrbutons, wth less weght to postve terms. These changes are not made drectly but are brought about n an nvolved way whch wll now be descrbed. A record s kept of the correlaton exstng between the sgns of the ndvdual term contrbutons n the ntal scorng polynomal and the sgn of delta. After each play an adjustment s made n the values of the correlaton coeffcents, due account beng taken of the number of tmes that each partcular term has been used and has had a nonzero value. The coeffcent for the polynomal term (other than the pece-advantage term) wth the then largest correlaton coeffcent s set at a prescrbed maxmum value wth proportonate values determned for all of the remanng coeffcents. Actually, the term coeffcents are fxed at ntegral powers of 2, ths power beng defned by the rato of the correlaton coeffcents. More precsely, f the rato of two correlaton coeffcents s equal to or larger than n but less than n+l, where n s an nteger, then the rato of the two term coeffcents s set equal to 2 n. Ths procedure was adopted n order to ncrease the range n values of the term'coeffcents. Whenever a correlaton-coeffcent calculaton leads to a negatve sgn, a correspondng reversal s made n the sgn assocated wth the term tself. nstabltes t should be noted that the span of moves over whch delta s computed conssts of a remembered part and an antcpated porton. Durng the remembered play, use had been made of Alpha's current scorng polynomal to determne Alpha's moves but not to determne the opponent's moves, count the squares on the checkerboard. The forced ncluson of the pece-advantage term prevents ths. as 1 11 'M :, * M -.' * \\ ", ',11 1 "' r 11 1 1.,

'r 86 ARTFCAL NTELLGENCE whle durng the antcpaton play the moves for both sdes are made usng Alpha's scorng polynomal. One s tempted to ncrease the senstvty of delta as an ndcator of change by ncreasng the span of the remembered porton. Ths has been found to be dangerous snce the coeffcents n the evaluaton polynomal and, ndeed, the terms themselves, may change between the tme of the remembered evaluaton and the tme at whch the antcpaton evaluaton s made. As a matter of fact, ths dffculty s present even for a span of one move par. t s necessary to recompute the scorng polynomal for a gven ntal board poston after a move has been determned and after the ndcated correctons n the scorng polynomal have been made, and to save ths score for future comparsons, rather than to save the score used to determne the move. Ths may seem a trval pont, but ts neglect n the ntal stages of these experments led to oscllatons qute analogous to the nstablty nduced n electrcal crcuts by long delays n a feedback loop. As a means of stablzng aganst mnor varatons n the delta values, an arbtrary mnmum value was set, and when delta fell below ths mnmum for any partcular move no change was made n the polynomal. Ths same mnmum value s used to set lmts for the ntal board evaluaton score to decde whether or not t wll be assumed to be zero. Ths mnmum s recomputed each tme and, normally, has been fxed at the average value of the coeffcents for the terms n the currently exstng evaluaton polynomal. Stll another type of nstablty can occur whenever a new term s ntroduced nto the scorng polynomal. Obvously, after only a sngle move the correlaton coeffcent of ths new term wll have a magntude of 1, even though t mght go to 0 after the very next move. To prevent volent fluctuatons due to ths cause, the correlaton coeffcents for newly ntroduced terms are computed as f these terms had already been used several tmes and had been found to have a zero correlaton coeffcent. Ths s done by replacng the tmes-used number n the calculaton by an arbtrary number (usually set at 16) untl the usage does, n fact, equal ths number. After a term has been n use for some tme, qute the opposte acton s desred so that the more recent experence can outwegh earler results. Ths s acheved, together wth a substantal reducton n calculaton tme, by usng powers of 2 n place of the actual tmes used and by lmtng the maxmum power that s used. To be specfc, at any stage of play defned as the Mh move, correctons to the values of the correlaton coeffcents CN are made usng 16 for N untl N equals 32, whereupon 32 s used untl N equals 64, et cetera, usng the formula: n - n N~ x - Cff 1 Lat-»r and a value for N larger than 256 s never used.

MACHNE LEARNNG USNG THE GAME OF CHECKERS 87 After a mnmum was set for delta t seemed reasonable to attach greater weght to stuatons leadng to large values of delta. Accordngly two addtonal categores are defned. f a contrbuton to delta s made by the frst term, meanng that a' change has occurred n the pece rato, the ndcated changes n the correlaton coeffcents are doubled, whle f the value of delta s so large as to ndcate that an almost sure wn or lose wll result, the effect on the correlaton coeffcents s quadrupled. Term Replacement Menton has been made several tmes of the procedure for replacng terms n the scorng polynomal. The program, as t s currently runnng, contans 38 dfferent terms (n addton to the pece-advantage term), 16 of these beng ncluded n the scorng polynomal at any one tme and the remanng 22 beng kept n reserve. After each move a low-term tally s recorded aganst that actve term whch has the lowest correlaton coeffcent and, at the same tme, a test s made to see f ths brngs ts tally count up to some arbtrary lmt, usually set at 8. When ths lmt s reached for any specfc term, ths term s transferred to the bottom of the reserve lst, and t s replaced by a term from the head of the reserve lst. Ths new term enters the polynomal wth zero values for ts correlaton coeffcent, tmes used, and low-tally count. On the average, then, an actve term s replaced once each eght moves and the replaced terms are gven another chance after 176 moves. As a check on the effectveness of ths procedure, the program reports on the usage whch has accrued aganst each dscarded term. Terms whch are repeatedly rejected after a mnmum amount of usage can be removed and replaced wth completely new terms. t mght be argued that ths procedure of havng the program select terms for the evaluaton polynomal from a suppled lst s much too smple and that the program should generate the terms for tself. Unfortunately, no satsfactory scheme for dong ths has yet been devsed. Wth a generated man- lst one mght at least ask that the terms be members of an orthogonal set, assumng that ths has some meanng as appled to the evaluaton of a checker poston. Apparently, no one knows enough about checkers to defne such a set. The only practcal soluton seems to be that f ncludng a relatvely large number of possble terms n the hope that all of the contrbutng parameters get covered somehow, even though n an nvolved and redundant way. Ths s not an undesrable state of affars, however, snce t smulates the stuaton whch s lkely to exst when an attempt s made to apply smlar learnng technques to real-lfe stuatons. Many of the terms n the exstng lst are related n some vague way to the parameters used by checker experts. Some of the concepts whch checker experts appear to use have eluded the wrter's attempts at defnton, and he has been unable to program them. Some of the terms are '! * ; ''. 'j ;r..,«. ;'r ;: [. j'l ll' V *

88 ARTFCAL NTELLGENCE qute unrelated to the usual checker lore and have been dscovered more or less by accdent. The second moment about the dagonal axs through the double corners s an example. Twenty-seven dfferent smple terms are now n use, the rest beng combnatonal terms, as wll be descrbed later. A word mght be sad about these terms wth respect to the exact way n whch they are defned and the general procedures used for ther evaluaton. Each term relates to the relatve standngs of the two sdes, wth respect to the parameter n queston, and t s numercally equal to the dfference between the ratngs for the ndvdual sdes. A reversal of the sgn obvously corresponds to a change of sdes. As a further means of nsurng symmetry the ndvdual ratngs of the respectve sdes are determned at correspondng tmes n the play as vewed by the sde n queston. For example, consder a parameter whch relates to the board condtons as left after one sde has moved. The ratng of Black for such a parameter would be made after Black had moved, and the ratng of Whte would not be made untl after Whte had moved. Durng antcpaton play, these ndvdual ratngs are made after each move and saved for future reference. When an evaluaton s desred the program takes the dfferences between the most recent ratngs and those made a move earler. n general, an attempt has been made to defne all parameters so that the ndvdual-sde ratngs are expressble as small postve ntegers. Bnary Connectve Terms n addton to the smple terms of the type just descrbed, a number of combnatonal terms have been ntroduced. Wthout these terms the scorng polynomal would, of course, be lnear. A number of dfferent ways of ntroducng nonlnear terms have been devsed but only one of these has been tested n any detal. Ths scheme provdes terms whch have some of the propertes of bnary logcal connectves. Four such terms are formed for each par of smple terms whch are to be related. Ths s done by makng an arbtrary dvson of the range n values for each of the smple terms and assgnng the bnary values of 0 and 1 to these ranges. Snce most of the smple terms are symmetrcal about 0, ths s easly done_on a sgn bass. The new terms are then of the form A " B, A " B, A " B, and A " B, yeldng values ether of 0 or 1. These terms are ntroduced nto the scorng polynomal wth adjustable coeffcents and sgns, and are thereafter ndstngushable from the other terms. As t would requre some 1404 such combnatonal terms to nterrelate the 27 smple terms orgnally used, t was found desrable to lmt the actual number of combnatonal terms used at any one tme to a small fracton of these and to ntroduce new terms only as t became possble to retre older neffectual terms. The terms actually used are gven n Appendx C.

MACHNE LEARNNG USNG THE GAME OF CHECKERS 89 "p f, t Prelmnary Learnng-by-generalzaton Tests An dea of the learnng ablty of ths procedure can be ganed by analyzng an ntal test seres of 28 games 12 played wth the program just descrbed. At the start an arbtrary selecton of 16 terms was chosen and all terms were assgned equal weghts. Durng the frst 14 games Alpha was assgned the Whte sde, wth Beta constraned as to ts frst move (two cycles of the seven dfferent ntal moves). Thereafter, Alpha was assgned Black and Whte alternately. Durng ths tme a total of 29 dfferent terms was dscarded and replaced, the majorty of these on two dfferent occasons. Certan other fgures obtaned durng these 28 games are of nterest. At frequent ntervals the program lsts the 12 leadng terms n Alpha's scorng polynomal wth ther correlaton coeffcents and a runnng count of the number of tmes these coeffcents have been altered. Based on these samplngs, one observes that at least 20 dfferent terms were assgned the largest coeffcent at some tme or other, some of these alternatng wth other terms a number of tmes, and two even reappearng at the top of the lst wth ther sgns reversed. Whle these varatons were more volent at the start of the seres of games and decreased as tme went on, ther presence ndcated that the learnng procedure was stll not completely stable. Durng the frst seven games there were at least 14 changes n occupancy at the top of the lst nvolvng 10 dfferent terms. Alpha won three of these games and lost four. The qualty of the play was extremely poor. Durng the next seven games there were at least eght changes made n the top lstng nvolvng fve dfferent terms. Alpha lost the frst of these games and won the next sx. Qualty of play mproved steadly but the machne stll played rather badly. Durng Games 15 through 21 there were eght changes n the top lstng nvolvng fve terms; Alpha wnnng fve games and losng two. Some farly good amateur players who played the machne durng ths perod agreed that t was "trcky but beatable." Durng Games 22 through 28 there were at least four changes nvolvng three terms. Alpha won two games and lost fve. The program appeared to be approachng a qualty of play whch caused t to be descrbed as "a betterthan-average player." A detaled analyss of these results ndcated that the learnng procedure dd work and that the rate of learnng was surprsngly hgh, but that the learnng was qute erratc and none too stable. 1 1...; 1 1 ' ' ) ''P!; '$]" ul- "LP!! 1 '-tlr-! ] Second Seres of Tests Some of the more obvous reasons for ths erratc behavor n the frst seres of tests have been dentfed. The program was modfed n several "The games averaged 68 moves (34 to a sde) of whch approxmately 20 caused changes to be made n the scorng polynomal,

tf 90 ARTFCAL NTELLGENCE respects to mprove the stuaton, and addtonal tests were made. Four of these modfcatons are mportant enough to justfy a detaled explanaton. n the frst place, the program was frequently fooled by bad play on the part of ts opponent. A smple soluton was to change the correlaton coeffcents less drastcally when delta was postve than when delta was negatve. The procedure fnally adopted for the postve delta case was to make correctons to selected terms n the polynomal only. When the scorng polynomal was postve, changes were made to coeffcents assocated wth the negatvely contrbutng terms, and when the polynomal was negatve, changes were made to the coeffcents assocated wth postvely contrbutng terms. No changes were made to coeffcents assocated wth terms whch happened to be zero. For the negatve delta case, changes were made to the coeffcents of all contrbutng terms, just as before. A second defect seemed to be connected wth the too frequent ntroducton of new terms nto the scorng polynomal and the tendency for these new terms to assume domnant postons on the bass of nsuffcent evdence. Ths was remeded by the smple expedent of decreasng the rate of ntroducton of new terms from one every eght moves to one every 32 moves. The thrd defect had to do wth the complete excluson from consderaton of many of the board postons encountered durng play by reason of the mnmum lmt on delta. Ths resulted n the msassgnment of credt to those board postons whch permtted spectacular moves when the credt rghtfully belonged to earler board postons whch had permtted the necessary ground-layng moves. Although no precse way has yet been devsed to ensure the correct assgnment of credt, a very smple expedent was found to be most effectve n mnmzng the adverse effects of earler assgnments. Ths expedent was to allow the span of remembered moves, over whch delta s computed, to ncrease untl delta exceeded the arbtrary mnmum value, and then to apply the correctons to the coeffcents as dctated by the terms n the retaned polynomal for ths earler board poston. n ths case, the dffculty whch was mentoned n the secton on nstabltes n connecton wth an arbtrary ncrease n span, does not occur after each correcton, snce no changes are made n the coeffcents of the scorng polynomal as long as delta s below the mnmum value. Of course, whenever delta does exceed the mnmum value the program must then recompute the ntal scorng polynomal for the then current board poston and so restart the procedure wth a span of a sngle remembered move par. Ths over-all procedure rectfes the defect of assgnng credt to a board poston that les too far along the move chan, but t ntroduces the possblty of assgnng credt to a board poston that s not far enough along.

'f-» MACHNE LEARNNG USNG THE GAME OF CHECKERS 91 As a partal expedent to compensate for ths newly ntroduced danger, a change was made n the ntal board evaluaton. nstead of evaluatng the ntal board postons drectly, as was done before, a standard but rudmentary tree search (termnated after the frst nonjump move) was used. Errors due to mpendng jump stuatons were elmnated by ths procedure, and because of the greater accuracy of the evaluaton t was possble to reduce the mnmum delta lmt by a small amount. Fnally, to avod the danger of havng Beta adopt Alpha's polynomal as a result of a chance wn on Alpha's part (or perhaps a stuaton n whch Alpha had allowed ts polynomal to degenerate after an early or mdgame advantage had been ganed), t was decded to requre a majorty of wns on Alpha's part before Beta would adopt Alpha's scorng polynomal. Wth these modfcatons, a new seres of tests was made. n order to reduce the learnng tme, the ntal selecton of terms was made on the bass of the results obtaned durng the earler tests, but no attenton was pad to ther prevously assgned weghts. n contrast wth the earler erratc behavor, the revsed program appeared to be extremely stable, perhaps at the expense of a somewhat lower ntal learnngrate. The way n whch the character of the evaluaton polynomal altered as learnng progressed s shown n Fg. 4. The most obvous change n behavor was n regard to the relatve number of games won by Alpha and the prevalence of draws. Durng the frst 28 games of the earler seres Alpha won 16 and lost 12. The correspondng fgures for the frst 28 games of the new seres were 1 8 won by Alpha, and four lost, wth sx draws. n all cases the games were termnated, f not fnshed, n 70 moves and a judgment made n terms of the fnal postons. Unfortunately, these fgures are not strctly comparable because of the decreased frequency wth whch Beta adopted Alpha's polynomal durng the second seres, both by desgn and because a programmng error mmoblzed the adopton procedure durng part of the tests. Nevertheless, the great decrease n the number of losses and the prevalence of draws seemed to ndcate that the learnng process was much more stable. Some typcal games from ths second seres are gven n Appendx B. As learnng proceeds, t should become harder and harder for Alpha to mprove ts game, and one would expect the number of wns by Alpha to decrease wth tme. f secondary maxma n scorng space are encountered, one mght even fnd stuatons n whch Alpha wns less than half of the games. Wth Beta at such a maxmum any mnor change n Alpha's polynomal would result n a degradaton of ts play, and several oscllatons about the maxmum mght occur before Alpha landed at a Pont whch would enable t to beat Beta. Some evdence of ths trend s dscernble n the play, although many more games wll have to be played before t can be observed wth certanty.! ' ' ' 1.!.." ",'- ( m ' 1 " ' f.. ' "! ;l! 'Ul', :; '

92 ARTFCAL NTELLGENCE Fgure 4. Second seres of learnng-by-generalzaton tests. Coeffcents assgned by plotted as a functon of the number of games played. Two regons of specal found that the ntal sgns of many of the terms had been set ncorrectly, and or 32 games. The tentatve conclusons whch can be drawn from these tests are: (1) A smple generalzaton scheme of the type here used can be an effectve learnng devce for problems amenable to tree-searchng procedures. (2) The memory requrements of such schemes are qute modest and reman fxed wth tme. (3) The operatng tmes are also reasonable and reman fxed, ndependent of the amount of accumulated learnng. (4) ncpent forms of nstablty n the soluton can be expected but, at least for the checker program, these can be dealt wth by qute straghtforward procedures. (5) Even wth the ncomplete and redundant set of parameters whch

! MACHNE LEARNNG USNG THE GAME OF CHECKERS 93 '. ' l '!"! j '\ ; MM 1 l 'l ['Ml '\ v\\<!' the program to the more sgnfcant parameters of the evaluaton polynomal nterest mght be noted: (1) the stuaton after 13 or 14 games, when the program (2) the condtons of relatve stablty whch are begnnng to show up after 31 -. :l :! H" H w have been used to date, t s possble for the computer to learn to play a better-than-average game of checkers n a relatvely short perod of tme. As a fnal precautonary note, t should be stated that these experments have not encompassed a suffcently large seres of games to demonstrate unambguously that the learnng procedure s completely stable or that t wll necessarly lead to the best possble choce of parameters and coeffcents. Rote Learnng vs. Generalzaton Some nterestng comparsons can be made between the playng style developed by the learnng-by-generalzaton program and that developed by tle earler rote-learnng procedure. The program wth rote learnng soop 'll' ;" 1 'd ','!'' ('

V 94 ARTFCAL NTELLGENCE learned to mtate master play durng the openng moves. t was always qute poor durng the mddle game, but t easly learned how to avod most of the obvous traps durng end-game play and could usually drve on toward a wn when left wth a pece advantage. The program wth the generalzaton procedure has never learned to play n a conventonal manner and ts openngs are apt to be weak. On the other hand, t soon learned to play a good mddle game, and wth a pece advantage t usually polshes off ts opponent n short order. nterestngly enough, after 28 games t had stll not learned how to wn an end game wth two kngs aganst one n a double corner. Apparently, rote learnng s of the greatest help ether under condtons when the results of any specfc acton are long delayed or n those stuatons where hghly specalzed technques are requred. Contrastng wth ths, the generalzaton procedure s most helpful n stuatons n whch the avalable permutatons of condtons are large n number and when the consequences of any specfc acton are not long delayed. Procedures nvolvng Both Forms of Learnng The next obvous step s to combne the better features of the rote-learnng procedure wth a generalzaton scheme. Ths must be done wth some care, snce t s not practcal to update the prevously saved nformaton after every change n the evaluaton polynomal. A compromse soluton mght be to save only a very lmted amount of nformaton durng the early stages of learnng and to ncrease the amount as warranted by the ncreasng stablty of the evaluaton coeffcent wth learnng. For example, the program could be arranged to save only the pece-advantage term at the start. At some stage n the learnng process the next term could be added, perhaps when no change had been made n the parameter used for ths term durng some farly long perod, say for three complete games. f and when the program s able to play an addtonal perod wthout changes n the next parameter, ths could also be added, et cetera. Whenever a change does occur n a parameter prevously assumed to be stable, the entre memory tape could be revewed, all terms nvolvng the changed parameter and those lower on the lst could be expunged, and the program could drop back to the earler condton wth respect to ts termsavng schedule. Another soluton would be to utlze the generalzaton scheme alone untl t had become farly stable and to ntroduce rote learnng at ths tme. t s, of course, perfectly feasble to salvage much of the learnng whch has been accumulated by both of the programs studed to date. Ths could be done by appendng an abrdged form of the present memory tape to the generalzaton scheme n ts present stage of learnng and by proceedng from there n accordance wth the frst soluton proposed above.