The vslam Algorithm for Navigation in Natural Environments

로봇기술및동향 The vslam Algorihm for Navigaion in Naural Environmens Evoluion Roboics, Inc. Niklas Karlsson, Luis Goncalves, Mario E. Munich, and Paolo Pirjanian Absrac This aricle describes he Visual Simulaneous Localizaion and Mapping (vslam ) algorihm, a novel algorihm for simulaneous localizaion and mapping (SLAM). The algorihm is vision- and odomery-based, and enables low-cos navigaion in cluered and populaed environmens. The algorihm creaes visual landmarks ha are highly disincive and ha can be reliably deeced, virually eliminaing he daa associaion problem presen in oher landmark schemes. No iniial map is required, and dynamic changes in he environmen, such as lighing changes, moving objecs, and/or people are gracefully handled by he algorihm. Typically, vslam recovers quickly from dramaic disurbances, such as kidnapping. Keywords: SLAM, Vision-based landmarks, SIFTfeaures, Landmark deecion, Pose esimaion, Paricle filer, Kalman filer, Mixed proposal disribuion. 1. Inroducion Simulaneous localizaion and mapping (SLAM) is one of he mos fundamenal, ye mos challenging problems in mobile roboics. To achieve full auonomy a robo mus possess he abiliy o explore is environmen wihou userinervenion, build a reliable map, and localize iself in he map. In paricular, if exernal beacons, or e.g. global posiioning sensor (GPS) daa, are unavailable, he robo mus auonomously deermine appropriae naural reference poins (or landmarks) on which o build a map. Only wihin he las 4-8 years has he research communiy made significan progress owards soluions o he SLAM problem. Such progress was enabled by he availabiliy of beer algorihms, as well as, beer sensors. In paricular, we should menion he inroducion of probabilisic roboics [1] and he usage of he SICK Laser Range Finder (LRF) and as major milesones in he developmen of soluions o he SLAM problem. Iniially, he mos promising algorihm for solving he SLAM problem was based on Expecaion Maximizaion (EM) echniques [2]. The idea behind EM algorihms is o sar wih an iniial guess of robo pose, followed by compuing he mos likely map. Nex, he esimae of where he robo really is locaed is improved, based on he compued map. Using he improved esimae of robo pose, he previously compued map is made beer, and so forh. Unforunaely, since he EM algorihm is no recursive, i is in mos pracical siuaions impossible o use in real-ime. The excellen survey [3] describes oher proposed SLAM soluions, as well as, he sae-of-ar of SLAM in year 2. The recenly mos promising SLAM algorihm was published in [4], where i was realized ha if he localizaion and mapping problem is described as a problem of pah esimaion raher han pose esimaion, hen i can be The vslam Algorihm for Navigaion in Naural Environmens 51

separaed via a clever facorizaion. The SICK LRF provides high resoluion 2D disance scans, which makes i possible o reliably deec and localize corners and oher edge feaures in he environmen. However, here are a leas hree problems in using he SICK LRF: 1) I is difficul o reliably associae deeced feaures wih previously deeced feaures; 2) he oal number of feaures o mainain in a map is large, poenially causing a prohibiive compuaional overhead; and 3) i is difficul o exrac and recognize corners and edge feaures in cluered and highly populaed environmens. The above hree problems are shared by oher radiional sensors, such as, he sonar and IR sensors. Bu hese sensors have he addiional problem of lower resoluion and accuracy. To use he sonar or IR sensors in SLAM i is impossible o rely on a single sensor reading, bu insead i is necessary o inegrae sensor daa over ime o reliably deec e.g. corners or edges. One way o address he above problems is o use a camera and compuer vision based echniques o selec appropriae reference poins. This was unil recenly prohibiive because of he price of he camera, he big compuaional requiremen, and he lack of appropriae algorihms. Today, however, cameras have become a commodiy and, hanks o Moore's law, high-performing compuers are available a low cos. Also, very powerful mehods in compuer vision [5] and probabilisic roboics now offer he necessary ools for vision-based SLAM algorihms. In he pas, he bes vision algorihms were only able o deec and rack very simple feaures (using e.g. he Harris corner deecor), or mach arbirary images wihou providing locaion daa. Tracking corners enabled he creaion of rich 3-D srucure, bu once he corners were no longer viewed, hey could never be found again. Objec recogniion based on correlaion, emplaes, eigen-images, or oher similar schemes provide image-based landmark recogniion, bu hese schemes are no robus o lighing or viewpoin changes, and hey also do no enable he esimaion of a pose wih respec o he landmark. An early aemp of using vision in navigaion of mobile robos was presened in [6]. However, his aemp addressed only localizaion, and no mapping, hence, does no qualify as a SLAM algorihm. The recen developmen of SIFT-feaures [5] has enabled wide baseline racking, hus enabling he use of virually unique naural landmarks ha can be idenified afer an arbirary robo moion. These landmarks also allow relaive robo pose esimaes o be made. Various SLAM algorihms based on SIFT feaures have been proposed by he invenor of he SIFT-feaures, David Lowe, [7,8]. In his sysems, a rinocular sensor is used o generae 3-D landmark poins from a single robo pose. Each single 3-D poin (wih associaed SIFT feaures in he hree rinocular views) is a landmark, and he landmark generaion and he mapping algorihm are ighly coupled. The vslam algorihm, presened in his paper, has a SIFT-feaure [5] based visual fron-end. Bu, in conras o previously proposed algorihm, he vslam sysem generaes, deecs, and esimaes he relaive pose o a landmark uilizing a single camera. Furhermore, he vslam algorihm uilizes he facorizaion in [4] o build and mainain he map of virually unique visual landmarks. By using a localizaion scheme wih a paricle filer [9] and an adapive mixed proposal disribuion [1,11], vslam enables navigaion wih good accuracy in a large variey of real-world environmens. The adapive mixed proposal disribuion also enables vslam o recover from kidnapping scenarios; i.e., siuaions where he robo is lifed up and moved wihou being noified. The aricle is organized as follows. Secion 2 formulaes he problem solved by vslam. We lis he sensors ha are used in he algorihm, bu we also highligh some of he challenges o solve he problem. Thereafer, Secion 3 gives an overview of he vslam sysem. The inenion is o show 52 로봇과인간제 2 권제 1 호

he basic idea of vslam, bu also o illusrae he flexibiliy of he sysem as a resul of is modulariy. In Secion 4 he Visual Fron-end is described. The procedure for creaing landmarks and compuing relaive pose esimaes is derived. Thereafer, in he shor Secion 5, he Pre-filer is presened. The Pre-filer provides a convenien way o une he vslam sysem for use in applicaions wih differen desired rade-offs beween accuracy and robusness. Secion 6 describes how he map of visual landmarks is mainained and how he robo-pose is esimaed relaive his map. In Secion 7 we show some experimenal resuls of using vslam for mapping and localizaion in various environmens. Finally, in Secion 8 we draw some conclusions and make some final remarks. 2. Problem Formulaion Consider a mobile robo ha mus localize iself in a previously unknown environmen. The objecive is o choose a sensor configuraion and an algorihm o process he sensor daa, which accuraely and robusly accomplish he localizaion in real-world environmens. Assume i is no acceped o aler he environmen by insalling beacons or oher exernal equipmen. Tha is, he design choices only apply o he robo iself, and he robo mus deermine is locaion based only on daa colleced by he seleced onboard sensors. Also, realize ha since he environmen is unknown, he robos mus also map he environmen. Now, suppose he mobile robo is equipped wih an odomery sensor providing dead reckoning daa, and a single camera providing images of he environmen. The odomery sensor may consis of sandard wheel encoders aached o he driving wheels of a differenial drive sysem. The objecive of he vslam sysem is o fuse image and odomery daa in a way ha enables robus map-building and localizaion. Being robus is key since he sensor daa acquired from he mobile robo (odomery and images) conains pleny of difficul-o-model noise. The odomery daa is incremenal, and herefore i will accumulae error over ime. Furhermore, he odomery sensor can never be perfecly calibraed. Bu, since he robo may slip or may be lifed and moved, he odomery is also exposed o discree evens of dramaic errors. The image daa is challenging o process because of he exisence of image blur, occlusions, limied image resoluion, imperfec camera calibraion, variable lighing condiions, limied processing power, ec. For example, if he robo is operaing in a highly populaed environmen, he field of view of he camera is frequenly filled by non-saionary objecs, such as moving people ha can no be used as reference poins in a map. Also, he lighing condiions may change from very dark o ligh, even sauraing he capured image. The remainder of his aricle will describe he vslam algorihm, which is a sysem ha addresses he above challenges and provides an accurae and robus way of doing localizaion and mapping in real-world environmens. 3. Overview of vslam From a sysem poin of view, vslam can be described as in [Fig. 1]. The inpus o he sysem are odomery daa and images, and he oupu is he robo pose and an absrac vslam map. Wheel Encoders Camera Odomery Images Visual Fron-end Prefiler Landmark Daabase SLAM [Fig. 1] Block diagram of vslam Robo pose & Map The vslam Algorihm for Navigaion in Naural Environmens 53

The basic componens of he vslam map are he visual landmarks ha are creaed along he pah of he robo. A visual landmark is a collecion of unique feaures exraced from an image. Each landmark is associaed wih a landmark pose, defined as he rue robo pose (x, y, and θ) when he landmark was creaed. In he firs module of vslam, he Visual Fron-end, he images are processed and compared wih previously creaed landmarks. If a landmark is recognized, an esimae is compued on where he robo is locaed relaive o where i was locaed when he landmark was creaed. Noe ha hese relaive esimaes do no give informaion on where in he global map he robo is since ha depends on where in he global map he landmark is locaed, which is unknown o he Visual Fron-end. The relaive pose esimaes are hroughou he paper called visual measuremens. Visual measuremens are described in deail in Secion 4.2. If a landmark is no recognized, he module aemps o creae a new visual landmark. The creaion of new landmarks is described in deail in Secion 4.1. If a landmark is creaed, some unique feaures (he SIFT-feaures) of he image corresponding o he landmark are saved for laer recogniion (he feaures are added o he Landmark Daabase). If a visual landmark was recognized in he Visual Fronend and a visual measuremen is compued, hen he visual measuremen is passed o he second module, he Pre-filer. The Pre-filer, assess he reliabiliy of he measuremen. If a measuremen is considered unreliable, i is rejeced and hrown away. I will hereafer no be used in any furher processing. However, if i is acceped, i is used as an inpu o he SLAM Back-end. The SLAM Back-end is a feedback sysem aking odomery daa and relaive pose measuremens as inpus, o be fused and o generae and mainain a map of many landmarks. Firs, he robo pose is esimaed based on he mos up-o-dae map and he wo inpus. Then, he map is updaed o reflec he new informaion. The SLAM Backend associaes a global landmark pose o he landmark and saves his pose in he Landmark Daabase. Thereafer, every ime he landmark pose is updaed, his pose is refined. 4. Visual Fron-end The inpus o he Visual Fron-end are images wih ime samps plus ime-samped odomery. The oupu depends on wheher - The fron-end creaed a new landmark - The fron-end compued a visual measuremen - The fron-end neiher compued a visual measuremen, nor creaed a landmark If a landmark was creaed, a landmark ID is provided o be used as a reference for fuure landmark recogniions. Noe ha no landmark pose is provided by he fron-end. The locaion of individual landmarks in a global map is esimaed in he SLAM module. If a visual measuremen was compued, he oupu is a relaive pose y = [ x, y, θ] T and a landmark ID n. The oupu indicaes where landmark n is esimaed o be locaed relaive he curren robo-pose. This relaive pose is described in he sandard, local robo coordinae sysem. The origin of his coordinae sysem is a he floor level, direcly below he drive wheels' wheelbase. The posiive x-axis exends forward in he direcion he robo is facing. The posiive y-axis exends o he lef assuming you sand sraigh up and face along he posiive x-axis. To allow he Pre-filer o assess he reliabiliy of he measuremen, he following hree saisics are also compued and aached o each measuremen: - The Roo Mean Square projecion error (ProjErrRMS) of he esimaion. For a perfecly consisen measuremen, his error should be zero. - The number of feaure poins (NPoins) ha have been recognized beween he curren image and he image 54 로봇과인간제 2 권제 1 호

corresponding o a landmark. For a reliable esimaion his number should be as large as possible. - The esimaed Slope of he robo, which under he assumpion ha he robo is operaing on a leveled floor, should be approximaely zero. Remark: If muliple landmarks are recognized in an individual image, hey may for simpliciy be reaed as separae measuremens and be reaed one afer anoher. This is how hey are reaed in his paper. However, for improved performance i is possible o process he measuremens corresponding o one image simulaneously. 4.1 Landmarks The creaion of a 3-D visual landmark is accomplished as follows: 1. Acquire 3 images while raveling approximaely 2 cm beween images. 2. Use he SIFT-based objec recogniion algorihm o find feaure correspondences. 3. Solve for pose and srucure of he hree views. 4. Sore he 3-D srucure and he associaed images and feaures. 4.1.1 Acquiring Images In mos applicaions, here is only one camera on he robo (due o cos consrains), and he camera is facing forward (he mos convenien orienaion for oher asks, such as elepresence, eleoperaion, and obsacle avoidance). This means ha consecuive views ypically consis of nearly pure ranslaion along he opical axis of he camera. This is he mos difficul case for which o aemp o do 3-D riangulaion, since here is very lile dispariy beween images. For his reason, we uilize hree views in order o consruc a 3-D landmark. Wih hree views, hree-way image correspondences can be deeced more reliably, and he 3-D geomery compued more accuraely. In ypical usage, images aken 2 cm apar while he robo is moving forward are used o build landmarks. A baseline of 2 cm allows esimaion of 3-D srucure up o 5 m away wih sufficien accuracy for subsequen pose esimaion. [Fig. 2] shows an example of acquired images for landmark creaion. 4.1.2 Finding Image Feaure Correspondence Given hree consecuive views of he environmen, he SIFT-based objec recogniion algorihm is used o find hree-way correspondences amongs he views. This is done in he following way: 1. Find he correspondences beween views 1 and 2, and beween views 1 and 3 by raining he objec recogniion algorihm on view 1 and finding all he maching feaures in view 2 and view 3, respecively. 2. Find he correspondences beween views 2 and 3 by raining he recogniion algorihm on view 2 and 1 1 1 3 3 3 [Fig. 2] Three images used o generae a 3-D landmark. The robo is facing a corner of a room where a feaure-rich image has been projeced wih an LCD projecor (o illusrae he mehod in an easily inerpreable srucure). The robo moved forward 2 cm beween image acquisiions, and he corner was 3.3 m away. The vslam Algorihm for Navigaion in Naural Environmens 55

finding all he maching feaures in view 3. 3. Find he se of consisen hree-way maches by, for each maching pair beween views 1 and 2, Checking if he feaure from view 1 exiss in he lis of correspondences of views 1 and 3. Checking if he feaure from view 2 exiss in he lis of correspondences of views 2 and 3. Checking ha he corresponding feaure in view 3 is he same in he above wo seps. A ypical 32 x 24 image has approximaely 6 feaures. A ypical wo-way image maching has on he order of 3 correspondences, and he resuling hree-way mach has on he order of feaures. [Fig. 3] shows an example, where here were 33, 285, and 331 maches beween views 1-2, 1-3, and 2-3, respecively. This resuled in 185 hree-way maches. Besides enabling he use of 3- view based 3-D srucure esimaion, he hree-way maching also helps o eliminae false correspondences. In he Fig., he plos of maches beween views 1-2, 1-3, and 2-3 all show spurious correspondences, none of which exis in he plo of hree-way maches. 2 2 4.1.3 Solving for Pose and Srucure of Three Views Given a se of hree-way feaure correspondences in hree views, we can now esimae he 3-D srucure of he scene. Since he images were acquired by a mobile robo, one could ry o use odomery informaion o define he relaive poses of he hree views, and use ha informaion o perform geomeric riangulaion of he correspondences and generae 3-D srucure. However, he odomery informaion is no accurae enough (and here may also be vibraions, slippage, and mis-calibraions), so ha poor 3-D srucure is generaed in his way. Insead, an algorihm ha esimaes boh he pose of he hree views as well as he srucure mus be used o generae accurae 3-D srucure. Our algorihm consiss of he following seps: Compue un-warped homogeneous coordinaes for he feaures. Since we ypically use a wide angle lens (>6 degrees horizonal field of view) wih significan barrel disorion, he firs sep is o compue unwarped feaure coordinaes. This is done by firs calibraing he camera wih a lens model ha includes radial and angenial disorion parameers [12]. 1 2 1 1 2 3 3 1 2 3 3 1 2 1 1 2 3 3 1 2 3 3 [Fig. 3] Resul of hree-way maching. Top Fig.s and boom lef are he wo-way correspon-dences amongs views 1-2, 1-3, and 2-3, respecively. The boom righ Fig. is he resuling hree-way maching. Compue an iniial esimae of he poses of he views and he 3-D srucure using wo-view algorihms. Firs he algorihm of Longue-Higgins [13] is applied o obain a closed form esimae of he pose beween view 1 and view 2. This esimae is furher refined using a nonlinear minimizaion algorihm o minimize he residual of he epipolar consrain [14]. During each ieraion of he non-linear minimizaion, robus oulier deecion is used o remove any false correspondences. The same wo algorihms are hen applied o view 1 and view 3. This procedure gives us he relaive poses of view 2 and view 3 wih respec o view 1, up o scale. To esimae he relaive scale of he moion beween views 1-2 and views 56 로봇과인간제 2 권제 1 호

1-3, we compue esimaes of he 3-D srucure by riangulaing views 1-2, and by riangulaing views 1-3. Then, he inverse of he median raio of corresponding riangulaed poin dephs is used as an esimae of he relaive scale of he moions. Refine he pose and srucure esimaes. Given he iniial esimaes of he moion beween views 1-2 and views 1-3, as well as an iniial esimae of he 3-D srucure from he riangulaion of views 1-2, a non-linear minimizaion algorihm similar o ha described in [14] is used o refine he moion and srucure soluion. In his algorihm, he moion beween he hree views is represened wih he minimal number of parameers, eleven. Since we use a calibraed camera, we can compue Euclidean ransformaions (rue roaions and ranslaions) and esimae meric srucure. Thus, he pose beween wo views is expressed by 6 degrees of freedom 3 ranslaional degrees of freedom, and 3 roaional degrees of freedom. However, he enire soluion is known up o one global scale parameer. Thus, we ake he pose beween views 1 and 2 o be uni norm ranslaion, resuling in 5 DOFs beween view 1 and 2, and 6 DOFs beween view 1 and 3. The algorihm opimizes he moion and he 3-D srucure so as o minimize he reprojecion error of he 3-D srucure ono he hree views. A each ieraion, robus oulier rejecion is performed by deecing hose poins for which any of he hree view reprojecions has an error 4 imes larger han he robusly esimaed sandard deviaion of all he reprojecions. A each ieraion, a sparse Jacobian of he reprojecion error wih respec o boh he 11 moion and 3*N srucure parameers (where N is he number of poins being considered) is compued in order o refine he esimaes using a gradien descen sep. The algorihm ypically converges in 5 o 1 ieraions, which ake a oal of on he order of.6 seconds of compuaion on a PIII 1GHz processor. Typically, here are 3 o 7 resuling poins wih good 3-D srucure. [Fig. 3] shows an example of he 3-D reconsrucion, where wo verical walls wih a feaure-rich image projeced on o hem wih an LCD projecor were imaged. The robo moved 2 cm forward beween image capures. Shown is he resuling 3-D srucure, along wih is uncerainy (compued by muliplying he rms reprojecion error wih he sensiiviy of he locaion of each 3-D srucure poin o he reprojecion error). Poins near he cener of expansion of he (almos) purely ranslaional moion have very large deph uncerainy. The righ plo of he Fig. shows only hose poins wih deph uncerainy less han 1 cm. Due o he fac ha he moion is almos purely ranslaional along he opical axis of he camera, he mos difficul case for riangulaion, he poins on he wo walls have deph errors of approximaely 1 cm. If wo cameras were o be used as a sereo pair, his Fig. could be improved by an order of magniude, bu in his paper we focus on mehods ha work wih a single camera. 4 4 3 3 2 1 4 4 3 3 2 1 [Fig. 4] The compued 3-D srucure, wih uncerainy, and esimaed locaions of he hree views. Unis are cenimeers. The righ Fig. shows he srucure afer poins wih deph uncerainy larger han 1 cm removed. The vslam Algorihm for Navigaion in Naural Environmens 57

4.1.4 Soring 3-D Srucure and Associaed Feaures Afer successfully esimaing 3-D srucure, i is added as a new enry ino he daabase of visual landmarks. The iems ha need o be sored in he daabase are: - The 3-D srucure. This will be used o esimae he robo pose from new viewpoins. - One of he original hree views of he landmark. This will be used o idenify landmarks from new viewpoins. - The lis of feaures in he original view of he landmark, along wih a link o he corresponding 3-D srucure poin of he landmark. Noe ha he original images mus no be sored. 4.2 Visual Measuremens Given a daabase of previously generaed 3-D visual landmarks, and given a new image acquired by he robo afer i has raveled hroughou he environmen an arbirary amoun, we can aemp o localize he robo wih respec o one or more of he landmarks according o he following procedure: 1. Mach he new image o he daabase of landmark images 2. For each mach, esimae he relaive pose o ha landmark given he 3-D srucure and he feaure correspondences beween he new view and he landmark Maching a newly acquired image o he daabase of landmark images is done using Lowe's SIFT-based objec recogniion algorihm. This resuls in very reliable maches, wih virually no false posiives. The maching is also very robus, able o recognize he correc landmark(s) over a wide variaion of poses and lighing condiions. The algorihm also provides us wih a very reliable lis of corresponding feaures in he new image and he landmark image, which is wha we need in order o esimae he relaive pose o he landmark. Since he landmark daabase associaes a 3-D poin o he feaures in he landmark image, he correspondences beween images found in he above procedure give us a se of correspondences beween he new image's feaures and 3- D landmark poins. These 2-D o 3-D correspondences are used o esimae pose in he following way: Firs, he unwarped homogeneous coordinaes of he feaures of he new image are calculaed, using he camera calibraion parameers. Then, he fairly robus POSIT mehod [15] of DeMenhon and Davis is used o generae an iniial esimae of he pose. The algorihm is called wihin a loop, where robus oulier rejecion is performed. This iniial esimae is hen used wih a non-linear minimizaion rouine, where he 6 DOFs of he pose (3 ranslaional, 3 roaional DOF) are found ha minimize he projecion error of he rigidly ransformed 3-D landmark srucure ono he corresponding poins in he new image. Typically, 1 o 4 poins of a landmark are seen in a new image and are used o esimae he new pose. The pose is esimaed wih an accuracy of approximaely 1 cm and 2~degrees, and successful esimaions occur when he new view is wihin a 1 m radius of where he landmark was originally seen and creaed. [Fig. 5] shows an example of he accuracy of localizaions. A landmark was creaed (in an albei challenging area, so ha only 3 landmark 3-D poins exis). The landmark 3-D poins had an average deph of 2 m. The robo was driven back and forh around he locaion of he landmark creaion (black square), and a he same ime i's ground ruh pose was recorded using a NAV SICK\TM laser rangefinder wih reflecive markers placed in he environmen. The Fig. shows he ground ruh posiion (solid dos) and orienaion, as well as he esimaed pose (aserisks conneced by a line o he respecive ground ruh do). Wihin a.4 m radius of he landmark creaion, pose esimaion is very frequen and accurae (less han 2 cm error). If he robo is furher away from he iniial pose, pose esimaions are less frequen, 58 로봇과인간제 2 권제 1 호

because here are less maching feaures o he landmark (from he change in viewpoin). The pose esimae also becomes less accurae. This is mainly due o he error in he deph of he 3-D landmark srucure, which cause a noiceable bias in he esimaed pose. Beyond 1 m from he landmark creaion poin, oo few feaures of he landmark are deeced, and/or he pose esimaion rouine fails o converge due o he larger viewable disorion of he landmark srucure. 2.5 2 1.5 1.5.5 1 1.5.5 1 [Fig. 5] Esimaed poses wih ground ruh. Black dos are 3- D srucure of landmark. Red dos and emanaing dashes are ground ruh poses of robo, and conneced green aseriks and dashes are he esimaed poses. Unis are meers. [Fig. 6] shows he ground ruh pah of he robo and locaions and orienaions of he landmarks generaed in a larger scale experimen where a robo was driven around a house for 32 minues. During ha ime, 82 landmarks were creaed. 2 1 1 3 4 [Fig. 6] The pah of he robo during a large-scale experimen of various loops inside a single-floor home. Red squares and blue arrows are he locaion and orienaion of he robo when landmark are generaed. Mos of hese landmarks were creaed during he firs loop around he house, and hen re-observed on subsequen loops. In he Fig., where here appears o be more han one landmark a he same locaion, i is because he robo was moving in a differen direcion (and hus viewing a differen par of he environmen, corresponding o a new landmark). There were a oal of 1127 visual measuremens made, a an average rae of one measuremen every 1.7 seconds. Applying some selecion crieria as o which visual measuremens o accep as reliable and which o rejec as unreliable (described in more deail in he nex secion), 966 of he visual measuremens (85.7% of hem) would be used by he SLAM module. For hese measuremens, he mean, 2 2 median and rms disance (disance defined as x + y ) error of he visual measuremens (compared o he ground ruh value) are 6.68, 1.19, and 13.87 cm, respecively. The mean, median, and rms values of he magniude of he orienaion error are 1.26, 1.66, and 2.31 degrees, respecively. Table 1 shows a disribuion of he accuracy of he visual measuremens. The vslam Algorihm for Navigaion in Naural Environmens 59

[Table 1] Disribuion of visual measuremen accuracy. xy err acceped rejeced Acceped/oal (cm) xy err (cm) acceped rejeced Acceped/oal <1 627 48 92.89% -1 627 48 92.89% <2 843 99 89.49% 1-2 216 51 8.9% <3 98 114 88.85% 2-3 65 15 81.25% <4 946 121 88.66% 3-4 38 7 84.44% < 964 127 88.36% 4-18 6 75.% <7 966 135 87.74% -7 2 8 2.% < 966 136 87.66% 7-1.% <1 966 14 87.34% -1 4.% < 966 142 87.18% 1-2.% > 19.% 5. Pre-Filer The Pre-filer provides a convenien way o modify he vslam sysem o accommodae for differen desired radeoffs beween accuracy and robusness. In paricular, differen applicaions may have differen crieria for wha is a reliable measuremen. The basic rule is ha ProjErrRMS and Slope should be very small and NPoins should be large for a measuremen o be considered reliable. A simple scheme is hence o define a region over hese hree variables, and o rejjec any measuremen ha falls ouside ha region. If he boundaries of he region are defined by some hreshold values, and hese are se conservaively (he region is oo small), a big porion of all measuremens are rejeced resuling in very few updaes of he map. On he oher hand, if oo many of he measuremens are acceped (he region is oo large), he error variance of he visual measuremens may be higher han is accepable. Experimenally, we have deermined ha a good compromise beween accurae acceped, ye few rejeced measuremens for mos applicaions is obained if he following crieria are used: - if ProjErrRMS>3., hen rejec he measuremen - if NPoins<1, hen rejec he measuremen - if Slope>.1, hen rejec he measuremen Noe ha a measuremen is rejeced as soon as one of he above hree inequaliies is saisfied. The concep of a Pre-filer allows he user o selec heir own accepance region. E.g., in some applicaions he environmen may be poor on exure, which is needed o creae and recognize landmarks. If, in he same applicaion, he over-all requiremen on accuracy is fairly low, he robusness of he localizaion is improved by enlarging he accepance region. More visual measuremens are hen acceped making sure ha frequen feedback is achieved in he SLAM module for map updae and localizaion. This comes wih he price of measuremens ha on average are somewha less accurae. Based on experimens, i is deermined ha he measuremens remaining afer he above filering fairly well can be described by he rue relaive pose plus a sochasic, uncorrelaed error vecor [ε x, ε x, ε Ө ] T belonging o he Gaussian disribuion of mean zero and covariance Σ vis, where Σ 81cm = 2 81cm 2.19 deg vis 2. In realiy, he error vecors do no belong o a Gaussian disribuion, bu o some heavier-ailed disribuion. Bu 6 로봇과인간제 2 권제 1 호

saisfacory resuls are obained when we use a Gaussian approximaion. 6. SLAM Back-end The SLAM Back-end mainains a map of all landmarks and esimaes he locaion of he robo wihin his map. In paricular, i esimaes he global locaion of each landmark and he robo. Noe ha he global landmark locaion is compleely unknown o he Visual Fron-end. The inpus o his module consis of - visual measuremens passed on from he Pre-filer, - dead reckoning daa from he odomery sensor, and - he previous esimae of map and robo pose (locaion and heading). The oupus consis of a refined esimae of he map and an updaed robo pose. To describe he algorihm in he SLAM Back-end, le us denoe he map by Φ. The map consiss of references o he landmarks creaed in he Visual Fron-end, each of which will be denoed φ n. The oal number of landmarks a a given ime is denoed N. This variable is iniially se o zero, bu will over ime increase as landmarks are creaed and added o he map. The robo pose is denoed s, where is a discree number corresponding o he curren ime. The pose is defined by he 3 x 1 vecor [x, y, θ ] T, where x and y is he locaion and θ is he heading, a ime. To implemen vslam, one needs o know wo sochasic equaions. The firs equaion is he moion model, which is a sae-space model s + + 1 = f ( s, u ) w (1) where u denoes he odomery colleced from ime o +1 and w is a noise process represening he error in odomery. The specific moion model depends on he robo's odomery, which on he oher hand depends on he robo's kinemaics and of he floor surface. In his paper, we use a moion model derived under he assumpion of a nonholonomic drive sysem and parameerized by wo noise parameers; however, he precise model is irrelevan for he developmen of he algorihm. Our model resuls in E( w ) T E( w w ) = = g( θ, u, σ, σ ) where σ T and σ R characerize he ranslaional and roaional odomery uncerainy. The second model is he measuremen model n y = h( s, ϕ ) + v (2) where n is he observed landmark a ime and v is a noise process represening he error in he relaive visual measuremens. If he rue robo pose is s, hen he errorfre visual measuremen is given by h(s,φ n ), E(v )= and E(v v T )=Σ vis (see also (2)). The sequence s =s 1,s 2,,s denoes he pah of he robo up o ime. While s denoes a single robo pose, s denoes a sequence of poses from ime 1 o ime. The ulimae goal is o esimae he robo pose s based on 1. an esimae of s -1, 2. measuremens u -1 and y, 3. he landmark pose φ n However, also φ n is uncerain and mus be esimaed. Therefore, le φ n, denoe he esimae of φ n a ime. The idea proposed in [4] o esimae s and Φ (raher han s and Φ) is hen adoped. In paricular, we consider he poserior disribuion p(s,φ n,y,u ). Using Bayesian calculus and sandard assumpions employed in probabilisic roboics, i is sraigh-forward o show ha p( s, Φ n, y 1, u ) = p( s p( s s 1 T 1, Φ n T, u ) 1, y 1, u 1 ) The vslam Algorihm for Navigaion in Naural Environmens 61

Furhermore, he following imporan facorizaion is used in he vslam algorihm: p( s, Φ n, y, u ) = p( s N Π= i 1 n, y, u ) p( ϕ s, n, y ) This facorizaion decomposes he problem of esimaing N+1 poses, and heir cross-correlaions, simulaneously; ino esimaing one robo pah and N landmark poses (separaely). The facorizaion is jusified by expressing he problem as a Bayesian nework. In paricular, knowing he robo pah d- separaes [16] he individual landmark esimaion problems and makes hem independen of each oher. The vslam updae now proceeds by firs updaing he robo pah, and hen he map of visual landmarks. 6.1 Simple Localizaion Similarly o in [17], vslam implemens he robo pah updae in (3) as a paricle filer. Some advanages of his choice for he robo pose updae is ha he paricle filer does no assume he pose disribuion o be neiher Gaussian, nor uni-modal. The landmark pose updaes are implemened as Kalman filers because of he simple dynamic of he landmark poses (hey are ypically consan). The basic paricle filer is implemened by firs compuing some number, e.g., of hypoheical pahs (so called paricles) based on he moion model (1), odomery daa, and he odomery noise model [18]. These iniial paricles represen a prior disribuion of he robo pah. Thereafer, so called imporance facors are compued based on he prior disribuion and he visual measuremens. By resampling, wih replacemen, from he paricles wih probabiliies equal o he imporance facors, i can be shown ha he disribuion of he resuling paricles (represening he poserior disribuion) indeed approximaes he rue disribuion p(s n,y,u ). i (3) 6.2 Robus Localizaion As recognized by many researchers, he basic paricle filer performs poorly if he moion model generaes oo few hypoheical pahs in regions where p(s n,y,u ) is large. This flaw of he paricle filer is in vslam addressed by using a mixure proposal disribuion insead of he moion model o compue he prior disribuion [1,11]. In paricular, a subse of he paricles, say 18, in he prior disribuion are generaed based on he moion model, while he remainder of hem, say 2, are generaed based on he measuremen model (2). The paricles generaed according o he basic filer are called nominal paricles, while he ones generaed based on he measuremen model are called dual paricles. The imporance facors for each nominal paricle is in principle compued based on he measuremen model, while he imporance facors for each dual paricle in principle is compued based on he moion model. However, grea care mus be exercised o ensure ha he imporance facors of he nominal and dual paricles are consisen and can be compared in he resampling process. A very useful feaure in vslam enabled by he mixure proposal disribuion is ha he raio of nominal versus dual paricles can be changed o deal wih exreme siuaion, such as mal-funcioning sensors, or kidnapping. More specifically, if a kidnapping scenario is deeced or suspeced, hen one should generae only dual paricles, and compue he imporance facors from a uniform disribuion. Noe ha simply modifying he odomery noise model o accommodae for he large odomery error ypically will no work if a paricle filer is employed since, in a normal-sized environmen, i would require a very large number of paricles o approximae a uniform disribuion. Also, if no prior paricle is in some neighborhood of he rue pose, hen all imporance facors are likely o become less han he machine precision of he compuer; i.e., all imporance facors will become zero making i impossible o make any 62 로봇과인간제 2 권제 1 호

inference from he measuremens. green pah (odomery only) is obviously incorrec, since, according o his pah, he robo is raversing hrough walls 6.3 Mapping and furniure. The red pah (he vslam correced pah), on he oher hand, is consisenly following he reference pah. Now, proceed o he landmark pose esimaion. I is The vslam pah, which uses a combinaion of visual ineresing o realize ha since each Kalman filer updae of measuremens and odomery, provides a robus and accurae a landmark pose mus obey p(φi s,n,y ), i is necessary o posiion deerminaion for he robo. associae a separae bank of Kalman filers (one filer per landmark) o each robo pah hypohesis s (each paricle). In oher words, if here are paricles and landmarks, hen he oal number of Kalman filers is imes ; i.e., 1,. On he oher hand, each Kalman filer conains only hree sae variables (he recursively esimaed landmark pose), and hree measuremen variables (he curren landmark pose esimae); and only he Kalman filers associaed o he measured landmark is updaed in any given ieraion of he vslam filer. Consequenly, he updae scheme becomes compuaionally very efficien. [Fig. 7] Example resul of SLAM using vslam. Red pah: vslam esimae of robo rajecory. Green pah: odomery esimae of robo rajecory. Blue circles: vslam landmarks creaed during operaions. 7. Experimens To illusrae he performance of vslam, we shall invesigae is localizaion capabiliy as i is inegraed on a robo operaing in a ypical home environmen. 7.1 vslam in Home Environmen (Robo follows a pre-defined pah) Firs, consider [Fig. 7], which shows he resul of vslam [Fig. 8] The curren view from he camera (lef), which is successfully mached wih he landmark image (righ). afer he robo has raveled around along a reference pah in a wo-bedroom aparmen. The reference pah, on which he [Fig. 8] shows ha also a moion blurred image can be used by vslam. Indeed, he landmark on he righ is robo drove, was unknown o he vslam algorihm. To help inerpre he resul, he layou of he aparmen is superimposed in he Fig.. This layou was neiher generaed, recognized in he blurred curren image on he lef in spie i is only parly visible in he curren view. Average localizaion error using vslam is in his nor used, by vslam. The vslam algorihm builds a map consising of example abou 1-15 cm, while he median error is 9 cm. landmarks, which are marked as circles in he Fig.. The The vslam Algorihm for Navigaion in Naural Environmens 63

7.2 vslam in Home Environmen (Robo follows a random pah) In many applicaions, i is no reasonable o assume ha a robo ravels on a reference pah. Le us herefore now invesigae vslam localizaion in a scenario where he robo ravels in an arbirary pah (unknown o vslam). The experimen ook place in a large living room of 7m x 4m on hardwood floor. As a reference, we acquired ground ruh pose daa wih a Nav SICK laser. The speed of ravel was abou 15 cm/sec. The localizaion error as a funcion of ime is shown in [Fig. 9]. As a comparison, also he pure odomery resuls are displayed. The localizaion accuracy using vslam is in his paricular example somewha lower han in he previous example. In paricular, he average error is abou 2 o 25 cm, while he median is abou 14 cm. I is also of ineres o invesigae he disribuion of visual measuremens among he landmarks. As seen in [Fig. 1], i urns ou ha a small porion of all landmarks resul in he big majoriy of all visual measuremen. There are wo reasons o his behavior. Firs of all, some landmarks were creaed lae in he run, and did no have ime o resul in many measuremens. Bu, a second imporan reason is ha landmarks differ in qualiy. Some landmarks correspond o scenes wih pleny of exure and many unique feaures. Oher landmarks correspond o scenes for which he visual fron-end barely managed o creae a landmark. Poor landmarks, are difficul o recognize and generae visual measuremens from.. 8. Conclusions and Fuure Work [Fig. 9] Posiion localizaion error using only odomery (green curve) and using vslam (red pah). [Fig. 1] Number of visual measuremens per landmark. We have described vslam, a novel vision- and odomery-based SLAM algorihm ha enables low-cos and robus navigaion in cluered and populaed environmens. The vslam algorihm does no require an iniial map, and he algorihm is ypically good a deecing and correcing for slippage and collisions. The key characerisics of vslam are 1. The low cos, which is enabled by basing he algorihm on a low-cos camera and an odomeer. 2. The visual landmarks ha are creaed and recognized in he fron-end. They are essenially unique, which virually eliminaes he daa associaion problem presen in oher landmark schemes. 3. The updae scheme for robo pose and map. I is based on a paricle filer and a Kalman filer bank and 64 로봇과인간제 2 권제 1 호

enables an efficien real-ime implemenaion. 4. The mixed proposal disribuion and he dynamic mixure raio, which dramaically improves localizaion recovery from kidnapping and oher large disurbances. While no explored in deail in his aricle, i can be shown ha vslam is highly robus o kidnapping and dynamic changes in he environmen, for example, changes in lighing, moving objecs and/or people. vslam recovers quickly from kidnapping once he map has become dense wih landmarks. Early in he mapping procedure, is i imporan o avoid kidnapping and disurbances in general. A problem wih he curren implemenaion of vslam is he landmark daa base, which in large environmens may become prohibiively large. In paricular, an individual landmark occupies anywhere beween 4 kb and kb depending on he number of visual feaures associaed wih he landmark, and he number of paricles used in he SLAM module. For mos pracical implemenaions, i is herefore essenial o inegrae a Landmark Daabase Manager wih he basic vslam sysem (see Fig. 1). Such a daabase manager ypically includes a pruning mechanism, which removes poor landmarks. Bu i also includes a segmenaion scheme ha pariions he full vslam map ino sub regions o limi he required primary memory for he map. In some applicaions, i may also be appropriae o use more han one camera (see Fig. 11). For example, one forward-poining and one backward-poining camera. Noe ha hese are independen sensors, and images acquired by each one of hem are processed separaely adding landmarks o a common daabase, bu generaing visual measuremens independenly. The sysem as described in his paper already accommodaes for he use of muliple cameras. An example of when muliple cameras migh be appropriae is when he environmen is poor on exure and where lowes possible price is no crucial. I is useful o undersand ha he use of muliple cameras does no necessarily increase he compuaional requiremens since some alernaing image acquisiion scheme can be employed o opimize he use of visual informaion. Finally, noe ha if he environmen does no conain a sufficien amoun of visual feaures, hen landmarks will no be creaed and visual measuremens will no be compued. vslam will sill produce pose esimaes, bu he esimaes will be exacly wha is obained from wheel odomery. This scenario is rarely experienced, bu is mos likely o happen in environmens ha are free from furniure, decoraions, and exure. References Wheel Encoders Camera 1 Camera 2 Visual Fron-end Visual Fron-end Prefiler Prefiler Landmark Daabase Manager Landmark Daabase SLAM Robo pose & Map [Fig. 11] Block diagram of a proposed exended vslam sysem. [1] S. Thrun, Probabilisic algorihms in roboics, Arificial Inelligence, no. 4, pp. 93 19,. [2] S. Thrun, D. Fox., and W. Burgard, A probabiliic approach o concurren mapping and localizaion for mobile robos, Machine Learning, vol. 31, pp. 29 53, 1998. [3] S. Thrun, Roboic mapping: A survey. Carnegie Mellon Universiy, Tech. Rep. CMU-CS-2-111, February 2. [4] M. Monemerlo, S. Thrun, D. Koller, and B. Wegbrei, Fasslam: A facored soluion o he simulaneous The vslam Algorihm for Navigaion in Naural Environmens 65

localizaion and mapping problem, in 2 IEEE ICRA, Workshop W4 Noes, 2. [5] D. Lowe, Disincive image feaures from scale-invarian keypoins, Inernaional Journal of Compuer Vision, vol. 6, no. 2, pp. 91 11, 4. [6] J. Wolf, W. Burgard, and H. Burkhard, Robus visionbased localizaion for mobile robos using an image rerievel sysem based on invarian feaures, in Proc. of In. Conf. on Roboics and Auomaion (ICRA), 2. [7] S. Se, D. Lowe, and J. Lile, Mobile robo localizaion and mapping wih uncerainy using scale-invarian visual landmarks, The Inernaional Journal of Roboics Research, vol. 21, no. 8, pp. 735 758, 2. [8] S. Se, D. Lowe, and J. Lile, Global localizaion using disincive visual feaures, in Proceedings of he 2 IEEE/RSJ Inernaional Conference on Inelligen Robos and Sysems, EPFL, Lausanne, Swizerland, Ocober 2. [9] A. Douce, N. de Freias, and N. Gordon, Eds., Sequenial Mone Carlo Mehods in Pracice. Springer-Verlag, New York, 1. [1] D. Fox, S. Thrun, W. Burgard, and F. Dellaer, Paricle filers for mobile robo localizaion, in Sequenial Mone Carlo Mehods in Pracice, A. Douce, N. de Freias, and N. Gordon, Eds. Springer-Verlag, New York, 1, ch. 19. [11] D. Hinkley, Boosrap Mehods and heir Applicaion. Cambridge Series in Saisical and Probabilisic Mehemaics, 1997. [12] J. Heikkil and O. Silvn, Calibraion procedure for shor focal lengh off he-shelf ccd cameras, Proc. 13h Inernaional Conference on Paern Recogniion, pp. 166 17, 1996. [13] H. C. Longue-Higgins, A compuer algorihm for reconsrucing a scene from wo projecions, Naure, vol. 293, pp. 133 135, 1981. [14] R. I. Harley and A. Zisserman, Muliple View Geomery in Compuer Vision. Cambridge Universiy Press, ISBN: 52162349,. [15] D. DeMenhon and L. S. Davis, Model-based objec pose in 25 lines of code, Inernaional Journal of Compuer Vision, vol. 15, pp. 123 141, June 1995. [16] G. R. Cowell, A. P. Dawid, S. L. Laurizen, and D. J. Spiegelhaler, Probabilisic Neworks and Exper Sysems. Springer-Verlag, New York, 1999. [17] M. Monemerlo, S. Thrun, D. Koller, and B. Wegbrei, FasSLAM: A facored soluion o he simulaneous localizaion and mapping problem, in Proceedings of he AAAI Naional Conference on Arificial Inelligence. Edmonon, Canada: AAAI, 2. [18] S. Thrun, Probabilisic algorihms in roboics, Carnegie Mellon Universiy, Tech. Rep. CMU-CS--126, April. Niklas Karlsson, Ph.D. E-mail : niklas@evoluion.com LuisGoncalves, Ph.D E-mail : luis@evoluion.com 66 로봇과인간제 2 권제 1 호

E-mail : mario@evoluion.com Mario E. Munich, Ph.D. E-mail : paolo@evoluion.com Paolo Pirjanian, Ph.D. The vslam Algorihm for Navigaion in Naural Environmens 67