Architectural Support for Efficient Large-Scale Automata Processing

Size: px

Start display at page:

Download "Architectural Support for Efficient Large-Scale Automata Processing"

Victoria Terry
5 years ago
Views:

1 Architecturl Support for Efficient Lrge-Scle Automt cessing Hongyun Liu, Mohme Ibrhim, Onur Kyirn, Sreepthi Pi, n Awit Jog College of Willim & Mry Avnce Micro Devices, Inc. University of Rochester Emil: {hliu08,mibrhim}@emil.wm.eu, onur.kyirn@m.com, sree@cs.rochester.eu, jog@wm.eu Abstrct The Automt cessor (AP) ccelertes pplictions from omins rnging from mchine lerning to genomics. However, s sptil rchitecture, it is unble to hnle lrger utomt progrms without repete reconfigurtion n reexecution. To chieve high throughput, this pper proposes for the first time rchitecturl support for AP to efficiently execute lrge-scle pplictions. We fin tht lrge number of existing n new Non-eterministic Finite Automt (NFA) bse pplictions hve sttes tht re never enble but re still configure on the AP chips leing to their unerutiliztion. With the help of creful chrcteriztion n profiling-bse mechnisms, we preict which sttes re never enble n hence nee not be configure on AP. Furthermore, we evelop SprseAP, new execution moe for AP to efficiently hnle the mis-preicte NFA sttes. Our etile simultions cross 6 pplictions from vrious omins show tht our newly propose execution moel for AP cn obtin. geometric men speeup (up to 47 ) over the bseline AP execution. I. INTRODUCTION Mny pplictions from omins such s genomics, mlwre etection, mchine lerning, n t nlytics exhibit high levels of prllelism n re being ccelerte through the use of sptil rchitectures tht cn exploit higher levels of prllelism thn CPUs n lso cn significntly reuce t movement [] [9]. Sptil rchitectures usully consist of mny interconnecte processing elements tht expose very high egree of prllelism. Fiel-progrmmble gte rrys (FPGAs) re clssic exmple; the systolic-rry-bse Mtrix Multiply Unit in Google s Tensor cessing Unit [0] is lso sptil rchitecture. One of the funmentl chllenges with sptil rchitectures is tht progrm size is first orer concern there re fixe number of sttes vilble n sptil progrm must fit completely to begin execution. Otherwise, execution my be impossible, or in the best cse multiple rouns of reconfigurtion n re-execution my be require tht cn incur significnt performnce penlties []. On tritionl von Neumnn rchitectures, these issues cn typiclly be hnle by tritionl mechnisms such s context switching n virtuliztion. However, the lrge size of the sptil progrm stte mens tht these techniques o not trnsfer irectly. Some of these issues ffect lso tritionl rchitectures like the Grphics cessing Units (GPUs), whose mssive prllelism lso mens tht the mount of stte is often prohibitively lrge to support efficient multitsking [] [5]. In this pper, we focus on proviing rchitecturl support for executing lrge-scle tsks on specil clss of sptil rchitectures, known s utomt processors (APs) [6]. These rchitectures ccelerte the processing of Non-eterministic Finite Automt (NFA), wiely use representtion of Finite Stte Mchines (FSMs). FSMs re fountionl in wie rnge of ppliction omins such s DNA sequence mtching, network intrusion etection n mchine lerning [7] []. Although mny existing pproches [] [6] ccelerte NFA processing on CPUs or GPUs, none of them completely solve the problem of t movement cuse by irregulr ccesses ue to NFA trnsition tble lookups. In comprison, the AP executes NFAs ntively n chieves significnt performnce speeup [7], [8] primrily becuse of: ) AP s mssive prllelism where NFA sttes re mppe to columns in DRAM n cn be ctivte inepenently n simultneously in given cycle; n b) AP s in-memory processing cpbility tht hnles NFA trnsitions without t movement between processor n memory. An AP hlf-core (the bsic processing unit of AP) cn hol up to 4K sttes. However, in future, we expect tht the NFAbse pplictions re going to scle both in terms of the number of NFAs per ppliction n the number of sttes in n NFA. We expect this scling from t lest two spects. First, in the er of big-t, the new pplictions will likely be mining even lrger tbses. For exmple, ClmAV [9], n nti-virus ppliction, uses vrint of regulr expression to specify ech virus signture in n ever-enlrging tbse. The number of NFA sttes constructe from these signture regulr expressions is consequently lrger n stte-of-the-rt AP chips cn no longer hol ll the sttes t once. Secon, number of existing n newly propose techniques enhnce the throughput of FSM processing, but only by incresing the number of sttes. For exmple, existing AP supports uplicting NFAs to run multiple input symbol strems in prllel [0]; newly propose Prllel Automt cessor [] uplictes NFAs for prllel enumertion; n the Multi-strie NFAs [], [] trnsformtion increses the number of trnsitions for processing multiple symbols t one step. Current AP chips execute these pplictions with lrge number of NFAs/sttes by mking inepenent btches of NFAs n executing ech btch on the entire input while reconfiguring the AP between ech btch. To ress the performnce inefficiencies from repete re-executions, we propose hrwre n softwre support for lrge-scle NFA-bse pplictions tht currently o not fit in the AP chips. Our mechnisms re bse on our key observtion tht not ll sttes of n NFA re enble uring execution, n

2 Percentge of sttes 00% 80% 60% 40% 0% 0% Hot (Enble) Col (Never-enble) HM TCP Rg EM Rg05 LV Bro7 Fig. : A lrge portion of NFA sttes re col (never-enble) but re still configure on the AP leing to its unerutiliztion. hence nee not be configure to the AP. Specificlly, lrge frction of sttes unnecessrily tke spce in the AP chip but re not prt of ny stte trnsitions. We refer to such never enble sttes s col sttes n the remining (enble) sttes s hot sttes. Figure quntittively shows our observtion cross 6 iverse pplictions [7], [4] sorte in the incresing orer of their percentge of hot sttes (cross ll NFAs in n ppliction). We fin tht on verge 59% of sttes re col n it cn be up to 99% in pplictions such s. These observtions cn be expline by revisiting the wy NFAs process inputs. NFA behvior is highly input epenent. A stte cn ttempt to mtch symbol of input only if it is enble. In the most generl cse, stte is enble only if t lest one of its preecessor sttes mtche symbol of input (the exceptions being strting sttes, which re lwys enble). A mtch inictes tht the current input string is plusibly still vli prefix of the regulr lnguge recognize by the NFA. Sttes stop mtching s soon s the input string is efinitely not in the lnguge. However, the AP must still process ll input symbols s long s there is one stte enble (which is lwys true for n NFA with t lest one strting stte tht is lwys enble), thus leving mny sttes never enble. Section III shows tht this is inee the cse for the NFAs running on the AP. Bse on the bove key insight, we first evelop softwrebse mechnism to preict which sttes re col n hence nee not be configure on the AP. Next, we propose chnges in the AP hrwre to efficiently execute the mis-preicte col sttes. To the best of our knowlege, this is the first work tht proposes rchitecturl support for efficiently executing lrge-scle NFA-bse pplictions on the AP. In summry, this pper mkes the following contributions: We emonstrte tht lrge number of NFA sttes re col uring execution but re still configure on the AP. This les to its severe unerutiliztion. We evelop preiction mechnism to clssify the NFA sttes into preicte hot n preicte col sets. We use properties of NFA execution to evelop simple n effective prtitioning scheme bse on stte s topologicl orer n profiling informtion. We evelop efficient hrwre mechnisms to execute preicte col sttes using new sprse execution moe for the AP (clle s SprseAP). Our etile evlution shows tht we cn chieve. geometric men speeup (up to 47 ) over the bseline AP execution cross wie rnge of 6 pplictions. II. BACKGROUND AND TMINOLOGY In this section, we provie brief bckgroun on NFAs n their processing on the AP. A. NFA-bse Pttern Mtching An NFA is represente by 5-tuple, (Q,Σ,,q 0,F), where Q is set of sttes, Σ is the lphbet (set of input symbols), is trnsition function which mps Σ pirs to new set of sttes, q 0 is the set of strting sttes, n F is set of ccepting or reporting sttes. Becuse there cn be more thn one possible stte on trnsition, such FSM is clle non-eterministic. The NFAs use by APs re homogeneous. These NFAs cn be visulize s irecte grph where ech noe represents stte n ech ege represents stte trnsition. Ech stte in the NFA hs symbol-set tht represents wht symbols cn be ccepte by this stte. Ech stte hs one or multiple successors connecte by irecte eges. In ech step, S S S4 b c S S5 c S6 Fig. : A homogeneous NFA tht ccepts regulr expression ((bc) (c)+)f: the ouble circle represents strting stte n the hexgon represents reporting stte. the NFA hs number of enble sttes. The strting sttes re enble prior to the execution. The mtching process is riven by strem of input symbols. Ech cycle, n enble stte compres the input symbol with its symbol-set for mtching; when the symbol mtches, the stte is ctivte, n ll its successor sttes re enble in the next cycle. When reporting stte is ctivte, it genertes report showing tht relevnt pttern hs been observe in the input symbol strem. Figure shows the NFA of the regulr expression ((bc) (c)+)f. At first, the strting stte S is enble. bcf is the input symbol strem. ctivtes stte S, resulting in the successors of S (i.e., S n S 4 ) to be enble in the next cycle. b ctivtes stte S (S 4 is not ctivte since it oes not ccept symbol b), then the successor of S (i.e., S ) is enble. The process repets until ll input symbols re consume. In this cse, since reporting stte S 6 is ctivte by input symbol f, report is generte inicting successful mtch. B. Bseline Automt cessor (AP) Figure shows schemtic of the consiere bseline AP chip. The AP is DRAM-bse sptil rchitecture in which ech stte of NFA is store in memory column of the DRAM, nmely stte trnsition element (STE). A bit in the column represents whether the STE cn ccept the corresponing input symbol represente by ech row. The mximum size of the lphbet is 56 s this is the with of the ress ecoer in the current AP rchitecture. Therefore, there re 56 rows In homogeneous NFAs [6], [5], [6], ll incoming trnsitions to ny given stte must ccept the sme set of input symbols (symbol-set). In the rest of this pper, we tret homogeneous NFA synonymous with NFA, becuse they hve the sme computtionl bility n time complexity. f

3 input symbol 97th row 8-56 Decoer Stte bit S S S S4 S5 S6 Routing mtrix Stte vector Fig. : The figure illustrtes the first execution cycle of n AP configure with the NFA shown in Figure. S is enble when input symbol rrives, which ctivtes S, n enbles S n S4 in the next cycle. Downwr rrows represent the enble signl being fe to routing mtrix in the current cycle. Upwr rrows enble successor sttes for the next cycle. The physicl connections between STEs n routing mtrix re bi-irectionl, which re represente by the she rrows. in totl. An AP chip consists of two hlf-cores. The stte trnsition cnnot go cross hlf-cores ue to the limittion of the interconnect. The stte trnsitions re compile to the reconfigurble interconnecting network nmely routing mtrix. The entire input strem is processe sequentilly with the rte of one symbol per cycle. Ech cycle, one input symbol is fe into the ress ecoer, which selects whole row (out of 56) of the DRAM (ornge she prt in Figure ). Ech STE column hs bit tht represents whether the STE is enble or not, nmely stte bit. The stte bits for ll STEs re combine s stte vector. This informtion is vilble from the previous cycle. An AND opertion is performe between the selecte row (e.g., she prt) n the stte vector resulting in vector tht etermines the ctivte sttes. This ctivtion informtion is sent to the routing mtrix, which uptes the stte vector with the enble sttes for processing next symbol. Such process is repete until the entire input symbol strem is processe. To unerstn the working of AP, we illustrte the execution of previously consiere NFA (Figure ) vi Figure. We previously observe in Figure tht S ccepts symbol. Accoringly, the bit store in the 97th row (corresponing to the ASCII of ) n the column of STE tht stores S is set to n the others remin 0. The stte bit of S is n {} is in the symbol-set of S, therefore, S is ctivte n it brocsts the enble signls to the successor sttes (S, S 4 ) vi the routing mtrix (upwr rrows in Figure ). III. MOTIVATION AND ANALYSIS In this section, we nlyze why high percentge of sttes re col, which sttes re more likely to be col, n how voiing these sttes from being configure to AP cn improve the performnce. A. Topologicl Orer n Normlize Depth In generl, it is hr to preict which sttes will be enble in NFAs [7]. Clerly, ll strting sttes will be enble t lest once n this oes not epen on the input. The sttes tht re further wy from the strting stte, however, epen on the input. Ech subsequent stte trnsition in homogeneous NFA must mtch symbol of input (homogeneous NFAs o not hve ε-trnsitions [8]). Intuitively, stte tht is further wy from the strting stte is less likely to be enble since ech itionl stte on the pth to it increses the chnces of mismtch. To verify if this intuition hols on NFAs from rel-worl pplictions executing on the AP, we stuy whether sttes re hot or col with respect to their epths in the NFAs. For simplicity of exposition, we first consier only NFAs tht re lso irecte-cyclic grphs (DAGs). In this cse, the epth of stte is simply its topologicl orer (i.e., the mximum steps from the strting stte to itself in the mtching process). Thus, the mtching process goes from sttes with lower topologicl orer to sttes with higher topologicl orer but cnnot go bck s DAGs o not hve cycles. Such n NFA cn be viewe s grph with lyers, where ll strting sttes re in the first lyer (i.e., their topologicl orer is one), sttes in the secon lyer (i.e., sttes with topologicl orer of two) re rechble from the first lyer, sttes in the thir lyer re rechble from the first n secon lyers, n so on. However, NFAs re not lwys DAGs, becuse they cn contin bck eges (i.e., from lter lyer to n erlier lyer) n cycles. For exmple, the NFA in Figure 4 ( ) contins cycle between sttes S 4 n S 5. Topologicl sort cnnot be performe on such grphs. Therefore, we pre-process n NFA by ientifying ll its strongly connecte components (SCC) [9]. Ech stte s is mrke with connecte component number SCC(s), such tht the sttes belonging to the sme SCC re mrke with the sme number. We construct grph G from irecte grph G (i.e., the NFA) by treting ech SCC in G s single noe in G (e.g., in Figure 4, the SCC tht inclues sttes S 4 n S 5 is consiere s single noe in G ). For ech ege (u,v) in G, n ege (SCC(u),SCC(v)) is e in G if noes u n v re in ifferent SCCs. The resulting G is DAG on which we cn run topologicl sort. Figure 4 ( ) shows the results of ientifying SCCs n topologicl sort. The topologicl orer of ech stte is inicte s number right to the stte. Since S 4 n S 5 belong to the sme SCC, they re ssigne with the sme topologicl orer. S S b c S S6 f S4 S5 c S S b c S S6 f 4 4/4 /4 /4 /4 /4 S4 c S5 SCC Fig. 4: Illustrtion of topologicl orering n normlize epth. The bsolute topologicl orer or epth of stte is uninformtive s ifferent NFAs cn hve ifferent number of lyers, even within the sme ppliction. Therefore, we /4

4 normlize the epth of stte to the mximum epth in the NFA it belongs to, resulting in normlize epth. For exmple, in Figure 4 ( ), becuse the mximum topologicl orer is 4 (S 6 ), the normlize epth of ech stte s is topoorer(s)/4 (e.g., for S 4 n S 5, it is /4 or 0.5) where topoorer is function tht returns the topologicl orer of stte. A normlize epth closer to inictes the stte is t the bottom of the NFA (or reltively eep), while vlue closer to 0 inictes the stte is closer to the top (or reltively shllow). B. Anlysis of Normlize Depth n Enble NFA Sttes Figure 5() shows the normlize epth istribution of enble (hot) sttes for our evlute pplictions. Ech ppliction is comprise of mny NFAs, ech representing ifferent pttern. We fin tht for the mjority of pplictions, the hot sttes hve low normlize epth (i.e., they re closer to the strting stte of the NFAs). Furthermore, for the sme set of pplictions, Figure 5(b) shows the normlize epth istribution of col (never enble) sttes. We observe tht the col sttes in the mjority of the pplictions hve high normlize epth (i.e., they re in eeper regions of the NFAs). To confirm this conclusion further, we lso fin tht there is significnt negtive correltion (verge correltion coefficient is 0.8) between normlize epth n percentge of hot sttes for ll pplictions, except. Percentge of sttes Percentge of sttes 00% 80% 60% 40% 0% 0% 00% 80% 60% 40% 0% 0% Shllow Meium Deep HM TCP Rg EM Rg05 LV Bro7 () Hot (Enble) sttes HM TCP Rg EM Rg05 LV Bro7 (b) Col (Never-enble) sttes. Fig. 5: Distribution of normlize epth for NFA sttes. For presenttion purposes only, normlize epth is clssifie s: i) shllow ([0 0.)), ii) meium ([0. 0.6)), n iii) eep ([0.6 ]). We conclue tht whether stte is hot or col is highly correlte with its normlize epth. Overll, shllow sttes re more likely to be hot while eep sttes re more likely to be col. C. Anlysis of Performnce Benefits We nlyze the iel performnce benefits when we completely eliminte the col sttes from being configure on the AP. We show the potentil benefits using performnce moel ssuming orculr knowlege of which sttes re col n not configure on the AP. Performnce Moel. Consier the cse of the bseline AP execution, where the ppliction hs S sttes (cross ll NFAs) n the number of sttes the AP hlf-core cn hol (cpcity) is C AP. Without loss of generlity, we only iscuss the cse of one AP hlf-core. If the number of sttes (S) is lrger thn the size of AP (C AP ), it is not possible to configure the entire ppliction t once to the AP n will require configuring the AP multiple times. Ech configurtion plces set of NFAs tht cn collectively fit in the AP. Suppose the size of ech NFA in the ppliction is less thn the size of AP, therefore, the number of configurtions to the AP woul be N config = C S AP, uner the ssumption tht iniviul NFAs cn be split t stte grnulrity. In the current AP rchitecture, btches (prtitions) usully contin whole NFAs, so the number of configurtions my be even higher. To mintin semntics, ech configurtion btch must see the sme input strem. The mtching process finishes fter ll btches of NFAs re execute on the sme input strem. Thus, the totl number of cycles spent on the sme input strem is N config n, where n is the length of the input strem n N config is the number of btches. Uner perfect scenrio where we cn ientify col sttes (S col ) with 00% ccurcy, we cn reuce N config by not configuring the col sttes to the AP. We efine the resource sving p = S col S. Therefore, the speeup over the bseline cse is C S AP / ( p) S C AP. If the number of sttes is sufficiently lrge, the speeup we cn get is proportionl to p, p. Thus, the lrger the proportion of col sttes tht cn be correctly ientifie n eliminte, the more speeup we cn hve over the bseline execution scenrio. Illustrtive Exmple. To illustrte the benefits of configuring the AP with only hot sttes, Figure 6 shows two scenrios: ) the bseline AP execution, n b) the AP tht only executes hot sttes. The execution in both cses consiers the sme ppliction ( A ). In the bseline scenrio, if the number of totl sttes is more thn the AP cpcity, the execution will nee to be one in btches s iscusse before. In this exmple, the compiler prtitions the ppliction into two btches, where ech btch cn iniviully fit in the AP ( B ). Hence, the sme input strem is execute twice in sequentil mnner ( D ). However, with the orculr knowlege of col sttes, the compiler cn generte perfect prtition of the ppliction with only the hot sttes ( C ). If this perfect prtition fits in the AP, it cn execute on it by consuming the sme input strem only once ( E ), resulting in significnt svings in the execution cycles. In summry, significnt speeup cn be chieve if col sttes re not configure to AP. In the next section, we propose simple n effective profiling-bse mechnism to ientify such sttes in relistic scenrios n then leverge the profiling informtion to efficiently prtition them from the NFAs. IV. DESIGN AND IMPLEMENTATION OF NFA PARTITIONING Any relistic implementtion tht elimintes col sttes from NFAs (i.e., prtitions NFAs into col n hot sttes, n only configures AP with hot sttes) hs to el with t lest three chllenges. First, lthough it is not possible to preict col

5 Bseline Compile-time Btch Btch Runtime Btch Btch D A Appliction Strting stte Hot stte B Perfect Prtitioning Hot Sttes input strem input strem Cycles Hot Sttes Sve input strem E Col stte C Time Fig. 6: An illustrtive figure showing tht by not configuring col sttes on AP, ll the hot sttes cn fit onto n AP t the sme time, reucing the number of re-executions over the input n hence sving time. sttes with 00% ccurcy in generl, we nee to evelop low-overhe techniques to improve the ccurcy of preiction s much s possible. Secon, in the cse of mis-preiction, some trnsitions my require sttes tht re not configure on the AP. To this en, we nee mechnism working s sfety net to hnle trnsition from stte on the AP to stte tht is not on the AP. Thir, to minimize the cost of such mis-preictions, trnsitions shoul be uniirectionl to voi re-executions of inputs on the AP. Our propose prtitioning scheme systemticlly resses these chllenges. First, we use profiling-bse scheme to ientify the topologicl lyer tht cts s prtition lyer for ech NFA in the ppliction. Secon, our propose scheme hnles trnsitions out of the AP by ing intermeite reporting sttes tht piggybck on existing AP reporting hrwre. Finlly, to ensure uniirectionl trnsitions, we prtition the NFA t specific topologicl orer. Since the mtching lwys procees from lower to higher topologicl orer, eges tht cross prtitions go only in one irection. A. filing-bse Hot/Col Stte Preiction We use smll portion of input for ech ppliction s profiling input. Bsiclly, t compile time, we run the profiling input on the NFAs of the ppliction n etermine whether stte is hot or col. We ssume tht this profiling informtion hols true uring the ctul execution n hence re ble to preict which sttes will be hot or col. In the following prts of this sub-section, we evlute the effectiveness of our profiling-bse preiction. filing n Testing Inputs. Ech ppliction tht we evlute hs MB input. We ivie this MB input into two equl prts of 5KB. The first 5KB of input is use for creting ifferent sizes of profiling inputs n the lst 5KB is use for testing input. We crete ifferent sizes of profiling inputs by using the first 0.%, %, 0%, 00% symbols of the 5KB portion, which is essentilly 0.%, %, 0%, 50% of the entire input. Methoology for Evluting the Effectiveness of filing. In our evlution, we tret hot s positive (P) n col s negtive (N). Therefore, true positives (TP) re sttes tht re hot both uner profiling input n testing input. Similrly, flse positives (FP) re sttes tht re hot uner profiling TABLE I: The effectiveness of profile-bse preiction Percentge of the entire input 0.% % 0% 50% Accurcy 87% 90% 9% 97% Recll 64% 76% 87% 97% Precision 94% 9% 90% 9% input but ctully col uner testing input. True negtives (TN) n flse negtives (FN) re efine similrly. We efine: ) ccurcy = TP+TN P+N, which mesures overll how well is the profiling-bse preiction; b) recll = TP+FN TP, which mesures how complete our preiction is terms of preicting hot sttes; n c) precision = TP+FP TP, which mesures how well the preiction coul relize the resource sving scope (p). Effectiveness of filing. Tble I shows the verge numbers for ccurcy, recll, n precision when we use ifferent sizes of profiling inputs. We evlute ll pplictions except n. Specificlly, using only % prefix of the first 5KB (i.e., % of the entire input) cn chieve 76% recll, which mens 76% of hot sttes uner testing input re lso hot with the smll profiling input. The results re consistent cross 4 pplictions (recll vries from 49% to 00%). In ition, the preiction lso hs goo results in terms of ccurcy n precision. To conclue, only smll profiling input cn ientify most of the hot sttes uring the ctul execution. Therefore, we use 0.% n % of the entire input for profiling n the remining for the ctul evlution (Section VII). B. Where to Prtition? In current AP rchitecture, the ppliction is split t NFA grnulrity into btches. In contrst, we prtition the NFAs t topologicl-orer grnulrity. There re two resons tht we use topologicl-orer s our prtition grnulrity. First, our previous nlysis (Section III-B) shows there is correltion between normlize epth n percentge of hot sttes. Secon, prtition t topologicl-orer grnulrity cn gurntee the uniirectionl trnsition between preicte col n hot sttes. In this subsection, we show how o we obtin prtition lyer k U for ech NFA U of the ppliction. We will show how to prtition ech NFA t the topologicl-orer grnulrity in Section IV-C. For n, we use the entire input for the ctul execution becuse their strting sttes re only enble t position 0 (strt-of-t in ANML configurtion).

6 Choosing Prtition Lyer. At compile time, we functionlly simulte ll NFAs of the ppliction using the profiling input n preict whether stte is hot or col. After simultion, for ech NFA U, we set k U = mx{topoorer(s)}, s is hot stte in NFA U uner the profiling input. We efine the preicte hot set = {s s U topoorer(s) k U, U}. Accoringly, the preicte col set = {s s U topoorer(s) > k U, U}. We ivie the preicte hot set t NFA level into btches tht cn fit in AP n configure ech btch sequentilly. Optimiztion. As n optimiztion, to mke ech btch fill the AP completely, we ssign itionl sttes to the preicte hot set from preicte col set. This is chieve by incrementing k U, which s the sttes of the subsequent prtition lyers for ech NFA U. This process termintes when the cpcity of AP is met for ech btch. C. How to Prtition? In this sub-section, we emonstrte how to prtition n NFA into two prts t given prtition lyer k clculte bse on the escription presente in Section IV-B n how to hnle stte trnsitions when the prtitioning is imperfect. For brevity, we escribe our prtitioning scheme for single NFA, which then cn be seprtely pplie to ech NFA in the ppliction. Figure 7 illustrtes NFA prtitioning using the prtition lyer k = n cut the eges tht connect sttes with k to sttes with k > (inicte s she lines in Figure 7 ( )). However, the preiction my not be perfect stte in the preicte col set coul en up being enble uring mtching. Since only sttes in the preicte hot set re present on the AP, the mtching process must trnsition out of the AP. b c Preicte hot set Preicte col set S c T e e S f Q S b c e e c c f P P P P4 4 P P P P4 S c T Trnsltion tble S S S S S S f Q Intermeite reporting stte Fig. 7: Prtitioning n NFA by the prtition lyer. To hnle such cses, for ech ege (u,v) we cut in the originl NFA, we introuce n intermeite reporting stte v n n ege (u,v ). The stte v mtches exctly the sme input symbols (symbol-set) s v but is lso reporting stte. During execution, the AP contins these intermeite reporting sttes long with the preicte hot set. Therefore, when the mtching process tries to enble stte tht is not on the AP (i.e., in the preicte col set), it ctivtes the corresponing intermeite reporting stte inste. Consequently, n intermeite report is generte tht notifies hnler (Section V). The hnler will enble corresponing sttes in preicte col set to continue the Percentge of sttes 00% 80% 60% 40% 0% 0% 0 Hot sttes Constrine sttes Col sttes HM Fig. 8: Constrine sttes re col sttes but configure on the AP ue to the constrints in our topologicl-orer-bse prtitioning scheme. Consequently, some AP resources re unerutilize with few pplictions. mtching process. Since we use topologicl orer to prtition, fter the mtching process continues, it will never go bck to the preicte hot set. In Figure 7 ( ), the intermeite reporting sttes re P through P 4. When ctivte, these sttes enble their corresponing sttes S, S n S s inicte in the trnsltion tble (Figure 7 ( )), which lie in the preicte col set shown in Figure 7( 4 ). D. Discussion The use of SCC n topologicl-orer-bse prtitioning imposes constrints tht le to more sttes thn necessry being e to the preicte hot set. Specificlly, () even if only one stte in n SCC is hot, the whole SCC must be inclue in preicte hot set, n () col stte with topologicl orer less thn the prtition lyer k is still inclue in the preicte hot set. This might reuce the AP resource svings. To stuy the extent of this unerutiliztion, Figure 8 shows tht for ll the 6 evlute pplictions, our topologicl-orer bse perfect prtitioning constrins only 4% on verge more sttes to the preicte hot set (which in relity re not going to be enble), compre with perfect prtitioning tht cn cut NFAs t rbitrry eges. Two exceptions re LV n whose lrge SCCs prevent effective prtitions. In summry, we still hve significnt opportunity for resource svings if we cn ccurtely ientify the prtition lyer for ech NFA. V. HARDWARE SUPPORT FOR INTMEDIATE REPORT HANDLING AND PARTITIONED NFA PROCESSING In this section, we iscuss how to efficiently hnle the intermeite reports generte from the execution of the preicte hot set. To this en, we propose to: ) enble the sttes tht intermeite reporting stte irects to, n b) continue the mtching process from the cycle (i.e., the input position) where the intermeite report ws generte t. Although both steps cn be performe on CPU, it incurs significnt performnce slowown (Section VII), therefore we propose new execution moe for the AP. A. Anlysis of New Moes for AP In orer to support the forementione steps, we propose n ugmente AP which supports two moes: BseAP moe, n SprseAP (SpAP) moe. The BseAP moe execution is similr to the bseline AP execution, however, AP in this moe is configure with only the preicte hot set. Once the TCP Rg EM Rg05 LV Bro7

7 Perfect Prtitioning Relistic Prtitioning Hot Sttes Preicte Hot Set Remining sttes (preicte col set) Hot Sttes input strem (BseAP moe) Preicte Hot Set b input strem Strting stte Hot stte Col stte (SpAP moe) c Jump Remining sttes Jump 5 4 b Intermeite reports Cycles sve vi Perfect prtitioning Cycles sve vi Relistic prtitioning input strem Fig. 9: Illustrtion of performnce benefits uner relistic prtitioning: becuse of the jump opertion, only portion of input symbols re execute in the SpAP moe execution. Time execution of BseAP moe finishes, the generte intermeite reports re hnle in the SpAP moe. In the SpAP moe, the AP is configure with the preicte col set. The AP in this moe not only consumes input symbols but is lso riven by the intermeite reports. In this context, we evelop two mjor opertions for the SpAP moe: enble n jump. The enble opertion llows ech intermeite report to enble the pproprite stte in the preicte col set. The jump opertion skips over the input symbols tht re not necessry for hnling the intermeite reports. Since no bck-ege exists from preicte col sttes to preicte hot sttes (iscusse in Section IV), no bck n forth switching between BseAP n SpAP moes is require. Ech intermeite report in the list of intermeite reports (L) is represente by tuple: input position n stte ID (c, si) enoting tht the intermeite report is generte t input position c (i.e., cycle c in the BseAP moe execution) n the stte to be enble is si. Algorithm shows the pseuo coe for the SpAP moe execution. In ech cycle, if no stte is enble (Line 4), it performs jump opertion setting the current input position i to the input position where next intermeite report ws generte. The enble opertion (Line 9 to Line ) is performe ue to either scenrio: current input position i reches the input position in next intermeite report or the current input position i ws just set to L[ j].c by the jump opertion. The remining functionlity of the SpAP moe is the sme s the BseAP moe. We escribe next how these opertions re use to hnle relistic prtitioning scenrios with the help of n illustrtive exmple. Illustrtive Exmple. Figure 6 erlier iscusse the performnce benefits of perfect prtitioning. Uner relistic prtitioning, inccurte preictions of col sttes require intermeite report hnling. Figure 9 shows n illustrtive exmple emonstrting the benefits of executing AP in BseAP n SpAP moes. The execution strts in the BseAP moe ( ) tht is configure with the preicte hot set. During its execution, two intermeite reports re generte t input position 5 n input position 4, respectively n re store (, b ). Once ll the input symbols re consume, the SpAP moe begins ( ), which is riven by both the input strem Algorithm Functionlity of SpAP moe Input: L, the list of intermeite reports. Ech element in L contins (c, si) showing the input position where the report ws generte, n the stte i to be enble. Input: input, the input symbol strem. Output: out list, the list of reports. : i 0 : j 0 i is the inex (input position) of input, j is the inex of L. : while i < input.length o 4: if E is /0 then E is the set of enble sttes. 5: if j < L.length then 6: i L[ j].c Jump opertion. 7: else 8: brek 9: while L[ j].c = i n j < L.length o 0: enble L[ j].si Enble opertion. : j j + : A {sttes in E tht ccept input[i]} : A is the set of ctivte sttes. 4: E /0 5: for ll s in A o 6: if s is reporting stte then 7: ppen (i, s.i) to out list 8: E E {successors of s}. 9: i i + n the intermeite reports. If no stte is enble, SpAP moe jumps to the input position where the next intermeite report ws generte. In this exmple, initilly, it jumps to the input position 5 of the first intermeite report irectly ( c ). During the execution, when there is no enble stte (t input position 8), the SpAP jumps to input position (4) of the next intermeite report ( ). Therefore, uner SpAP, only portion of the input symbols re execute (green she prt in ). B. Implementtion Detils We escribe the require hrwre implementtion supporting SpAP moe by implementing the jump n enble opertions on top of the current AP rchitecture. We strt by

8 the implementtion of the SpAP opertions. Then we estimte the execution time overhe of these opertions. Finlly, we emonstrte the storge requirements for the intermeite reports. Jump Opertion. The jump opertion moifies register tht trcks the current input position. Specificlly, if no STE is enble, the jump opertion uptes the register vlue with the input position from the next intermeite report. Since no stte configure to SpAP is lwys enble, the enble sttes in next cycle re only etermine by the ctivte sttes in the current cycle. Therefore, given tht the routing mtrix routes the enble signl from the ctivte sttes, we ssume tht the routing mtrix provies flg tht is set if no STE is enble. Enble Opertion. Given n intermeite report, we use the stte ID informtion to enble the corresponing STE. Since STEs re connecte to the routing mtrix, n the routing mtrix follows hierrchicl esign (block, rows, n STEs) [6], we utilize such hierrchy to perform the enble opertion. The routing mtrix consists of 96 blocks per hlf core. Ech block is group of 6 rows, n ech row is group of 6 STEs. Since stte ID is represente by 6 bits, we ivie these bits to enble the require STE in hierrchicl mnner. We use the first 8 bits to select the block, the mile 4 bits to select the row, n the lst 4 bits to select the require STE within the row. We use totl of three ecoers to select the require block, row, n STE, respectively. Specificlly, 7 8 ecoer is use to select the block. Then, 4 6 ecoer selects the row. Finlly, 4 6 ecoer enbles the require STE. The enble opertion works in prllel with the processing of input symbols uring SpAP moe. Enble Opertion Overhe. We cn overlp the enble opertion of only one intermeite report with the processing of the input symbols in SpAP moe. Thus, if multiple intermeite reports were generte in the sme input position uring BseAP moe, the input processing is stlle until ll the sttes in the simultneous intermeite reports re enble. In SpAP moe, to o tht, we compre the input position of the he intermeite report with the next input position (current input position + ). Similrly, we compre the input position of the secon intermeite report with the next input position. If both of these comprisons re set, we puse the processing of the input symbols. After enbling the sttes in ll simultneous intermeite reports, the input processing resumes. The cycles spent to enble the simultneous intermeite reports re consiere overhe to the overll SpAP moe execution n re ccounte for in our evlution methoology. Intermeite Reports Storge Overhe. The list of intermeite reports is store in the off-chip evice memory. Only portion of the reports is loe to the on-chip memory to be consume uring the SpAP moe. We use queue of 8 entries to store the loe intermeite reports. Becuse ech intermeite report is (input position, stte ID) tuple, we nee 6 bytes per intermeite report (4 bytes for the input position, n bytes for the stte ID). Thus, the overll storge require for the intermeite reports queue is 8 6 bytes. A. Applictions VI. EVALUATION METHODOLOGY We evlute our mechnisms with ll pplictions in the ANMLZoo benchmrk suite [7] n the Regex benchmrk suite [4]. Tble II shows tht these pplictions hve sttes rnging from pproximtely K to 00K, n severl of them hve sttes more thn 4K, which is the size of our bseline AP hlf-core. In orer to evlute pplictions with n even lrger number of sttes, we generte multiple pplictions bse on three sources: ClmAV [9], Hmming [40], n [4]. ClmAV4k (). We convert the regulr expressions in min.cv of the Q 08 ClmAV Virus Dtbse to ANML formt. We select the first 4,000 ptterns from the virus tbse. We use the sme input of ClmAV in ANMLZoo [7]. Hmming. We generte Hmming utomt using the sme pproch s the ANMLZoo benchmrk suite [40]. To keep it consistent with Hmming in ANMLZoo, we lso crete the utomt in the BMIA (Boune Mismtch Ientifiction Automton) formt. We crete three ifferent worklos from Hmming tht contin ifferent number of NFAs, nmely, n. For ech worklo we generte, we crete mix of ifferent expecte pttern lengths (8,, 0, 0), ech with istnce of to 0% of the pttern length (e.g., 0. 0 = 6). Similr to Hmming in ANMLZoo [7], we generte the inputs rnomly. L. Our ppliction inclues,6 rules from both community rules n registere rules of the network intrusion etector [4]. We convert the regulr expressions to ANML formt. We use the sme network trffic input s the ppliction in ANMLZoo. We consier totl of 6 pplictions n ivie them into three groups bse on the number of sttes they contin. The high resource requirement (high) group contins pplictions with sttes more thn the cpcity of n AP chip (49K). The meium resource requirement (meium) group contins pplictions with sttes more thn the cpcity of n AP hlfcore (4K). The rest of the pplictions re groupe into low resource requirement (low) group. B. Experimentl Setup We buil our mechnisms on top of the open-source virtul utomt simultor VASim [4]. As we mentione in Section V, we evlute both AP CPU n BseAP/SpAP execution. In the AP CPU execution, the sttes tht re execute in the SpAP moe re inste execute on the CPU. Tble III shows summry of the evlute scenrios. We moel ifferent timing mechnisms for AP CPU n BseAP/SpAP in the simultor s etile below. Timing AP CPU. We recor the totl mount of time tht the CPU spens to hnle the intermeite reports by using st::chrono in C++ librry. Therefore, we use the rel time when we clculte the speeup in the AP CPU execution. We run our experiments on mchine with Intel(R) Xeon(R) CPU E5-68 v. We use 7.5 ns s the cycle time per symbol [] for the BseAP execution.

9 TABLE II: List of evlute pplictions: RSttes stns for reporting sttes n MxTopo stns for mximum topologicl orer cross NFAs. Grp stns for resource requirement groups: High (H), Meium (M), Low (L). Appliction Abbr. Grp. #Sttes #NFAs MxTopo #RSttes ClmAV4000 [9] H Hmming500 [40] H Hmming000 [40] H big [4] L H Hmming500 [40] H [7] H Dotstr [7] H EntityResolution [7] H RnomForest [7] H [7] H ClmAV [7] H [7] M tomt [7] M [7] M PowerEN [7] M RnomForest [7] M TCP [4] TCP L Dotstr06 [4] 06 L Rnges05 [4] Rg05 L Rnges [4] Rg L ExctMth [4] EM L Dotstr09 [4] 09 L Dotstr0 [4] 0 L Hmming [7] HM L Levenshtein [7] LV L Bro7 [4] Bro7 L TABLE III: Summry of Scenrios System Softwre Hrwre of of preicte entire NFAs hot set AP Prtition (t NFA grnulrity) AP CPU Prtition (hot/col set) BseAP/SpAP Prtition (hot/col set) of preicte col set BseAP Moe N/A N/A N/A BseAP Moe CPU N/A BseAP Moe SpAP moe Recoring the Cycles in BseAP/SpAP. In the BseAP/SpAP execution, we recor the execution cycles vi the simultor. The number of cycles in BseAP/SpAP execution is the sum of cycles spent on BseAP moe n SpAP moe. Therefore, Speeup BseAP/SpAP = Number of cycles on AP bseline execution Number of cycles on BseAP Moe+Number of cycles on SpAP Moe. Performnce per STE. We efine metric clle performnce per STE to show how much throughput ech STE cn provie on verge. Specificlly, performnce per STE = throughput number of input symbols C AP, where throughput = number of cycles. This llows us to compre APs with ifferent cpcities while lso consiering techniques tht improve performnce solely by incresing the AP size. Becuse ech STE in the AP occupies ie re, we cn lso consier this metric s proxy for performnce/re. Overhes. In this pper, we focus on reucing the reexecution overhe s we foun it is the mjor performnce bottleneck in AP. The new SpAP moe incurs the stll cycles ue to simultneous intermeite reports (Section V-B). Our finl results inclue these stll cycles. There re two more generic overhes relte to output n reconfigurtion. In our evlutions, we o not inclue the output overhe [0] n rely on existing work [4] tht proposes both hrwre n softwre techniques to ress it. We lso o not inclue the reconfigurtion overhe (50 ms [44], [45] for reconfiguring full AP bor) in our results s we believe it cn be mortize over AP execution, especilly when it executes very lrge inputs. VII. EXPIMENTAL RESULTS Effect on Performnce. To show the benefits of our schemes, we evlute the speeup for pplictions in the high n meium groups. Our mechnisms o not chnge the throughput of AP for pplictions in the low ctegory since the sizes of pplictions re smller thn our bseline AP with 4K STEs. Figure 0() shows the performnce results of our proposl, from which we rw four mjor observtions. First, The AP CPU execution shows significnt geometric men slowown of 9.8 n.9 uner 0.% n % profiling input, respectively. However, five pplictions out of 6 pplictions (,,,, ) chieve 4. geometric men speeup t no cost of hrwre moifiction. Secon, we fin tht BseAP/SpAP execution shows speeup in the mjority of evlute pplictions. It cn chieve.8 n. geometric men speeup using 0.% n % of input s profiling input, respectively. Thir, BseAP/SpAP execution cn be slower thn the AP in few pplictions (e.g., ), since these pplictions generte mny simultneous intermeite reports, leing to lengthy enble stlls on the SpAP moe (shown in Tble IV). Fourth, in pplictions with lrge SCCs tht prevent efficient prtitioning (e.g.,, see Figure 8), our scheme configures ll the sttes to the BseAP moe execution with no chnge in execution time. Effect on Performnce per STE. In orer to evlute the efficiency of our schemes cross wier set of system sizes n configurtions, we show performnce per STE in Figure, from which we rw two mjor observtions. First, lthough ifferent sizes of AP chips cn execute the sme ppliction with the sme performnce (e.g. n ppliction in low group fits n runs on both n AP chip or n AP hlf-core), lrger AP chips hve less performnce/ste, becuse fewer STEs in the lrger AP re utilize for the sme ppliction size. Such unerutiliztion les to less performnce/ste. Secon, on verge, our scheme not only increses performnce/ste by.% uner the scenrio of AP hlf-core n using % profiling input, but consistently chieves better performnce/ste uner ifferent sizes of AP s well. There re two mjor resons: () we preict col sttes n eliminte them from being configure, which increses AP utiliztion; () we use fewer cycles in the SpAP moe for mis-preiction hnling thn re-execution by btches hence incresing the throughput. Resource Svings n Speeup. We show the results of resource svings in Figure 0(b). By compring it with Figure 0(), we mke three observtions. First, generlly, the pplictions with high resource svings lso hve goo speeups. Secon, shows slowown lthough it hs goo resource svings. This is becuse its SpAP moe execution hs lots of enble stlls ue to lrge mount of simultneous intermeite reports (Tble IV). Thir, lthough the resource svings my be the sme uner ifferent profiling inputs, the speeup my be ifferent (e.g., ). It is becuse the originl size of the preicte hot set ws ifferent, but ue to the optimiztion in Section IV-B, ech btch ws extene with prt of the preicte col sttes to mtch the cpcity of AP. Consequently, this les to the sme resource svings.

10 Speeup AP-CPU file0.% AP-CPU file% BseAP/SpAP file0.% BseAP/SpAP file% high meium () Speeup with AP CPU n BseAP/SpAP execution using 0.% n % profiling input (cpcity = 4K). GeoMen Resource Svings Fig. 0: Speeup n Resource Svings on AP. high file 0.% file % meium (b) Resource svings (i.e., the portion of sttes tht re not configure in the BseAP moe) Avg Performnce/STE (* 0000) AP (49k) BseAP/SpAP (49k) AP (4k) BseAP/SpAP (4k) AP (k) BseAP/SpAP (k) TCP 06 Rg05 Rg high meium low EM 09 0 HM Fig. : Performnce per STE of vrious AP sizes with BseAP/SpAP execution consiering % profiling input. TABLE IV: Runtime sttistics for AP n BseAP/SpAP (uner % profiling input): The first three columns show the number of executions on the AP, BseAP moe n SpAP moe, respectively. EStlls stns for the stlls cuse by enble opertions for hnling simultneous intermeite reports. JumpRtio is efine s the proportion of cycles skippe in the SpAP moe. #Bseline #BseAP/SpAP BseAP/SpAP Runtime Sttistics App AP BseAP SpAP #Intermeite Moe Moe Reports #EStlls JumpRtio % % L % % % % % % % % However, since lrger profiling input hs higher recll for hot sttes (Section IV-A), the speeup is lso higher. To conclue, the speeup is generlly relte to resource svings s we expline in Section III-C, but the speeup lso epens on other fctors such s the qulity of preiction n the number of enble stlls. Intermeite Reporting Sttes. The ition of intermeite reporting sttes increses the totl number of sttes which coul increse the totl number of configurtions n executions (e.g., in Tble IV). Figure shows the effect on the number of reporting sttes in BseAP moe normlize to tht of the bseline. In the BseAP/SpAP moe, the totl number of reporting sttes inclues both originl reporting sttes n intermeite reporting sttes (stcke brs in the figure). We mke two observtions. First, the totl number of reporting sttes in BseAP moe coul be more thn the #Reporting Sttes (normlize to Bseline) 4 0 high LV Bro7 Bseline P0.%_True P0.%_IM P%_True P%_IM meium Avg Fig. : Comprison of number of reporting sttes: IM stns for intermeite reporting sttes. True stns for originl reporting sttes on BseAP moe. P stns for profiling. bseline AP execution tht only contins originl reporting sttes. For exmple, the number of reporting sttes in increses by.6, becuse it hs lrge number of crossing eges between preicte hot set n preicte col set. Secon, the number of reporting sttes coul ecrese (e.g., n ) in the BseAP moe execution becuse the number of crossing eges is smller thn the number of originl reporting sttes. Although our scheme my increse the number of reporting sttes, we re wre tht n effective softwrebse reporting stte compression technique [4] coul be pplie on top of our scheme. Effect of Jump Opertions. In Tble IV, lthough for some pplictions (e.g.,, ) the number of executions of BseAP/SpAP my be greter thn or equl to the bseline, we still obtin speeups on them becuse SpAP moe execution cn reuce totl number of cycles ue to the jump opertions. To show the effect of jump opertions, we efine JumpRtio s the proportion of cycles skippe in the SpAP moe. Formlly, JumpRtio = Totl cycles on SpAp moe Number of btches on SpAP moe Length of input strem. Higher JumpRtio inictes better effect of jump opertions. We show JumpRtio in Tble IV for the pplictions tht use SpAP moe. To conclue, the mjority of the pplictions only execute few percent of input symbols with the help of jump opertions.

11 Speeup AP-CPU file0.% AP-CPU file% BseAP/SpAP file0.% BseAP/SpAP file% TCP 06 Rg05 Rg EM 09 high meium low () Speeup on smll AP (cpcity = K) HM LV Bro7 0 GeoMen Speeup Fig. : Sensitivity on the ifferent cpcities of AP chip. AP-CPU file0.% AP-CPU file% high BseAP/SpAP file0.% BseAP/SpAP file% GeoMen (b) Speeup on n AP chip (cpcity = 49K) Sensitivity of speeup on cpcity of AP. The pplictions in the low resource requirement group require fewer sttes thn the cpcity of AP hlf-core. Figure () shows the speeup chieve by our schemes when the cpcity of AP is K. Similr observtions still hol s iscusse in Figure 0(). Specificlly, BseAP/SpAP chieves.9 n. speeup using 0.% n % profiling input, respectively. In ition, we emonstrte nother sensitivity stuy on AP with 49K STEs for the pplictions in the high group. Figure (b) shows BseAP/SpAP execution chieves.9 n. speeup using 0.% n % profiling input in the pplictions of this group. VIII. RELATED WORK To the best of our knowlege, this is the first work tht esigns n efficient rchitecturl support for lrge-scle NFA pplictions on AP. Sptil Architectures. Multitsking on sptil rchitectures is usully crrie out through the use of multiple contexts [46], which cn consume extr memory. In contrst, our BseAP/SpAP proposl relies on the bility to eliminte ynmiclly unuse sttes from NFAs to improve AP utiliztion. We rely on mechnism to trnsfer control to sptilly istinct prtition to ccommote lrger thn evice NFAs, though these coul be implemente s multiple contexts. Recently, gte removl hs been propose to eliminte unuse logic gtes from generl purpose processor IPs to customize processors to specific pplictions [47]. In our pproch, we only eliminte sttes from the NFA (i.e., the progrm), n not the hrwre. There re lso lterntive implementtions of AP [48] [50]. For exmple, cche utomton [49] re-purposes the lst-level cche for utomt processing. We believe our techniques re complementry s we propose hrwre/softwre mechnisms to mke the utomt processing itself more efficient. DFA n NFA Accelertion. Deterministic finite utomt (DFA) hve been chrcterize previously with respect to implementing specil mchines [5] n for prlleliztion [7], [5] [55]. Prllel execution of NFAs on the AP processor hs been propose by tring AP resources for higher throughput []. However, our chrcteriztion of the ynmic execution properties of NFAs specific to the AP execution moel is, to our knowlege, the first of its kin. Our elimintion of ynmiclly unuse sttes cn free up AP resources to complement prllel execution. FSM Decomposition. FSM ecompositions [56] [59] coul reuce the complexity of plcement n routing in the routing mtrix by simplifying the lyout. While csce ecompositions re the closest to our stuies, they re often sttic, for eterministic mchines only, n re mostly not bse on ynmic stte behvior (i.e., preicte hot vs. preicte col sttes). In contrst, our propose pproch (which uses grphtheoretic techniques, rther thn sequentil mchine theory) is focuse on incresing the AP throughput by llowing only preicte hot sttes to be configure to the AP. We believe both pproches re complementry n cn be pplie to ifferent bottlenecks in the AP execution pipeline. For exmple, FSM ecomposition cn mke the reconfigurtion process efficient while our technique cn ccelerte the NFA execution on AP by reucing the number of re-executions of the input symbol strem. IX. CONCLUSIONS Automt processors (AP) re very efficient in executing Non-eterministic Finite Automt (NFAs). However, like other types of sptil rchitectures, AP fces mjor chllenges in its execution moel to efficiently execute very lrge tsks. In this pper, we mke use of the inherent properties of NFAs to voi using compute resources for sttes tht re never use uring execution by low-cost softwre/hrwre-coorinte pproch. Consequently, this results in new execution moel for APs tht enbles efficient n high-performnce processing for lrge-scle tsks. We believe this work will be helpful towrs wier option of APs n will open up new reserch irections for enbling efficient NFA processing. ACKNOWLEDGMENT The uthors thnk the nonymous reviewers n members of the Insight Computer Architecture Lb t the College of Willim n Mry for their feebck. This mteril is bse upon work supporte by the Ntionl Science Fountion (NSF) grnts (#6576, #775, n #750667). This work ws performe in prt using computing fcilities t the College of Willim n Mry which were provie by contributions from the NSF, the Commonwelth of Virgini Equipment Trust Fun n the Office of Nvl Reserch. AMD, the AMD Arrow logo, n combintions thereof re tremrks of Avnce Micro Devices, Inc. Other prouct nmes use in this publiction re for ientifiction purposes only n my be tremrks of their respective compnies.

Algorithms for Memory Hierarchies Lecture 14

Algorithms for Memory Hierarchies Lecture 14 Algorithms for emory Hierrchies Lecture 4 Lecturer: Nodri Sitchinv Scribe: ichel Hmnn Prllelism nd Cche Obliviousness The combintion of prllelism nd cche obliviousness is n ongoing topic of reserch, in