Communication Analysis

Size: px

Start display at page:

Download "Communication Analysis"

Edward Wood
5 years ago
Views:

1 Chapter 5 Communication Analysis 5.1 Introduction The previous chapter introduced the concept of late integration, whereby systems are assembled at run-time by instantiating modules in a platform architecture. The instantiated modules interact using the communication resources supplied by the architecture. Where several virtual communication channels are mapped to a shared medium, the available bandwidth must be allocated to the channels appropriately. Determining how resources should be allocated and ensuring real-time performance requirements are met is a challenging task, as it must be done at run-time. This precludes the use of the simulation-based or trace-based methods described in Section of the background. Instead, following a similar theme to the construction of the architectural platform, this thesis advocates the application of appropriate constraints to the communication system, such that its behaviour becomes predictable and analysable. In the Sonic-on-a-Chip architecture, the modular processing elements interface to the communication infrastructure through routers of fixed design. In addition, the processing elements perform image processing tasks which are typically highly repetitive. As will be demonstrated in this chapter, these attributes enable the communication behaviour of the processing elements to be characterised and parameterised. This chapter develops an analysis of a shared bus scheme using a statistical time division 110

2 5.1. Introduction 111 multiplexing (STDM), as used by Sonic-on-a-Chip. The objective of the analysis is a method for determining whether a given mapping of channels to shared media can meet pre-determined resource and real-time performance requirements. An outline of the original contributions of this chapter is as follows. Details of the parameterisation of processing elements necessary for the purpose of the analysis. This included in the description of the analysis scenario in Section 5.2. A thorough analysis of a bus shared using STDM arbitration. This starts in Section 5.3, by assuming channels are buffered by FIFOs of unlimited size. The analysis is then extended to size-limited saturating buffers in Section 5.4. As the outcome of the analysis, a method for determining bandwidth allocation amongst channels sharing a bus, and a means of estimating the maximum required channel buffer sizes and the maximum induced latency (Section 5.5). The method is summarised in Section 5.6. A verification of the validity of the analytically-derived results with a cycle-accurate simulation model. The simulation results are presented in Section 5.7. It should be noted that while the analysis is applied to the Sonic-on-a-Chip platform, it is not limited to this architecture, or even video processing, but may be applied to any communication system designed with similar constraints.

3 5.2. Scenario and Assumptions Scenario and Assumptions We start with the assumption that the processing system is a process network comprising a number of processing nodes (PEs) connected by a series of communication channels, which is to be mapped to a system of buses connected by bridges, as in the template described in Chapter 4. The components of a channel are illustrated in Figure 5.1. Note the use of the stream buffers that were described in Section 4.5. Assume that each node has been assigned to a bus. By using bridges which buffer data, the behaviour of each bus can be isolated and studied separately. We will ignore channels which are assigned to using the ChainBus connections, as they are of no interest in this analysis. Therefore, for a particular bus we need to set the size of the time-slot for each channel to ensure throughput is met, as well as determining the maximum latency and the required buffer sizes. In the analysis which follows, the processing nodes are assumed to have a common pattern of behaviour: one or more streams of data are stored in input buffers; the engine performs a number of accesses on the stored data and outputs results to the output buffer; input data which are no longer needed are discarded from the front of the input buffer. This pattern is repeated indefinitely, such that the processing node has a baseline periodicity. Note that in some cases a node may exhibit different input and output periodicity; for example, a node which computes a histogram of the intensity values of an image may access and discard pixels one at a time (input periodicity of 1 pixel), whereas the results are only presented to the output buffer once per frame (output periodicity of 1 frame). producer PE consumer PE node FIFO buffer stream buffer node channel k size β k,p size β k,c time shared communication medium interface Figure 5.1: Components in a communication channel mapped to a shared medium. The channel is buffered on the producer side by a FIFO of depth β k,p and on the consumer side by a stream buffer of depth β k,c. The channel interfaces regulate the access to the communication medium.

4 5.2. Scenario and Assumptions 113 The behaviour pattern is illustrated by the following example of a motion vector estimator (MVE). Motion vector estimation is a key computation in MPEG video compression algorithms as well as machine vision applications. Example 5.1: Motion vector estimation. In motion vector estimation, a reference frame is divided into a number of non-overlapping reference blocks. The best match for each reference block is found within the next video frame (the search frame). The area of the search frame to scan for a given reference block is limited to a search window. The search windows overlap. Take a MVE processing node which scans search windows of size pixels for a match to a reference block of pixels. The processing node has two inputs (the search window and the reference block) and one output (the location of the best match). A full search is computationally expensive, and can be avoided with intelligent techniques such as the three-step coarse-fine search [138]. After completing a search, a new search is started for the next reference block. Since the search windows overlap, only part of the current search window is discarded from the input buffer. The buffering behaviour is illustrated in Figure 5.2. Table 5.1 lists some sample video processing algorithms and shows how they may be parameterised in pixels. All algorithms (with the exception of the motion vector estimator) process non-interlaced raster-scanned images of width c and height r. For reference in this chapter, Figure 4.8(a) from Section 4.4 is reproduced in Figure 5.3. The diagram illustrates the statistical time division multiplexing (STDM) protocol used for the majority of the data transfer on the shared buses.

5 5.2. Scenario and Assumptions Buffer fill level Words Address Advance Stall signal (address points to empty location) 0 1 Wait signal (buffer full) Time (cycles) Figure 5.2: Motion vector estimation buffering behaviour. At the top, the graph shows the number of words in the buffer and the buffer location accessed by the engine. If the location accessed does not hold valid data the engine is stalled (middle graph). The wait signal (on the router side) is asserted when the buffer is full (lower graph). abort by destination Packet word not transferred Header word Arbitration word Bus Grant Stream Release Grant Stream Release Grant Stream Release Grant Stream Release Grant Stream Type c c d d d d c c c d d d d d c c c d d d c c c d d d d d d c c c d d d Ack Channel 1 Channel 2 Channel 3 Channel 4 Channel N Channel 1 service period τ Figure 5.3: The STDM communication protocol.

6 5.2. Scenario and Assumptions 115 Algorithm Periodicity Advance Stored data Window function e.g. median filter (5 5 window) 2D convolution (3 3 kernel) Histogram (3 colour channels, 256 points) Input Output c c D DFT (parallel) c c c c Motion vector estimation (16 16 macro-block) / / 256 Table 5.1: Characteristics of video processing algorithms operating on frames of width c pixels and height r pixels.

7 5.3. First Approximation First Approximation In this section, a first approximation analysis is formulated by assuming unlimited buffer sizes. The aim is to determine, for a given mapping of channels to a bus, what size timeslot to allocate to each channel, and the required amount of channel buffering. Consider a bus with maximum bandwidth Γ supporting N channels. Each channel k has a required average throughput φ k, and is allocated ω k consecutive bus cycles for each data transfer (at one word per cycle) excluding the STDM overhead of h cycles. Clearly the average bandwidth required must be less than that available: N φ k = Φ < Γ (5.1) k=1 must be satisfied. If data are produced and consumed at a constant rate (φ k for each channel k) and there are no buffer overflows, then the service period (the time taken for all channels to have completed one transfer each, see Figure 5.3) is: τ = 1 Γ ( N (ω k + h) = 1 N ) ω k + Nh Γ k=1 During this time, φ k τ data are produced for channel k. In steady state: Then we can solve for τ: k=1 (5.2) φ k τ = ω k (5.3) τ = 1 (τφ + Nh) (5.4) Γ = Nh Γ Φ (5.5) Also: ω k = φ knh Γ Φ (5.6) This value ω k is the time-slot size for channel k. This parameter is used as the cycle count limit in the programming of the arbitration unit. The minimum source buffer size β k,p for each channel is the number of words which need to be stored during the time the channel does not have control of the bus: ( β k,p = ω k 1 φ ) k Γ (5.7)

8 5.3. First Approximation 117 However, avoiding buffer saturation comes at a cost of greater than necessary consumer side buffering; this is highly non-desirable as on-chip memory is limited, and particularly so in FPGAs. The following example illustrates this point. Example 5.2: Simple input buffer sizing. Consider the motion vector estimation buffer case of Example 5.1. Each search window comprises = 1936 pixels, with an overlap of = 1232 pixels between adjacent search windows. To ensure that valid data are always available to the engine, a simple method for determining the buffer size is to store 1936 pixels (for the current search window) and an additional = 704 pixels (the non-overlapping part of the subsequent search window), totalling 2640 pixels. The actual memory consumed is a power of 2, 4096 pixels, and therefore 112% larger than required for storing just the current search window.

9 5.4. Size-limited Buffers Size-limited Buffers In order to account for limited buffer sizes, we will modify the assumptions and allow some buffers to saturate. Again, we will determine the time-slot sizes to be used in the arbitration of the bus, and the required buffer sizes for the buffers which do not saturate. In this case, channel throughput is no longer constant, but has inactive periods (when the consumer-side buffer is full), which must be balanced by periods where the throughput is higher than average. This is shown in Figure 5.4 for the case of Example 5.2. The graph shows the number of words in the input stream buffer (the fill-level) for the case where there is no buffer saturation (upper black line) and where the buffer does saturate (lower blue line). The rate at which the buffer is filled is slightly higher in the saturating case. There are several important features to be noted: 1. In the analysis of the saturating buffer case, we also take into account the pattern of locations accessed in the buffer. In Figure 5.4 all possible accesses are plotted Fill level non saturating buffer 2000 Saturation Fill level saturating buffer Words Addresses 500 Stall time Cycles Figure 5.4: Graph of motion vector estimation search window buffering. The address pattern shows all possible address accesses over one fundamental period. Buffer fill levels for non-saturating and saturating buffer conditions are shown.

10 5.4. Size-limited Buffers 119 for the MVE example. The buffer must be filled sufficiently quickly such that data accesses are all within the available buffered data. 2. To simplify the calculation of the required fill rate for a given access pattern, not all addressed locations need to be considered. The required fill rate can be quickly calculated from the envelope of possible accessed locations. 3. It is assumed that the engine processing rate will be at least as fast as the overall required throughput rate, and potentially faster. This is accounted for by introducing an allowable stall time per fundamental period when determining the required fill rate. This can be seen to be 1000 cycles in the example of Figure 5.4. We now divide the N channels into M channels whose activity is time-variant (T-V), due to buffer saturation, and N M time-invariant (T-I) channels. One can visualise the activity of the time-variant channels as being a cyclic pattern, with a period where there is a burst of activity followed by a period where there is no bandwidth demand once the consumer-side buffer has saturated. The time-variant channels have a required peak throughput φ k, 1 k M. The bus must be able to support concurrent peak demands of the time-variant channels: Φ T-V = M φ k < Γ (5.8) k=1 If the peak demand on the bus, including the time-invariant channels, is less than the bus capacity: Φ peak = M φ k + k=1 N k=m+1 φ k < Γ (5.9) then Eq. (5.6) can be used to calculate the STDM time-slot parameters ω k by substituting φ k for φ k and Φ peak for Φ. If the inequality of Eq. (5.9) does not hold, let us term the bus usage critical. In a bus with a critical level of usage, bandwidth demands vary over time. During periods of peak activity by the time-variant channels the remaining time-invariant channels are starved of bandwidth. This is compensated for during off-peak times. As a result the time-invariant channels have increased buffering requirements. Consider the case where M = 1: there is one time-variant, saturating channel b. The

11 5.4. Size-limited Buffers 120 peak demand on bus bandwidth is: Φ critical = Φ T-V + ˆΦ T-I (5.10) where ˆΦ T-I is the reduced total bandwidth available to the N 1 time-invariant channels. Rearranging Eq. (5.6) and substituting variables: (b is the time-variant channel) and also: ω k = Φ critical = Γ Nhφ b ω b (5.11) ˆφ k Nh Γ Φ critical, M < k N (5.12) for the time-invariant channels. The variable ˆφ k is the bandwidth available to channel k during the peak demand times, and is given by: ˆφ k = φ k ˆΦT-I Φ T-I, M < k N (5.13) From these equations it will be possible to determine the time-slot size (ω k ) to allocate to each of the time-invariant channels, provided a value can be found for Φ critical first. Using Eq.s (5.10), (5.11), (5.12) and (5.13) we can derive: ω k = φ k NhˆΦ ( T-I )] (5.14) Φ T-I [Γ Γ φ b Nh ω b = φ kω bˆφt-i Φ T-I φ b = φ kω b Φ T-I φ b ( Γ φ b Nh Φ T-V ω b ) (5.15) (5.16) Now, the service period τ in the critical bus usage case also varies over time. During peak activity periods by the time-variant channels the service time will be longer than when these same channels are idle. For the M = 1 case, during the off-peak period (when channel b is idle) the service period is given by: τ offpeak = 1 Γ ( N k=m+1 ω k + Nh ) = Nh Γ ˇΦ T-I (5.17)

12 5.4. Size-limited Buffers 121 where ˇΦ T-I is the off-peak bandwidth demand of the time-invariant channels. Now substitute Eq. (5.16) in Eq. (5.17), and simplify, noting that Φ T-V = φ b in this case and Φ T-I = N k=m+1 φ k. Solve for ω b : ω b = φ b NhΓ ( Γ φ b ) ( Γ ˇΦT-I ) (5.18) An expression for ˇΦ T-I must now be found. The channel b will be active for φ b /φ b of the time, during which each time-invariant channel k has bandwidth ˆφ k. In order for the average bandwidth requirement for channel k to be met: ( φ b ˆφ φ k + 1 φ ) b ˇφ b φ k = φ k, M < k N (5.19) b Therefore: ˇΦ T-I = N k=m+1 = 1 1 φ b φ b = 1 1 φ b φ b = 1 1 φ b φ b = 1 1 φ b φ b ˇφ k (5.20) N ( φ k φ ) b ˆφ φ k k=m+1 b ( Φ T-I φ ) b Φ T-I φ b ( Φ T-I φ b ( φ b ( Γ φ b Nh ω b Φ φ b φ Γ + φ b Nh ) b ω b Φ T-V )) (5.21) (5.22) (5.23) (5.24) So finally: ( φ )( ) ω b = b Nh Γ φb Γ Φ Γ φ b (5.25) This can be generalised for situations where M > 1: ( φ ) ( ω b = b Nh Γ ) M b=1 φ b Γ Φ Γ M b=1 φ b (5.26) All the variables in this equation are known, so ω b can be calculated for all time-variant channels b M. One of these channels is then used to find Φ critical using Eq. (5.11).

13 5.4. Size-limited Buffers 122 Channel Mean bandwidth Peak bandwidth ω k φ k (Mwords/s) φ k (Mwords/s) Total Table 5.2: Channel characteristics for Example 5.3. This can be used to find the values for ω k for the remaining time-invariant channels M < k N. This is illustrated in the following example. Example 5.3: Calculation of arbitration parameters. A system comprises two of the motion vector estimation process nodes of Example 5.2, processing sized images (SDTV/EDTV standards as well as VGA) at different frame rates (22 and 18 frames-persecond). Each node has two input channels (the reference block and the search window) and one output channel (the vectors), making six channels in total, with an overall mean bandwidth of 46.1Mw/s, mapped to a bus with capacity 50Mw/s. On inspection of the address patterns, buffer sizes and consumption behaviour of the channels it is determined that two of the destination channel buffers will saturate, increasing the peak bandwidth demand to 52.5Mw/s, as shown in Table 5.2. Using Eq. (5.26), and N = 6, h = 3, Γ = 50Mw/s, Φ = 46.1Mw/s, M = 2, we find ω 1 = and ω 2 = The critical bandwidth demand is Φ critical = 47.9Mw/s from Eq. (5.11), and from Eq. (5.10) we find that ˆΦ T-I = 7.74Mw/s. Therefore, using Eq. (5.13) and Eq. (5.12) we find the values ω k = {35.9,29.4,0.1,0.1} for k = {3,4,5,6}. These are rounded up to integer values, while ensuring that the ratio ω k : ( 6 i=3 ω i+nh) for each k does not decrease in the process, giving ω k = {40,33,1,1} for k = {3,4,5,6}. After similarly rounding and adjusting ω 1 and ω 2 we obtain the values for ω k as shown in the right column of Table 5.2.

14 5.5. Buffer Sizing and Latency Buffer Sizing and Latency We have so far found a method for determining the time-slot sizes to use in the bus arbitration, including situations in which limited buffering causes time-variant behaviour on some channels. We now must determine the required buffer space on the producerside of these channels and the effect on the size of the buffers for the remaining channels in the system. If the bus usage is critical, channels which do not exhibit time-variant behaviour require extra buffer space to compensate for periods where their bandwidth is temporarily restricted. Consider a channel k which is a time-invariant channel: its bandwidth demand is constant. Due to the changes in bandwidth demands by time-variant channels, the actual throughput of channel k will be time-varying: φ k (t). On the consumer side of the channel, there must be extra buffer space β k sufficient to prevent supply the processing node engine without causing stalls during deviances from the average throughput rate φ k. Thus: β k,c t2 t 1 φ k φ k (t)dt {t 1,t 2 } : t 1 < t 2 (5.27) Determining the buffer sizes requires finding the worst cases for Eq. (5.27). This occurs when the throughput φ k (t) reduces, due to all time-variant channels being active concurrently. Assuming the active channels are not source-limited and therefore (using the STDM protocol) consume the maximum amount of bandwidth available to them when active. The time-variant channels when idle due to buffer saturation consume a single bus cycle of their allocated time-slot before releasing the bus for arbitration. By inspection of Figure 5.3, one can observe that the time-varying throughput of channel k is therefore: Γω k φ k (t) [ M [ i=1 i(t)] α N ] + i=m+1 ω i + Nh (5.28) where for time-variant channel i ω i channel active α i (t) = 1 channel inactive (5.29) Therefore, the evaluation of the integral of Eq. (5.27) is computationally not difficult, since the worst-case (approximate) φ k (t) is piecewise constant. However, it is necessary to determine the active and inactive times for each channel, and the interval (t 1,t 2 ). Assume that all M burst channels become active at time t 1 = 0, and each channel i has

15 5.5. Buffer Sizing and Latency 124 periodicity T i, determined by the periodicity of the node it supplies. The procedure for determining the channel active times is relatively straightforward: 1. At time t = 0, each active channel i M starts with a number of words r i (0) = φ i T(i n) to be transfered before the channel will become inactive again. 2. For each channel calculate the transfer bandwidth φ k (0 + ) from Eq. (5.28). 3. Determine the time for the first channel to become inactive: ( ) φ i d 1 = min T i φ i (0 +, i M (5.30) ) This channel is marked as inactive for t > d Record the number of words remaining to be transfered at time t = d 1 in the other channels: r i (d 1 ) = r i (0) φ i (0 + )d 1 (5.31) 5. For each subsequent stage n = {1,2,... }, the duration of the stage is given by: r i (d n ) d n+1 = min φ i (d + active channels n ) (5.32) q i T i d n inactive channels where φ i (d + n ) can be calculated from Eq. (5.28) and r i (d n ) = r i (d n 1 ) φ i (d + n 1 )(d n d n 1 ) (5.33) In the term q i T i d n, q i is an integer value that is incremented each time channel i becomes inactive. At each stage, one channel becomes active or inactive, depending on which term in Eq. (5.32) is minimum. For each channel k there will be a time d p at which φ k (d + p ) > φ k. The integral of Eq. (5.27) is therefore calculated between (0,d p ). On the source side, the equation is slightly different. The producer FIFOs must be large enough to contain data generated by the node without causing a stall, even when the data generation rate is not constant. If the producer for channel k is node n and generates data at a rate p n (t), then the equation for the buffer space required is: β k,p t2 t 1 p n (t) φ k (t)dt {t 1,t 2 } : t 1 < t 2 (5.34)

16 5.5. Buffer Sizing and Latency 125 To simplify this, we will compute a conservative estimate for the upper bound, by setting p n (t) to a periodic function: p n (t) = 0 p n 0 < t < φ kt n p n φ k T n p n < t < T n (5.35) Here p n is the peak rate at which node n can produce data, and T n is the periodicity of the node. Eq. (5.34) is now a piecewise constant function and can be computed in a similar way to Eq. (5.27). For a time-variant channel b, the source side buffer must be sufficiently large to hold the data produced while the consumer-side buffer has saturated. Again, Eq. (5.34) must be evaluated, however in this case we find the worst case conditions by assuming that the destination buffer saturates at time t = 0 and φ k (t) is the periodic function: ) 0 0 < t < T k (1 φ k φ φ k (t) = k ) (5.36) φ k T k (1 φ k < t < T k φ k In addition to the buffer space required resulting from variations in throughput, the buffer levels also ripple up and down over the duration of each service period τ. The height of this ripple is given by: γ k φ k Γ ω i + Nh (5.37) i k The total spare buffer space required is found by adding β k and γ k. Finally, the maximum latency introduced by the channel can be approximated by the buffer size and the average throughput rate: l k β k + γ k φ k (5.38) Example 5.4: Buffer sizing and latency. We now calculate the required buffer space for each channel from Example 5.3. Assume that data are fed into channels 1 to 4 at a constant rate, such that for input node n and corresponding channel k, p n = φ k. Thus, β k,c = β k,p = β k. Channel 1 has periodicity T 1 = 37.9µs, and for channel 2, T 2 = 46.3µs. We find d 1 = 28.3µs (channel 1 becomes inactive) and d 2 = 37.3µs

17 5.5. Buffer Sizing and Latency 126 Channel γ k β k Total (words) Latency (µs) Table 5.3: Spare buffer space and latency for Example 5.4. (channel 2 becomes inactive). For channel 3, from Eq. (5.28), 4.23Mw/s 0 < t < d 1 φ 3 (t) = (5.39) 8.37Mw/s d 1 < t < d 2 Therefore: β 3 = d1 0 φ 3 φ 3 (t)dt (5.40) = d 1 (φ 3 φ 3 (0 + )) (5.41) = 72 (5.42) The ripple for channel 3: γ 3 = φ3 Γ (ω 1 + ω 2 + ω 4 + ω 5 + ω ) (5.43) = 59 (5.44) Other values for the buffers are listed in Table 5.3. Note that for channels 1 and 2 (where consumer-side buffers saturate) the buffering values are the minimum producer-side buffer sizes. For channels 3 to 6, the totals are the minimum producer-side buffer sizes, and the total spare capacity required before saturation of the consumer-side buffers.

18 5.6. Method Summary Method Summary The aim of the analysis is to show how the system designer can ensure that derivative designs constructed at run-time will achieve required performance when sharing communication media. The process is summarised as follows. At design time, the system designer collects the following information about each processing node: 1. The envelope of the address pattern, including the base-line repeat period. 2. The magnitude of the consume delta function. 3. The maximum theoretical processing throughput, assuming no stalling due to lack of data or output buffer space. 4. The input and output buffer sizes. Algorithms are created from communicating clusters of nodes. The designer determines a set of possible mappings of nodes to platform buses for each application. At run-time, the necessary algorithms and the associated performance requirements are determined by supervisory application software. The run-time system software selects a mapping for each algorithm and then verifies the performance requirements can be fulfilled by executing the following steps: 1. Calculate the required throughput for each node and channel, based on algorithmic throughput requirements. 2. Verify mean demand on each bus does not exceed available bandwidth. 3. Determine for each node the allowable stall time for the engine. 4. From the stall time, the required throughput, and the address pattern envelope, determine for each channel if the destination buffer will saturate, or if not, the spare capacity in the buffer. 5. Based on the buffer saturation, divide channels into those with time-variant bandwidth demand and those with time-invariant demand. 6. Calculate the peak bandwidth demand of the time-variant channels, and verify the aggregate peak bandwidth demand is less than the bandwidth available.

19 5.6. Method Summary Calculate the time-slot size allocation for each channel (ω k ). 8. Calculate the required spare buffer capacity for all source-side buffers and all timeinvariant channel consumer-side buffers, and verify this is less than the available capacity. 9. Verify the latency of each channel is acceptable. If each verification step in the process is successful, the required performance will be achieved. If any step fails, the selected mapping is not acceptable; at this point a number of options are available to the system. A different algorithm mapping may be selected, if the bandwidth requirements of different buses are mismatched. An alternative algorithm may be used with lower performance requirements (with a lower quality of service for example), or an alternative algorithm implementation. The high-level decision made here is system-dependent, and not covered in the scope of this chapter. It is emphasised that the run-time evaluation of communication parameters involve calculations with low computational intensity, and moreover the evaluation is performed infrequently relative to the operation time of the algorithms (which process continuous streams of video data). Thus the overhead incurred is slight.

20 5.7. Experimental Results Experimental Results To verify empirically the validity of the analysis presented above, cycle-accurate simulations of the communication system have been developed. The performance parameters of the simulation model are based on the implementation of a prototype Sonic-on-a-Chip system as described in Section 4.6. The processing subsystem achieved a clock rate of 50MHz. The example system described in Example 5.3, comprising two motion vector estimation nodes, is used again in the simulations. The address patterns for these nodes have been extracted from real data (a carphone video sequence), and the parameters calculated in Example 5.3 and Example 5.4 and listed in Table 5.2 and Table 5.3 are used as nominal values. The simulated system has four input nodes; the rate at which these node supply data to the system is independently adjustable. Figure 5.5 is a graph of the time-averaged bandwidth of the four main channels (1 to 4) and the overall bus bandwidth usage over a period of 2ms (10 5 bus cycles). For this simulation all input nodes were set to supply data at the maximum output rate (one word per cycle). The graph shows that the STDM arbitration scheme is able to cope with x 107 Mean bandwidth 4 Total 3.5 Bandwidth Channel 1 Channel 2 Channel Channel Cycles x 10 4 Figure 5.5: Bandwidths of the channels in the simulated system, averaged over 2000 cycles. The arbitration scheme is effective at high bus utilisation (92% in this case), and copes with time-varying demand while maintaining the required average throughput for each channel (dashed lines). Note that channels 5 and 6 are not visible on the scale of this graph.

21 5.7. Experimental Results Throughput of motion vector estimation Frames/s Cycles x 10 4 Figure 5.6: The macro-block by macro-block throughput of the motion vector estimators from the simulation. The instantaneous throughput rates are variable, as expected from the variable channel latency. The average throughputs (solid lines) meet or exceed the designed rates of 22 and 18 fps, demonstrating the validity of the communication system analysis. high overall bandwidth utilisation and allocate bandwidth to each channel appropriately despite the time-variable demands of the channels. Note the mean bandwidth for each channel is as expected from Table 5.2. The block-by-block frame rates achieved by the motion vector estimators are plotted in Figure 5.6; again all input nodes supply data at an unmoderated rate. Variability in the throughput rate is expected due to the variation in latency in communication. The latency for the channels supplying the 22 fps MVE processing node (channels 1 and 3) was calculated to be between 0µs and 19.4µs (in Table 5.3), and confirmed in the simulation (see Figure 5.7). Compared with the mean processing time of a single macro-block (at 22 fps and 1200 macro-blocks per frame) of 37.9µs, the instantaneous throughput is expected to vary by up to ±26%; this is confirmed in the graph of Figure 5.6. Note that the time-averaged throughputs of the two MVE nodes meets the designed rates of 22 and 18 fps. In a real system, data would not necessarily be supplied to the system faster than they

22 5.7. Experimental Results Minimum and maximum latencies, channels 1 and Latency, us Cycles x 10 4 Figure 5.7: The maximum (solid lines) and minimum (dashed lines) latencies of channels 1 (circles) and 3 (squares). The latencies match the values listed in Table 5.3.

23 5.7. Experimental Results 132 Rate controlled MVE throughput channels Low rate High rate % % % % % % 3, % % % % 2, % % 2, % % 2,3, % % % % 1, % % 1, % % 1,3, % % 1, % % 1,2, % % 1,2, % % 1,2,3, % % Table 5.4: System throughput with rate-controlled channels. can be processed. Instead, the rate at which data are available for processing may be limited, and the system is required to process the data at the supplied rate. The communication system must cope with scenarios in which some channels may be supplied by a rate-limited source, and other channels the data are available at an unmoderated rate. Regardless of the source data-rate, the communication parameters can be calculated based on the desired system throughput. To test this, the performance of the communication system has been measured in simulations where source data-rate throttling is applied to all combinations of the four input nodes (supplying channels 1 to 4). The data-rate of a rate-limited channel is set such that the throughput of the corresponding motion vector estimator is limited to slightly below the original designed rate of 18 or 22 frames-per-second. This expected MVE rate was calculated and compared to the actual achieved throughput, averaged over 15 macro-blocks. The results are listed in Table 5.4. The achieved throughputs match or exceed the expected throughputs to within measurement error in all cases. The buffer sizing calculations listed in Table 5.3 were verified by varying the size of the

24 5.7. Experimental Results Throughput vs buffer size 1 Normalised throughput Channel 1 (source) Channel 2 (source) 0.6 Channel 3 (source) Channel 3 (dest.) Channel 4 (source) Channel 4 (dest.) Buffer size (% of nominal) Figure 5.8: Effect of varying buffer sizes on system throughput. Buffers smaller than the sizes calculated in Example 5.4 cause the system throughput to drop below the target rate. buffers by ±50% of the nominal value and then measuring the corresponding throughput for the affected motion vector estimator. Rate-limiting of the source data channels was applied where it results in lower performance. Moreover, the address pattern used was set to the worst-case values (i.e. the addresses closest to the address envelope of Figure 5.4). The outcomes are plotted in Figure 5.8, normalised to the buffer sizes of Table 5.3. It can be seen that the calculated required buffer sizes are sufficient to avoid degrading the system throughput performance. In addition, the buffer sizes calculated are not significantly larger than necessary in this instance, with the exception of the source buffer for channel 1, which appears to be oversized by around 50%. In general, the calculated required buffer sizes are based on worst-case conditions, which may never occur in a given system, and therefore the calculations result in conservative estimates. Thus far, for consistency all examples and experiments have been based on a single node type, namely motion vector estimators. To demonstrate the applicability of our approach with a wider variety of node types, in addition to the two-mve system described above (from now denoted sys1) four other sample systems (sys2 to sys5) were designed and simulated. As shown in Table 5.5, each system has a different mixture of processing node types providing different communication behaviour.

25 5.7. Experimental Results 134 sys1 sys2 sys3 sys4 sys5 Subsystem Node throughput rate, fps MVE MVE Block 2D DCT Foreground separation Median filter 20.0 Histogram Bus parameters Channels (N) Time-variant channels (M) Average total bandwidth Φ (Mw/s) Peak total bandwidth Φ peak (Mw/s) Critical yes no yes yes no Table 5.5: Characteristics of the simulated systems.

26 5.7. Experimental Results Normalised buffer utilisation 1.2 Maximum buffer level system no. Figure 5.9: A graph of the maximum fill levels of every channel buffer in each of the five simulated systems, normalised to the expected values calculated using the method of Section 5.6. In 4 of the 56 cases the expected values were exceeded due to transient (start-up) behaviour, further simulations with limited buffers demonstrated that this did not affect throughput. Arbitration values (ω k ) and buffer sizes were calculated for all systems as per the method given in Section 5.6. In these simulations, the buffer size estimates were verified by oversizing all non-saturating buffers and recording the maximum utilisation point for every source and destination buffer. The results are plotted in Figure 5.9. In 52 of the 56 buffers the fill level never exceeds the calculated expected buffer size. For buffers where the expected maximum level was exceeded, further investigation of the simulations revealed this was due to transient behaviour during the start-up of the system. When restricted in size, the temporary saturation of these buffers did not affect the steady-state throughput of the system.

27 5.8. Summary Summary One challenging aspect to the run-time assembly of modular systems is determining the impact of communication using share media on system performance. This chapter presented an analysis of the statistical time division multiplexing communication scheme used in Sonic-on-a-Chip. The analysis demonstrated that, by application of constraints to the communication interfaces, system behaviour can be reliably predicted. Thus, real-time throughput and latency requirements can be guaranteed. The key to this analysis is the characterisation of the behaviour patterns of the modular processing elements which are used to construct the run-time system. The characterisation exploits the fact that image processing tasks exhibit a high degree of repetition. The outcome of the analysis is a method for determining for each channel mapped to a given shared bus: (a) the time-slot size to allocate in the arbitration protocol, (b) the required producer-side and consumer-size buffer sizes, and (c) the maximum expected latency. The analysis accounts for the limited on-chip memory by allowing for the saturation of limited-sized buffers. The theoretical work has been verified against simulations using a cycle-accurate model of the communication. The simulations demonstrated two points: that the predictions of the analysis are accurate, and that the communication system is stable under a variety of different channel demands.

RESOURCE ALLOCATION IN CELLULAR WIRELESS SYSTEMS

RESOURCE ALLOCATION IN CELLULAR WIRELESS SYSTEMS Villy B. Iversen and Arne J. Glenstrup Abstract Keywords: In mobile communications an efficient utilisation of the channels is of great importance. In this