WHITEPAPER MULTICORE SOFTWARE DESIGN FOR AN LTE BASE STATION

Executive summary This white paper details the results of running the parallelization features of SLX to quickly explore the HHI/ Frauenhofer reference HEVC codec implementation. The codec allows developers to make trade-offs of coding efficiency versus effective use of the parallel computation resources of the target platform. SLX is primarily used to assess the benefits of any exposed parallelism in the current implementation. LTE (Long Term Evolution), also known as E-UTRAN (Evolved Universal Terrestrial Access Network), introduced in 3GPP R8, is the access part of the Evolved Packet System (EPS). The presented results are based on the unedited implementation of the code as downloaded on a virtual hexacore target platform based on ARM Cortex A7 processors. Fig. Analysis Results for up to 16 cores on Generic Baseband Processing Multi-DSP Platform In a test period of 4 hours of computer run time, the speedup results are shown with the default implementation. For an additional 32 loops, opportunities for further parallelism with its associated potential speedup on the target platform were identified including the inhibiting reason. Overall, SLX quickly identifies the places where manual efforts should be spent and which ones should be avoided. This case study on unchanged, real- life application source code shows the significant productivity increase by using automation versus a costly, manual investigation by parallelism experts. For further information, more White Papers, other resources and for your free trial of SLX, please visit www.silexica.com Fig. Gant charts detail set-up and process times throughout this white paper. Fig. Peak power performance is analyzed on 4 and 10 cores. 2018 Silexica GmbH. All rights reserved worldwide.

Table of Contents 1 INTRODUCTION 4 2 LTE OVERVIEW 4 2.1 LTE FRAME 4 2.2 PHYSICAL CHANNELS 5 2.2.1 PBCH (PHYSICAL BROADCAST CHANNEL) 5 2.2.2 PCFICH (PHYSICAL CONTROL FORMAT INDICATOR CHANNEL) 6 2.2.3 PDCCH (PHYSICAL DOWNLINK CONTROL CHANNEL) 6 2.2.4 PHICH (PHYSICAL HYBRID-ARQ INDICATOR CHANNEL) 6 2.2.5 PDSCH (PHYSICAL DOWNLINK SHARED CHANNEL) 6 3 APPLICATION OVERVIEW 6 3.1 PARALLELIZATION STRATEGY 7 4 MAPPING ANALYSIS WITH SLX 8 4.1 GENERIC BASEBAND PROCESSING MULTI-DSP PLATFORM 8 4.2 GENERIC DSP ARCHITECTURE ENHANCED WITH HARDWARE ACCELERATORS 12 4.3 EXPLORING DIFFERENT PROCESSOR FREQUENCIES AND POWER ANALYSIS 15 5 CONCLUSION 17 2018 Silexica GmbH. All rights reserved worldwide.

1 Introduction This paper presents a typical base station design scenario, where decisions about HW/SW partitioning, the number of processing elements, and operational system parameters, among other things, need to be made early on by system architects. SLX determines the impact of these various design decisions and parameter selections, while exploring different target architecture configurations and checks if application constraints can be met. In the following, an LTE DownLink Application is used as the application driver. 2 LTE Overview LTE (Long Term Evolution), also known as E-UTRAN (Evolved Universal Terrestrial Access Network), introduced in 3GPP R8, is the access part of the Evolved Packet System (EPS). The main requirements for the new access network are high spectral efficiency, high peak data rates, short round-trip time as well as flexibility in frequency and bandwidth. The case tested here is the implementation of the DownLink (DL) Physical Layer (PHY) of an LTE enodeb (base station). The LTE PHY specification accommodates bandwidths from 1.25MHz to 20MHz. Orthogonal frequency division multiplexing (OFDM) is the selected modulation. The DL is composed of physical channels that convey information from the higher layer and some synchronization signals. Here, we consider the following channels and signals PBCH: Physical Broadcast Channel, PCFICH: Physical Control Format Indicator Channel, PHICH: Physical Hybrid-ARQ Indicator Channel, PDSCH: Physical Downlink Shared Channel, PDCCH: Physical Downlink Control Channel, P-SS: Primary Synchronization Signal, S-SS: Secondary Synchronization Signaling. 2.1 LTE Frame LTE frames are each 10 ms in duration. They are divided into 10 subframes, with each subframe duration being 1 ms. Each subframe is further divided into two slots of 0.5 ms each. Slots consist of either 7 or 6 OFDM symbols, depending on whether the normal or extended cyclic prefix is being employed. A symbol corresponds to a certain time span of a signal, carrying one spot in the modulation schemes I/Q- Constellation Diagram. Figure 1 shows the structure of a frame. In OFDMA (Orthogonal Frequency Division Multiple Access), a specific number of subcarriers for a determined amount of time are allocated to the users. These are referred to as Physical Resource Blocks (PRBs). PRB allocations are handled by higher protocol layers in the enodeb. The available number of PRBs is determined by the bandwidth of the transmission (e.g. 100 PRBs in 20 MHz). 2018 Silexica GmbH. All rights reserved worldwide. 4

Figure 1: LTE frame structure (normal cyclic prefix) 2.2 Physical Channels All physical channels convey information provided by the higher LTE stack layers. Each physical channel undergoes the following data processing before transmission: Bit scrambling, Modulation, Layer mapping, Precoding, Resource element assignment. Layer mapping and precoding are related to the Multiple Input Multiple Output (MIMO) antenna operation of the base station. For simple MIMO configurations (e.g. 1x1 and 2x2), there is a one-to-one relationship between the layers and the transmitting antennas. For other complex MIMO configurations, there are only two spatially multiplexed data streams. Layer mapping defines how the additional antennas are used. Precoding is used in connection to spatial multiplexing. The next subsections present the 5 relevant physical channels in detail. 2.2.1 PBCH (Physical Broadcast Channel) All physical channels convey information provided by the higher LTE stack layers. Each physical channel undergoes the following data processing before transmission: Carries only the MIB (Master Information Block). Uses QPSK as the modulation scheme. Is mapped onto 6 Resource Blocks (72 subcarriers), centered around the DC subcarrier in subframe 0. Is mapped to resource elements that are not reserved for transmitting reference signals, PDCCH or PCHICH. 2018 Silexica GmbH. All rights reserved worldwide. 5

2.2.2 PCFICH (Physical Control Format Indicator Channel) Carries the number of symbols that can be used for control channels (PDCCH and PHICH). Mapped onto the first OFDM symbol in each downlink subframe, occupying 16 data subcarriers of the symbol. User equipment (UE) decodes this channel to figure out how many OFDM symbols are assigned for the control channels (PDCCH and PHICH). The exact position of PCFICH is determined by cell ID and bandwidth. 2.2.3 PDCCH (Physical Downlink Control Channel) Mapped to the first OFDM symbol in each downlink subframe. Number of symbols used for PDCCH: can be 1, 2, or 3. Number of symbols used for PDCCH is specified by PCFICH. PDCCH carries downlink control information (DCIs). Multiple PDCCHs can be assigned in a single subframe. UE then does blind decoding of all PDCCHs. Uses QPSK as the modulation scheme. 2.2.4 PHICH (Physical Hybrid-ARQ Indicator Channel) Carries repeat request (H-ARQ) feedback for the received PUSCH. Depending on the configuration, several PHICHs constitute a PHICH group using the same resource elements. 2.2.5 PDSCH (Physical Downlink Shared Channel) Carries user-specific data (DL Payload). Carries random access response message. Uses Adaptation Modulation and Coding (AMC) with QPSK, QAM16, QAM64 or QAM256 modulation schemes. This actual modulation scheme is determined by the modulation and coding scheme (MCS) carried within the DCI specified by PDCCH. 3 Application Overview The application under analysis implements the DL transmission procedure performed by an LTE enodeb. It constructs a test FDD subframe and transmits it. The application performs all corresponding physical layer (PHY) tasks necessary for encoding, modulating and transmitting the PBCH, PDSCH, PHICH, PDCCH and PCFICH physical channels. Since an LTE subframe transmission duration is equal to 1 ms, all prior processing must fulfill a very tight latency constraint (tl). In order to fulfill tl the processing for the subframe must take at most 1 ms (excluding some initial base station setup time). If the processing time is greater than tl then the next subframe will not be ready when the previous subframe completes transmission. 2018 Silexica GmbH. All rights reserved worldwide. 6

Our application is implemented as a data flow model containing 73 processing tasks of varying complexity. Tasks communicate with other tasks via first-in first-out (FIFO) channels, which are initially unmapped to platform communication means. That is, all channels are treated in an abstract manner without defining whether they are mapped onto shared memories, for instance, or handled via DMAs (Direct Memory Access devices). Overall, there are 186 inter-task communication channels. Figure 2 presents the graphical view of an application fragment as shown by SLX. Tasks implementing LTE physical channels and the data flow amongst them can be observed. Figure 2: LTE application diagram in SLX 3.1 Parallelization Strategy The application follows a coarse-grain parallelization strategy that exploits mostly Task-Level Parallelism (TLP) and Pipeline-Level Parallelism (PLP). There is one parallel pipeline for each channel flow, where the specific channel data is processed. When the channel data is encoded, it enters a parallel processing flow that is replicated according to the number of antennas in the system. Figure 3 shows a high-level overview of the parallelism exploited in the application. Figure 3: Parallelism exploited in application 2018 Silexica GmbH. All rights reserved worldwide. 7

In order to better understand the concept of a channel pipeline, consider the PDSCH pipeline shown in Figure 4. A controller task called PDSCH_ctrl (1) drives the entire channel flow after receiving encoding parameters and input data from higher layers. The PDSCH_cb task (2) segments the input data into code blocks for encoding. The PDSCH_cb_encod task (3) performs turbo coding and rate matching on the code blocks. The PDSCH_concat task (4) concatenates the resulting encoded code blocks, which are later scrambled by the PDSCH_scrambler task (5). Afterwards, the PDSCH_modmapper (6) performs the channel-specific modulation mapping, which results in a stream of modulation symbols that have to be mapped to layers (PDSCH_lmapper - 7) and precoded (PDSCH_ precoder - 8) before transmission (depending on the current transmit diversity and spatial multiplexing settings). Figure 4: Channel pipeline example 4 Mapping Analysis with SLX In the following, we use SLX to map the application tasks to processing elements for different target architectures. SLX has built-in Mapping and Scheduling algorithms that can find high performance mappings for different optimization goals, such as power, load, and execution time. This determines whether the required application constraints are met when mapping an application to a specific target architecture. With SLX, varying combinations of system parameters, such as the number of processors, memory sizes, and processor frequency, among others, can be evaluated automatically. This way, the features help to answer critical design questions and to perform power/cost/performance trade-off experiments at early design stages. The LTE test application will be analyzed with the following platforms: A generic baseband processing multi-dsp architecture with 16 TI C66x DSP cores operating at 1 GHz An enhanced version of the generic baseband processing multi-dsp architecture including 6 hardware accelerators These architectures represent the baseband processing systems typically used today by most wireless providers. It is also worth mentioning that application mapping/distribution analyses for off-the-shelf System on Chips (SoCs) can also be performed with SLX in the same way as for these generic architectures, but this is out of the scope of this paper. 4.1 Generic Baseband Processing Multi-DSP Platform We compute multiple automatic mappings for the application on the generic baseband processing multi-dsp platform by considering subsets of cores of increasing size. This way the optimal platform size for an application can be estimated. We computed automatic mappings when using 1, 2, 4, 6, 8, 10, 12, 14 and 16 cores in the platform. Figure 5 presents the Mapping Analysis Results Chart generated by SLX. The mapping strategy used in these experiments tries to achieve load and data throughput balancing across the system. The Mapping Analysis Results Chart helps keep track of the performance of different configurations. 2018 Silexica GmbH. All rights reserved worldwide. 8

Figure 5: Mapping Analysis Results for up to 16 cores on Generic Baseband Processing Multi-DSP Platform For the evaluated LTE application, increasing the number of cores consistently decreases the execution time. However, after a certain number of cores are used in the platform, adding more cores only produces marginal improvements. The spatial and temporal mapping (scheduling) computed by SLX can be seen in the Gantt Chart in Figure 6 Figure 6: Gantt Chart of the application on 10 cores on Generic Baseband Processing Multi-DSP Platform By placing time markers (cursors) in the Gantt Chart, the processing time for each process and the underlying timing constraints can be monitored. Using this information the mappings generated by SLX can further refined, e.g. by manually setting further mapping constraints. This is especially useful when applications with time-critical processes (or sections) are mapped onto multi-core platforms with limited resources. In Figures 7, 8, and 9 the application setup time, subframe processing time, and antenna transmission time are shown in more detail. 2018 Silexica GmbH. All rights reserved worldwide. 9

Figure 7: The application setup time highlighted in the Gantt Chart Figure 8: The application subframe processing time highlighted in the Gantt Chart 2018 Silexica GmbH. All rights reserved worldwide. 10

Figure 9: Antenna transmission time highlighted in the Gantt Chart The scheduling state (ready, running, read-blocked, write-blocked) over time for each process can be seen in the Task State Graph, which allows users to easily spot bottlenecks associated with read or write-blocks. Figure 10 shows the Task State Graph for the 10-core mapping. Figure 10: Task State Graph Even with 16 cores, the best execution time we were able to achieve was 13.95 ms. This is not even close to being acceptable for a base station configuration. Using the Profiler view, one can check the process execution time for all application tasks on all processor types in the platform, as shown in Figure 11. In the figure, the reported times are for the TI 66x DSP core running at 1 GHz. It can be seen that the processes that take the most time are the FFT/ IFFTs, turbo coding and bit-level processing tasks in the PDSCH channel. 2018 Silexica GmbH. All rights reserved worldwide. 11

Figure 11: Estimated process execution times for each process 4.2 Generic DSP Architecture Enhanced with Hardware Accelerators see how this changes the application execution time. The enhanced architecture comprises 6 additional hardware accelerators: 4 for FFT/IFFT computation, one for turbo encoding and one for PDSCH bit-code processing. Additionally, we now add to the application specification a real-time latency constraint that specifies the processing time required for a subframe. The most important constraint in LTE is that a subframe is encoded within 1 ms so as to allow the uninterrupted transmission of subsequent subframes. Transmission itself takes an additional 1 ms, but it is performed only by the antennas when the channel data has already been prepared. Therefore, task mapping decisions have significant influence on the first 1 ms (processing time) for a subframe, but no influence at all on the actual transmission time. Apart from the 2 ms subframe time mentioned above (i.e. 1 ms for processing, 1 ms for transmission), a certain initial setup time must be accounted for when booting the system up and preparing the first frames and subframes. During this time, processes like computing the user RB scheduling (performed in higher layers), encoding unique cell identifiers and preparing synchronization signals for the rest of a transmission scenario take place. When these are added together, our selected subframe time constraint is 2.6 ms, corresponding to 1.6 ms for setup and processing and 1 ms for transmission. Specifying a constraint causes SLX to automatically try to find the best system configuration to fulfill it. It is worth mentioning that, although SLX uses cross-target performance estimation tools that can guess the execution time of tasks on processors and DSPs, adding HW accelerators to the system requires the user to provide manual timing annotations for the affected tasks. This is a straightforward procedure, as only a couple of lines have to be added into the application source code. The actual timing numbers come from data sheets, existing implementations or from the user s experience. Similar to previous experiments, the system is analyzed after gradually increasing the number of cores until the imposed timing constraint is met. Figure 12 shows the Mapping Analysis Result comparison for 4 and 6 cores with our previous generic baseband processing multi-dsp architecture. 2018 Silexica GmbH. All rights reserved worldwide. 12

Figure 12: Mapping Analysis Result comparison for 4 and 6 cores A significant improvement in the execution time can be seen with the enhanced platform. Offloading the highly timeconsuming processes decreases execution time by up to 7x and improves performance by 85%. Although the constraint is not met using the enhanced platform with 4 cores, the application setup time and subframe processing time are significantly reduced when compared to the results for the previous architecture (see Figures 7 and 8). Figures 13, 14 and 15 show the respective Gantt charts for 4 cores with hardware accelerators. Figure 13: Gantt chart for 4 cores and hardware accelerators 2018 Silexica GmbH. All rights reserved worldwide. 13

Figure 14: Gantt chart for application setup time Figure 15: Gantt chart for subframe processing time From the figures, the subframe setup and processing time is 1.25 ms + 0.78 ms = 2.03 ms. This, added to the 1 ms time required for transmission, still does not fulfill the selected constraint of 2.6 ms. As the current architecture possesses additional DSP cores, one can continue mapping the application onto more cores to see the effect. Figure 16 shows the Mapping Analysis Results Chart for 1, 2, 4, 6, 8, 10, 12 and 14 DSP cores and the 6 hardware accelerators. 2018 Silexica GmbH. All rights reserved worldwide. 14

Figure 16: Mapping result for up to 14 cores on generic DSP architecture with hardware accelerators This time, one can see that using 10 or more DSP cores fulfills the 2.6 ms constraint. The final decision about whether a platform with 10, 12 or 14 cores will be used needs to come after a cost-benefit analysis performed by the system architects. As a matter of fact, faster execution times would certainly leave room to add other tasks later, for instance (e.g. protocol improvements), but this would also yield a higher SoC area that could affect the design s cost effectiveness. 4.3 Exploring Different Processor Frequencies and Power Analysis At this point, many other questions can be easily answered by using SLX. For instance, can the number of cores further be reduced by increasing the processor, bus and memory frequencies? To demonstrate how to answer these questions, we focus next on the cases for 4 and 10 cores, which yielded execution times of 3.04 ms and 2.51 ms respectively. This corresponds to an execution time difference of about 18%. By quickly modifying the SLX high-level architecture model, we increased the frequency of all DSPs from 1 GHz to 1.3 GHz, and mapped the application again on only 4 cores. The Mapping Analysis Results Chart in Figure 17 shows the comparison between 10 cores operating at 1 GHz and 4 cores operating at 1.3 GHz. 2018 Silexica GmbH. All rights reserved worldwide. 15

Figure 17: Mapping analysis result for cores with different operating frequencies As can be seen, mapping onto 4 cores running at 1.3 GHz yields a significantly better execution time than using 10 cores at 1 GHz. Once again, these results and subsequent cost-benefit analyses can help system architects make critical early design decisions. So far, increasing the operating frequency may sound like a sure-fire way to improve the overall execution time of the application. However, it might also have several implications for the power consumed by the base station. For instance, peak power plays an important role in the design of base stations, as power sources in deployed antenna sites are usually constrained. By relying on system-level power estimation strategies, SLX can also provide Power Profile Charts that help to analyze this dimension of the system. Figure 18 shows the resulting Power Profile chart after mapping and scheduling of the application on 4 cores at 1GHz. The figure depicts how much total and individual power/energy is consumed by each core and at which points in time. This allows the user to spot, for instance, critical application sections where an allowed maximum peak power is overshot. The power analysis features in SLX are enabled by providing switching capacitance values of processors functional units, memories and buses, and by specifying the system s voltage and frequency domains/settings. Figure 18: Power Profile chart 2018 Silexica GmbH. All rights reserved worldwide. 16

Comparing now the power profile chart of the mapping scenario for 4 cores at 1.3 GHz vs. 10 cores at 1.0 GHz, one can find which system yields better power efficiency. Figures 19 and 20 show fragments of the Power Profile charts of these two mapping configurations. Figure 19: Power Profile chart for 4 cores @ 1.3 GHz operating frequency Figure 20: Power Profile chart for 10 cores @ 1 GHz operating frequency In this case, it can be seen that mapping the application onto only 4 cores at 1.3 GHz does cause a peak power increment of 15% compared to the case of 10 cores at 1 GHz. However, computing the system s power efficiency in terms of subframes per Watt-second with the formula shows that the solution with 4 cores at higher frequency is superior, resulting in a 13% gain. 5 Conclusion In this report we took a complex LTE base station application and mapped it onto two different architectures, using SLX. Not only did we map and run the application onto these architectures, but we also experimented with different hardware configurations to determine whether different constraints are met (i.e. peak power and total execution time). All in all, Silexica s SLX provide system architects with detailed insights into their target multicore system, and facilitate quick HW/SW partitioning, mapping and scheduling analyses that help system architects answer early design questions and allow them to make informed design decisions early on in the design process. 2018 Silexica GmbH. All rights reserved worldwide. 17