WiMAX Basestation: Software Reuse Using a Resource Pool. Arnon Friedmann SW Product Manager

WiMAX Basestation: Software Reuse Using a Resource Pool Cory Modlin Wireless Systems Architect cmodlin@ti.com L. N. Reddy Wireless Software Manager lnreddy@tataelxsi.co.in Arnon Friedmann SW Product Manager arnon@ti.com

Outline Overview of Problem Traditional Approaches Resource Pool Approach Realization

Our Goals Mobile WiMAX (802.16e) PHY baseband base station demonstration Single scalable architecture Single to multiple sectors From pico to macro base station Single antenna to multiple antennae Multiple processors C6455 DSPs: add more DSPs as more processing is needed FPGA

WiMAX Brief Overview (1) OFDM downlink/ OFDMA uplink 5 to 20-MHz bandwidth 512 to 2048 OFDM subcarriers 23/6 Mbit/s (downlink/uplink) at 10 MHz TDD or FDD 5 ms frames = ~50 OFDM symbols Advanced features DL beamforming/mimo UL MIMO

WiMAX Brief Overview (2) DL Burst #2 UL Burst #1 UL Burst #2 DL Burst #1 DL Burst #3 DL Burst #4 UL Burst #3 UL Burst #4 DL Burst #6 DL Burst #5 UL Burst #5 DL Burst #7 WiMAX TDD Frame (5 ms) figure from Mobile WiMAX Part I: A Technical Overview and Performance Evaluation; WiMAX Forum April, 2006.

Unique Challenges for Advanced Wireless OFDM base stations High complexity MIMO algorithms requires > 1 DSP Single user can consume either a small fraction of a frame or the entire frame Can not statically divide processing by user Processing load can vary substantially from frame to frame MIMO receiver >> non MIMO Turbo decoder >> convolutional decoder Beamformed user >> single DL antenna System designed for worst case could be substantially overdesigned Worst case for one burst might not be sustainable over entire subframe In TDD, worst case UL and worst case DL can not co-exist MAC scheduler controls allocation of users and can keep control over resource requirements

Outline Overview of Problem Traditional Approaches Resource Pool Approach Realization

Two General Approaches for Taking Advantange of Multiple Processors Compiler/ programming language that abstracts the hardware from the software designer Application software not aware of physical topology Example: remote procedure call (RPC), CORBA... processor 1 Static allocation of resources among processors Software architecture places functional blocks on specific processors Placement of functional blocks is an integral part of the application processor 2 processor 1 processor 2

Disadvantages to the Compiler Approach for our Application Physical location of function calls determined at compile time Does not allow dynamic flexibility during run-time Run-time flexibility desirable for efficient use of resources and for failure recovery Host/client thread is blocked while remote function is called Waits for function to return Lot of data movement between processors There is overhead for moving data Interprocessor link can be high latency, so we do not want to require that results come back Overhead for a generic approach like RPC/IDL (interface definition language) Need to balance desire for generic interface with realtime processing requirements processor 1 processor 2

Common Communications Infrastructure Architectures parallel concurrency for uplink/downlink split downlink FEC encoder/ interleaver modulation pulse shaping DSP0 DSL CDMA FEC decoder demodulation equalizer DSP1 DSP2 uplink pipelined concurrency for symbol rate vs chip rate

Problems With Static Allocation of Resources FEC encoder/ interleaver modulation pulse shaping DSP0 FEC decoder demodulation equalizer single antenna, single sector, simple FEC code all add fits turbo on single encoder DSP and decoder need to split uplink and downlink need hardware accelerator/fpga for turbo decoder

Problems With Static Allocation of Resources FEC encoder/ interleaver modulation pulse shaping DSP0 FEC decoder pre/post processor demodulation equalizer DSP1 FEC decoder on FPGA single add support antenna, for multi single antenna sector or multiple sectors add MIMO turbo encoder and decoder need to split uplink into and parts downlink need hardware accelerator/fpga for turbo decoder

Problems With Static Allocation of Resources FEC encoder/ interleaver modulation pulse shaping DSP0 FEC decoder pre/post processor demodulation DSP1 MIMO equalizer DSP2 FEC decoder on FPGA add implement support design for multi and antenna discover or that multiple DSP2 sectors is 101% loaded add need MIMO to split processing done on DSP2 into parts need to split uplink into parts

Problems With Static Allocation of Resources FEC encoder/ interleaver modulation pulse shaping DSP0 FEC decoder pre/post processor FEC decoder on FPGA demodulation DSP1 MIMO equalizer DSP2 DSP3 MIMO equalizer MIMO equalizer Implement design and discover that DSP2 is 101% loaded Need to split processing done on DSP2 into parts

Problems With Static Allocation of Resources Limited re-use For a common hardware platform, division of resources among processors will be different for different standards Changes in complexity (e.g. more antennas) or addition of features (e.g. beamforming) require substantial redesign Even a small change to a function on a heavily loaded processor could completely change the architecture Inefficient Worst case loading on each DSP might be well under 100% because of way functions are divided Each DSP must be provisioned for worst case even if worst case is never possible Example in time division duplexing (TDD) Worst case downlink is most of frame dedicated to downlink Worst case uplink is when most of frame dedicated to uplink But worst worst case is never possible Headroom for unexpected worst case is limited can not be distributed among the processors

Outline Overview of Problem Traditional Approaches Resource Pool Approach Realization

Ideal Resource Pool Pool of processors is abstracted from the application Total processing power is equal to the sum of the individual processing power Resources are configurable at run-time Both parallel and pipelined division of resources is possible resource pool

Resource Pool DSP1 Physical Layer Controller MAC host DSP1 Resource Pool Controller DSP1 Connectivity Layer DSP1 DSP2 Signal Processing Signal Processing Signal Processing Resource Pool FPGA host Management Entity Frame-by-frame configuration from MAC used as input to resource pool controller Coding, block sizes, memory locations change from frame-toframe Management entity configures resource pool controller Number of processors Allocation of functions to processors

Guiding Principles MAC/ management There can be from 1 to n DSPs MAC/management communicate with only one DSP Same code image runs on all DSPs Same data structures reside on all DSPs Processors can talk to each other Definition of jobs/ functional blocks is done manually No attempt to automate division of resources PHY/resource pool controller

Architecture Layers with Connectivity Layer WiMAX processing divided into jobs/functions Jobs are called through DSP Connectivity 1 Layer API DSP 2 WiMAX Signal Processing commit (copy) data post next job Connectivity Layer - determines physical location of destination if on same DSP WiMAX Signal Processing if on a remote DSP PHY Controller if on a remote DSP srio (PHY) driver Connectivity Layer post next job Connectivity Layer knows where each job runs Allows abstraction of multiple processors from application srio (PHY) driver Resource Pool Controller srio PHY srio PHY PHY driver (RapidIO in our case) transfers the data or if on a remote DSP Generates interrupts/notification

PHY/resource pool controller Calculate job descriptor for all jobs for example, FEC coding parameters per codeword and memory location for each codeword Calculate resource descriptor to designate on which physical processor each job will run Send job descriptors and DSP assignments to all DSPs Job distribution is dynamic and configurable at run-time PHY/Resource Pool Controller

Connectivity Layer API Commit (copy) data ( source address, destination address, data size pointer to appropriate resource descriptor) all memories involved in resource sharing reside on all processors Connectivity Layer knows starting memory location on all processors Connectivity Layer knows which processor to copy to (could be current processor) based on which application function called it Notify/post ( job to be posted job index pointer to appropriate job and resource descriptor) Connectivity Layer causes interrupt on relevant processors (we use RapidIO doorbell on a remote processor) Connectivity Layer then posts next job

Outline Overview of Problem Traditional Approaches Resource Pool Approach Realization

WiMAX Transmitter FEC and modulation MAC buffer randomizer FEC encoder + interleaver modulation buffer buffer permutation + IFFT permutation SRIO IFFT buffer FEC and modulation buffer buffer randomizer FEC encoder + interleaver buffer modulation resource pool (per sector) resource pool (per codeword) CRC calculation permutation + IFFT buffer randomizer FEC and modulation FEC encoder + interleaver buffer buffer permutation IFFT buffer modulation

GUI to Configure the Resource Pool

Hardware Platform AMC70k2000 (STx) -4, C6455 DSPs -IDT RapidIO switch Tundra RapidIO switch lab development AMC carrier board (STx) General Purpose Processor with RapidIO

end