High-performance Parallel Concatenated Polar-CRC Decoder Architecture

JOURAL OF SEMICODUCTOR TECHOLOGY AD SCIECE, VOL.8, O.5, OCTOBER, 208 ISS(Print) 598-657 https://doi.org/0.5573/jsts.208.8.5.560 ISS(Online) 2233-4866 High-performance Parallel Concatenated Polar-CRC Decoder Architecture Seunghun Oh and Hanho Lee Abstract In this paper, a novel parallel encoding and decoding method is proposed, which uses concatenated polar-cyclic redundancy check (polar- CRC) codes for high throughput polar decoder implementation. When compared to previous works, the proposed method considerably reduces latency and improves throughput. A parallel concatenated polar-crc decoder architecture based on the proposed method is presented and synthesized using 65-nm CMOS process technology. Synthesis results show that the proposed architecture has 4.9 times the data throughput and 4.5 times the hardware efficiency of conventional SC polar decoder architecture. Index Terms Polar codes, CRC codes, successive cancellation decoding, concatenated I. ITRODUCTIO Assuming high enough code length, polar codes [] can achieve channel capacity in a binary-input memoryless channel (B-DMC). Also, when compared with low-density parity-check (LDPC) codes or turbo codes, polar codes have an advantage in that they have lower decoding complexity [2]. List decoding with polar codes is currently under consideration for potential adoption in future 5G standards. Polar codes and LDPC codes have been accepted by 3GPP as channel coding schemes for control and data channels, respectively. Manuscript received Feb. 20, 208; accepted May. 7, 208 Dept. of Information and Communication Eng., Inha University, Incheon, 2222, Korea E-mail : hhlee@inha.ac.kr Polar codes have low computational complexity and a simple structure that is easy to implement in comparison to other channel codes. The first algorithm for decoding polar codes was the successive-cancellation (SC) algorithm proposed by Arikan (2009), which is a softdecision decoding algorithm []. Then, the belief propagation (BP) algorithm [3] and SCL algorithm, which is an SC algorithm used in list decoding [4], were proposed. A polar decoder using the SC algorithm has an advantage over one using the BP algorithm in that it has better error correction performance. However, due to the serial nature of the SC algorithm, the delay time tends to be higher than in the BP algorithm, which processes operations in parallel. Also, a polar decoder based on the BP algorithm requires many processing elements, so it is not as useful for practical applications. Recently, it has been found that SC codes using list decoding have higher error correction capability than LDPC codes and turbo codes, but it has a disadvantage in that its latency is still high [2]. In this paper, a novel parallel encoding and decoding method using concatenated polar-cyclic redundancy check (CRC) codes is proposed to reduce latency and improve throughput for polar decoder implementation. In the conventional SC polar decoder, as the code length increases, the required processing time for decoding increases. However, the proposed parallel SC polar decoder architecture reduces processing time and improves the throughput of the decoding process by dividing the whole codeword in to shorter length codewords and using a sub-decoder on each of them. To compensate for bit error rate (BER) degradation, the proposed method uses a CRC code concatenated with polar codes. CRC codes are used to check the error of a

JOURAL OF SEMICODUCTOR TECHOLOGY AD SCIECE, VOL.8, O.5, OCTOBER, 208 56 + + + + + + Fig.. Polar encoding with =8. decoded codeword, and if uncorrected errors are found, the parallel SC polar decoder requests retransmission of the codeword. Through the retransmission process, the proposed parallel SC polar decoder can compensate for the BER degradation. Therefore, with the proposed parallel SC polar decoder, it is possible to decode the received message without BER degradation in a specific energy band. The rest of this paper is organized as follows: Section II introduces polar codes. In Section III, the proposed parallel concatenated polar-crc encoding and decoding method is presented. Also, the parallel concatenated polar-crc decoder architecture is proposed. The implementation results are discussed in Section IV along with comparison to previous works. Finally, conclusions are drawn in Section V.. Polar Encoding II. POLAR CODE Polar coding uses a channel polarization phenomenon that occurs when channel combining and channel splitting are repeated. After the operations of the channel combining and channel splitting on independent copies of a channel W, we obtain the successive synthesized binary input channels W i i =, 2, ¼,, with transition probabilities i i- W ( y, u u ). Polar codes are written as i (, k), where denotes the length of the whole codeword, and k denotes the number of information bits included in the codeword. In this case, the number of parity bits is -k, i.e. the remaining number of bits after excluding the information bits of the whole codeword; these bits are called frozen bits. The information bits can be assigned to the more reliable channels, and the frozen bits can be set as fixed values. Polar encoding can be expressed as x = G u, where G is the generator matrix and u and x are the input vector and codeword, respectively. The generator matrix defined as matrix and matrix Än G = B F, where G is B is the bit-reverse Än F is the n-th order Kronecker product of é 0ù F = ê ú. For example, Fig. shows the polar ë û encoding process when the codeword length = 8. 2. SC Polar Decoding The SC decoding algorithm [], which is based on the decoding method using likelihood ratio (LR), is proposed for polar decoding. Let u ˆ denote a decoded bit of codeword u. Then, LR is defined as follows: - (, ˆ 0) - ( ˆ, ) W y u = () i i i i- (, ˆ ) i i W y u LR y u For the received and if + Fig. 2. SC polar decoding process with =8. y i value, if u ˆi is a frozen bit, it is 0, y i is an information bit, then the decoding process proceeds as follows:

562 SEUGHU OH et al : HIGH-PERFORMACE PARALLEL COCATEATED POLAR-CRC DECODER ARCHITECTURE Fig. 3. Parallel concatenated polar-crc encoding and decoding scheme. - ( ˆ ) i i ì ï0, if LR y, u ³ uˆ i = í ï î, otherwise (2) III. PROPOSED PARALLEL COCATEATED POLAR-CRC CODES AD ARCHITECTURES In order to reduce hardware complexity, an SC decoding algorithm using the logarithm-likelihood ratio (LLR) is proposed; the LLR is found by calculating the LR value in the logarithm domain [5]. For the LLR, the analogous version of Eq. () is defined in Eq. (3). - - (, ˆ ) ( ˆ, ) LLR y u @ LR y u (3) i i i i In the log domain, the decision rule is defined as follows: - ( ˆ ) i i ì ï0, if LLR y u, ³ uˆ i = í ï î, otherwise Fig. 2 shows the SC polar decoding process when code length = 8. The decision unit shown in Fig. 2 uses Eq. (4). The f-function and g-function are expressed by the following equations: ( La Lb ) ( La ) ( Lb ) ( La Lb ) (4) f, = sign sign min, (5) u ( ˆ ) ( ) ˆ g L, L, u = L - s + L (6) a b s a b In order to determine the decoded value of the bit, a partial sum of the decoded bit is required; that is, the SC polar decoding process progresses sequentially.. Proposed Encoding and Decoding Scheme In the conventional SC polar decoder architecture, as the code length increases, the latency required for the decoding process increases. [6] proposed a precomputation look-ahead SC polar decoder architecture to reduce latency. The pre-computation look-ahead SC polar decoder is a structure that reduces the latency in the decoding process by pre-computing the calculation result of element processing. When the length of the codeword is, the decoding latency of the pre-computation lookahead SC polar decoder is ( ) clock cycles. As the length of the codeword increases, the decoding latency increases. The increase in decoding latency is a major challenge in the improvement of data throughput when designing a decoder. The proposed concatenated polar-crc decoder divides the codeword and performs the decoding process using sub-decoders. As a result, the proposed SC polar decoder reduces the decoding latency and improves data throughput. In order to compensate for BER performance degradation caused by the sub-decoders, a code concatenated with CRC is used. The performance degradation of the polar decoder is compensated for through confirmation of whether the decoded code is error-corrected, and if not, it performs retransmission. Fig. 3 shows the proposed parallel concatenated polar- CRC encoding and decoding scheme. The (256, 36) SC polar sub-decoder is based on the architecture proposed

JOURAL OF SEMICODUCTOR TECHOLOGY AD SCIECE, VOL.8, O.5, OCTOBER, 208 563 uˆ 2 i- uˆ 2i Fig. 6. Proposed concatenated polar-crc sub-decoder architecture. Fig. 4. Encoding method of parallel concatenated polar-crc code. Fig. 7. Timing diagram of the proposed concatenated polar- CRC sub-decoder. Fig. 5. Proposed parallel concatenated polar-crc decoder architecture. information obtained by concatenating the 28-bits of original data and the 8-bit CRC code. Then, a 024-bit codeword composed of four 256-bits codewords is transmitted through the channel. in [6]. The processing element in the polar decoder is based on the simplified merged processing element (SMPE) architecture proposed in [7]. The codeword length of the proposed parallel SC code algorithm in Fig. 2 is 024-bits, and the length of the information bits k in the codeword is 52-bits. The code rate (k/) defined by the ratio of the length of the codeword to the length of the information bits is 0.5. In Fig. 3, the left side of the center channel W is an encoding process in a transmitter, and the right side is a decoding process in a receiver. Fig. 4 illustrates the encoding method of the proposed parallel concatenated polar-crc code. The input of 52- bits of data is divided into four parts (4 28-bits of data) for parallel processing. Each divided 28-bits of data is then passed through an 8-bit CRC encoder to generate an 8-bit CRC code. The (256, 36) polar encoder generates a 256-bit codeword with 36-bits of 2. Proposed Decoder Architecture Fig. 5 shows the proposed parallel concatenated polar- CRC decoder architecture. The 024 LLR values received over the channel are divided into four groups of 256 LLR values for parallel processing. The four divided groups of LLR values are inputted to four sub-decoders. Then the (256, 36) SC polar sub-decoder decodes the 256-bit code codeword. When the decoding step is completed, the 36-bits of information, excluding the frozen bits, are separated from the 256-bits codeword. The remaining 36-bits can be separated into 28-bits of information and 8-bits of CRC code. The CRC decoder then generates its own 8-bits of CRC code from 28-bits of information and compares this with the CRC code from the transmitter. Then, the decoder determines whether an error has occurred in the received data or not.

564 SEUGHU OH et al : HIGH-PERFORMACE PARALLEL COCATEATED POLAR-CRC DECODER ARCHITECTURE Fig. 8. Retransmission timing diagram of the proposed parallel concatenated polar-crc decoder. If an error occurred in at least one sub-decoder, the codeword is retransmitted and the decoding process is performed again. Fig. 6 shows the proposed concatenated polar-crc sub-decoder architecture. The proposed parallel concatenated polar-crc decoder consists of four subdecoders, and each sub-decoder consists of a (256, 36) SC polar decoder, a CRC decoder for generating a CRC code, and a CRC code checker for comparing the transmitted CRC code with the CRC code generated by the receiver. Fig. 7 shows the timing diagram of the proposed concatenated polar-crc sub-decoder. The (256, 36) SC polar decoder performs the decoding process using the received LLR values and takes 255 clock cycles to complete the process, and a further three clock cycles to check the CRC codes. Therefore, the latency required for the sub-decoder is 258 clock cycles, while the conventional (024, 52) SC polar decoder requires 023 clock cycles to process 024-bits of codeword. Hence, the proposed parallel SC polar decoder significantly reduces the latency required for the decoding process. When evaluating the performance of the proposed parallel concatenated polar-crc decoder, it is important to note that there is a maximum number of retransmissions allowed. As the performance of the proposed parallel concatenated polar-crc decoder is compensated through retransmission, the delay time of the output value changes according to the number of retransmissions required. Since the proposed parallel concatenated polar-crc decoder has a delay time of 258 clock cycles to output. When retransmission occurs more than three times, the delay time exceeds 032 clock cycles, and in that case, the delay time is more than that of the conventional (024, 52) SC polar decoder. Therefore, the maximum number of retransmissions allowed is two. Fig. 8 is a retransmission timing diagram of the proposed parallel concatenated polar-crc decoder when retransmission occurs twice. Referring to Fig. 8, since it takes 774 clock cycles to achieve an output even after two retransmissions, the delay time is less than that of the conventional (024, 52) SC polar decoder. IV. IMPLEMETATIO RESULTS AD. BER Performance COMPARISO Fig. 9 shows the BER performance for the proposed parallel concatenated polar-crc decoder. The BER performance at the same Eb/0 value is compared considering the retransmission. Fig. 0 shows the average number of retransmissions according to Eb/0. Simulation results show that the average number of retransmissions is.5 at.5 db and the number of retransmissions decreases gradually as the Eb/0 value increases. In particular, when the value of Eb/0 is 3 db or more, it can be confirmed that almost no retransmission occurs. When the value of Eb/0 value is smaller than 3 db, many retransmissions occur and more energy is used in the transmission. Compared with the same Eb/0 condition, it has low BER performance in the low Eb/0 region due to retransmission occurring frequently. However, in the case of Eb/0 > 3, the proposed parallel

JOURAL OF SEMICODUCTOR TECHOLOGY AD SCIECE, VOL.8, O.5, OCTOBER, 208 565 Table. Hardware complexity of the proposed parallel concatenated polar-crc decoder Polar Decoder CRC Complexity umber of Parallel Sub-decoders P = 4 P = 8 P = 2 o. of PE,020q 2,040q 3,060q o. of Register 3,048q 6,096q 9,44q o. of MUX 2,036q 4,072q 6,08q o. of XOR 2,560 5,20 7,680 o. of Register 32 64 96 Latency + 2 + 2 + 2 Throughput Generator polynomial: q : quantization bits : code-length 8 7 4 3 x + x + x + x + x + = 0. 4 + 2 8 + 2 2 + 2 concatenated polar-crc decoder has the same BER performance when compared with conventional (024, 52) SC polar decoders. 2. VLSI Implementation Fig. 9. BER performance of the proposed parallel concatenated polar-crc decoder. Fig. 0. Average number of retransmission for the proposed parallel concatenated polar-crc decoder. The proposed parallel SC polar decoder architecture can improve data throughput by increasing the number of internal parallel sub-decoders. Table lists the hardware complexity and performance comparison according to the number of sub-decoders in the proposed parallel SC polar decoder. In Table, P denotes the number of internal parallel sub-decoders in the proposed parallel SC polar decoder. It can be seen that when the number of parallel sub-decoders increases, the data throughput increases proportionally because the latency is constant (+2). The proposed concatenated polar-crc decoder was synthesized using the Synopsys Design compiler and a TSMC 65-nm CMOS process with appropriate time and area constraints. The estimated total number of AD gates is 29,800, determined from synthesized results, and the clock speed is 833 MHz for the proposed polar decoder. When scaling to the same technology (65-nm CMOS), the technology scaled normalized throughput (TST) metric is defined in terms of throughput per thousand gates (Kgate) [8]. Table 2 lists the implementation results for the proposed decoder and the reported (024, 52) SC polar decoders. The total number of AD gates is larger by approximately 9% compared with the conventional SC polar decoder in [7], while the clock speed is 833 MHz, which is an improvement of 24% compared to the

566 SEUGHU OH et al : HIGH-PERFORMACE PARALLEL COCATEATED POLAR-CRC DECODER ARCHITECTURE Table 2. Implementation result of the proposed parallel concatenated polar-crc decoder Proposed [7] [5] [9] Technology 65-nm 65-nm 65-nm 45-nm Gate count (AD) 29,800 268,200 24,370 338,500 Clock speed (MHz) 833 670 500 750 Decoding Latency (cycle) 258,023 2,080 767 Decoding Latency (ns) 30,535 4,60,028 Throughput (Mbps) 3,307 670 246,000 TST (Mbps/Kgate).3 2.5.5 2.04 TST = [(throughput)*(technology/65 nm)] / (total gate count) [8] TST is scaled by 65-nm Technology. conventional SC polar decoder in [7] using the same CMOS process technology. The throughput is 3.3 Gbps, which is approximately 4.9 times higher than in [7]. The normalized hardware efficiency TST of the proposed parallel concatenated polar-crc decoder is.3, which is 4.5 times higher than in [7]. Therefore, the proposed parallel concatenated polar-crc decoder improves data throughput and hardware efficiency significantly. V. COCLUSIOS In this paper, a parallel concatenated polar-crc decoder with high data throughput using concatenated polar-crc codes was proposed. The proposed parallel concatenated polar-crc decoder divides the codeword and performs decoding through sub-decoders. This method significantly reduces the latency required for decoding and improves data throughput. Based on this method, a high-throughput concatenated polar-crc decoder architecture can be implemented. Implementation results show that the proposed architecture has significant advantages with respect to throughput and hardware efficiency compared to the previous SC polar decoder architectures. ACKOWLEDGMETS This work was supported by Inha University Research Grant. REFERECES [] E. Arikan, Channel polarization: a method for constructing capacity-achieving codes for symmetric binary-input memoryless channels, Information Theory, IEEE Transactions on, Vol. 55, o. 7, pp. 305-3073, Jul. 2009. [2] K. iu, K. Chen, J. Lin and Q.T. Zhang, Polar Codes: Primary Concepts and Practical Decoding Algorithms, IEEE Communication Magazine, Vol. 52, o. 7, pp. 92-203, Jul. 204. [3] B. Yuan and K. K. Parhi, Early Stopping Criteria for Energy-Efficient Low-Latency Belief- Propagation Polar code Decders, Signal Processing, IEEE Transactions on, Vol. 62, o. 24, pp. 6496-6506, Dec. 204. [4] I. Tal and A. Vardy, List Decoding of Polar Codes, Information Theory, IEEE Transactions on, Vol. 6, o. 5, pp. 223-2226, May. 205. [5] C. Leroux, A. J. Raymond, G. Sarkis, and W. J. Gross, A semi-parallel successive-cancellation decoder for polar codes, Signal Processing, IEEE Transactions on, Vol. 6, o. 2, pp. 289-299, Jan. 203. [6] C. Zhang, B. Yuan, and K. K. Parhi, Reducedlatency SC polar decoder architectures, Communications (ICC), 202 IEEE International Conference on, pp. 347-3475, Jun. 202. [7] H. Yun, and H. Lee, "Simplified merged processing element for successive-cancellation polar decoder," IET Electronics Letters, Vol. 52, o. 4, pp. 270-272, Feb. 206. [8] H.-Y. Hsu, A.-Y. Wu and J.-C. Yeo, Area-efficient VLSI design of Reed-Solomon decoder for 0G Based-LX4 optical communication systems, Circuits and Systems II, IEEE Transactions on, Vol. 53, o., pp. 245-249, ov. 2006. [9] B. Yuan and K. K. Parhi, Low-latency Successive-Cancellation Polar Decoder Architectures Using 2-Bit Decoding, Circuits and Systems. I, IEEE Transactions on, Vol. 6, o. 4, pp. 24-254, Oct. 203.

JOURAL OF SEMICODUCTOR TECHOLOGY AD SCIECE, VOL.8, O.5, OCTOBER, 208 567 Seunghun Oh received his B.S and M.S degrees, both in Information & Communication Engineering from Inha University, Incheon, Korea, in 205 and 207, respectively. His research interests are VLSI and SoC architecture design for forward error correction and communications. Hanho Lee received his Ph.D. and M.S. degrees, both in Electrical & Computer Engineering, from the University of Minnesota, Minneapolis, in 2000 and 996, respectively. From April 2000 to August 2002, he was a Member of Technical Staff at Lucent Technologies (Bell Labs Innovations), Allentown. From August 2002 to August 2004, he was an Assistant Professor at the Department of Electrical and Computer Engineering, University of Connecticut, USA. Since August 2004, he has been with the Department of Information and Communication Engineering, Inha University, Korea, where he is currently a Professor. From August 200 to August 20, he was a visiting scholar at Bell Labs, Alcatel-Lucent, Murray Hill, ew Jersey, USA. His research interests include VLSI architecture design for forward error correction, cryptographic, and communications.