A Two-Stage Radix-4 Viterbi Decoder for Multiband OFDM UWB Systems

Sung-Woo Choi, Kyu-Min Kang, and Sang-Sung Choi

ABSTRACT—This letter presents a power efficient 64-state Viterbi decoder (VD) employing a two-stage radix-4 add-compare-select architecture. A class of VD architectures is implemented, and their hardware complexity, maximum operating speed, and power consumption are compared. Implementation results show that the proposed VD architecture is suitable for multiband orthogonal frequency-division multiplexing (MB-OFDM) ultra-wideband (UWB) systems, which can support the data rate of 480 Mbps even when implemented using 0.18-μm CMOS technology.

Keywords—Viterbi decoder, radix-4, UWB, MB-OFDM.

I. Introduction

Ultra-wideband (UWB) systems have been receiving much attention in recent years, mainly due to their high data rate capability with low transmit power [1]-[3]. Specifically, a multiband orthogonal frequency-division multiplexing (MB-OFDM) based UWB system supporting the maximum data rate of 480 Mbps is widely considered [1]. In the MB-OFDM UWB system, a rate 1/3 convolutional code is utilized. The Viterbi algorithm is generally used to achieve near optimal decoding performance.

Recently, there have been many studies on high-speed Viterbi decoders (VDs). A systolic solution of the M-step parallel processing architecture was proposed in [4]. Because a k-fold increase in throughput requires a $2^k$-fold increase in hardware complexity, $M$ is usually limited to less than 3. A bit-level pipelined structure was studied in [5]. It solves the timing constraints by retiming the critical path; however, both power consumption and hardware complexity increase to support data rates of hundreds of Mbps. A minimized method and sliding block methods have also been studied for high-speed processing [6]. To greatly increase the data rate, a systolic design has been employed. Because the above methods have their own speed limits and/or hardware complexity, an appropriate architecture should be implemented.

II. Two-Stage Radix-4 Viterbi Decoder

In this letter, we present a power efficient two-stage radix-4 VD architecture, which can support the maximum data rate of 480 Mbps in the MB-OFDM UWB system. Figure 1 shows a block diagram of the proposed two-stage radix-4 VD. Input data is composed of 12 input signals (4 input symbol vectors of 3 signals each). The upper 6 input signals are fed into the BM0 module, while the lower 6 input signals are fed into the BM1 module.

![Fig. 1. Two-stage radix-4 Viterbi decoder.](image-url)
1. Two-stage 64-State Radix-4 Trellis

Figure 2 shows a two-stage 64-state radix-4 trellis diagram for the rate 1/3 convolutional code with the constraint length of 7. Each of the ACS0 and ACS1 modules consists of 64 four-way add-compare-select (ACS) units as shown in Fig. 3.

2. BM Unit

Branch metrics (BMs) for the radix-4 trellis are generated by combining branch metrics of successive iterations of the underlying radix-2 trellis. Three 4-bit soft decision input signals are used to calculate eight sub-branch metrics corresponding to the eight possible encoder outputs in the radix-2 trellis. Because the radix-4 VD utilizes six 4-bit soft decision inputs for two decoded outputs, the maximum branch metric is 90. The branch metrics are calculated using a uniform distance measure equal to the symbol itself when compared to logic-0 and equal to its one’s complement when compared to logic-1 [6].

3. ACS Unit

An example of a four-way ACS unit for state 0 is given in Fig. 3. Four adders and six comparators are required for the implementation of a four-way ACS unit. The four-way ACS unit updates a new state metric and two survivor bits by using both 4 state metrics and 4 branch metrics. The state metric $\Gamma_{n+2}^i$ is updated recursively as

$$\Gamma_{n+2}^i = \min_j \{ \Gamma_n^i + \lambda_n^{i,j} \}, \ s = 0,1, \cdots, 63 , \tag{1}$$

where $i$ is a predecessor state of $s$, and $\lambda_n^{i,s}$ denotes the branch metric on the transition from state $i$ to state $s$ at time $n$.

The two-way add-compare circuit of the ACS unit is described in detail in Fig. 4. By employing the modulo normalization algorithm from [7], we can avoid errors due to overflow during the updating of the state metrics and simplify the comparator circuit of the ACS unit as in Fig. 4. Note that the output of the two-way add-compare logic is 0 if A is larger than B; otherwise, the output is 1.

4. TB Unit

The proposed VD architecture employs the 3-pointer even algorithm for trace-back recursion [8], which is more hardware-efficient than the register exchange algorithm. Figure 5 shows the memory banks, the last-in-first-out (LIFO) buffers, and the block decoding method of the proposed VD.
Table 1. Hardware complexity, maximum operating speed, and power consumption in the ASIC implementation of several ACS architectures.

<table>
<thead>
<tr>
<th></th>
<th>Radix-2</th>
<th>Pipelined radix-4</th>
<th>Two-stage radix-2</th>
<th>Radix-4</th>
<th>Two-stage radix-4</th>
</tr>
</thead>
<tbody>
<tr>
<td>Chip area (μm²)</td>
<td>318,422</td>
<td>171,365</td>
<td>2,409,272</td>
<td>1,114,845</td>
<td>532,463</td>
</tr>
<tr>
<td>No. of gates¹</td>
<td>31,906</td>
<td>33,666</td>
<td>241,410</td>
<td>219,026</td>
<td>53,353</td>
</tr>
<tr>
<td>Operating speed (MHz)</td>
<td>528</td>
<td>264</td>
<td>132</td>
<td>132</td>
<td>132</td>
</tr>
<tr>
<td>Timing pass</td>
<td>No</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>Maximum speed (MHz)</td>
<td>277</td>
<td>364</td>
<td>292</td>
<td>329</td>
<td>164</td>
</tr>
<tr>
<td>Power (mW)²</td>
<td>106.8</td>
<td>38.0</td>
<td>307.2</td>
<td>95.8</td>
<td>94.5</td>
</tr>
<tr>
<td>CMOS technology (μm)</td>
<td>0.18</td>
<td>0.13</td>
<td>0.18</td>
<td>0.18</td>
<td>0.13</td>
</tr>
</tbody>
</table>

¹ Based on 2×1 NAND gate, ² Power consumption is estimated by Synopsys’ Power Compiler.

Table 2. Hardware complexity of two-stage radix-4 VD.

<table>
<thead>
<tr>
<th>VD (with memory)</th>
<th>VD (without memory)</th>
<th>ACS only</th>
</tr>
</thead>
<tbody>
<tr>
<td>143,295 gates</td>
<td>131,415 gates</td>
<td>103,257 gates</td>
</tr>
</tbody>
</table>

architecture. To conduct the 3-pointer even algorithm, six memory banks and two LIFO buffers are required. The width of each memory bank is 256 bits and the depth is 5. The overall trace-back (TB) length is 40. Note that the 64×2 survivor bits of ACS0 are mapped to the lower half of the memory bank (WT_m), while the 64×2 survivor bits of ACS1 are mapped to the upper half of the memory bank (WT_m) at each iteration.

III. Implementation Results

Table 1 compares the hardware complexity, the maximum operating speed, and the power consumption in the ASIC implementation of a class of ACS architectures. The ACS architectures are implemented and tested by utilizing the TSMC 0.13-μm and 0.18-μm CMOS libraries with the operation condition of slow mode. Because the sampling frequency of the MB-OFDM UWB system is 528 MHz [1], the radix-2, radix-4 (or two-stage radix-2), and two-stage radix-4 ACS architectures should be operated at the clock speeds of 528 MHz, 264 MHz, and 132 MHz, respectively [6]. Although the pipelined radix-4 architecture satisfies the timing constraints, it requires much greater hardware complexity and power consumption than the radix-4 or two-stage radix-4 architecture [5]. In the radix-4 architecture, only 0.13-μm technology can support the required operation speed. To make matters worse, the power consumption of the radix-4 architecture is approximately 49% higher than that of the two-stage radix-4 architecture. Table 1 indicates that the proposed two-stage radix-4 VD is the most power efficient architecture for the MB-OFDM UWB systems. We summarized the hardware complexity of the two-stage radix-4 VD implemented using 0.13-μm CMOS technology in Table 2.

IV. Conclusion

We have proposed a power efficient two-stage 64-state radix-4 VD architecture. Implementation results showed that the proposed VD with relatively low hardware complexity and power consumption can support various data rates for MB-OFDM UWB transmission. As ASIC technology evolves, the proposed VD architecture is expected to support data rates of more than 1 Gbps and accordingly is suitable for use in next generation high-speed communication systems.

References