

# FPGA Implementation of Hierarchical Subcarrier Rate and Distribution Matching for up to 1.032 Tb/s or 262144-QAM

Downloaded from: https://research.chalmers.se, 2025-12-08 23:28 UTC

Citation for the original published paper (version of record):

Yoshida, T., Igarashi, K., Konishi, Y. et al (2021). FPGA Implementation of Hierarchical Subcarrier Rate and Distribution Matching for up to 1.032

Tb/s or 262144-QAM. 2021 Optical Fiber Communications Conference and Exhibition, OFC 2021 - Proceedings. http://dx.doi.org/10.1364/OFC.2021.Tu6D.4

N.B. When citing this work, cite the original published paper.

research.chalmers.se offers the possibility of retrieving research publications produced at Chalmers University of Technology. It covers all kind of research output: articles, dissertations, conference papers, reports etc. since 2004. research.chalmers.se is administrated and maintained by Chalmers Library

# FPGA Implementation of Hierarchical Subcarrier Rate and Distribution Matching for up to 1.032 Tb/s or 262144-QAM

Tsuyoshi Yoshida<sup>1,2</sup>, Koji Igarashi<sup>2</sup>, Yoshiaki Konishi<sup>1</sup>, Magnus Karlsson<sup>3</sup>, and Erik Agrell<sup>3</sup>

<sup>1</sup> Information Technology R&D Center, Mitsubishi Electric Corporation, Kamakura, 247-8501, Japan

<sup>2</sup> Graduate School of Engineering, Osaka University, Suita, 565-0871, Japan

<sup>3</sup> Fiber-Optic Communications Research Center (FORCE), Chalmers University of Technology, Gothenburg, SE-41296, Sweden Yoshida.Tsuyoshi@ah.MitsubishiElectric.co.jp, yoshida.tsuyoshi@opt.comm.eng.osaka-u.ac.jp

**Abstract:** A novel hierarchical subcarrier rate and distribution matching has been implemented in an FPGA at 1.032 Tb/s. The implemented subsystem achieves seamless data flow among subcarriers at a resolution < 0.01 bit per channel use. © 2020 The Author(s)

#### 1. Introduction

Probabilistic shaping (PS) is a powerful tool in digital signal processing (DSP) based fiber-optic communications, especially in terms of capacity-approaching performance and rate adaptation [1,2]. In typical PS systems, distribution matching (DM) is placed outside forward error correction (FEC) coding for low-complexity implementation. The use of digital subcarrier enables waterfilling performance [3] or reduces impact from equalizer-enhanced phase noise [4].

An application specific integrated circuit (ASIC) for DSP with PS has been already reported, e.g., in [4]. However, there are only few reports detailing the DM implementation, which is still a challenging issue. Field programmable gate array (FPGA) implementations were reported in [5–7]. Real-time demonstrations of both DM encoding/decoding implementations were done in [5] only, and later verified at 400 Gb/s throughput in [8]. In contrast, only [6,7] implemented fine rate adaptation at 0.2 bit/channel use (bpcu) for 16- and 64-ary quadrature amplitude modulation (QAM). More precise rate adaptation and subcarrier rate matching are required in practical systems. To realize such rate adaptation, control of clock frequency or usage of dummy bits is essential, but it is rarely reported. Per-subcarrier PS was also implemented [4], but its signal processing architecture and granularity of subcarrier rate are still unclear.

Deployable high throughput and high baud rate fiber-optic communications [2,4,8] are constrained by transceiver signal-to-noise ratio (SNR) up to, e.g., around 20 dB with 256-QAM [9]. On the other hand, in recent years there have been several reports on hyper-scale constellations far larger than 256-QAM. Physical layer encryption was studied with 4,294,967,296-QAM [10], and SNR-adaptive radio-over-fiber transmission was demonstrated up to 1,048,576-QAM [11]. Ultra-high spectral efficiency was performed by PS-16384-QAM at 10 Gbaud [12]. A practically implementable DM for hyper-scale constellations will be needed for such alternative applications.

In this work, we propose and report subcarrier rate and distribution matching (SRDM) implementation for the first time to the best of our knowledge. The proposed hierarchical SRDM (HiSRDM) with a tree-structured architecture is suitable for implementation in high-throughput subcarrier-multiplexed fiber-optic communications. The achieved rate granularity is < 0.01 bpcu with a single FPGA at a system throughput up to 1.032 Tb/s. The implemented HiSRDM circuitry can also generate PS signals having hyper-scale base constellations up to 262,144-QAM.

#### 2. Principle and architecture

To realize adaptive rates with a fixed circuit, we enable or disable bit lanes as needed, and fill the disabled lanes with dummy bits. In each subcarrier, we employ binary-tree-structured hierarchical DM (HiDM) [13] consisting look-uptables (LUTs) in each layer. Thus the bit lanes are sequentially enabled until data bits enter each LUT. Fig. 1 depicts the (a) signal format and (b) block diagram of HiSRDM. The leftmost sub-figure in Fig. 1(a) shows the numbers of total lanes w and enabled lanes s (yellow), where  $0 \le s \le w$ . Dummy bits do not carry any meaningful information (can be fixed to '0' or don't care). Data bits are placed on the bottom lanes. The bits are demultiplexed into C subcarriers. Second left sub-figure in Fig. 1 (a) shows per-subcarrier bus width w/C and the number of enabled lanes  $s_j$ , where  $j \in \{1,2,...,C\}$  is the subcarrier index, and  $0 \le s_j \le w/C$ . The number of enabled lanes  $s_j$  can be unequally distributed over subcarriers with nonuniform SNRs. This is the subcarrier rate matching and is realized by a binary-tree demultiplexer shown in the right bottom sub-figure in Fig. 1(b). In each binary-tree demultiplexer, the bottom-side outputs the bottom  $v_b$  lanes having indices from 0 to  $v_b - 1$  of the input lanes. Among the  $v_b$  lanes, the bottom  $s_b$  lanes contain data bits. The top-side outputs are given by consecutive  $v_t$  lanes having indices from  $s_b$  to  $v_b - s_b - 1$  of the input lanes. Among the  $v_t$  lanes, the bottom  $s_t$  lanes contain data bits.

Bits in the *j*-th subcarrier are demultiplexed into layers 1 to *L* and extra data bits. The bus widths for layer  $\ell$  is  $w_{\text{lyr},\ell}$ , and data bits are at the bottom  $s_{j,\ell}$  lanes, where  $0 \le s_{j,\ell} \le w_{\text{lyr},\ell}$ . The bus width for the extra bits is  $w_{\text{ex}}$ , and data bits are at the bottom  $s_{j,\ell}$  lanes, where  $0 \le s_{j,\ell} \le w_{\text{ex}}$ . The extra data bits will be used after the LUTs in layer 1.

Bits in the  $\ell$ -th layer of the j-th subcarrier are further demultiplexed into each component for an LUT. An i-th LUT in layer  $\ell$  (LUT $\ell$ ,  $i \in \{1,2,...,I_\ell\}$ ) accepts a variable number of data bits  $s_{j,\ell,i}$  with a fixed bus width of  $w_{t,\ell}$  for information bits, where  $0 \le s_{j,\ell,i} \le w_{t,\ell}$ . The number of LUT $\ell$  (=  $I_\ell$ ) equals  $2^{L-\ell}$  in the HiDM. Such pre-processing helps to realize flexible-rate shaping. In the HiDM, output bits from an LUT in layer  $\ell$  (LUT $\ell$ ) are fed into LUT $\ell-1$ . As the LUT contents for HiDM can be software-defined, no additional memories are required in the hardware. After layer 1, bits for modulation symbols are generated with  $u_1$  bits from LUT1 and  $w_{t,e}$  (=  $w_{ex}/2^{L-1}$ ) extra data bits, then subcarriers are multiplexed before FEC encoding. The extra data bits are non-shaped and can be used for separating a signal point into two or four signal points. The receiver-side processing is an inverse version of the transmitter's one.



Fig. 1. Principle of HiSRDM; (a) signal formats and (b) block diagram. Paths for sign-bits are not shown because of space limitation.

#### 3. FPGA implementation

We implemented both Tx and Rx subsystems of HiSRDM on a single FPGA of Xilinx® Virtex® Ultrascale+ $^{TM}$  VCU118 XCVU9P. The number of subcarrier groups and the number of layers in the HiSRDM are flexible. At maximum, the number of subcarrier groups C is 16 (symmetric) and the number of layers C is 7. Processing of the sign bits was also implemented, which mainly adjusts for the latency.



Fig. 2. Constellation gain G as a function of spectral efficiency  $\beta = 2 + k/n$  for implemented shaping modes, where n denotes DM output block length (2d-symbols).

Fig. 2 shows the constellation gain  $G = d_{\min}^2(2^{\beta} - 1)/(6E)$  [14] for available operation modes, where  $d_{\min}$ ,  $\beta$ , and E denote minimum Euclidean distance, spectral efficiency, and average two-dimensional (2d) symbol energy, respectively. The largest base constellation is 262,144-QAM, and the rate granularity  $g_R$  is less than 0.01 bpcu (see

also Tab. 1) regardless of the chosen base constellation (or its combination among subcarriers). For a 90° rotationalsymmetric QAM,  $\beta = 2 + s_i/n$ , where n denotes the number of DM output block length (in 2d-symbols). Shaped information bit rates  $s_i/n$  for j-th subcarrier and s/n in total are fully flexible. The gain G with a Maxwell-Boltzmann distribution is at maximum 1.53 dB for a high  $\beta$  [14]. For  $\beta$  = 9–17 bpcu, even with a reduced n of 32, 4096- or higher-order-QAM shows a good and flat G of 1.2 dB (around 0.3 dB gap to the theoretical limit). For  $\beta$  < 9 bpcu, the gap is < 0.3 dB. We have two 256-QAM modes with similar G; one is with 5 shaped bits and one extra non-shaped bit (see Sec. 2) per QAM symbol at n = 128 (marked (e) in Tab. 1), another is fully shaped at n = 64.

The FPGA fitting was successful at a clock frequency of 252 MHz. Parameters of the implemented HiSRDM are shown in Tab. 1, where M,  $B_s$ ,  $R_{spg}$ , and  $R_{tot}$  denote the number of QAM constellation points, symbol rate (baud rate), shaped information bit rate, and total information bit rate including sign bits, respectively. The maximum  $R_{\rm spg}$  and  $R_{\text{tot}}$  are 774.1 Gb/s and 1.032 Tb/s, respectively. Tab. 2 shows utilized resource elements and Fig. 3 illustrates the utilized area (cyan) in the FPGA consisting of three dies. We utilized mainly LUTs as logic and registers in configurable logic blocks (CLBs), and block random access memory (RAM). No DSP slices and ultra RAM were employed. The utilization of block RAM shows a relatively high value of 46%. There are several reasons why this is significantly larger than [5], where it was only 5%. Firstly, we employed dual-port read LUTs in [5], which reduced memory by 50%. On the other hand, in this work we did not do so to simplify the implementation. Thus, there is room to reduce the utilization in future work. Secondly, the supported information bit rate and base constellation are significantly larger than [5]. Thirdly, several LUTs in this work are larger than in [5]. There are 2160 instances of 36 kb RAMs in the FPGA. When an LUT size is less than 18 kb, half of an instance is utilized. Note that all block RAMs were utilized in DM encoding and decoding, and that the subcarrier rate matching employed CLB logics only.

| Tab. 1. | Parameters | for i | mplemented | HiSRDM. |
|---------|------------|-------|------------|---------|
|         |            |       |            |         |

| М      | B <sub>s</sub><br>(Gbaud) | max R <sub>spg</sub> (Gb/s) | $\max R_{\text{tot}}$ (Gb/s) | g <sub>R</sub> (bpcu) |
|--------|---------------------------|-----------------------------|------------------------------|-----------------------|
| 16     | 129.024                   | 516.096                     | 1032.192                     | 9.766E-4              |
| 32     | 64.512                    | 387.072                     | 645.120                      | 1.953E-3              |
| 64     | 64.512                    | 516.096                     | 774.144                      | 1.953E-3              |
| 128    | 64.512                    | 645.120                     | 903.168                      | 1.953E-3              |
| (e)256 | 64.512                    | 774.144                     | 1032.192                     | 1.953E-3              |
| 256    | 32.256                    | 387.072                     | 516.096                      | 3.906E-3              |
| 1024   | 32.256                    | 516.096                     | 645.120                      | 3.906E-3              |
| 4096   | 16.128                    | 322.560                     | 387.072                      | 7.813E-3              |
| 16384  | 16.128                    | 387.072                     | 451.584                      | 7.813E-3              |
| 65536  | 16.128                    | 451.584                     | 516.096                      | 7.813E-3              |
| 262144 | 16.128                    | 516.096                     | 580.608                      | 7.813E-3              |

Tab. 2. Utilized resource elements.

| Classification | Element      | Used/Available | Utilization |
|----------------|--------------|----------------|-------------|
| CLB Logic      | LUT as logic | 364k/1182k     | 30.77%      |
|                | Register     | 781k/2364k     | 33.04%      |
| RAM            | Block RAM    | 1010/2160      | 46.76%      |



Utilized area in FPGA (cyan).

## 4. Conclusions

Hierarchical subcarrier rate and distribution matching was proposed and implemented in a single FPGA. Its hardware architecture helped to realize key features; the highest throughput of 1.032 Tb/s, almost seamless rate granularity, and the largest base constellation of 262,144-QAM.

#### Acknowledgments

This work is in part supported by "Massively Parallel and Slice Optical Network," the commissioned research of National Institute of Information and Communication Technology (NICT), and "The research and development of innovative optical network technology as a new social infrastructure" (JPMI00316) of the Ministry of Internal Affairs and Communications, Japan. We also thank Prof. Kyo Inoue for assistance in the research.

### References

- G. Böcherer et al., "Bandwidth efficient and rate-matched low-density parity-check ...," IEEE Trans. Commun., 63(12), 4651–4665, 2015.
- F. Buchali et al., "Rate adaptation and reach increase by probabilistically shaped ...," J. Lightw. Technol., 34(7), pp. 1599-1609, 2016.
- D. Che and W. Shieh, "Approaching the capacity of colored-SNR optical channels by ...," J. Lightw. Technol., 36(1), pp. 68–78, 2018.
- H. Sun et al., "800G DSP ASIC design using probabilistic shaping and digital ...," J. Lightw. Technol., 38(17), pp. 4744-4756, 2020.
- [5] T. Yoshida et al., "FPGA implementation of distribution matching and dematching," ECOC 2019, Paper M.2.D.2.
- [6] Q. Yu et al., "FPGA implementation of prefix-free code distribution matching for probabilistic constellation ...," OFC 2020, Paper Th1G.7.
- [7] Q. Yu et al., "FPGA implementation of rate-adaptable prefix-free ...," J. Lightw. Technol., DOI: 10.1109/JLT.2020.3035039, 2020.
  [8] M. Binkai et al., "Demonstration of shallow probabilistic ...," IEICE Communications Express, DOI: 10.1587/comex.2020XBL0165, 2020.
- [9] A. Matsushita et al., "41-Tbps C-band WDM transmission with 10-bps/Hz spectral ...," J. Lightw. Technol., 38(11), pp. 2905–2911, 2020.
- [10] X. Chen et al., "Experimental demonstration of 4,294,967,296-QAM based Y-00 ...," Opt. Express, DOI: 10.1364/OE.405390, 2021.
- [11] D. Che, "Digital SNR adaptation of analog radio-over-fiber links carrying up to 1048576-QAM signals," ECOC 2020, Paper Th3B-1.
- [12] X. Chen et al., "16384-QAM transmission at 10 GBd over 25-km SSMF using polarization-multiplexed ...," ECOC 2019, Paper PD.3.3.
- [13] T. Yoshida et al., "Hierarchical distribution matching for probabilistically shaped ...," J. Lightw. Technol., 37(6), pp. 1579–1589, 2019.
- [14] F. R. Kschischang and S. Pasupathy, "Optimal nonuniform signaling Gaussian ...," IEEE Trans. Inf. Theory, 39(3), pp. 913–929, 1993.