# A Pipeline VLSI Design of Fast Singular Value Decomposition Processor for Real-time EEG System Based on On-line Recursive Independent Component Analysis

Kuan-Ju Huang, Wei-Yeh Shih, Jui Chung Chang, Chih Wei Feng, and Wai-Chi Fang, Fellow, IEEE

Abstract—This paper presents a pipeline VLSI design of fast singular value decomposition (SVD) processor for real-time electroencephalography (EEG) system based on on-line recursive independent component analysis (ORICA). Since SVD is used frequently in computations of the real-time EEG system, a low-latency and high-accuracy SVD processor is essential. During the EEG system process, the proposed SVD processor aims to solve the diagonal, inverse and inverse square root matrices of the target matrices in real time. Generally, SVD requires a huge amount of computation in hardware implementation. Therefore, this work proposes a novel design concept for data flow updating to assist the pipeline VLSI implementation. The SVD processor can greatly improve the feasibility of real-time EEG system applications such as brain computer interfaces (BCIs). The proposed architecture is implemented using TSMC 90 nm CMOS technology. The sample rate of EEG raw data adopts 128 Hz. The core size of the SVD processor is 580x580 um<sup>2</sup>, and the speed of operation frequency is 20MHz. It consumes 0.774mW of power during the 8-channel EEG system per execution time.

### I. INTRODUCTION

EEG signals are very week, and may be easily contaminated by artifacts. The independent component analysis (ICA) [1] can clearly separate the clean EEG signal and artifacts. In order to improve the feasibility and convenience of EEG system applications such as BCIs [2], a real-time ICA pre-processing is required. The proposed EEG system is based on the ORICA algorithm [3]. Because this type of ICA has a very fast convergence rate to reach receivable accuracy results, this algorithm is extremely suitable for real-time EEG applications. It produces a result from each independent component at each sample time, so we don't waste additional registers to buffer a segment of EEG raw data. Therefore, the features of this type of ICA not only give a fewer output delay time but also narrow down the hardware area. The ORICA algorithm can improve reaction time of independent component algorithm and make the combination of BCI and ICA more feasible.

The proposed SVD processor is utilized to calculate diagonal, inverse and inverse square root matrices of the target matrices during the whole system process. This processor is based on the coordinate rotation digital computer (CORDIC) algorithm [4]. This work proposes a novel SVD architecture which speeds up the calculated latency. There are two

\*Supported in part by the National Science Council of Taiwan, R.O.C.

The authors are with the Department of Electrical Engineering and Institute of Electronics, National Chiao Tung University, Hsinchu 30010, Taiwan,(e-mail: wfang@mail.nctu.edu.tw).

methods to lower the clock cycle numbers of SVD computations: 1) Reduce the iteration numbers directly of the computations. But unless it considerably revises the architecture of CORDIC, this method will cause the sharp decrease in the accuracy of results. 2) Alter the sequence of the updated data process. This method uses additional hardware resources to avoid structural hazard, and utilizes VLSI design skills to achieve pipeline design. However, this paper adopts the second method to achieve a low-latency and high-accuracy SVD processor.

The remainder of the paper is organized as follows. Section II demonstrates the processor in the EEG system. In section III the architecture of the SVD processor is presented. Section IV describes the details of the pipeline design consideration of the SVD processor. The simulation results are provided in Section V and finally a conclusion is given in Section VI.

## II. THE SVD PROCESSOR IN THE EEG SYSTEM

This section describes the proposed SVD processor used in the EEG system based on ORICA. Fig. 1 shows the flow chart of the EEG system. First of all, the purpose of the Whitening unit is to remove the correlation of each EEG raw data, and the whitening transformation is a linear operation. Vector Z, as shown in (1) and (2), will be uncorrelated as their variances equal to unity after being whitened. And then, vector Z is sent to ORICA Training unit to estimate an unmixing weight W and the independent component Y as shown in (3) to (5). Initially, W is first set to an identity matrix ensuring full rank. Then, the following calculation is applied to update W until  $\Delta$  W is close to zero.



Figure 1. The data flow diagram of an EEG system.

$$Cov (X) = E[X, X^{T}]$$
(1)

$$Z = Cov (X)^{-1/2} \times X$$
 (2)

$$Y = W \times Z \tag{3}$$

$$\Delta W = \frac{\lambda}{1-\lambda} \left[ W - \frac{y \times f^{T} \times W}{1+\lambda (f^{T} \times y - 1)} \right]$$
(4)

$$W = (W + \Delta W)^{-1/2} \times (W + \Delta W)$$
(5)

$$X_{c} = (W \times P)^{-1} \times Y_{c}$$
(6)

Finally, Y is delivered to Auto de-artifact and Inverse ICA unit for artifact rejection and EEG signals reconstruction. After rejection, Y is updated as  $Y_c$ . Then, the inverse ICA is performed to obtain eye blink artifact free EEG signal,  $X_c$  as shown (6).

The following paragraph describes the algorithm adopted in this SVD processor [5]. The Jacobi SVD (JSVD) algorithm has been proven to have very good performance in dealing with pseudo inverse, matrix approximation and other similar ill-posed problems. In linear algebra, any rectangular matrices  $A_{mxn}$  can be decomposed into three special matrices as shown in (7) by JSVD algorithm.  $U_{mxm}$  and  $V_{nxn}^T$  are both unitary matrices, and  $\Sigma_{mxn}$  is a rectangular diagonal matrix with singular values of matrix A on the diagonal.

$$A_{m \times n} = U_{m \times m} \sum_{m \times n} V_{n \times n}^{T}$$
(7)

And using the property of the unitary matrix, equation (7) can be written in the following form:

$$A = U \Sigma V^{T} \Longrightarrow U^{T} A V = \Sigma$$
(8)

A practical JSVD algorithm called the Kogbetlianz method performs two orthogonal side plane rotations to generate matrix  $U_{mxm}$  and  $V^{T}_{nxn}$  [6]. And as the iteration increases,  $A_{i+1}$  becomes more diagonal than  $A_i$ . The rotation equation is shown in (9).  $J_i^{\ l}$  and  $J_i^{\ r}$  are Jacobi rotation matrices and the matrix  $A_n$  will be transformed into the diagonal matrix  $A_0$  after n iterations.

$$A_{n} = (J_{n}^{l})^{T} (J_{n-1}^{l})^{T} ... (J_{0}^{l})^{T} A_{0} J_{0}^{r} ... J_{n-1}^{r} J_{n}^{r}$$
(9)

JSVD, based on CORDIC implementation, can be decomposed into 2-by-2 SVD problems, as shown in (10). To solve each 2-by-2 SVD problem, the two-side rotation matrices composed of  $\theta$  and  $\phi$  are applied to transform the matrix A into a diagonal matrix.

$$\begin{bmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{bmatrix}^T A \begin{bmatrix} \cos\phi & \sin\phi \\ -\sin\phi & \cos\phi \end{bmatrix} = \begin{bmatrix} \sigma_1 & 0 \\ 0 & \sigma_2 \end{bmatrix}$$
(10)

# III. THE PROPOSED SVD PROCESSOR ARCHITECTURE AND IMPLEMENTATION

The state diagram of the SVD processor flow control unit is shown in Fig. 2. In the beginning, the processor remains in the IDLE state. Once the data is received, the processor turns into State 1. In State 1, matrix L and matrix R will be set to unity matrix in the different single-port SRAMs. After that, the processor turns into Angle\_Fetch state. In this state, the SVD processor will catch four corresponding elements in matrix D. Then  $\theta_n$  and  $\phi_n$  are estimated in State 3. After the calculation of angles is completed, the signal START turns into "1". In State 4, the processor will catch row vectors of matrix L and D, and catch column vectors of matrix R and D. These vectors will be updated by vector CORDIC operation and store these data to the same location of the different SRAMs in State 6.

However, this work uses eight vector CORDIC engines in State 5 for reducing the calculation latency to estimate new vectors in matrix L, R, and D. If the iteration number, Itt\_num, of the vector CORDIC engine is less than Total\_ITER, this work is set to 5, the data flow will turn into State 4 periodically. However, the situation wherein the number of Iteration equals the total number of iteration results in State 5. This situation means that the processor finished the singular value decomposition of the target matrix. Finally, the processor calculates the inverse or inverse square root of the diagonal elements of matrix D in State 7. It then turns into State 0 while waiting for the next data.



Figure 2. State diagram of the SVD processor flow control unit.



Figure 3. The architecture of the vector CORDIC engine circuit



Figure 3. The data flow of the SVD processor

The architecture of the vector CORDIC engine circuit is shown in Fig. 3. The pipeline VLSI design used in this work is implemented in stage 1. The novel circuit is in front of the traditional circuit, stage 2 [4]. Stage 1 contains a distance modulation circuit, sixteen adders, and a comparator circuit to execute CORDIC iteration. And distance modulation circuit consists of sixteen multipliers to finish distance scaling of updated vectors. Although this additional circuit causes the hardware area to increase, the advantage is that the engine can finish comparing operations of about sixteen arc-tangent values in a clock period. Therefore, this way can not only decrease the total latency of SVD operation, but also increase the accuracy of the EEG system results.

# IV. PIPELINE DESIGN OF THE SVD PROCESSOR

Fig. 3 is the data flow during the SVD process. First, Angle\_CORDIC\_1 and Angle\_CORDIC\_2 fetch the red elements, A<sub>11</sub>, A<sub>12</sub>, A<sub>21</sub>, A<sub>22</sub>, in matrix A, and these Angle\_CORDICs calculates  $\theta_1$  and  $\phi_1$  respectively at the same time. Similarly, the Angle\_CORDICs will fetch the blue elements, A<sub>11</sub>', A<sub>13</sub>', A<sub>31</sub>', A<sub>33</sub>, in the updated matrix A to calculate  $\theta_2$  and  $\phi_2$  while on its second operation. After the angles estimation, Vector\_CORDICs catch  $\theta_1$  and the red elements, M<sub>11</sub>, M<sub>12</sub>, M<sub>13</sub>, M<sub>14</sub>, M<sub>21</sub>, M<sub>22</sub>, M<sub>23</sub>, M<sub>24</sub>, in matrices A and U to execute iterative operation. The SVD utilizes eight Vector\_CORDICs to update matrices A and U simultaneously. After the data updating, the latest data will be stored back in the same location of the original ones. In the same way, Vector\_CORDICs finish renewing matrices A and  $\Sigma$ .

$$T_{A} = C_{2}^{channels} \times (4 + 16) \tag{11}$$

$$T_{B} = [(2 + 1) \times channels + (4 + 16)] \times 2 \times T_{A}$$
 (12)

However, Angle\_CORDIC and Vector\_CORDIC spend total clock cycles as shown in (11) and (12). It's obvious that Vector\_CORDIC operation almost wastes latency in the whole SVD operation.

#### V. SIMULATION RESULTS

The pipeline SVD processor is used to solve the diagonal, inverse and inverse square root matrices of the target matrices. The simulation shows that the mean square errors are under 1%. And the execution time improved by the pipeline architecture of the proposed SVD processor is sufficient for the real-time and high accuracy requirements. Moreover, the performance assessment of the EEG system is based on the correlation coefficient, which is defined as the relationship between the EEG source and the ICA results.

Fig.5 shows the variation of mean error with the iteration number increasing. This work compares the result of the SVD processor with eig(A), where eig is MATLAB command for calculating the singular value decomposition of the target matrix. This work adopts five times the iteration number, Itt\_num, based on the simulation results. This setting is the most suitable trade-off between operation latency and accuracy requirements of the SVD processor. It is evident that the proposed SVD processor gives better performance as compared with Reference [7].



Figure 5. The correlation of the SVD processor

| 8            |                                        |               |                                       |                                                                                                                  |          |
|--------------|----------------------------------------|---------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------|----------|
| -2<br>0<br>2 | 200                                    | 400           | 600                                   | 800                                                                                                              | 1000     |
| -20<br>-20   | 200                                    | 400           | 600                                   | 800                                                                                                              | 1000     |
| 20<br>-20    | 200                                    | 400           | 600                                   | 800                                                                                                              | <u> </u> |
| -2<br>0<br>2 | 200                                    | 400           | 600                                   | 800                                                                                                              | 1000     |
| -20<br>-20   | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ~~~~~<br>400  | ۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰۰ | 00800000000000000000000000000000000000                                                                           | 1000     |
| 5 HWWW       |                                        | พพพบนุ่านงงผม | พพการทุ่งพพพทา                        | $\mathcal{T}\mathcal{T}\mathcal{T}\mathcal{T}\mathcal{T}\mathcal{T}\mathcal{T}\mathcal{T}\mathcal{T}\mathcal{T}$ | WWWWW    |
| 20           | 200                                    | 400           | 600                                   | 800                                                                                                              | 1000     |
| \$ ANNWA     | MARIN MARINE MARINE                    | MARINA MARINA | MMM MMM MMM                           | $\sim$                                                                                                           | WWWWW    |
| - <u>_</u> 0 | 200                                    | 400           | 600                                   | 800                                                                                                              | 1000     |
| 0 FLF        | unit                                   |               |                                       | urýru                                                                                                            |          |
| -20          | 200                                    | 400           | 600                                   | 800                                                                                                              | 1000     |
|              |                                        |               |                                       |                                                                                                                  |          |



Figure 6. (a) Original EEG signal, (b) Simulation result of proposed EEG system

Fig. 6 shows the simulation results of the eight-channel EEG system. The original EEG signal is shown in Fig. 6(a), and the treated signal by this work is shown in Fig. 6(b). This work generates a randomly mixing matrix M = rand(ch;ch), where rand is MATLAB command for generating uniformly distributed random numbers. It is used to obtain mixtures matrix, Mix(n). And this matrix is the input data of the multi-channel EEG system based on the SVD processor. It is evident that the channel eight of both Fig. 6(a) and Fig. 6(a) gives faster convergence speed and obviously steadier state source separation.

It is particularly important that the correlation is very high, and this result vigorously proves that the proposed SVD processor is very efficient and feasible in this system. Finally, the core size of the SVD engine based on TSMC 90nm CMOS process library is 580x580 um<sup>2</sup>, and the speed of operation frequency is 20MHz. It consumes 0.774mW of power during the eight-channel EEG system per execution time. The chip layout is shown in Fig. 7.

# VI. CONCLUSION

In this paper, a low-latency and high-accuracy SVD processor is presented. It has also been shown that the proposed parallel architecture can improve 54.8% latency



Figure 7. SVD processor chip layout.

consumption in CORDIC operation. And the proposed SVD processor can actually benefit the EEG system to reach real-time computation and achieve a high accuracy with low mean errors and favorable correlation coefficient. Furthermore, the SVD processor gives faster convergence speed and better steady state during the EEG system process. Therefore, this work can greatly improve feasibility and convenience of the real-time EEG applications such as BCIs.

#### ACKNOWLEDGMENT

This work was supported in part by the National Science Council of Taiwan, R.O.C., under grant NSC101-2220-E-009-049 and NSC101-2221-E-009-169-MY2. The authors would also like to express their sincere appreciation to the National Chip Implementation Center for chip fabrication and testing service.

#### REFERENCES

- A. J. Bell and T. J. Sejnowski, "An information maximization approach to blind separation and blind deconvolution," *Neural Comput.*, vol. 7, no. 6, pp. 1129–1159, Nov. 1995.
- [2] Palumbo, A.; Calabrese, B.; Cocorullo, G.; Lanuzza, M.; Veltri, P.; Vizza, P.; Gambardella, A.; Sturniolo, M.; , "A novel ICA-based hardware system for reconfigurable and portable BCI," Medical Measurements and Applications, 2009. MeMeA 2009. IEEE International Workshop on , vol., no., pp.95-98, 29-30 May 2009.
- [3] Muhammad Tahir AKHTAR, Tzyy-Ping Jung, Scott Makeigy, and Gert Cauwenberghs, "Recursive Independent Component Analysis for online Blind Source Separation," in Proc. IEEE Int. Symp. on Circuits and Systems, May 20-23, 2012.
- [4] Meher, P.K.; Valls, J.; Tso-Bing Juang; Sridharan, K.; Maharatna, K.; , "50 Years of CORDIC: Algorithms, Architectures, and Applications," Circuits and Systems I: Regular Papers, IEEE Transactions on , vol.56, no.9, pp.1893-1907, Sept. 2009
- [5] Jun Ma; Parhi, K.K.; Deprettere, E.F.; , "An algorithm transformation approach to CORDIC based parallel singular value decompositions architectures," Signals, Systems, and Computers, 1999. Conference Record of the Thirty-Third Asilomar Conference on , vol.2, no., pp.1401-1405 vol.2, 24-27 Oct. 1999
- [6] G. Strangman, D. A. Boas, and J. P. Sutton, "Non-invasive neuroimaging using near-infrared light," *Biological Psychiatry*, vol. 52, pp. 679-693, 10/1 2002.
- [7] Shih Kang; Shih-Yang Wu; Yuan-Huang Hsu; Fu, C.; Wai-Chi Fang, "A VLSI design of singular value decomposition processor for portable continuous-wave diffusion optical tomography systems," Life Science Systems and Applications Workshop (LiSSA), 2011 IEEE/NIH, vol., no., pp.100,103, 7-8 April 2011.