**OMica Accelerator Architecture**

A schematic of the OMica architecture proposed here for the acceleration of the universal matrix convolution is shown in Fig. 1. The light that carries the information of kernel *A* is replicated into multiple beams with different angular orientations by a 2D beam splitter, and thus, each sub-beam carries all the information of kernel *A* and propagates along its own diffraction angle. Next, these sub-beams pass through a 4F lens system, which comprises a pair of Fourier lenses, L1 and L2, and projects multiple displaced images of kernel *A* onto its conjugate plane at a magnification of ´1, where the feature map, matrix *B*, is located. Thus, once the sub-beams carrying information of kernel *A* pass through matrix *B*, element-wise multiplication is simultaneously realized. The kernel’s sliding process is automatically executed with massive parallelism if the appropriate distance *d* between matrix *A* and the beam splitter is selected, making the displaced distance equal to the element spacing of matrix *B* (see Supplementary Note 1). Finally, each sub-beam carrying the information of the element-wise multiplication of displaced kernels *A* and matrix *B* are truncated by the aperture of the modulator located in matrix *B*, and the passed field is focused by another convergent lens L3 to execute the accumulation operation. Moreover, the accumulations for these sub-beams are inherently separated in the Fourier plane of lens L3 due to the difference in the angular spectra for these sub-beams; thus, the final convolution matrix *C* is obtained from the 2D spot array on the Fourier plane, that is,

*C* = *A* ⊙ *B* (1)

where ‘⊙’ means convolution operation. In addition, owing to the object–image conjugate configuration, the OMica accelerator proposed here possesses a large space-bandwidth product22-2433, which makes it possible to realize massive parallelism with sufficiently high accuracy.

In our proof-of-concept implementation, a homemade 2D even Dammann grating (DG) was inserted into a 4F system (Fig. 1(a)), working as a beam splitter for the generation of multiple displaced images of convolution kernel *A* (see Methods and Supplementary Notes 1 & 5). Here, the even DG is the critical element for the realization of simultaneous translational sliding of kernel *A* on feature map *B*. Two spatial light modulators (SLMs) are located on the object and image planes of the 4F system, where the two convolution matrices, kernel *A* and feature map *B*, are dynamically loaded. In the experiment, the light intensity was used as the information carrier, and two amplitude-only SLMs were used to embed the information of Kernel *A* and feature map *B* into the incident uniform light beam. Therefore, in principle, only nonnegative matrices can be loaded and calculated based on this hardware. To overcome this drawback, an arbitrary digital encoded method was developed for hybrid analog–digital optical convolution computing. In a hybrid analog–digital framework, one can easily decompose a negative arbitrary-bit matrix, in the same form as a positive matrix, into one larger-scale or several same-size low-bit matrices in spatial or temporal sequences, respectively25,26. In other words, both positive and negative numbers in the original matrix can be expressed as , where {*c*n} are *N*-separated *k*-bit bytes, and each {*c*n} denotes a *k*-bit* *number, with *n* = 0, 1, 2, …, *N*−1. After this decomposition, a negative arbitrary-bit matrix is transformed into low-bit non-negative matrices, and it is possible to load these matrices on the SLMs. The principle of this encoding method is shown schematically in Fig. 1(b). Notably, there is a balance between computing precision and computing power, which can be tuned by changing the parameter *k*. A small *k* suggests that a higher precision and a lower computing power will be generated, whereas a large *k* indicates a large computing power and relatively low precision. Therefore, compared with analog optical convolutional computing, this encoding method can improve the computing precision to the same extent26.

Here, as an example, the encoding process is demonstrated step-by-step for a matrix with elements having a quaternary (2-bit) number under the condition of *k* = 1. First, the quaternary number for each element of the high-bit matrix to be encoded is expressed in multiple low-bit elements after encoding. For example, the first element is written as −2 = 0 × (−2)2 + 1 × (−2)1 + 0 × (−2)0. Thus, each element of the matrix to be encoded is expressed as multiple elements in the encoded matrix. Therefore, the elements of the matrix are arranged in rows after encoding, denoted as *P*1,* P*2,* P*3,* P*4,* P*5, and each element in the column direction is encoded with three bytes, denoted as *Bit*3, *Bit*2, and* Bit*1, as shown in Fig. 1 (b). For example, the first element, −2, is expressed as {010} in the first column of the encoded matrix, that is, *c*2 = 0, *c*1 = 1, and *c*0 = 0. Then, the converted matrices are sequentially loaded onto the SLMs for computing in a temporal or spatial sequence. Notably, in a spatial sequence, some zero elements should be inserted into the encoded matrix between two adjacent rows or columns of the original high-bit matrix to avoid aliasing. In this situation, the physical pixels of the SLMs will not be fully utilized because of the redundant zero elements. On the other hand, the encoding method in temporal sequence takes full advantage of the physical pixels of SLMs. However, the convolution must be executed between all bits of either kernel *A* or matrix *B*, and thus the refresh rate of the system will be at a large discount. Therefore, a compromise should be struck between high computing power and high computing precision by choosing an appropriate parameter *k* when OMica hardware is used for computing acceleration in the hybrid analog–digital framework described above.

**Hybrid Analog–Digital**** Coding Matrix Convolution**

As an example, the hybrid analog–digital optical convolution of 10 pairs of random binary matrices, 10 pairs of quaternary matrices, and 10 pairs of quaternary matrices with negative elements are demonstrated. In our proof-of-concept experiment, the maximum matrix size loaded into OMica hardware is about 10 × 10 due to the finite signal-to-noise ratio caused by the finite contrast of reflective liquid crystal SLMs available. Fig. 2 compares the experimental results of the optical convolution of a pair of binary matrices and two pairs of quaternary matrices with the theoretical results. The results, typical of these three types of matrix convolutions, are shown in Figs. 2(a) – (c). In each box, the theoretical results obtained by an electric computer (full precision, 64-bit) are illuminated in the first subfigure of the first row. The light intensity distributions of the spot arrays on the detected plane, denoting the raw results of the convolution, are shown in the second subfigure, and the experimental results before decoding are shown in the third subfigure. The absolute error map, defined as |*C*theo−**C**exp|, is shown in the last subfigure of the first row, where *C*theo and *C*exp are the theoretical and experimental convolutional results, respectively, and “|.|” denotes the absolute operation. In addition, the theoretical and experimental results of the convolution after decoding are shown in the first and second subfigures in the second row. It is shown that the overall trend of the experimental and theoretical results of the convolution is consistent.

Fig. 2(a) shows the results of the convolution of two 10 × 10 binary matrices. It can be seen that the mean value of the absolute errors |*C*theo−**C**exp| is 0.240, and the maximum error is below 0.4, indicating that the computing accuracy after digitalization is 100%. Fig. 2(b) shows the results of the convolution of two 3 × 10 2-bit matrices. The mean value of the absolute errors is 0.114, and it is seen that the maximum value is approximately 0.239 before decoding, which indicates that high precision is achieved by the OMica architecture. Fig. 2(c) shows the results of the convolution of two 2 × 10 2-bit matrices with negative elements. The mean error is 0.080, and the maximum value is approximately 0.145. It should be noted that the mean error before decoding is greater than that of the other two cases, mainly due to the increased crosstalk resulting from relatively large convolution elements. Moreover, the other two encoded matrices are filled with zero elements to avoid aliasing, which further reduces the crosstalk and final error. Because the maximum absolute errors are all less than 0.5 for these three cases, the correct convolution results, with an accuracy of 100%, can still be obtained after digitalization. Thus, the experimental light intensity distribution of the three cases precisely reflects the values of the convolution results.

The error distribution of all 30 sets of matrix convolutions is shown in Fig. 2(d), and it is clear that the maximum absolute error is less than 0.5. This means that no errors will occur after the convolution results are digitalized, suggesting good reliability and robustness of the OMica architecture. Notably, the accuracy is related to the stability of the light source, contrast of the modulators, transferring ability of the imaging system, and sensitivity and dynamic range of the detector. In addition, we experimentally demonstrated the convolution results of larger-scale and higher-bit matrices, as shown in Figs. S12, S13, S14, and S15 (see Supplementary Note 6), where the convolution results of 3 × 3 1-bit and 20 × 20 1-bit matrices, two 10 × 10 8-bit matrices, 20 × 20 8-bit matrices, and 180 × 224 8-bit matrices are given, respectively.

**CNNs based on MNIST**

Based on the abovementioned hybrid analog–digital coding method, we demonstrate the recognition of handwritten digits based on the OMica architecture. Here, a binary neural network (BNN)27 is implemented as an example to test the robustness and accuracy of the proposed optical hardware. For a BNN, the input signal is a binary (0 or 1) image, and the kernel is a binary matrix with a weight of −1 or +128. Each kernel of the BNN trained in advance is divided into two sub-matrices; one is a low-bit (positive) matrix and the other is a high-bit (negative) matrix, as shown in Fig. 3(a). Intuitively, it seems that two convolution operations should be executed in the temporal sequence. Interestingly, 10 original kernels need to be divided into 10 low-bit sub-kernels and another uniform high-bit sub-kernel. Furthermore, the first positive kernel and the negative kernel are exactly the same; thus, the total number of convolution kernels after encoding is still 10, which means that no additional computational overhead is incurred. The final convolution result can be obtained by the addition of the positive and negative convolution result multiplied by −2. Fig. 3(b) shows the inference process of the CNN based on encoding low- and high-bit kernels. The 10 encoded kernels are sequentially loaded onto the SLM located at the input plane of matrix *A*, and the binary input images with a scale of 28 × 28 are loaded sequentially onto the SLM located at the input plane of matrix *B*. When light passes through the two SLMs in sequence, and is then focused and separated by the focusing lens, the spot array denoting the convolution results is captured by the detector on the focal plane. Finally, the original convolution results are obtained by decoding the corresponding low- and high-bit convolutions.

Fig. 3(c) shows the error map between the theoretical and experimental results of an input image of a handwritten digit 7 convolved by the first kernel. Compared with the above three examples with an input matrix of 10 × 10, the size of a standard input image of handwritten digits is 28 × 28, whereas the size of the convolution kernel is almost the same, and the average value of the absolute errors is 0.405. This suggests that it is possible to calculate the optical convolution of larger-scale matrices using the OMica architecture with high precision. The following pooling layer, nonlinear operations, and full connections are executed by a classical electrical computer.

To validate the reliability and robustness of the system, we implemented blind-testing for the first 1000 sets of MNIST images with serial numbers from 1 to 1000. The experimental results indicate that a blind-testing accuracy of up to 97.3% was achieved for the OMica convolution accelerator, whereas the recognition accuracy was only 96.7% for the same test dataset for electrical computers. This was due to the computing error of the optical convolution also carrying characteristics of the input images, thus further strengthening the feature extraction ability. Noticeably, the error maps for different handwritten digits are highly correlated with the input image, as shown in Fig. 4. Therefore, the recognition accuracy of the optical convolution system based on the MNIST dataset was slightly higher than that of the electronic computer, as shown in Fig. 3(d). By further optimizing the kernel weights of the optical convolution system, direct training of the optical CNN is expected to yield better results than those of an electronic computer. On this basis, the architecture can be effectively used as a hardware accelerator with large computing power in various DNNs.

To the best of our knowledge, the OMica architecture is the only optical parallel acceleration solution capable of achieving both high-precision convolutional computers and AI hardware accelerators with high recognition accuracy. In addition, not only convolution layers but also the pooling layers and fully connected layers (all layers are linear convolution calculations) could be realized by the OMica architecture if an appropriate distance *d* (Fig. 1(a)) is chosen. For AI algorithms, it has been shown that very high accuracy is not required29, especially for inference tasks. Inference models work nearly as well with 4–8 bits of precision and trained with nearly 8–16 bits of precision per computation30. Our results suggest that the computing precision is close to 8 bit; thus, it is sufficiently accurate for most AI inference applications. Moreover, the computing accuracy could be improved by more than 8 bits if high-contrast modulators, such as DMDs, are employed, and this accuracy can be improved further by adopting hybrid analog-digital encoding method. Thus, the results obtained by this optical accelerator would be adequate for training most AI models. In addition, when training the neural network directly in the OMica system, the physical characteristics of the system itself are also trained, such as alignment errors and crosstalk, which are expected to further improve the performance of the neural network mentioned above.

An OMica accelerator for fully parallel universal convolution computing was proposed, and a hybrid analog–digital encoding scheme with sufficiently high precision was demonstrated. In principle, the convolution of an arbitrary bit matrix with massive parallelism and sufficient accuracy can efficiently be calculated by using a suitable encoding scheme and the OMica architecture. Moreover, the convolution is universal, and the computing results obtained may be easy to transplant to any other computing platform. Our proof-of-concept experimental results prove the feasibility of the optical convolution of 10 × 10 matrices with an accuracy above 8-bit, meaning that the results obtained by this optical accelerator are sufficiently accurate for most AI inference tasks, even for training some AI models. Furthermore, a BNN for recognition tasks of handwritten digits for the standard MNIST dataset was constructed, and the inference process was demonstrated based on this optical hardware. The results indicate that the blind-testing recognition accuracy is as high as 97.3%, which is even higher than that predicted by pure electrical networks. These proof-of-concept experimental results suggest that the OMica architecture can be used for massive parallelism, high-precision, and high-efficiency AI accelerators, and this computing paradigm has potential applicability in the construction of task-specific cloud computing centers or other AI computing centers. By developing high-speed SLMs with higher contrast, optimizing a special-proposed projection imaging system, and configuring a dedicated dot array lighting source, it is possible to construct a photonic coprocessor with higher computing power and lower energy consumption than state-of-the-art supercomputers, such as Fugaku, based on the proposed OMica architecture. In addition, the characteristics of the imaging system itself imply that the computing power of the system can be further increased by cascading multiple 4F systems and employing extra multiplexing degrees of freedom31,32; thus, a hybrid optical–electrical computer center or data center can be directly constructed. In the future, with the advancement of nonlinear optical elements33-35, a scheme based on the OMica architecture could also be integrated into pure photonic accelerators by combining planar waveguides36,37, metasurfaces38-40, or some other technologies41,42.

In summary, the OMica architecture is expected to be used in self-driving vehicles43, machine version44, and other fields that require large computing power for real-time or quasi-real-time data processing. This opens the door for increasing the computing power and energy efficiency for convolution by using high-performance devices, such as larger-scale modulators with higher updating frequencies and detectors or detector arrays with wider dynamic range and higher sampling frequencies, which, in the near future, would be superior to the most powerful supercomputers.