Complementary metal oxide semiconductor (CMOS) image sensors have become an indispensable part of our data-driven world, where visual information prevails1,2. The front-end silicon photodiode array in a CMOS image sensor converts light into electrical currents. These electrical data undergo analog-to-digital conversion and are then shuttled to a digital back-end for image processing. While this standard sequence of front-end image capture and back-end processing restricts the role of the photodiode array to sensing, emerging machine vision applications would benefit from data processing within the photodiode array itself3,4. For example, in object tracking for self-driving vehicles, drones, or robots, where only the edges of objects are relevant5–8, edge extraction in the front-end photodiode array would be much more economical in energy expenditure, processing latency, required bandwidth, and memory usage, as compared to transferring the whole image data containing superfluous information to the back-end digital processor—only to extract the edges9.
Such in-sensor computing would require an electrical modulation, or programming, of photocurrents. In fact, in-sensor computing has been recently demonstrated with electrostatically doped photodiodes whose photocurrents can be modulated with gate biasing10,11. These pioneering works have realized electrostatically doped photodiodes by gating two-dimensional (2D) transition metal dichalcogenide (TMD) layers or their van der Waals (vdW) stacks12–14. In contrast, such in-sensor computing is not possible with the present build of CMOS image sensors, for they employ chemically doped silicon photodiodes, whose photocurrents are not amenable to electrical modulation. Here, we report in-sensor computing with an array of electrostatically doped silicon p-i-n photodiodes, which can be seamlessly integrated with the remainder of the CMOS image sensor electronics, while replacing the chemically doped silicon photodiode array. Such silicon-based approach could expedite the real-world application of in-sensor computing due to its compatibility with the mainstream CMOS electronics industry3,4,15,16. Concretely, we first demonstrate large-scale device production by fabricating thousands of dual-gate silicon p-i-n photodiodes at the wafer scale. We then perform in-sensor computing on serial optical images using a 3 × 3 network of these electrostatically doped photodiodes by electrically programming the network into 7 different convolutional filters.
Electrostatically doped silicon photodiodes
The photocurrent of a diode, Iph, grows with the power of the incident light, P, with the responsivity, R, being the proportionality constant17, i.e., Iph = R · P. A conventional, chemically doped photodiode exhibits a constant responsivity R, since the parameters that determine R, especially the doping densities of the p and n regions, are fixed. On the other hand, in an electrostatically doped photodiode, where the doping densities can be modified by gate biasing, R is electrically programmable. The electrostatically doped photodiode can thus perform analog multiplication between the incident light power P and the electrically programmed responsivity R. This programmable optoelectronic analog multiplication is the key to in-sensor image processing.
Our electrostatically doped photodiode is built on an intrinsic silicon wafer. It contains two contact electrodes—i.e., electrode 1 and 2—to provide the current path, and two top gate metals, which, when biased with the same voltage magnitude of opposite signs, create electrostatically doped p and n regions in silicon (Figs. 1a and b). The part of the silicon without any overlying gate metal is an intrinsic (i) region, and acts as a channel in the device. This channel region is directly exposed to light from above. The contact and gate electrodes are arranged in an interdigitated fashion for a high channel width/length ratio of 5576 µm/5 µm. Detailed fabrication steps are described in Methods and Supplementary Fig. 1. The resulting p-i-n diode exhibits a standard rectifying behavior (Supplementary Fig. 2), which confirms the electrostatic doping. As we swap the signs of the two gate biases, the rectifying behavior flips its polarity (Supplementary Fig. 2), which further verifies the electrostatic doping.
Illumination of the intrinsic channel region with a frequency of light higher than the silicon bandgap (~1.12 eV, or ~1,100 nm) generates a photocurrent. For this photocurrent generation mode, throughout this work, we bias both contact electrodes at zero voltage and define the current flow from electrode 1 to 2 as positive. The genesis of the photocurrent is the electrons and holes excited by the light, which are swept in opposite directions by the built-in potential (Vbi) of the diode, which is determined by the doping densities of the p and n regions (Fig. 1b). The electrostatic alteration of the doping densities via the gate voltages changes Vbi, which in turn can modify the magnitude and direction of the photocurrent for a given power of incident light. In other words, the gate biases tune the responsivity R.
We demonstrate this dependence of R on the gate biases by measuring the photocurrent with a fixed power, red-filtered halogen lamp that is periodically shuttered on and off, while the voltage at the gate above electrode 1 (VG, 1) is stepped up from -3 V to 3 V with a 0.5 V step, and simultaneously the voltage at the gate above electrode 2 (VG, 2) is stepped down from 3 V to -3 V with a 0.5 V step (inset of the Fig. 1c). The optical power of the light source, Psource, is 15 µW, which is different from, but proportional to, the power P of the light incident on the device, scaled according to the device and/or beam area. The measured photocurrent, shown in Fig. 1c, exhibits the expected modulation of R by the gate bias voltages. Repetition of such gate-controlled photocurrent modulation for ~50 min shows the stability of the programmability in R (Supplementary Fig. 3).
COMSOL Multiphysics simulation also confirms the operating principle of the electrostatically doped p-i-n diode. The gating clearly creates p and n regions, with the band bending across the channel (Supplementary Fig. 4a-c) and the responsivity R changing with the gating as expected (Supplementary Fig. 4d; more on this shortly, in connection with Fig. 2c).
Programmable optoelectronic multiplication
We further investigate the dependence of the photo response on the gate voltages, and now, also on the light power (Fig. 2a). Figure 2b shows the photocurrent map with two gate voltages independently swept, each from -5 V to 5 V with a step of 0.1 V, while the photodiode is illuminated by blue laser light (473 nm) with a fixed Psource of 125 µW. When the two gate voltages are identical, i.e., VG, 1 = VG, 2, whether it is positive (n-n doping) or negative (p-p doping), no overall potential gradient develops, and thus no photocurrent should be produced. The corresponding pp to nn line, with zero current, is indeed close to the ideal positive diagonal line, and its slight deviation is possibly due to charge carrier trapping at defects formed during fabrication. On the other hand, when we sweep the two gate voltages at the same magnitude, but with opposite signs, along the negative diagonal line, the photocurrent monotonically increases from the negative maximum to the positive maximum, which is consistent with the monotonic change of Vbi from the negative maximum to the positive maximum (Fig. 1b). Figure 2c plots this photocurrent response along the negative diagonal line as a function of VG, 1 = -VG, 2, which we denote as programming voltage Vp. This measured dependence of the photocurrent on Vp is also qualitatively consistent with the COMSOL simulation (Supplementary Fig. 4d). From here on, all the gate biasing is configured as Vp = VG, 1 = -VG, 2.
Moreover, we demonstrate the linear dependence of Iph on Psource––and therefore on P––for any given R programmed by tuning Vp. This linearity is important for high-fidelity analog multiplication between P and a given R. Figure 2d shows the measured Iph as a function of Psource (red-filtered halogen lamp) for various Vp (and thus R) values. A simple linear fit yielding a high coefficient of determination (0.996 averaged across all Vp values) confirms the linear dependence of Iph on Psource, and thus on P, for each programmed value of R. Linearity is also confirmed for different wavelengths of incident light (Supplementary Fig. 5).
Wafer-scale characterization of electrostatically doped silicon photodiodes
Electrostatically doped silicon photodiodes may accelerate the real-world realization of in-sensor computing due to their suitability for large-scale integration with CMOS electronics. To demonstrate, we have fabricated, in-house, 4,900 of the dual-gate p-i-n silicon photodiodes on a 4-inch silicon wafer (Fig. 3a, left) using the CMOS-compatible fabrication (see Methods). The fabricated wafer features 7 × 7 = 49 reticles, with each reticle containing 10 × 10 = 100 photodiodes (Fig. 3a, right).
Figure 3b shows photocurrent maps obtained by illuminating a 400 nm LED light with a fixed Psource of 170 µW serially––diode by diode––across an example reticle containing 100 photodiodes, for various Vp values (-5 V to 5 V with a 2 V step, clockwise from right, top corner). These maps show a high device-to-device uniformity in the responsivity programming within the reticle. In the wafer-scale photocurrent measurement of a 5 × 5 reticle array (2,500 photodiodes) with an automated probe station with Vp varied from -5 V to 5 V with a 0.1 V step, 2,372 devices showed programmable responsivity (~95% yield). Concretely, as we sweep Vp, the photocurrents of the 2,372 devices, in response to the 400 nm LED light with the fixed Psource of 170 µW, varied from -380 ± 50 nA to 430 ± 47 nA (Fig. 3c). Figure 3d shows the distribution of the 2,372 photocurrents for selected Vp values (-5 V to 5 V with a 1 V step), where device-to-device variations are more pronounced than those from the single reticle, which is standard at the wafer scale.
Optoelectronic convolutional image processing in a photodiode network
We connect 9 photodiodes as shown in Fig. 4a to perform analog multiplication between the incident light power and the programmed responsivity in each photodiode, and to sum, or accumulate, the resulting 9 photocurrents via Kirchhoff’s current law. The photocurrent sum resulting from this analog multiply-accumulate (MAC) operation is a dot product between the 1 × 9 incident light power vector and the 1 × 9 vector of programmed responsivities. Consequently, the 9-photodiode network of Fig. 4a serves as an optoelectronic convolutional processor, with the 1 × 9 vector of programmed responsivities––or equivalently the 3 × 3 map of responsivities programmed across the photodiode array––serving as an image filter kernel. The accumulated photocurrent is converted to an output voltage (Vout) via a transimpedance amplifier on a printed circuit board (PCB). Our measurement system is detailed in Supplementary Fig. 6.
With an image filter kernel programmed, the 9-photodiode network not only captures an input scene, but also processes it simultaneously. Figures 4b-d show an example demonstration where the network finds the edges of a moving light spot. We program the photodiode network to feature the specific responsivity map of Fig. 4b by independently tuning Vp of each photodiode (see Methods). This filter kernel is designed for edge-detection along the x-axis, resulting in positive and negative photocurrents when the photodiode network is at the right and left edges of the light spot, respectively, and otherwise negligible photocurrents. Figure 4c shows Vout monitored with a light spot from an LCD projector (power set to 255 out of 255, green channel only) moving from left to right at a frequency of 4 Hz, demonstrating the consistent positive (6 V) and negative (-6 V) responses as the spot moves over the array. We have evaluated this dynamic processing up to a spot movement frequency of 500 Hz (Fig. 4d), the maximum frequency of our optical setup.
Expanding from the simple example above, we perform in-sensor processing of a 256 × 256 pixel image (Fig. 5a, grey scale, 8 bit depth) with the contrast inversion filter kernel (Fig. 5b) programmed into the 9-photodiode network (see Methods for programming). Illumination of a 3 × 3 patch of the image onto the photodiode network using an LCD projector (green channel only) results in an accumulated photocurrent as the outcome of the optoelectronic convolution. By sliding the 3 × 3 patch through the 256 × 256 image and repeating the optoelectronic convolution, we generate a 254 × 254 matrix of accumulated photocurrents (a total of 64,516 accumulated photocurrents), which represents the image (Fig. 5c) processed with the contrast inversion filter kernel.
Besides the contrast inversion filtering, we have repeated in-sensor image processing using 6 other widely used filter kernels11,18–22: difference of Gaussians (DoG), Gaussian blurring, image sharpening, box blurring, horizontal Sobel, and vertical Sobel filters (Supplementary Fig. 7). As the 63 photocurrent values programmed with a fixed Psource from the LCD projector (green channel only; 9 values per filter and a total of 7 filters), which correspond to the 63 programmed R values, are compared to their target values, which range from -2 µA to 4 µA, the maximum error was 18 nA. Since the ratio of the maximum error to the target range, 1/333, lies between 1/29 and 1/28, the programming accuracy is 8 bit. The images shined with the LCD projector (green channel only) and processed with these filter kernels are shown in Fig. 5e, bottom; the Sobel filtered image in Fig. 5e, bottom is a composite produced by the root sum of squares of the horizontal and vertical Sobel filtered images (Supplementary Fig. 8)20. The juxtaposition of these images processed in the analog domain within the photodiode array (Fig. 5e, bottom) with those computed digitally (Fig. 5e, top) unequivocally verify our in-sensor computing scheme.