A MEMS based Real-Time Structured Light 3-D Measuring Architecture on FPGA

With its its ability to non-contact measure three-dimensional information of objects and its extremely high accuracy advantage in close range, structured light 3-D measurement is widely used in various ﬁelds. However, some application scenarios, such as measuring moving objects and performing measurements in conﬁned spaces, impose requirements for high speed and miniaturization in structured light 3-D measurement. Therefore, we propose a real-time structured light 3-D measurement system on FPGA. This system employs a four-step phase-shifting method to compute wrapped phases, complemented Gray code for phase unwrapping, and a cubic polynomial ﬁtting approach for calculating the 3-D coordinates of points. We have proposed the optimized pipeline structure for each module. We have also proposed an optimized on-chip buﬀer structure to further improve throughput. The 3-D mea-Wenbiao


Introduction
With recent tends in various industries moving towards intelligence, 3-D measurment has received much attention.Many techniques such as time of flight [5], binocular stereo vision [15], speckle and structured light have been cutting-edge topics.Because of its ability to non-contact measure three-dimensional information of objects and its extremely high accuracy advantage in close range, structured light three-dimensional reconstruction technology has been widely applied in various fields such as electronic hardware production, automotive manufacturing, 3-D printing, defect detection, and so on [6].With the development of industrial applications, how to improve the speed of structured light 3-D measurement, especially the phase based fringe projection profilometry (FPP), has aroused broad concern.
This paper propose a MEMS based real-time structured light 3-D measurement system on FPGA.In this architecture, we have designed hardware modules based on a fully pipelined parallel structure for the steps of the FPP algorithm.We have also optimized the cache structure, significantly enhancing the system's bandwidth and throughput.The entire algorithm has been re-implemented at the logic gate level using Verilog.Additionally, we have accelerated the least squares fitting of a third-degree polynomial using the Schur complement method and designed a hardware structure for fitting the third-degree polynomial.Experimental results further indicate that the accuracy of this system is comparable to that of the bucket algorithm implemented using Python.
2 Related work

Improve algorithm to reduce patterns
Many researchers have proposed several methods on how to improve the speed of structured light 3-D measurement.First of all, it's obviously that reduce the number of patterns is an effective method due to the shorter time spent on camera imaging and less data to deal with.Transformation methods, such as Fourier transform profilometry (FTP) [14], [13], windowed Fourier transform [7], wavelet transform [12], etc., are typical single lens FPP methods that only use a single stripe image and extract its wrapped phase map through a bandpass filter.However, traditional FTP methods have low measurement accuracy when measuring non smooth and complex surface geometries.
In addition, researchers have developed spatial encoding methods based on two-dimensional patterns, utilizing Pseudo-Random Binary Arrays (PRBA) to mark the surface of objects [11] [8].Another approach involves arranging a set of miniature patterns in a pseudo-random two-dimensional array, ensuring the uniqueness of each sub-window [2].
However, such spatial encoding methods often have limited resolution and cannot achieve sub-pixel measurement accuracy comparable to traditional phaseshifting methods.Consequently, efforts have been made to enhance traditional phase-shifting techniques, leading to the development of a color-coded digital fringe projection technology for high-speed three-dimensional surface profiling [4].But this color-coded encoding method is susceptible to crosstalk between color channels.This susceptibility is particularly pronounced when the object's surface already has color.In such cases, this encoding method can be significantly affected, to the extent that it may become impractical or unusable.

Utilizing hardware acceleration
With the rapid advancement of Graphics Processing Units (GPUs), Digital Signal Processors (DSPs), Application-Specific Integrated Circuits (ASICs), and Field-Programmable Gate Arrays (FPGAs) in recent years, hardware acceleration for traditional structured light 3-D measurement algorithms has become another research hotspot.Zhang developed a GPU-assisted phase-shifting fringe projection profilometry (FPP) system based on a 2+1 phase-shifting algorithm [19].They offloaded the coordinate calculations from the CPU to the GPU, enhancing the speed of coordinate computation.This approach achieved a speed of 25.56fps for images with a resolution of 532*500 pixels.However, it is evident that GPUs have the drawbacks of being costly and power-intensive, making them unsuitable for mass production and miniaturization.Zhan designed a phaseshifting measurement hardware architecture based on FPGA, which achieved 3-D reconstruction with 3×4 phase-shifted images at a frame rate of 1024*768 pixels within 21 ms [17].However, the structural design of the high-speed circuit in the FPGA did not take into account temporal stability.Hess implemented the entire phase-based FPP system on the FPGA portion of the FPGA+ARM architecture, including image acquisition (BCON interface), lens distortion and image correction, phase accumulation through phase unwrapping, phase matching, and the 3-D reconstruction algorithm [3].Compared to previous methods, the main improvement made by Liu is the polynomial fitting of the three-dimensional coordinates of points with their pixel coordinates and phases [9].
Chen [1] proposed a phase-based FPP algorithm on FPGA.Chen employed a method similar to coordinate calculation in stereo vision.The approach by Chen involved significant parallelization and pipelining, but there is still room for optimization in the design of the cache structure.

The proposed hardware
This article presents an optimzed hardware, which which is based on the Stripe Projection (FPP) algorithm on the FPGA.The three main modules have been designed for pipelining, and the buffer structure has been redesigned to accelerate the entire algorithm implementation.
The images captured by the camera first enter a double buffer, which is then controlled by the memory control module to write to the external DDR.Subsequently, the memory control module transfers the corresponding image data to the caches of the Gray Code decoding module, Phase Shift decoding module, and Line Shift decoding module, respectively.After each decoding module completes its calculations, the calculated results are sent to the coordinate calculation module for the computation of absolute phase.
At the same time, the memory control module retrieves coefficient data for calculating point cloud coordinates from the DDR and transfers it to the coordinate Fig. 1 The architecture of structured light 3-D measuring calculation module.The coordinate calculation module, using the absolute phase and the corresponding coefficient data, calculates the three-dimensional point cloud coordinates and writes them into the DDR.

The buffer structure of each module
The buffer structure for each decoding module is as shown in Fig. 2. The entire buffer consists of 3-4 blocks of BRAM (determined by the decoding module).Taking the phase unwrapping module module's cache as an example, there are a total of 4 BRAM blocks.Each of these 4 BRAM blocks stores grayscale image data for the same pixel coordinates from 4 different Gray Code patterns.When the grayscale image data for the 5th Gray Code pattern arrives, the grayscale data for the same pixel coordinates from all 5 images is sent together to the calculation unit for processing.The advantage of organizing the cache in this way is to minimize on-chip BRAM usage, conserve resources, reduce the time data is spent in the cache, and improve efficiency.
Fig. 2 The buffer structure of wrapped phase calculation module and phase unwrapping module

Wrapped phase calculation module
At present, a commonly used method in calculating wrapped phase is Four-step phase-shifting method.This method's mainn steps are: First, generate four fringe patterns with intensity varying with cosine law.Their initial phases are 0, π/2, π, and 3π/2, respectively.Then, The grayscale distribution of these patterns can be represented as: Where x, y are the row pixel coordinates and column pixel coordinates.I i (x, y) is the grayscale distribution on the phase-shifted patterns.I ′ (x, y) is the the background light intensity value of the phase-shifted patterns.I ′′ (x, y) is the max intensity value modulated by the object.ϕ(x, y) is the wrapped phase and it can be derived from Eqs. (2): Take the inverse tangent of Eqs.(2) can obtain the wrapped phase value: Four-step phase-shifting method need four pixel grayscale values which from four different phase-shifted patterns to obtain a wrapped phase value.Due to the discontinuity of input data, memory access time will increase.In addition, calculating the arctangent value is also very suitable for using FPGA for parallelization and pipelining to accelerate the calculation process.
The structure of the wrapped phase calculation Module is as shown in Fig. 3.The grayscale image data for the same pixel coordinates from four phase-shifted images are input together for computation.Specifically, the pixel grayscale values of the first phase-shifted image are subtracted from those of the second image, and the pixel grayscale values of the third image are subtracted from those of the fourth image.The results of these subtractions are then input into an arctan IP core for arctan calculation, and the computed results are stored in a cache.
This structure is replicated eight times to enable parallel computation for 32 pixels of data simultaneously.
Fig. 3 The structure of wrapped phase calculation module

Phase unwrapping module
The commonly used methods for phase unwrapping include Gray code method and multi frequency heterodyne method.The multi frequency heterodyne method requires more computation and has higher computational complexity compared to the traditional Gray code method, but its accuracy is higher.This is because the traditional Gray code method is prone to errors when solving the phase wrapping at the jump of the Gray code, and the black and white boundaries of the Gray code are not sharp cutoff and are not ideal binary distributions.Therefore, after binarizing the image, there may be a mismatch between the decoding level edge and the truncated phase edge, resulting in phase unwrapping errors.However, Sun and Zhang proposed an encoding method for Complementary Gray code that avoids errors in advance [16].Compared to traditional Gray code coding methods, this method projects an additional Gray code, so that the codeword width of the last Gray code is half of the period of the marked sine stripe.When decoding and truncating phase edges, use the middle part of the last two Gray code patterns to avoid errors caused by gray code edge hopping.The Gray code unwrapping algorithm mainly involves XOR calculation between the last bit of data and the previous bit of data, which is very suitable for parallelization acceleration using FPGA.
The structure of the phase unwrapping module is shown in Fig. 4. In this module, the grayscale data of the same pixel coordinates in the five Gray code images are sent together to a comparator, where they are compared to a threshold.If the grayscale data is greater than the threshold, the comparator outputs a 1; otherwise, it outputs a 0. After that, each bit of the first four Gray codesis XORed with the previous bit to obtain the decoded result in Gray code k 1 , According to k 1 , when k 1 is odd, add the fifth Gray code word to k 1 .When k 1 is even, Adding 1 to k 1 and then subtracting the codeword of the fifth Gray code will result in the complementary Gray code codeword k 2 .
And then use the following formula to unwrap the phase and obtain the absolute phase.

3-D coordinate calculation module
The commonly used method for obtaining 3-D coordinates is the method which is based on the principle of small hole imaging and establishes the relationship between the world coordinate system, image coordinate system, and projector coordinate system.First, According to the principle of small hole imaging, the relationship between the 3-D coordinates of a point P in the world coordinate system and in the image coordinates system can be describe as: Fig. 4 The structure of the phase unwrapping module The relationship between the 3-D coordinates of a point P in the world coordinate system and in the projector coordinates system can be describe as: Where A c and A p are the internal parameter matrices of the camera and projector, respectively.M c and M p are the external parameter matrices of the camera and projector in the same world coordinate system.u c and u v are the coordinates of the point P in the image coordinates system.up and u p are the coordinates of the point P in the projector coordinates system.x w , y w and z w are the coordinates of the point P in the world coordinates system.Meanwhile, through the previous wrapped phase calculation, the absolute phase of the point is obtained as Φ(u c , v c ), and the the horizontal coordinates corresponding to the point in the projector coordinates system can be calculated by using Eqs.(6): Where N v is the fringe periods of the phase-shifted patterns, W is the horizontal resolution of the fringe pattern projected by the projector.
Then the coordinates of the point in the world coordinates system can be obtained based on Eqs. ( 4), ( 5) and ( 6).It's obviously that calculating the threedimensional coordinates of a point in this way requires a lot of matrix operations, and even the inverse of the matrix operation, which is complex and computationally expensive.
Zhang proposed a method for calculating threedimensional coordinates using cubic polynomial fitting in 2021 [18].His research suggests that the absolute phase of a point in space is related to its coordinates in the camera coordinate system, and this relationship can be described using a cubic polyphase equation.
Where a 0,1,2,3 , b 0,1,2,3 and c 0,1,2,3 are the constants coefficient corresponding to each pixel, and vary with each other.These coefficients can be obtained by calibration.When calibrating the internal and external parameters of a camera, it is possible to simultaneously project structured light patterns onto the calibration board, compute the absolute phase on the calibration board, and then determine the camera coordinate system in the plane of the calibration board based on the calibration results.Subsequently, a three-term polynomial fit can be applied to obtain these coefficients.Once the coefficients are obtained through calibration, the subsequent three-dimensional coordinate calculations for objects only require the computation of the cubic polynomial in Eqs.(7), significantly improving the speed of three-dimensional measurements of objects.However, the cubic polynomial fitting in the calibration step involves a large number of matrix operations, resulting in slow computation speeds, which may require hardware acceleration.
The method used for the three-dimensional coordinate calculation module in this article is the thirdorder polynomial fitting method proposed by Zhang in 2021.The coefficients of the third-order polynomial for each point are generated during camera calibration.In this paper, the circular center calibration method is employed on the PC to calibrate the camera's intrinsic and extrinsic parameters.After obtaining the matrices of intrinsic and extrinsic parameters, the camera coordinates of each pixel can be calculated using Eqs.(8).Due to the substantial time consumption of third-order polynomial fitting on the PC, this paper also explores accelerating the third-order polynomial fitting on an FPGA.
The method used in this paper is based on the least squares method for third-order polynomial fitting, assuming a model for the third-order polynomial as follows: This paper uses 10 sets of phase and coordinate data to fit a set of coefficients for a third-order polynomial.According to the least squares method: By setting the partial derivatives of a 0 , a 1 , a 2 and a 3 to zero, we obtain the following matrix equation: Let X be the Vandermonde matrix of x 0 , x 1 , ..., x 10 : Therefore, Eqs. ( 10) can be written as: Then we can obtain: A = (XX T ) −1 XY .In this paper, the Schur complement method is used to find (XX T ) −1 .The matrix XX T mentioned above can be expressed in a block form as shown in Equation ( 13): Then the Schur complement of A is D − CA −1 B, and the inverse of XX T is: The overall structure of the entire cubic polynomial fitting module is shown in Fig. 5. Fig. 6 is the structural diagram of the XX T calculation module.This module is used to compute the elements of XX T .Obviously, the XX T matrix is composed of elements such as i .Therefore, it is only necessary to calculate the 0th to 6th powers of x 0 , x 1 , ..., x 10 and accumulate the results to obtain all the elements of XX T .Fig. 5 The architecture of cubic polynomial fitting module Fig. 6 The structure of XX T calculation module Fig. 7 The structure of 2*2 matrix inversion module Fig. 8 The structure of 2*2 matrix multiplication module Fig. 7 is the structural diagram for inverting a 2*2 matrix.The principle is based on the formula for inverting a 2*2 matrix: each element is divided by the determinant of the matrix.
Fig. 9 The structure of systolic srray module Fig. 9 depicts a systolic array for a 4th-order matrix, used for pipelined computation of matrix multiplication.The entire array consists of 16 computing units, each comprising two sets of registers for storing matrix elements flowing in from the previous computing unit or input end, a multiplier, an adder, and a set of registers for storing the calculation results of the current unit.During computation, the elements of the two matrices to be multiplied are sequentially streamed into the pulsating array by rows and columns.The computing unit performs multiplication operations on the input matrix elements and accumulates the results.Simultaneously, the matrix elements are passed to the adjacent next computing unit.After computation is complete, the elements of the result matrix are output sequentially from the result registers of each computing unit.
At first startup, the camera coordinates and phase data obtained from calibration will be inputed through the Ethernet port and saved into DDR.Afterward, the data, following calculations through various modules, will yield the coefficients of a cubic polynomial, which will be saved in the SD card.The entire calibration process is only performed once during the first startup.Each time the FPGA powers up, it can perform three-dimensional coordinate calculations by reading the polynomial coefficients from the SD card, eliminating the need for recalibration every time.
The structure of the three-dimensional coordinate calculation module is shown in Fig. 10.Coordinate calculation is carried out using cubic polynomial fitting.The four coefficients of the cubic polynomial are generated during camera calibration and stored in an external memory.When extracting point clouds, the memory control module will transfer the coefficient data to Fig. 10 The structure of 3-D coordinate calculation module the cache of the coordinate calculation module.During coordinate calculation, the decoded phase data, along with the corresponding four coefficients, are input into the multiplier-accumulator tree as shown in the diagram.After two multiplication and addition operations, the corresponding coordinates are obtained.This same structure exists for three sets in parallel, allowing the simultaneous calculation of x, y, and z coordinates for a point.The calculated coordinates are cached and then written together to DDR.

Experiments and results
This paper utilizes the ALINX company's ARTIX-7 series FPGA development board, featuring XILINX's XC7A200T-2FBG484I FPGA chip.The camera resolution used in this paper is 2448*2048, and it employs a structured light projection module based on onedimensional MEMS micro-mirrors, with a projection resolution of 1024*1.
We conducted 3-D measurements on the standard sphere and David's sculpture respectively.Fig. 12(b) Our method achieves good performance, it's mean squared error is 0.42mm when measuring the standard sphere.Table 1 presents the deviation between the point clouds obtained by FPGA and Python.The mean absolute deviation is 0.00065mm and the standard deviation is 0.00235mm.It indicates the proposed method's accuracy is close to the same algorithm implemented in Python.Table 2 presents the resource consumption of the proposed structured light point cloud real-time extraction system and compares it with the work in Ref [1].The use of complementary Gray code for phase unwrapping, combined with the three-dimensional coordinate extraction using cubic polynomial fitting, and the redesign of cache structures at various levels, results in savings of more LUTs, FFs, and BRAM.However, the utilization of more parallel processing units leads to higher DSP consumption.Table 3 provides a comparison of the running speed between FPGA and i5-7500 CPU.Table 4 provides a comparison of the running speed of our proposed method with other methods mentioned in the literature.It can be observed that due to the adoption of a pipeline design, extensive parallel computing, and the use of computationally simpler complementary Gray code unwrapping and cubic polynomial calculation for three-dimensional point cloud generation, our proposed structure achieves a point cloud output speed of 76.9 fps at a clock rate of 100 MHz and it's much more faster than running on Python.Table 5 presents the resource consumption of our proposed FPGA acceleration structure for cubic polynomial fitting, while Table 6 compares the speed of fitting cubic polynomials using a CPU and using FPGA.It can be observed that utilizing our proposed FPGA acceleration structure significantly enhances the speed of cubic polynomial fitting.

Conclusion
This paper presents a real-time structured light point cloud extraction system based on FPGA.It utilizes the four-step phase-shifting method to compute wrapped phases, complementary Gray code for phase unwrapping, and a cubic polynomial fitting method for calculating three-dimensional coordinates of points.To enhance the overall system's processing speed, a pipelined design is implemented for various modules on the FPGA.Additionally, to maximize data bandwidth, a multi-frame image buffer structure is proposed.These enhancements enable the entire system to operate at extremely high speeds and meet the requirements of dynamic measurement applications.Furthermore, this paper addresses the issue of slow polynomial fitting during calibration by proposing a hardware acceleration structure for cubic polynomial fitting, which can improve the speed of cubic polynomial fitting, further enhancing efficiency during actual measurements.

Fig. 11
Fig.11The photo of the system

Table 1
Deviation between the point clouds obtained by FPGA and Python

Table 2
Utilization of the structured light 3-D measuring system

Table 3
Measuring performance comparison with i5-7500 CPU

Table 4
Performance comparison with other works

Table 5
Utilization of cubic polynomial fitting module

Table 6
Comparison of cubic polynomial fitting with FPGA and i5-7500 CPU