3.2. Lip-based language feature extraction
It can be seen from Fig. 1 that the curve library has a better effect than the curve on the right (the first paragraph and the third row, and the appropriate curve library has a better effect than its rectangle. (The second and fourth rows) Therefore, the upper lip is selected for two-dimensional placement. At present, the appropriate curve on the surface of the lips has been determined. The curve is separate: long lips from the upper left corner of the mouth to the center of the main point, from the upper left corner of the mouth to the center of the main point, use the appropriate curve, and from the bottom to the left lip, use the appropriate curve: time correction curve, The expressions of the two curves are as follows:
$$\{\begin{array}{c}y={a}_{1}{x}^{3}+{a}_{2}{x}^{2}+{a}_{3}x+{a}_{4}\\ y={b}_{1}{x}^{3}+{b}_{2}{x}^{2}+{b}_{3}x+{b}_{4}\\ y={c}_{1}{x}^{2}+{c}_{2}x+{c}_{3}\end{array}$$
1
Between them, a and b (i = 1,2,3,4) are the two corners of the long-lip curve c; (i = 1,2,3) are the boundaries of the lower lever cycle.
Figure 2 shows Lip model.
Compared with the direct use of the length and width of the lips in the previous literature, this article believes that the proportion of the side shape of the lips reflects the language information and will be louder when the lips move. This article uses the following 4 borders to introduce the basic shape of lips:
(1) The ratio of the height and width of the outer contour of the lips R1
$${R}_{1}=\frac{{H}_{o}}{{W}_{o}}$$
2
(2) The ratio of the upper and lower heights of the outer contour of the lips R2
$${R}_{2}=\frac{{H}_{ou}}{{H}_{od}}$$
3
(3) The ratio of the height and width of the inner contour of the lips R3
$${R}_{3}=\frac{{H}_{i}}{{W}_{i}}$$
4
(4) The ratio of the upper and lower heights of the inner contour of the lips R4
$${R}_{4}=\frac{{H}_{i\nu }}{{H}_{id}}$$
5
Among them, the meanings of Ho, Wo, Hou, Hod, Hi, Wi, Hiu, and Hi are the height of the outer contour of the lips, the width of the outer contour of the lips, the upper height of the outer contour of the lips, the lower height of the outer contour of the lips, the height of the inner contour of the lips, and the inner contour of the lips. The width, the height of the upper part of the inner contour of the lips and the lower height of the inner contour of the lips.
RI, R2, R3 and R4 constitute the first half of the geometric visual features, namely
$${V}_{1}=({R}_{1},{R}_{2},{R}_{3},{R}_{4})$$
6
The structure of the growth scale and length may have an inner part which is the main point of the lip information. In order to merge all important information, the boundaries of the three lip regions are selected as part of the geometric vector segment, namely:
$${V}_{2}=({a}_{1},{a}_{2},{a}_{3},{a}_{4},{b}_{1},{b}_{2},{b}_{3},{b}_{4},{c}_{1},{c}_{2},{c}_{3})$$
7
So far, the geometric visual feature vector is obtained as follows:
$${V}_{g}=({V}_{1},{V}_{2})$$
8
(1) Discrete Fourier transform
Discrete Fourier Transform can quickly and efficiently process and analyze images. By changing the image of the lip part in the repeated part, different information in the image is analyzed to obtain information about the lip image. Fourier disc structure:
$${F}_{u,v}=\frac{1}{MN}\sum _{x=0}^{M-1}\times \sum _{y=0}^{N-1}\times {P}_{x,y}{e}^{-j2\pi \left(\frac{ux}{M}+\frac{w}{N}\right)}$$
9
Among them, M and N are the height and width of the image, Pxy are the pixels of the image, x and y are the pixel coordinates, the value after Fuv Fourier transform, and u and v represent the frequency in the horizontal and vertical directions, respectively. The inverse transformation of the two-dimensional Fourier transform from the frequency domain back to the image is:
$${P}_{x,y}=\sum _{u=0}^{M}\pm \sum _{v=0}^{N}\times {F}_{u,v}{e}^{j2\pi \left(\frac{ux}{M}+\frac{v}{N}\right)}$$
10
The result of the Fourier transform of a real function is a complex number:
$${F}_{u,v}={R}_{u,v}+j{I}_{1,v}$$
11
(2) Discrete Hartley transform
The Hartley transformer is a so-called Fourier transform, but it does not require complicated calculations. The advantage is that the anterior and posterior eye circles perform the same arithmetic. The same face is defined as a series of repeated segments. Discrete Hartley is described as follows
$${H}_{u,v}=\frac{1}{N}\sum _{x=0}^{N-1}\times \sum _{y=0}^{N-1}\times {P}_{x,y}\times (\text{c}\text{o}\text{s}(\frac{2\pi }{N}\times (ux+vy))+\text{s}\text{i}\text{n}(\frac{2\pi }{N}\times (ux+vy)\left)\right)$$
12
The inverse Hartley transform is applied to perform the same processing on the transformed lips image:
$${P}_{x,y}=\frac{1}{N}\sum _{x=0}^{N-1}\times \sum _{y=0}^{N-1}\times {H}_{u,v}\times (\text{c}\text{o}\text{s}(\frac{2\pi }{N}\times (ux+vy))+\text{s}\text{i}\text{n}(\frac{2\pi }{N}\times (ux+vy)\left)\right)$$
13
Hartley changes rapidly. By comparing Hartley with Hu,v as a function, different Fourier changes can be calculated.
However, compared with Fourier's own changes, Hartley's changes did not transfer changes, but the problem can be solved in different ways.
(3) Discrete cosine transform
Discrete cosine transform is a real number transform. The two-dimensional DCT transform formula of an image is as follows:
$$F(u,v)=\frac{2C\left(u\right)C\left(v\right)}{N}\sum _{x=0}^{N-1}\times \sum _{y=0}^{N-1}\times f(x,y)\text{c}\text{o}\text{s}\frac{\left(2x+1\right)u\pi }{2N}\text{c}\text{o}\text{s}\frac{\left(2y+1\right)v\pi }{2N}$$
14
The corresponding inverse transformation formula is:
$$f(x,y)=\frac{2}{N}\sum _{u=0}^{N-1}\times \sum _{v=0}^{N-1}\times C\left(u\right)C\left(v\right)F(u,v)\text{c}\text{o}\text{s}\frac{\left(2x+1\right)u\pi }{2N}\text{c}\text{o}\text{s}\frac{\left(2y+1\right)v\pi }{2N}$$
15
Where u,v = 0,1,2,N-1; x,y = 0,1,2...N-1:
$$C\left(u\right)C\left(v\right)=\{\begin{array}{cc}\frac{1}{\sqrt{2}}& u,v=0\\ 1& u,v=\text{1,2},\dots ,N-1\end{array}$$
16
It can be seen from the DCT formula that DCT has many advantages in terms of energy concentration. After the image is modified by DCT, the energy is mainly concentrated in the low frequency part. In this way, DCT can always be used for the same visual quality requirements. It provides a very high push ratio. In addition, like FFT, DCT also has rapid changes.
There are two ways to perform DCT on an image: treat the image as a whole for global DCT transformation and divide the image into several sub-regions for block DCT transformation, as shown in Fig. 3.
After obtaining the normalized lip image, divide the image into 6 equally, and the size of each small area is 10*10. Perform DCT transformation on each small area, as shown in Fig. 4.
$${V}_{p}=({v}_{s1},{v}_{s2},{v}_{s3},{v}_{s4},{v}_{s5},{v}_{s6})$$
17
Where v*=(rn > Cz,C3sSCq,Cis,Ci6,C;)i = 1,2,3,4,5,6,Ca(k = 1,2,3,4,5,6,7) Is the coefficient extracted from each subregion. In this way, a total of 7X6 = 42-dimensional data can be obtained as the feature vector of the lip area.
In the previous two subsections, the geometric model feature V and the pixel feature V0 of the lip area were obtained respectively. Combining V0 and V can get the mixed feature vector of the lips:
$${V}_{lip}=({V}_{g},{V}_{p})$$
18
Since the visual vector discovered later has many value differences in each dimension, and even different shapes, the detection accuracy may be damaged. Therefore, the value of the vector segment needs to be adjusted to understand the difference between the values of different segments. Since W, Yi and V are actually very different shapes, the three different shapes are normalized as follows:
$${v}^{*}=\frac{v-{v}_{min}}{{v}_{max}-{v}_{min}}$$
19
$${V}_{1}={V}_{1}^{*}$$
20
$${V}_{2}={V}_{2}^{* }$$
21
$${V}_{p}={V}_{p}^{*}$$
22
Finally, the visual feature vector of the lips is as follows:
$${V}_{s}={V}_{lip}^{*}=\left({V}_{1}^{*},{V}_{2}^{*},{V}_{p}^{*}\right)$$
23
When speaking, each language has a variable form. If you can imagine a specific image of the lips, some information about the movement of the lips will be lost. The first part of the variance must be used to express this dynamic information, and its calculation formula is as follows:
$$d\left(n\right)=\frac{1}{\sqrt{\sum _{-k}^{k}\times {i}^{2}}}\sum _{-k}^{k}\times i\times c(n+i)$$
24
c(n) is the vector of the field of view of the specific lips, d(n) is the effective vector of the related image, and k is the effective change of the study.
$${V}_{d}={\Delta }{V}_{s}$$
25
Combine the single image feature of the lips with the dynamic feature, which is the total feature vector:
$$V=({V}_{d},{V}_{s})$$
26