AR context
AR was originally born as a technical form. It was first to simulate operation data by the graphics generated by a computer [15]. In recent years, with the increasing maturity of AR technology, AR begins to be applied in all walks of life, including medical treatment, education, military, game, culture and other industries, and AR products are gradually entering the public's vision [16]. The core technology in the AR context is to combine virtual objects with real environment to create a new language field, as shown in Fig. 1.
Through AR technology, the new domain space is divided into mobile enhancement space, immersive enhanced space, and semi immersive enhanced space. Different language spaces can meet the combination characteristics of the virtual and real environments and bring users sensory pleasure and aesthetic enjoyment [17]. However, the premise is to understand the language in the new language field, including AR basic knowledge, operation language, and interaction language, to realize the barrier-free use of AR products in the new language field of AR.
In the new context of AR, the diversification of information communication is reflected in the diversification of information display methods and communication channels. Under the background of AR, it has a strong sense of immersion and stimulates people's vision, hearing, smell and touch. The design methods of symbols are more diversified. In addition, information display and display equipment are inseparable. Present AR devices include mobile handheld AR devices, space ar devices and wearable AR devices. Different display devices have different information display modes, which can be multi-dimensional text, image and audio information, or abstract sound and smell information. In addition, the communication of AR information breaks through the time and space boundaries with the assistance of the communication mode of the Internet. Under various information transmissions based on AR, people can receive different information anytime and anywhere.
AR technology realizes the diversification of information dissemination, changes the way of information dissemination, and triggers people's multi-sensory system. The new mode of interaction between human and text information has formed a new theme. The multi perception characteristics of the new context are mainly reflected in two aspects. One is real-time. In order to make the virtual information close to the information and reaction process seen by the user in the real world, it is necessary to realize real-time interaction, which is reflected in the real-time nature of tracking the user's position and superimposing virtual information. The second is to meet the interactivity of people's behavior. The environment of AR can give users an immersive experience, which is reflected in that users can interactively operate products according to body feelings. Its operation settings are mostly in line with users' habits, not just clicking the screen. The application of somatosensory interaction is more in line with ergonomics, and meets people's natural behavior by experiencing virtual information and approaching the real world.
Edge cloud computing structure and platform construction
(1) Introduction to edge cloud platform
The centralized cloud platform is a cloud computing network structure based on large data centers. The popularization-oriented business operation model has problems such as high complexity, high network dependency, and poor robustness [18]. When applying the distributed cloud computing system, there are also problems such as limited subsystem capacity, difficult collaboration, and difficult security maintenance [19]. Hence, combined with a centralized cloud platform and distributed cloud platform, an edge cloud platform is proposed to solve the above problems. The edge cloud computing architecture consists of core cloud, edge cloud, and user terminals, as shown in Fig. 2.
Edge cloud and core cloud cooperate to provide cloud computing services for users. The corresponding edge cloud is established according to the different geographical locations of resources and services. The edge cloud is an independent cloud computing system. The use of resources and services between each edge cloud is coordinated through the core cloud. Users log in to each edge cloud to use cloud computing services, which can effectively reduce the network load of a single edge cloud system and ease the dependence of cloud computing on network bandwidth and delay.
The edge cloud computing’s network structure well combines the advantages of the two traditional network structures and effectively avoids their shortcomings. The newly designed structure inherits the single resource composition of the centralized network structure (each edge cloud is composed of different types of resources, but provides a unified resource representation for the core cloud upward). Its resource management is simple, the resource aggregation is high, and the collaborative services of various resources are simple. Meanwhile, it avoids the disadvantages of traditional centralized network structure, such as high dependence on the network and insufficient utilization of existing resources. Compared with the traditional distributed network structure, the new structure solves the problems such as the independence of the internal network, the decentralization of service resources, the difficulties in management, and the difficulties in resource coordination services. Meanwhile, it also has the characteristics of the traditional distributed network, such as high utilization of existing resources, low network dependency, high scalability, and good service robustness.
(2) Experimental environment
In the data collection system of the on-site deployment platform of the cultural and creative booth, the data will be transmitted to the edge cloud platform for transcoding, recording, production and distribution in the cloud. In the signal coverage area of 5G edge cloud network, tourist users can conduct remote camera control and cloud switching production through the connected corresponding terminal equipment. Brain-computer users connect PC devices through brain-computer platform for data transmission and analysis. Figure 3 is a schematic diagram of the system scheme.
Table 1 displays the three types of physical computer specifications used in this test.
Table 1
Cloud platform computer specifications
Attribute | CPU | CPU Clock Speed | Kernel | Memory | GPU |
Edge cloud | Inter@Xeon Gold 5320 | 2.2GHz | 27 | 39 | No |
Public cloud | i9-12900K | 3.2GHz | 16 | 30 | NVIDIA RTX A6000 |
Local area network (LAN) | Intel Xeon Gold E5-2680 v2 | 2.8GHz | 10 | 25 | No |
The NVE6a LAN server has a low configuration and poor decoding capability, resulting in a delay greater than Mobile Edge Computing (MEC) in some LANs. An IP encoding, decoding, transcoding, distribution, production and processing platform is deployed in the cloud. The mobile App uses third-party software and supports Real Time Messaging Protocol (RTMP) and Secure Reliable Transport (SRT) protocols.
(3) Protocol Configuration
Table 2 shows different protocols' encoding and decoding transmission delay under three network architectures.
Table 2
Delay of different protocols
Serial number | Terminal | NDI | Test bandwidth | MEC monitoring delay | Output delay | LAN monitoring delay | Time delay of public cloud monitoring |
1 | PTZ EYES P200 | NDI |FULL | 80M ~ 100M | 160ms | 302ms | 135ms | Environment not supported |
2 | N2 | NDI |HX | 6M64K ~ 8M | 140ms | 270ms | 210ms | Environment not supported |
3 | SRT | 600ms | 680ms | 610ms | 750ms |
4 | RTMP | 1.2s | 1.26s | 1.18s | 1.22s |
5 | P30 | RTMP | 5M | 2.4s | 2.6s | 2.0s | 2.73s |
6 | SRT | 5M | 1s | 1.1s | 1s | 1.1s |
Brain-computer interface system management
(1) System channel
The brain-computer interface system obtains and preprocesses Electro Encephalo Graph (EEG) signals, extracts the characteristics of EEG signals, classifies and identifies them, and finally converts them into control command output, as shown in Fig. 4: |
Figure 4 is a classic brain-computer interface system framework. It suggests that the brain-computer interface system based on EEG collects the EEG signal from the scalp as the input signal, then quickly processes the collected EEG signal, and finally converts the user data into a control signal to further output it to the control equipment.
There are many processing methods of induced potential signals. The most classic is the cumulative average method, which is also one of the most commonly used methods to improve the signal-to-noise ratio in electrophysiological measurement [20]. The specific operations are as follows. The body is stimulated several times for a short time in the same way, and the potential generated by each stimulation is collected immediately. Then, the recorded waveform is aligned in time with the time when the stimulation is applied as the reference point, and the cumulative average is carried out. The average waveform represents the signal to be extracted.
After the \(i\)-th stimulation, the recorded signal \({r}_{i}\left(t\right)\) can be expressed as:
$${r}_{i}\left(t\right)=s\left(t\right)+{n}_{i}\left(t\right)$$
1
\(s\left(t\right)\) represents the evoked potential signal, \({n}_{i}\left(t\right)\) is the noise recorded for the \(i\)-th time, and \(i\) is the record number. The expression of the signal after \(N\) times of cumulative average reads:
$$\frac{1}{N}\sum _{i=1}^{N}{r}_{i}\left(t\right)=\frac{1}{N}\left(\sum _{i=1}^{N}s\left(t\right)+\sum _{i=1}^{N}{n}_{i}\left(t\right)\right)$$
2
If the noise variance is \({\sigma }^{2}\), after \(N\) cumulative averages, the mean and variance are:
$$\text{E}(\frac{1}{N}\sum _{i-1}^{N}{n}_{i}\left(t\right))=0$$
3
$$\text{v}\text{a}\text{r}\left(\frac{1}{N}\sum _{i=1}^{N}{n}_{i}\left(t\right)\right)=\frac{{\sigma }_{n}^{2}}{N}$$
4
After \(N\) times of cumulative averaging, the power signal-to-noise ratio of the average response can be \(N\) times that of the single response, and the amplitude ratio of signal to noise can be increased by \(N\) times.
(2) Principal Component Analysis (PCA)
PCA aims to use the idea of dimension reduction to transform multiple indexes into a few comprehensive indexes [21]. In researching practical problems, many influencing factors must be considered to comprehensively and systematically analyze the problems. These factors are generally called indexes and variables in multivariate statistical analysis. Each variable reflects some information about the research problem to varying degrees, and there is a certain correlation among the indexes. Hence, the information reflected by the obtained statistical data overlaps to a certain extent. When using statistical methods to study multivariable problems, too many variables will increase the amount of calculation and increase the complexity of analysis problems. People hope to get more information in the process of quantitative analysis.
(3) Feature extraction
The common spatial pattern is usually used for spatial filtering of motor imagery EEG signals. This method can find a set of spatial filters to maximize the variance of one kind of signal and minimize the variance of another kind of signal [22]. The variance of EEG signals after band-pass filtering can be equal to the signal's energy. Therefore, Content Security Policy (CSP) can be adopted for the classification of Event-Related Desynchronization (ERD) / Event-Related Synchronization (ERS) of motor imagery EEG signals. The left hand / right hand EEG signals of motor imagination are taken as an example. The signals are filtered by band-pass filtering. \({X}_{l}\left(i\right)\) and \({X}_{l}\left(j\right)\) represent left / right samples, and their covariance matrices are \({R}_{l}\left(i\right)={{X}_{l}\left(i\right){X}_{l}\left(i\right)}^{T}\) and \({R}_{r}\left(i\right)={{X}_{r}\left(i\right){X}_{r}\left(i\right)}^{T}\), respectively. \({R}_{l}\left(i\right)\) represents the covariance matrix of the \(i\)-th sample in class \(l\), and \({R}_{r}\left(i\right)\) is the covariance matrix of the \(i\)-th sample in class \(r\). The common spatial pattern projection matrix \(W\) that diagonalizes \(W\) and \({R}_{r}\) simultaneously is found, so that:
$${W}^{T}{R}_{l}W=D,{W}^{T}\left({R}_{l}+{R}_{r}\right)W=I$$
5
\(I\) is the identity matrix. Then,
$${W}^{T}{R}_{r}W=I-{W}^{T}{R}_{l}W=I-D$$
6
For another class of samples, \(var\left({W}_{k}^{T}{X}_{r}\right)=1-{d}_{j}\) is projected.
Calculation steps of projection matrix \(W\) are as follows:
The covariance matrix \(R\) is calculated and subjected to singular value decomposition:
$$R={U}_{0}{\varLambda }_{C}{U}_{0}^{T}$$
7
The transformation matrix of the covariance matrix can be obtained after singular value decomposition of the eigenvector and eigenvalue matrix:
$$P={\varLambda }_{C}^{-1/2}{U}_{0}^{T}$$
8
\({R}_{l}\) and \({R}_{r}\) are transformed to obtain:
$${S}_{l}=P{R}_{l}{P}^{T}, {S}_{r}=P{R}_{r}{P}^{T}$$
9
The eigenvalue decomposition of \({S}_{l}\) or \({S}_{r}\) is carried out to obtain the common eigenvector matrix \(U\) of \({R}_{l}\) and \({R}_{r}\).
For a trial EEG data matrix, after \(X\left(i\right)\) projection, it can be obtained that:
$$Z\left(i\right)=WX\left(i\right)$$
10
For each projected matrix \(Z\left(i\right)\), its variance is taken as the feature for classification.
In the actual algorithm debugging, the maximum and minimum eigenvalues in the CSP projection matrix are usually selected to form the final CSP projection matrix. In addition, there are multi-classification CSP and CSP filter bank [23].
(4) Classification and identification
Fisher linear classifier projects the samples of d-dimensional space onto a straight line to form a one-dimensional space. Fisher linear discrimination should solve the problem of finding the projection direction that is easy to classify to make the projection of various samples more separable.
User experience overview
The industry defines user experience as a subjective feeling built by users in using products. This concept was put forward in the 1950s, and its core point is "user-centered design" [24]. With the continuous deepening of user experience, the industry also has a new definition and explanation, as shown in Fig. 5.
Figure 5 reveals that user experience is people's cognitive impression and response to the products, systems, and services they use or expect to use [25]. It highlights its subjective characteristics and explains that the user experience should include all users' physiological and psychological reactions to the product or system before, during, and after use. This view was also re-emphasized at the 2015 International Experience Design Association [26]. User experience is no longer limited to a single subjective feeling, but an ecological experience in which users have a high sense of participation in a highly immersive state. It requires products to provide users with comprehensive services and experience while meeting their basic needs. Figure 6 shows the specific framework of user experience:
User experience can be divided into ease of learning, ease of positioning, efficiency, memorability, accessibility, fault tolerance and satisfaction [27]. Meanwhile, it should be analyzed from five aspects: strategy, range, structure, framework and presentation. The strategic level focuses on the target needs of users, and the range layer emphasizes product functions. The structural layer focuses on interaction design. The framework layer focuses on the user's browsing experience design, and the presentation layer focuses on the user's perception experience after using the product, focusing on visual design.
No matter from which aspect to define and summarize the elements of user experience, the key is that user experience is subjective, multi-level and multi-field. Most of the experiences include physical sensory perception, interactive experience, psychological emotion and value experience.
The implementation method of the AR-BCI system
(1) Visual interface design
At present, there are many visual stimulation software and toolkits, and Digital-State Visual Evoked Potentials (DSVEP) paradigm is the combination of AR and dynamic Steady-State Visual Evoked Potentials (SSVEP). Figure 10 shows the stimulation form of DSVEP.
In DSVEP in the real environment, the flicker stimulus is attached to the object in the environment, which has the following characteristics. The stimulus is superimposed on the real-time dynamic environment background instead of black or fixed color. The Versus (VS) area is not fixed in place, but attached to the specified object. The VS area may have a dynamic location or size.
(2) Communication interface implementation
In the research on the brain-computer interface, Bluetooth and other methods can be used to transmit EEG data to the computer. Lab Streaming Layer (LSL) is used to read and transmit data in real time. LSL is used for the unified collection of experimental measurement time series, processing the network of both sides, time synchronization, real-time access to equipment system, optional data collection, visual view and saving recorded data. Figure 11 is a schematic diagram. |
LSL transmission inherits the reliability of Transmission Control Protocol (TCP), is message-oriented and type-safe. The library provides automatic fault recovery from applications or computer crashes to minimize data loss, so that computers can be replaced in the middle of records without restarting data collection. Data are buffered at both sender and receiver to tolerate intermittent network failures. The data transfer type is secure and supports type conversion when necessary.
(3) Introduction to robot control
The robot used is developed in C, C + + and python languages. Many interfaces are open at the top level for change, including robot movement, and manipulator planning. The robot structure has 15 degrees of freedom. The arm is double six degrees of freedom, equipped with an adaptive end gripper with a greater range of motion and higher flexibility. The end of one arm can carry 1.5kg weight. According to the reasonable optimization of the algorithm, the cooperation ability of both arms can be displayed. In terms of movement and obstacle avoidance, the robot adopts hub motor driving wheel and differential drive, which can carry a weight of 120kg. The robot has lidar, ultrasonic sensor, inertial measurement unit and multi-sensor fusion to easily realize autonomous navigation, obstacle avoidance and motion planning. The robot system has an open control protocol, which is convenient for secondary development. The joint servo motor module of the robot is the power unit of the robot arm and other motion units. It is the core component of an intelligent robot.
(4) Real-time target detection and tracking
The system is based on python programming and designed in multithreading mode to ensure the system stability and transmission rate. Flicker visual stimulation, real-time camera transmission, target tracking and EEG signal processing are allocated to a single thread for data interaction within the thread, and the dynamic stimulus labeling algorithm is realized by deep learning. Next, the density histogram estimation method is used to match multiple detection targets identified in the front and back frames to realize object tracking, as shown in Fig. 12.
The specific implementation process is as follows. Object detection is carried out on the image collected by the camera in the n-th frame, m dynamic objects are detected, and the regional density histogram of m objects is calculated. Then, the calculation result in the frame n + 1 image is matched with the calculation result in the frame n image. By analogy, the whole object tracking process is realized. After experimental verification, the object can be detected and tracked and visual stimulation marking can be conducted when the speed is below 1m/s, and there will be no unrecognizable or marking failure.
(5) System experimental test
The key part of DSVEP is that the VS area is attached to the specified object rather than fixed in a specific location. Based on the above research, the same batch of robots are allowed to conduct online experiments, and different objects are flickered and marked with frequencies of 8, 10, 12 and 15Hz. Then, the average recognition accuracy within 1s is analyzed and the corresponding Interactive Text Response (ITR) is calculated.
Experimental results and analysis