Mobile robot: automatic speech recognition application for automation and STEM education

Nowadays, robots are widely applied in life as well as industrial production, medicine, rescue, learning, and entertainment. There are many kinds of robots using different modern technologies such as manipulators, movement, and biomimetic robots; each type is applied according to its own role. Robots are also applied in training, especially in STEM training implemented with mobile robots. In this paper, we proposed models to design mobile robot combined with Android OS Tablet with a user interaction screen, voice control model, and AIDL IPC interactive model for remote control. We research the mobile robot using RockChip AI Processor RK3399Pro, which focuses on understanding the structure and operating mechanism of the AI Processor that RockChip robot uses, along with researching Android with IPC AIDL architecture and Automatic Voice Recognition technology to control robot to turn left, turn right, move forward, backward, and stop the movement by voice with three languages: Korean, English, and Vietnamese. We experimented on 1850 recording audio files including 1013 female voices and 837 male voices with ‘latest_short,’ ‘latest_long,’ and ‘command_and_search’ models. The results obtained a fairly high average confidence of 95% for English and Vietnamese languages and 87% for Korean. From results of this study, it will serve as a foundation for the research and development of a system to support students in learning by interacting with mobile robots through voice with multi-language. STEM education in the form of gamification of the curriculum, especially for extensive research on training programs can be applied in mobile robots with more flexibility.


Introduction
In the 4.0 technology era with the strong development of advanced technologies, including the development of robots, which are applied in fields such as firefighting, healthcare, education, space exploration, exploration ocean danger. The appearance of robots has effectively supported human production activities, robots have played many important roles in many fields, helping to optimize many production, business and trade activities. Not only that, they also show their role in many other fields. Since its appearance, the robot has changed to better promote its role.
Meanwhile, a manipulator is a type of robot (Kah et al. 2015) with multiple joints in series, each of which can be reciprocating or rotating, usually driven by a servo motor. Manipulators with clamps or handles with many degrees of freedom are suitable for applications such as painting, welding, workpiece picking, etc. Due to the development of technology and practical needs, today's manipulators have many applications and new uses such as surgical support in medicine, assistance for the disabled, or in industry.
A movement robot (Robot 2021) is a robotic system capable of performing tasks in many different locations with the ability to move by wheel, chain, or foot depending on the terrain. For robots that move underwater or in the air, we need a propeller, propeller, or jet engine to create motion for the robot. The robotic movement has many applications and requires solving many new problems. One of the common problems to be studied in movement robots is the ability to determine the direction of the robot.
In addition to movement robots running on wheels and chains, researchers are currently trying to apply the transport mechanisms of living organisms to robots to create new types of biomimetic robots with the ability to move automatically (Robot 2021).
Around the world, many corporations in different countries have also developed many robots in the field of education, especially in the field of STEM teaching (Nite et al. 2014;Merino et al. 2016), with a wide variety of categories. In the history of robots and their technology, the most important countries are the USA and Japan. The USA is still leading in the level of comprehensiveness of robot technology, while Japan has the highest number and diversity of robots in the world. However, the problem of combining the Android (Griffiths and Griffiths 2015) operating system to let users interact directly as well as allow programming support right on the robot is still in the beginning stage for many reasons that may be related to copyright.
There are many types of educational robotics for STEM reviewed in the paper (Fachantidis, et al. 2017) such as Do It Yourself (DIY) robots, open hardware robots, brick-based robots, pre-assembled robots, only for simple actions or specific purpose robots, humanoid robots, and robots based on tangible programming. However, the integration of the Android screen that allows users to interact directly is still limited. In the paper (Kaleci and Korkmaz 2018), the authors presented Android OS Mobile Technologies in Robotics, a project for Educational and STEM-Enhancing.
In this paper, we continue to research the mobile robot model that combines interactive screens with the Android operating system and integrates IPC AIDL architecture, automatic speech recognition application, and single Ten-sorFlow to control movement robots and support education, entertainment, and open-source application development. RK3399Pro AI chip is a high-performance, power processor that will be used to design the mobile robot. Creating excitement in learning for learners, especially under-university students, they can manipulate directly on the robot or separate the robot's face to learn depending on their needs. This model can add to the educational development ecosystem for STEM, and it is possible to research and deploy application models for robotic business process automation and robotics application.
The USA and the Republic of Korea (South Korea) are major powers in robotics (Science Robotics Special Edition Booklet 2019), Vietnam is a developing country, and nowadays, robots are applied in production, business, education, and entertainment a lot. Vietnam has had trade cooperation with the USA and Korea, and the Korean Studies industry has developed very well in Vietnam. Therefore, the research team wants to focus on testing with three different languages including American and Korean to increase business and educational cooperation opportunities in Vietnam with the mobile robot products.
The paper structure includes the following sections: Introduction, approach, related studies, model design, system development, experiments and results, conclusions, and development directions.

Research methodology
In this article, we research the structure and function of the components as well as how to use and interact with the RockChip RK3399Pro AI Processor (AI processor Rockchip RK3399Pro Datasheet, Revision 1.0 2018). Research Artificial Intelligence Technology in Google cloud speech (Google Cloud AI Platform, 2022) in multilingual voice diagnostics and authentication to integrate the voice control commands into mobile robot software. We propose to interact with the mobile robot via Android Tablet using RockChip AI by integrating IPC AIDL architecture (Language and (AIDL)https: developer.android.com, guide, components, aidl. 2021) into the system to receive interactive commands after voice conversion into the corresponding integer values to control the robot to move left, right, forward, backward, jump or stop moving the voice. In addition, we research and integrate more speech processing models using Sequential API in TensorFlow Keras and then Freeze and Convert the trained model into a Single Ten-sorFlow.pb file to join with tensor-flow-android to recognize English speech even without an internet connection.
Finally, we design software on the Android platform and install it on mobile robot to interact, and this software includes features such as allowing three languages to be switched between English, Korean and Vietnamese. Corresponding to each language in which the user will use the selected language to talk and give commands to mobile robot, the control commands in this research are mentioned such as turn left, turn right, forward, backward, jump, and stop moving. Users can also give voice commands as well as give commands by interacting directly with the face of the mobile robot. In addition, we build the central software as an ecosystem in STEM training, focusing on math, foreign languages, and Blockly drag-and-drop programming.
We have created and tested audio data for speech-to-text recognition consisting of 1,850 mp3 files, and these audio datasets were recorded for experimenting with 5 groups of commands including 'Turn left,' 'Turn right,' ' Move forward,' 'Move backward,' and 'Stop.' This dataset has 1,013 female voices and 837 male voices, the recorded sample rate is 24,000 Hz, and all of these data are published on GitHub (Tran and Huh 2022). The models of 'latest_short,' 'latest_long,' and 'command_and_search' were applied to extract features and obtained confidence as well as visualized comparison of the results. The latest_short model is for short utterances that are a few seconds in length, and this model is useful for trying to get commands or other single shot directed speech use cases. We can use the command_and_search model for replacing lasted_short. On the other hand, the lat-est_long model is for any kind of long-form content. However, we can use lasted_long for short content, and it uses to extract information for media and conversations.

RockChip AI processor RK3399Pro
Fuzhou RockChip Electronics Co., Ltd. at CES2018, they released RockChip with first AI (artificial intelligence) processor RK3399Pro by adopting a CPU ? GPU (Steinkraus et al. 2005) ? NPU (Amrouch et al. 2020) hardware structure. The chip integrates a dual-core ARM Cortex-A72 MPCore processor and quad-core ARM Cortex-A53 MPCore processor, and both are high-performance, low-power, and cached application processors. It has two CPU clusters, a big cluster with dual-core Cortex-A72 is optimized for high performance, and a little cluster with quad-core Cortex-A53 is optimized for low power. Full implementation of the ARM architecture v8-A instruction set, and ARM Neon Advanced SIMD (Single Instruction Multiple Data) support for accelerating media and signal processing based on Big. Little architecture (ARM, big 2013; Yu et al. 2013). Equipped with one powerful Neural Network Process Unit (NPU), it supports mainstream platforms in the market, such as Caffe, tensor flow, and so on. Figure 1 shows RK3399Pro Develop Tool for AI.
RockChip RK3399Pro provides AI-related APIs for programmers to use such as RKNN API, TensorFlow Lite API, and Android NN API.
Three important features of RockChip RK3399Pro AI solution, including High-performance AI hardware, Superior platform compatibility, and Easy development of the turnkey solution.

High-performance AI hardware
RK3399Pro adopted an exclusive AI hardware design. Its NPU computing performance reaches 2.4 TOPs, and indexes of both high performance and low consumption keep ahead: the performance is 150% higher than other same-type NPU processors; the power consumption is less than 1%, compared with GPU as an AI computing unit.

Superior platform compatibility
RK3399Pro NPU supports 8-bit and 16-bit and is compatible with various AI software frameworks. Existing AI interfaces support OpenVX and TensorFlow Lite/ AndroidNN API; AI software tools support the importing, mapping, and optimizing of Caffe/TensorFlow model.

Easily develop a turnkey solution
RockChip provides a one-stop AI solution based on RK3399Pro, including hardware reference design and SDK. The solution can increase the AI products R&D speed of global developers and greatly reduce product launch time. It can significantly improve the speed of AI product development for global developers and greatly shorten the time to market. 3.2 Automatic speech recognition (ASR) and feature extraction technique Speech recognition, it is also known as automatic speech recognition (ASR), develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. The speech recognition system is developed with major components that include an acoustic front-end, acoustic model, language model, lexicon, and decoder (Karpagavalli, Chandra, Evania. 2016;Pham, et al. 2020;Palogiannidi, et al. 2020) as shown in Fig. 2. Acoustic front-end will convert the speech signal into appropriate features which provides useful information for recognition. The input audio is converted into a sequence of fixed-size acoustic feature vectors. The parameters of word/phone models are estimated from the acoustic vectors of training data. The decoder operates by searching through all possible word sequences to find the sequence of words that is most likely to generate. The function of the automatic speech recognition system can be described as the addition of several speech parameters from the audio speech signal to each word or sub-word unit. Speech parameters describe the word or particle according to their change over time, and together they form a pattern that characterizes the word or particle. During the training phase, the program will read all the words in the current application's vocabulary. Word patterns are stored, and later when a word needs to be recognized, it is compared with the stored samples and the word that gives the best match is selected.
The purpose of the ASR system (Tulics, et al. 2020;Huang, Chen. 2020) is to obtain the most likely word order given by the speaker audio signal.
Munish Kumar et al. provided a systematic survey about research for automatic speech recognition (Munish Kumar et al. 2020), and the authors summarized the best available research on automatic speech recognition of the Indian language from the research papers published from 2000 to 2018. Figure 3 shows the google cloud speech-to-text platform. Google Cloud Speech-to-Text accurately converts speech into text using an API powered by Google's AI technologies. There are many functions are provided by this AI technology: Speech-to-Text client libraries to get started with Speech-to-Text in your language of choice, Cloud Speech REST API with v1 REST API Reference (Nonstreaming JSON), Cloud Speech RPC API with v1 gRPC API Reference (Streaming and non-streaming Proto 3), Language support with the list of languages supported by Speech-to-Text such as Korean, Vietnamese, English language, Supported class tokens with the list of class tokens supported for speech adaptation by language, and Cloud Speech-to-Text On-Prem API with the Cloud Speech-to-Text On-Prem solution.
Google Cloud Platform is researched by many scientists to deploy software (Gupta, et al. 2020). In this paper, the speech-to-text model in the Google Cloud Platform will be applied to the system, using three languages Korean, Vietnamese, and English to control the robot.
The study and identification of entities of different languages with deep learning machine learning models are also applied. Munish Kumar et al. used the DeepSpacy NER deep learning model (Singh et al. 2022a) for named entity recognition in the Punjabi language. The authors also proposed and made public the annotated benchmark corpus for the Gurmukhi script.
There is much research about feature extraction Techniques for Speech Recognition, such as MFCC, MODGDF, GFCC, etc. are listed in the survey about feature extraction and classification techniques (Singh et al. 2022b), speech recognition, and computational intelligence model (Singh et al. 2022c), Munish Kumar and et al. also examined major challenges for the speech recognition with different languages have provided and compared many processing models as well as rich feature extraction techniques.

Android interface definition language (AIDL) and IPC
The Android Interface Definition Language (AIDL) is similar to other IDLs (Shannon and Snodgrass 1989;Lamport 1986) you might have worked with. It allows you to define the programming interface that both the client and Fig. 2 Architecture of speech recognition system Fig. 3 Google cloud speech-to-text platform service agree upon in order to communicate with each other using interprocess communication (IPC) (Kashyian, et al. 2008) (Choi, et al. 2018). On Android, one process cannot normally access the memory of another process, they need to decompose their objects into primitives that the operating system can understand, and marshal the objects across that boundary for you. The code to do that marshaling is tedious to write, so Android handles it for you with AIDL. Using IPC AIDL allows clients from different applications to access your service with handling the multithreading in the services. Figure 4 shows IPC AIDL architecture model. Before begin designing AIDL interface, be aware that calls to an AIDL interface are direct function calls. AIDL interface must be defined in a.aidl file using the Java programming language syntax, and these are the steps: (1) Create the.aidl file; this file defines the programming interface with method signatures. (2) Implement the interface; the Android SDK tools generate an interface in the Java programming language, based on the.aidl file. This interface has an inner abstract class named Stub that extends Binder and implements methods from the AIDL interface. It must be extended to the Stub class and implement the methods. (3) Expose the interface to clients, and implement a Service and override onBind() to return an implementation of the Stub class.

STEM education on robotics
STEM is an idea-based curriculum that equips learners with knowledge and skills related to science, technology, engineering, and math-in an interdisciplinary and humancentered approach. Learning can be applied to solve problems in everyday life (Sapounidis and Dimitris 2020). And now the trend of research and publication of STEM education technology is growing, and the article (Yeping et al. 2020) has listed 7 topic categories related to STEM publication: (1) K-12 teaching, teacher, and teacher education in STEM; (2) post-secondary teacher and teaching in STEM; (3) K-12 STEM learner, learning, and learning environment; (4) post-secondary STEM learner, learning, and learning environments; (5) policy, curriculum, evaluation, and assessment in STEM; (6) culture and social and gender issues in STEM education; (7) history, epistemology, and perspectives about STEM and STEM education. It is shown in Fig. 5.
In recent years, the trend of applying Robotics in the implementation of education to improve STEM education has been increasingly focused (Eguchi and Uribe 2017). In developed countries like Vietnam, many STEM education associations have been built and gradually introduced into mass education in the country, in which the application of high technologies such as robots in teaching is increasingly interesting. Mobile robot application with the interactive interface will make it easier for learners to access and at the same time can build software running on Android devices, which is already popular in the world and in Vietnam and other countries. In developing countries, this will promote the demand for applied STEM learning, making students more excited and creative in the learning process.
Moreover, Munish Kumar et al. designed an effective agile knowledge management framework from the empirical study (Singh et al. 2022d), and this study provided both theoretical and practical contributions. The proposal gave practitioners practical aspects of knowledge management and agility. Therefore, this research also has a great effect on the application of STEM education.

Computer and tablet environment
In this paper, we experiment on Windows operating system (64-bit, 128 GB RAM, Intel(R) Core (TM) i7-9700 CPU @ 3.00 GHz), using.net framework, Java language, Android Studio 4.2.1 tool, Android OS, IPC AIDL architecture to deploy software. Android Tablet with RockChip AI with Android OS version 9.0, RAM 4 GB.

Proposal and implementation of research model
In this research, we propose two models, the first model is an interactive master model that controls mobile robots using RockChip AI by Automatic Speech Recognition, and the second model is a detailed model that converts voice into code combined with IPC AIDL architecture to control the robot to turn left, turn right, move forward, move backward, dancing, or stop moving. Figure 6 shows General Proposed Research of Model for a voice control processing model that combines Tensorflow Keras API and Google Cloud AI.
There are five main steps in the general model. In the Step 1, the user selects the interactive language, which can be English, Korean, or Vietnamese. And Interactive language recognition program will be detected in Step 2. After detecting language, in Step 3 the program will apply models for speech processing corresponding to the selected language. If English is selected (including the environment with internet or without the internet), the system will apply the Tensorflow Keras API model. If Korean and Vietnamese languages are selected (The program is only applicable in the model with an internet connection), the program will apply the Google Cloud AI, platform model. Step 3 will be details explained after the general model. And in Step 4, for each model applied in speech recognition, the program will receive text commands, the text examples are shown in Table 1. And finally, in Step 5, the program will receive the text command from step 4 and conduct analysis, converting the text to an integer number to call the commands in IPC AIDL architecture to control the movement of the mobile robot. Figure 7 shows the detailed architecture for IPC AIDL interaction model for mobile robot movement.
In Fig. 7, the model has 5 steps to process the mobile robot movement.
Step 1: Embed the .aidl formats into the project including IcSpAidlInterface.aidl and IcSpAidlInterfaceCallback. aidl through the AIDL tool, which will generate 2 Java interfaces. These interfaces provide functions for us to interactively call mobile robot control commands, such as the sendCmd_Int(cmd) function to send the command to robot.
Step 2: In this step, we will declare interface variables for binding, unbinding, and ServiceConnection references to support the registration and deregistration of callBack.
Step 3: registerCallback for IcSpAidlInterfaceCallback by implementing the MyServiceConnection class for reference step 2 to use. This step after completing the service creation process will be used for step 2 to refer to bindService and unbindService. And Step 4: The program will filter the text command received from Google Cloud, if the command is in the group of defined commands, this text command will be converted to an integer number to control mobile robot, and if the command is incorrect, the program will show the Alert warning. Table 2 shows the Integer number for mobile robot movement.
Step 5: Invoke commands to control the robot to move using the codes analyzed in step 4. cmdID variable is the integer number to send control commands to mobile robot through the IPC AIDL architecture, and it is shown in Algorithm 3. Meanwhile, Fig. 8 is a summary of the UML model of the object class participating in the mobile robot control interaction.  IcSpAidlInterface is a java interface defined in the IPC AIDL architecture, which is automatically generated when the system uses the AIDL engine for processing. This interface provides several important methods by which we can interact with the circuit board system. These functions include registerCallback (used to register IcSpAidlInterface Callback), unregisterCallback (used to unregister IcSpAidlInterface Callback), getAllCmd (getAllCmd is a method that supports getting commands from the system), and sendCmd_Int (sendCmd_Int is a method to send control commands to robot, and the command is formatted as an integer number) (Table 3).
MyServiceConnection is a class that implements the ServiceConnection interface, providing two important functions, onServiceConnected and onServiceDisconnected to register and unregister the IcSpAidl callback. onSer-viceConnected is a function used for Stub to create an IcSpAidl object and register a callback to receive the system's return results. bindService task will refer to MySer-viceConnection object when it is called. onServiceDisconnected is a function used to unregister a callback and unbindService task.
MainActivity is the central layer of the software system installed on the mobile robot. This class is used to design and handle operations such as: changing the language to interact with the robot, transmitting voice, receiving text, command the robot to perform some operations such as turning left, turning right, moving forward, moving backward, jumping, or stopping moving. Meanwhile, the bindService and unbindService tasks are also performed in MainActivity.
Thus, we have presented the general proposal and implementation of the Research Model including the structure of the mobile robot, interaction model, classes, system operation, and IPC AIDL architecture applied.
Tensorflow Keras API is used to handle the English language in case mobile robot has an internet connection or no internet connection. The model is shown in Fig. 9 Implement Single TensorFlow.pb file in mobile robot for speech recognition. This model will process speech recognition through Tensorflow Keras even when the robot is not connected to the network; in this case, we do English speech processing.
Step 1: The Audio Dataset was reused from tensorflow. org, and the original dataset consists of over 105,000 WAV audio files of people saying thirty different words. This data was collected by Google and released under a CC BY license (Warden 2018). Each utterance of a onesecond file is stored in the '.wav' file format with a 16 kHz sampling rate, the dataset contains the words 'up,' 'down,' 'left,' 'right,' and 'stop.' Google provided two versions of the Speech commands dataset to use (Speech commands dataset version 1 2017; Speech commands dataset version 2 2018).
Step 2: Preprocess dataset includes: importing the Speech Commands dataset, reading audio files, and their labels. Convert the waveform into a spectrogram, which shows frequency changes over time and can be represented as a 2D image. This can be done by applying the short-time Fourier transform (STFT) (Griffin and Lim 1984) to convert the audio into the time-frequency domain.
Step 3: Build and train the model, and the Sequential API of TensorFlow Kera is used. For the model, the simple convolutional neural network (CNN) (Anirudha et al. 2020) is used, since the audio files have been transformed into spectrogram images. The model also has the following additional preprocessing Normalization layer to normalize each pixel in the image based on its mean and standard deviation and A Resizing layer to down sample the input to enable the model to train faster.
Step 4: Freeze and Convert the trained model into a Single TensorFlow.pb file by write_graph function. After having this package file, we will put it into the Mobile Android project to use in step 5.
Step 5: Use TensorFlowInferenceInterface class in the tensorflow-android library to Load the TensorFlow model   (37) from Single TensorFlow.pb and then predict the speech recognition to text to get the text command.
Step 6: Deploy the application into mobile robot to experiment with the words 'up,' 'down,' 'left,' 'right,' and 'stop.' These strings, when received, will be analyzed text and converted to integers to control the robot movement by AIDL architecture. Figure 10 shows the implementation Google Cloud AI platform for speech recognition; when mobile robot has an internet connection, it will support the Korean language and Vietnamese language.
The model consists of seven main implementation steps, described in detail below.
Step 1 Unique secure identity, a simple, secure, and flexible approach to identity and device management. Identity Platform is a customer identity and access management (CIAM) platform that helps organizations add identity and access management functionality to their applications, protect user accounts, and scale with confidence on Google Cloud.
Step 2 Choose language and voice command. In this step, the user will choose the interactive language designed on the software installed in mobile robot, and select Korean or Vietnamese. Then will speak to mobile robot in the specified language. The voice command for mobile robot movement is shown in Table 1, and these voice commands will be passed to step 3 for processing.
Step 3 Speech and language. In this step, the program will receive the voice command and proceed to call the GoogleRecognitionListener to receive the text from the user's voice. The program will send voice to Google Cloud through Google Functions and receive a text back after Speech-to-Text API and AutoML Natural Language processing. The result of text will be used for the integer transfer to process the robot control, which will be detailed in other steps.
Step 4 Transcribe. In this step, the program will receive Voice from step 3 and call the Speech-to-Text API to convert voice to text; this API supports over 125 languages, and streaming speech recognition, customize speech recognition, Speech-to-Text On-Prem, speech-totext can recognize distinct channels in multichannel situations and annotate the transcripts to preserve the order and API can handle noisy audio from many environments without requiring additional noise cancelation.
Step 5 Intent and entity extraction. After step 4 is done, the system will check the language using the AutoML Natural Language tool. This library allows building and deploying custom machine learning models that analyze documents, categorize them, identify entities within them, or assess attitudes within them. AutoML Natural Language uses machine learning to analyze the structure and meaning of documents. We can train a custom machine learning model to classify documents, extract information, or understand the sentiment of authors.
Step 6 Text command. This step will receive the results returned from the Google Cloud platform after going through the processing processes in steps 4 and 5. The string returned from voice including language analysis is Korean or Vietnamese.
Step 7 Analyze the text and control the robot.
Step 7 will receive the Text command from step 6 and conduct analysis, convert the text to an integer number to call the commands in IPC AIDL architecture to control the movement of the mobile robot. It is shown in the general model section. The GoogleRecognitionListener class is used to process the client-side command to receive voice  input and return text results from Google Cloud. There are many functions supported in this class, like onBegin-ningOfSpeech (The function marks the beginning of voice processing), onEndOfSpeech (The function marks the voice processing as complete), and onResults (The result from Google Cloud analysis, which is stored in the Bundle object, and we will extract data from this object).
Thus, we have presented a proposal and implementation of the Research Model including interaction mobile robot model, classes, system operation, and IPC AIDL architecture applied, combined with Single TensorFlow Keras API and Google Cloud AI platform. In the next section, we will experiment with the functions presented in the model.

Experimental results
We build a software system to experiment with the proposed model. The system includes the following Blueprint for Interactive screens, which is shown in Fig. 11.
The above screen is the blueprint of the program, using Views such as LinearLayout, TableLayout, ImageView, Edittext, TextView, Progressbar, and Button. Figure 12 shows an illustrative structure for BluePrint, applied to interactive interface design (MacLean et al. 2015;Smith 2015).
The interactive screen structure consists of two parts: the interface part and the part that handles user interaction. The interface is designed in XML, and the interactive part is written in java. When running the software, the Android operating system in the mobile robot's face will proceed to compile the XML layout into java code, from which we can access the elements in the graphic user interface similar to the Java object-oriented model.
The screen program will support switching between many languages including English, Korean and Vietnamese. Figure 13 shows the Interactive Screen in the English language.
The main functions of the test software include: changing language, commanding mobile robot with this voice, changing volume, and movement functions such as turning left, turning right, going forward, going backward, jumping or stopping moving, exiting the application. The software will be compiled and packaged into .apk format and deployed to mobile robot. It is shown in Fig. 14.
When choosing to switch the interface to Korean, the program will automatically change the interfaces on the software to Korean, and now the system will only listen to Korean when the user commands for mobile robot. Figure 15 shows the screen with the interface using the Korean language.  And the last one, Fig. 16 shows the Interactive Screen in Vietnamese language when the user switches the interface to Vietnamese, the program will automatically change the interfaces on the software to Vietnamese, and now the system will only listen to Vietnamese when the user commands for mobile robot.
Android Broadcast receiver technology will be applied in the experiment, it will listen to the internet connection state, and when the connection drops, the Korean and Vietnamese languages will be disabled automatically and enabled again when the internet connection is on. It is shown in Fig. 17.
We experimented with speech-to-text processing with three models 'latest_short,' 'latest_long' and 'com-mand_and_search' for 1,850 recording audio mp3 as described in previous sections. We summarize the results of comparing the average confidence values of the models corresponding to each activity group. The comparison tables are divided into three languages English, Korean and Vietnamese (Table 4).
Observing Table 4, confidence for the 'latest_long' model gives stable results and is as high as 95%. The 'com-mand_and_search' model gave the lowest results and varied from 74 to 93%. Next is the 'latest_short' model which also has results ranging from 87 to 95%. So basically, for the English language experiment, the model 'latest_long' gives the best results.
Observing Table 5, for Korean the 'latest_short' and 'latest_long' models for confidence are equivalent at about 87%. However, for the 'command_and_search' model, there is strong volatility, the lowest is 51% and the highest is 92%. The lowest case is for the recording group '중지하십 시오' which only achieves an average confidence of 51% for the 'command_and_search' model. So, in terms of stability, we recommend using the 'latest_short' and 'latest_long' models with an average confidence value of around 87%.
For the experimental results for the Vietnamese language shown in Table 6, the 'latest_short' model gives the most stable results with an average confidence of about 95%. Next is the 'latest_long' model most of the results give an average confidence of 95%; however, there is one exception with the action group 'Dừng di chuyển' which only achieves an average confidence of 72%. And finally, the 'command_and_search' model has a large variation in average confidence, which varies from 78 to 97%. So basically, the model 'command_and_search' gives higher results than 'latest_long' but it is not the same in proportion for each action group. Thus, if high stability is accepted, we propose to use the 'latest_short' model for Vietnamese language processing.
From the above experimental results, we provide a table comparing the average confidence among languages during the experiment from 1850 recording mp3 corresponding to each action group in Table 7.
Observing the pivot Table 7, the 'latest_short' and 'lat-est_long' models when tested against the 1,850 mp3    recording dataset all gave fairly stable results. Most English and Vietnamese have a high average confidence level, the highest threshold is 95%, while for Korean it is stable at 87%. As for the 'command_and_search' model, the experimental results are quite volatile, but also reach the highest average value for the Vietnamese case at 96.5%. The result of a visual comparison of average confidence when implementing the 'latest_short' model is shown in Fig. 18. Figure 18 shows visualization, with the 'latest_short' model, the Vietnamese language when experimenting with the dataset used by the research team gave quite good results with an average confidence of 95%, followed by the English language set average confidence, highest for the 'Move backward' action group with average confidence= 95%, and the lowest for the 'turn left,' and 'stop' groups at 92.3%. As for Korean, the results are consistent with an average confidence of 87% for all active groups.
And Fig. 19 shows the result of a visual comparison of average confidence when implementing the 'latest_long' model.
Similar to the 'latest_short' model, the English and Vietnamese language experiment with 'latest_long' models give quite similar results at about 95%, except for the 'Stop' action group of the Vietnamese language (average confidence is surprisingly reduced to only about 87.3%). As for Korean, it still maintains an average confidence of 87%. Thus, in terms of stability, it can be said that the After the 'speech-to-text' models are selected and applied to convert voice to text, this text will continue to be processed by the model to convert to integer form; this value will be sent directly to mobile robot to control. Figure 21 shows the synchronized command strings with integer conversion while experimenting.
This model is to simplify the analysis process, in the languages English, Korean, and Vietnamese after receiving the results returned from Google's machine learning system or Single TensorFlow Keras model. If the command is in the group of defined commands, this text command will be converted to an integer number to control mobile robot by IPC AIDL, and if the command is incorrect, the program will show the Alert warning.
Algorithm 4 illustrates sending the command (through the parameter cmdID which is an integer value).
From this function, we can call mobile robot commands according to IPC AIDL architecture simply as the example is shown in Table 3.
Thus, we tested the system according to the proposed model and took advantage of Google's artificial intelligence system and Single TensorFlow Keras model to apply mobile robot voice control for multi-language selection. With the IPC AIDL architecture applied in Android, the system can control the board through the generated java interfaces when using the AIDL tool.
We also applied a model to build the STEM hub education in one mobile robot. Depending on the educational   institution's lesson design, each STEM component can be developed into modules that can be installed and invoked from the STEM hub system, and it is shown in Fig. 22. In addition to the STEM system, mobile robot was designed with English learning modules (Vietnam market and developing countries have a lot of demand for learning English). At the same time, the system will provide modules for Entertainment. We recorded a video of the future works for the STEM hub and mobile robot face emotions .
There are many application levels of the mobile robot, from the model proposed by the research, and researchers can apply to control mobile robot by voice in many different languages and can integrate it into English, STEM programs (for Science, Technology, Engineering, Art, Mathematics) as entertainment programs can also be integrated into mobile robot.

Conclusion and future work
We have researched mobile robots using RockChip AI and designed a model for automation and STEM education with Automatic speech Recognition architecture. In this research, we have implemented a voice-controlling model and IPC AIDL interaction model for mobile robot, a model class for motion by mobile robot, and Synchronize command sequence with integer conversion.
Mobile robot control processing is tested on five action groups including 'stop movement,' 'move forward,' 'move back,' 'turn right,' and 'turn left' with multi-language (English, Korean, and Vietnamese). The voice data was tested on 1850 mp3 files with 1013 female voices and 837 male voices. There are three models experimented on this dataset including 'latest_short,' 'latest_long,' and 'com-mand_and_search,' the results obtained a fairly high average confidence of about 95% for English and Vietnamese languages and about 87% for the Korean language. The study also provides detailed results comparison tables in the process of testing models on data sets. From these results, it can help other researchers to refer to the experimental results as well as reuse the dataset in similar studies.
In addition, in the case without a network connection, the article also proposed a model to integrate the trained model by Tensor-Flow-Keras into mobile robot by Freeze and Convert the trained model into a 'Single Tensor-Flow.pb' file to recognize English speech. This proposal is also a potential application direction for mobile robot, which can add data for different languages to conduct training and use multiple languages in an environment without a network connection. STEM education can be installed applications into mobile robots and interaction with no internet connection.
We also have built and experimented with the Android application into mobile robot with the following features: A choice of three interactive languages English, Korean and Vietnamese, along with a mobile robot voice control feature with operations to turn left, turn right, forward, backward, or stop moving. We recorded a video for an experiment mobile robot with auto-speech recognition (Video abstract 2: Video demo Movement control by Automatic Speech Recognition 2021).
In future work, we continue to collect more data to improve the quality of the voice processing model to control mobile robots, especially to improve the quality of the Korean language which has only average confidence of 87%. Also, we will apply the proposed model related to automatic speech recognition to provide a learner interaction module with mobile robot through research on emotion analysis in human-robot interaction, Plutchik's wheel of emotion (Szabóová, "Emotion Analysis in Human-Robot Interaction",, et al. 2020), and the future of service-the power of emotion in human-robot interaction (Joanne, et al. 2021). We will propose an artificial intelligence system for mobile robot to appear twelve emotions (contempt, sadness, distraction, joy, interest, rage, optimism, admiration, love, surprise, apprehension, and disapproval), and these will increase the excitement for learners, as well as create conversations with robots for entertainment. These properties are very useful in deploying STEM applications, especially STEM applications for primary school students. Figure 20 shows the twelve emotions mobile robot interaction that we will apply in future works (Fig. 23).
In addition, Google Play Service can be integrated into mobile robot and we can build the machine learning model on-device applications. These help in low latency, keeping data on-device for privacy data, and cost saving. Data availability Enquiries about data availability should be directed to the authors.

Declarations
Conflict of interest The authors declare that they have no competing interests. We confirm we have included a data availability statement in our main manuscript file.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.