Vision-Based Human-Machine Interface for an Assistive Robotic Exoskeleton Glove

This paper presents a vision-based Human-Machine Interface (HMI) for an assistive exoskeleton glove, designed to incorporate force planning capabilities. While Electroencephalogram (EEG) and Electromyography (EMG)-based HMIs allow direct grasp force planning via user signals, voice and vision-based HMIs face limitations. In particular, two primary force planning methods encounter issues in these HMIs. First, traditional force optimization struggles with unfamiliar objects due to lack of object information. Second, the slip-grasp method faces a high failure rate due to inadequate initial grasp force. To address these challenges, this paper introduces a vision-based HMI to estimate the initial grasp forces of the target object. The initial grasp force estimation is performed based on the size and surface material of the target object. The experimental results demonstrate a grasp success rate of 87. 5%, marking significant improvements over the slip-grasp method (71.9%).


Introduction
Exoskeleton gloves are used to restore the grasping ability to perform Activities of Daily Living (ADLs) for patients with brachial plexus Injuries (BPI) (Xu et al., 2020;Jian et al., 2018;Ge et al., 2020) or for post-stroke rehabilitation (Rahman and Al-Jumaily, 2012;Stilli et al., 2018;Sun et al., 2021;Iqbal and Baizid, 2015;Bauer et al., 2021).BPI is usually caused by motorcycle or snowmobile accidents that damage the neural system of the hand, resulting in lost mobility and sensation First author and second author contribute equally to this work.(Midha, 1997).Stroke, caused by disruption of blood flow to the brain, can damage the area of the brain that controls muscle movement, resulting in reduced mobility and sensation in the hand (Hunter and Crome, 2002).In both cases mentioned above, an exoskeleton glove is a promising solution to improve the quality of life for patients with hand disabilities.
In recent decades, numerous wearable robotic rehabilitation exoskeleton gloves have been developed to assist patients with hand disabilities (Xu et al., 2020;Jian et al., 2018;Ge et al., 2020;Rahman and Al-Jumaily, 2012;Stilli et al., 2018;Sun et al., 2021;Iqbal and Baizid, 2015;Bauer et al., 2021;Ma and Ben-tzvi, 2015;Ma and Ben-Tzvi, 2015;Lee and Bae, 2015;Popov et al., 2017;Refour et al., 2019).Unlike robotic hands and grippers, which require a full automated grasping system, exoskeleton gloves require a semi-human guided control system.Patients who wear the exoskeleton glove will manually aim at the object during grasping.Also, the exoskeleton glove can only provide a limited number of degrees of freedom in terms of mobility, thereby limiting the types of grasps it can exhibit.Thus, humanmachine interfaces (HMIs) for robotic exoskeleton gloves only need to determine the grasp type and force.
Various HMIs have been developed to control exoskeletons, including Electroencephalogram (EEG), Electromyography (EMG), vision, and voice-based, with each HMI having its advantages and disadvantages.An EMG-based HMI is the most commonly used method.It can be used to provide real-time motion and force planning directly from the wearer through EMG sensors placed on the forearm (Bronks and Brown, 1987;Artemiadis and Kyriakopoulos, 2008).Most researchers only used EMG sensors to detect gestures due to their good wearability (Chen et al., 2021;Cheon et al., 2020;Li et al., 2021;Yun et al., 2017;Lalitharatne et al., 2013;Huang et al., 2021).However, patients with paralysis of the hand have significantly weaker muscle EMG signals than normal people (Zhou et al., 2021).Therefore, EMG-based approaches are not suitable for patients with extremely weak or no hand function.The researchers designed multiple other HMIs to control the exoskeleton gloves.EEGbased HMI can provide a force planning feature (Paek et al., 2015) similar to the EMG approach, but suffers from wearability issues of the EEG sensor (Araujo et al., 2021;Li et al., 2019).Visionbased HMI requires minimal user action, but is low in precision and lacks initial grasp force planning ability (Kim et al., 2019;Pham et al., 2015;Ko et al., 2023;Calandra et al., 2018;Yamaguchi and Atkeson, 2017;Takamuku and Gomi, 2019).Voice-based HMI is known for its outstanding high accuracy, but lacks force planning ability (Guo et al., 2020;Wang et al., 2019;Kim et al., 2020).Force planning is critical for exoskeleton HMIs and can only be provided by the user through EMGbased HMIs.The lack of force planning ability will result in a slow and unstable grasp.Providing force planning on non-EMG-based HMIs has become one of the most challenging problems in exoskeleton glove control.
In this research, we focus on solving the aforementioned force planning issue by adding a visionbased HMI to a voice-controlled exoskeleton glove.Computer vision techniques are used to estimate the size, weight, and surface material of the target object.The estimated weight and size information is used to estimate the initial grasp force.
The main contributions of this study are summarized as follows.Initially, transfer learning was applied to state-of-the-art house interior surface materials detection techniques, adapting them to effectively identify materials on common objects in constrained contexts.Subsequently, a novel computer vision based HMI system was created, specifically tailored for assistive robotic exoskeletons.This inventive system tackles challenges in force planning by precisely estimating the dimensions, weight, and surface material of the target object.Lastly, grasp experiments were employed to showcase the effectiveness of the vision-based HMI in approximating the initial grasp force.The outcomes revealed a notably elevated success rate in grasping, surpassing that of the traditional slip-grasp method.

Exoskeleton Glove Hardware
This research employs an assistive exoskeleton glove tailored for patients with BPI (Xu et al., 2020(Xu et al., , 2023)).As individuals with BPI lack control over their muscles, this exoskeleton glove serves as a replacement for hand function.Key features of the exoskeleton glove include the utilization of Series Elastic Actuators (SEAs) alongside datadriven control and calibration for precise force measurement and control (Guo et al., 2021).The exoskeleton glove incorporates 7 SEAs to manage finger extension and contraction, thumb joint rotation, and wrist bending motion, enabling it to perform five rudimentary grasp types: cylinder grasp, sphere grasp, tip grasp, tripod grasp, and lateral grasp (as shown in Fig. 1).Each grasp type has been specifically designed to handle certain types of objects, as illustrated in Fig. 3.For instance, the cylinder grasp is well-suited for grasping water bottles and cups, while the tip grasp is ideal for handling spoons and forks.
The exoskeleton's operation can be outlined in three steps.First, the user interacts with the exoskeleton through a voice-based Human-Machine Interface (HMI) to instruct it on the desired grasp type (Guo et al., 2020(Guo et al., , 2022)).Second, the user, having a functional arm, selects an appropriate grasp position based on the object's location and places the exoskeleton accordingly.Third, force planning is carried out using a slipgrasp force planning method to adjust the grasp force (Guo et al., 2022;Xu et al., 2022).However, this method encounters challenges due to sensor limitations, as discussed in the related work section (Sec.2.1).To address this issue, a vision-based HMI is proposed, which estimates the object's size and weight, thus aiding in the force planning of the exoskeleton.

Limitations of Force Planning Methods used for Exoskeleton Gloves
Previous research proposed several methods to solve the force planning problem in non-EMGbased HMIs.However, force planning strategies suffer from two problems, as described below.
First, exoskeleton gloves need to grasp objects with unknown shapes, surface material, and weight.Nevertheless, all force planning algorithms require the setting of equations with precise grasp position, friction coefficient, and weight to calculate the optimal contact forces.Vanteddu et al. developed two methods to satisfy two of the conditions required for a stable grasp.These include deformation prevention of soft objects and maintaining force and moment equilibrium of the objects being grasped.Like exoskeleton gloves, some robotic hands and grippers also face the same problem.Cheng and Orin used the compactdual linear programming method to find the force distribution for a robotic grasping system called DIGITs.Youshen Xia et al. proposed using recurrent neural networks for grasp force optimization for multi-fingered robotic hands.Xiong and Xiong used an algorithm based on an artificial neural network to determine the joint torques that must be applied to a multifingered robotic hand required for a successful grasp.However, during normal usage of assistive exoskeleton gloves, the grasping position, object weight, surface material, and object size are almost impossible to determine accurately, thus making the above algorithms difficult to use.
Second, exoskeleton gloves need to predict the grasp force before lifting the object.Previous researchers designed a slip-grasp method to find the appropriate force through trial and error.Lee et al. proposed a slip detection method using a customized pressure sensor to measure slippage at the fingertips of the SAFER exoskeleton glove.A hybrid slip detection method for an exoskeleton glove was proposed by Xu et al..This method utilizes both Serial Elastic Actuators (SEA) and pressure sensors to enhance its accuracy.The force controller adds force to the fingertips if the object slips.However, the reinforcement process typically results in a tedious grasping process in which the user must continue to find the optimal grasp force through failures, which is not practical for exoskeleton glove users.Moreover, slip detection on a robotic exoskeleton glove differs from a robotic hand or gripper due to space and size limitations.Previous researchers have designed multiple slip detection sensors for robotic hands and grippers and have achieved good results in the slip-grasp force planning method (Romeo and Zollo, 2020;James and Lepora, 2020).However, there is not enough space for larger and more accurate slip detection sensors to be fitted at the fingertips in an exoskeleton glove application.The limitation of sensors makes the slip-grasp method suffer from accuracy issues.

Vision-Based Force Planning on Exoskeleton Gloves
Researchers have previously performed extensive research on vision-based force planning using robotic grippers.Pham et al. used a computer vision system to estimate the pose of the hand and object to assist in force planning.However, their research assumed that the weight of the object is known.Similarly, most vision-based grasping methods focused on position estimation to assist force planning (Yu et al., 2013;Liu et al., 2019;Zhang et al., 2021).Ko et al. and Takamuku and Gomi used the RGB camera to predict the grasp force based on the motion of the object.
Their methods are used mainly to improve the synchronicity between the grasp and load forces.However, their methods do not provide an initial prediction of the grasp force.Calandra et al. and Yamaguchi and Atkeson designed vision-based reinforcement learning methods to predict the optimal initial grasp force.However, their method shares performance issues similar to the slip detection methods.Initial estimation of the grasping force remains an ongoing research challenge.Humans can grasp and lift an object without knowing its exact weight, surface material, and size.Studies have shown that even with restricted haptic feedback, humans can still perform a stable grasp based on visual input (Stone and Gonzalez, 2015).Humans can use vision to estimate the grasp force.If the object's actual size, weight, and surface friction coefficient match the estimation, the predicted force will be close to the optimal grasp force.Haptic feedback is used to detect slippage when the estimated force is inaccurate.Humans can adjust the grasp force according to the haptic feedback.
Humans can perform accurate force planning even with restricted haptic feedback.Researchers working on the development of exoskeletons have attempted to capture these biological signals from force planning using EMG or EEG methods to assist force planning (Bronks and Brown, 1987;Artemiadis and Kyriakopoulos, 2008).However, these methods require conversion of the user's intention to biological signals to create control outputs, which suffer from low signal-to-noise ratios, significant processing time, and long reaction times.This research is inspired by the human force planning method.Instead of capturing the EEG or EMG signal, this paper proposes a computer vision-based HMI that mimics a human grasping procedure to directly estimate the size, weight, and surface material of an object and can calculate the initial grasp force based on static force analysis.

Material Recognition in the
Wild and MINC-2500 Dataset

HMI System Overview
The vision-based force planning method is designed to grasp an object without the need for detailed measurements in advance.The goal is to find the initial grasp force by estimating the size, shape, weight, and surface material of the object to be grasped.This vision-based initial grasp force estimation method uses voice input from a microphone to initiate grasping and releasing (input voice command: "grasp" and "release").Such a voice command system is proposed by Guo et al. (2022).After receiving a grasp command, the camera embedded in the glasses will start to take pictures and perform the following three steps on the image to calculate the initial grasp force.
(1) The input images are sent to an object detector trained on the Common Objects in Context (COCO) dataset.This step will help the vision-based force planning method to understand the environment by detecting all objects in the view and extracting the target object using an ARUCO marker on the exoskeleton glove (ARUCO marker is shown in Fig. 2).In this step, the target object category and size are acquired and the grasp type is determined according to the target object's category.
(2) The surface material of the target object is acquired by performing a material classification or material segmentation on the image patch of the target object.Given the object's size and surface material, the object's weight can be estimated.
(3) The initial grasp force is calculated based on the spatial location of the exoskeleton, the surface material of the target object, and the weight.
The initial grasp force is then sent to the exoskeleton.The SEAs are FSRs on the exoskeleton glove will detect slip while applying the predicted initial grasp force, and the slip-grasp method will adjust the grasp force as needed.The structure of the vision-based force planning method is shown in Fig. 2. Sample images for the exoskeleton grasping environment, object category, and object material are shown in Fig. 3.

HMI System Characteristic
The proposed HMI has the following characteristics: (1) The vision HMI is designed specifically for human-guided assistive robotic exoskeleton gloves.In this application, the location where the object is located in reference to the location of the glove is controlled by the user, and the vision HMI can generate initial grasp force to help the exoskeleton grasp target objects.
(2) The initial grasp force generated by the HMI is not the optimal grasp force.For example, a non-transparent plastic cup full of water and an empty plastic cup shows no difference in the proposed vision-based estimation system.The estimation system can set a range for the initial grasp force that is not too far from the optimal grasp force to help the system grasp the object.(3) The vision HMI can generalize to detect objects only in the MS COCO dataset because the object detector is trained using MS COCO.Material detection can generalize to detect the surface material of different objects but may be limited to contexts.This system cannot detect the new material category without training.

Object Detection
There are two common approaches for detecting and locating an object in an image: object detection (Gao et al., 2020;Chen et al., 2019;Girshick, 2015;Tan et al., 2020) or instance object segmentation (Liang et al., 2018;Siddique et al., 2021).Object detection requires image annotation using a bounding box during training.The detection result for object detection is a bounding box that contains background information.Thus, object detection is faster during training and inference.Instance object segmentation requires pixel-wise image annotation for training, and the detection result consists of pixels of the object without backgrounds.Object segmentation can better understand the object's shape, but is slower during training and inference than object detection.In this research, object detection was used over object segmentation for two reasons.
(1) Object detection is faster than object segmentation during inference proccess.Two-stage object segmentation will first detect the object in a bounding box and then extract the object pixels from the background.Single-stage object segmentation uses a decoder network to find the object and an encoder to propagate the object's pixels.Both methods mentioned above need additional calculations during inference, thus being slower than object detection using bounding boxes.The need for speed in this application necessitated the use of object detection instead of object segmentation.
(2) Object detection techniques have better data availability.Object detection does not necessitate pixel-level labeling, and this study may address the difficulty of grasping items that are not included in publicly accessible datasets.To detect uncommon objects in a small-scale project, transfer learning or fine-tuning on a public dataset is usually employed.Therefore, object detection techniques are utilized in this research as they require less annotation and will have better data availability.
The state-of-the-art object detection methods are based on Single Shot Detector (SSD) (Chen et al., 2019), Faster R-CNN (Girshick, 2015), Effi-cientDet (Tan et al., 2020), and YOLOV4 (Gao et al., 2020).Researchers have previously tested these methods on the COCO dataset (Lin et al., 2014).The inference speed and Mean Average Precision (mAP) at 50% Intersection over Union (IOU) of seven different object detection methods are compared on the collected validation dataset in order to select the most suitable object detection method.Sample images of the collected validation dataset are shown in Fig. 3.The experimental results are shown in Fig. 8.According to the experiments, YOLOV4 was selected as the object detection method used in this research; it better balanced speed and mAP than other methods.

Size Estimation for Target Object
The data output from object detection will be an object category vector c, an object bounding box vector B, and an object center vector S. The   ℎ object detected in an image belongs to category For the   ℎ object detected in an image, the object's bounding box n b is the combination of the upper left corner n p ul = (    ,    ) and the lower right corner n p lr = (    ,    ). (2) For the   ℎ object detected in an image, the center of the pixel of the detected object is located at n s calculated from the bounding box n b.
The target object is selected on the basis of the distance to the ARUCO marker located on the exoskeleton glove.The output of the ARUCO Application Programming Interface (API) contains the center coordinate of the marker: s m = (  ,   ).
The exoskeleton glove used in this research is right-handed with the ARUCO marker placed on the index finger linkage (see Fig. 3).The object to be grasped is likely to be on the lower right of the ARUCO marker.A weighted distance function was customized to find the distance between the ARUCO marker center coordinate s m and the detected   ℎ object center n s: where,   is the   ℎ object distance between the object center and the ARUCO marker center. 0 is the weight that serves as the penalty for the object located on the right of the marker, and  1 is the weight that serves as the penalty for the object located above the marker.(    ,    ) is the coordinate of the center of the object from the vector of the center of the object n s.The grasped object's index  can be found by minimizing the customized distance function  : The category of the target object is  , the bounding box is i b, and the center coordinate is i s.

Finding the Target Object Size using ARUCO Marker
Theoretically, it is not possible to obtain the exact size of an object without using a stereoscopic camera.However, it was assumed that the ARUCO marker and the target object have the same distance from the camera.Thus, the size of the target object can be estimated on the basis of the size of the ARUCO marker.
The marker width and height are 2 centimeters.The coordinates are explained in Fig. 4. The coordinates of the detected object's bounding box i b can be transferred from pixel coordinates to camera coordinates, and then to marker coordinates.The Euclidean distance between the points e and f in the marker coordinates is the length of the object () in centimeters (the points are shown in Fig. 4).The Euclidean distance between points f and g in the marker coordinates is the height of the object (ℎ) in centimeters.
The following method can be used to convert points from pixel coordinates to marker coordinates.The ARUCO API outputs the rotation vector (r) in the axis-angle representation, and the center coordinate (t) of the marker in the camera coordinates.To transfer a point p p = (, ) from the pixel coordinates to the camera coordinates p c = (  ,   ,   ), the following equations are used: where,   is the distance from the marker to the camera in the camera coordinates.  and   are the coordinates of the principle point in the camera coordinates (640 and 360 in this application).
and   are focal lengths of  and  axes in pixels (1184 and 1249 in this application).
To transfer a point p c = (  ,   ,   ) from the camera coordinate to the marker coordinate p m = (  ,   ,   ), the following equations are used: where, Rodrigues formula was used to build a transformation matrix R from the axis-angle representation rotation vector r.t is the marker coordinate center represented in the camera coordinates.

Material Classification
There are two common approaches to detect the surface material of an object, including image classification based on center pixels and semantic segmentation on the entire image (Bell et al., 2015;Zhang et al., 2017;Zhao et al., 2017).The most widely used material classification datasets are the Flicker Material Dataset (FMD), MINC, and open surface datasets.There are only limited pixel-wise annotated images provided, and most of these annotated images are furniture from the interior of a house, which is very different from this application.Due to the limited availability of annotated data, a pixel-wise supervised classification method such as UNet (Siddique et al., 2021;Zhao et al., 2017) cannot be used.For this application, the center pixel classification method was used to classify the material of a given object image, and the conditional random field (CRF) (Krähenbühl and Koltun, 2011) method was used for segmentation.Material segmentation is used to visualize the classification result.Since this application focuses on grasping daily used objects as shown in Fig. 3, the number of classes in MINC-2500 was reduced from 23 to 5, which include ceramic, metal, glass, plastic, and wood.

Material Classification Challenges
Initially, the deep learning material classification method was trained and tested on MINC-2500 and achieved good accuracy.The original MINC dataset material patch classification was trained on VGG-16, AlexNet, and Incep-tionV1 in 2014.The VGG-16 architecture was used as a performance baseline to test the new networks, which achieved high classification accuracy in the ImageNet challenge: Incep-tionResNetV2 and ResNet152V2.Moreover, networks that achieve similar classification accuracy were tested, but have faster inference speeds: InceptionV3, ResNet50V2, and MobileNetV2.In addition to different network architectures, the NetVLAD pooling method was tested, which is a clustering-based pooling method commonly used in speaker verification, face detection, and place recognition (Arandjelovic et al., 2016).
The weight of the model is transferred from ImageNet, and the training is terminated if the validation loss does not decrease for ten consecutive epochs.The training result was tested on a small data set similar to the use case of this application, which contains images from the FMD dataset and images collected online.Some sample images from the data can be visualized in Fig. 7.The dataset contains 169 images for each of the five categories.
The training results and model performance comparison are shown in Tab. 1.According to training results, ResNet50V2, MobileNetV2, and InceptionV3 are the top 3 networks that achieve a good time and performance balance in the MINC-2500 validation set.However, the MINC-2500 does not have a perfect generalization to material classification.The context in the MINC dataset is very different from that of this application, which prevents the network from finding a correct label during testing on the collected dataset.NetVALD clustering pooling layer also does not improve accuracy.To solve the generalization issue, transfer learning was performed to retrain the model in the collected dataset.Transfers from ImageNet and MINC-2500 weight were experimented.The results are shown in Tab. 2.
The results show that the transfer from MINC-2500 using ResNet50V2 has the best accuracy when testing on the collected dataset.  .Due to the low generalization accuracy of the MINC-2500 data set, the MINC-2500 weight was transferred to the collected dataset using the same architecture.The training and inference procedure is shown in Fig. 5.
When inferring on a sample image, the ResNet50V2 network was modified to output a class probability map c P [1x5] and a feature-mapsized class probability map f P [12x12x5] using Grad-CAM (Selvaraju et al., 2017).The Grad-CAM is generated using the following equation: Where, n M is the   ℎ feature map and n W is the weight of the   ℎ feature map.
where,   () is the energy function for class .
x is the set of all pixels in image I.  and  are pixel indexes in set x.  and  control a nested loop to pair each pixel with all other pixels without repetition. () is the unary energy that is the negative log probability of a pixel belonging to class . (, ) is the pairwise energy that measures the pixels' spacial and color similarity.The unary and pairwise energy is defined in the following equations: where, i p P c is the pixel level probability of   ℎ pixel in the image belonging to class .  and   are the position of   ℎ and   ℎ pixels.i I and j I are the RGB values of   ℎ and   ℎ pixels.Long-range connections were used in the energy calculation.Thus, the pairwise energy contains only the appearance kernel.  and   are the position similarity and color similarity parameters, respectively.Parameter values   and   were chosen to be 60 and 10 respectively based on Krähenbühl and Koltun.The results of the CRF algorithms will be an updated pixel level probability map crf P [362x362x5] .
The classification results can be found by finding the maximum value of the c P class probability map.The results can be directly used to estimate the grasp force.The segmentation results can be used to perform pixel-wised classification when the target object contains different materials.The sample segmentation results and classification accuracy are available in Sec.8.4.

Weight Estimation
The estimated size and material of the target object can be obtained based on the methods described in the previous sections.However, the information is insufficient to estimate the weight, and some assumptions need to be made in order to calculate the volume of the target object.
The target object in this application can be classified into four different categories: fork/spoon, bottle/cup/wine glass, sports ball, and apple/cell phone.The weight of an apple and a cell phone is not affected much based on size; thus, the average weight of an apple and a cell phone can be used as the weight of the target object.Sports balls are usually very light, so it was assumed that a sports ball weighs 20 grams if it has a diameter less than 5cm, weighs 100 grams if it has a diameter between 5-10cm, and weighs 250 grams if the diameter is larger than 10cm.
The shape of a spoon or fork can be simplified to a plate with a thickness of 0.1 cm.Thus, the weight of a spoon or fork can be estimated using the following: = 0.1ℎ ( 14) where,  and ℎ are the estimated width and height of the target object, respectively. is the density of the material of the target object.   is the volume of the object.   is the weight of the target spoon or fork.
The shape of a bottle, cup, and wine glass can be simplified to a hollow truncated cone.It is assumed that the truncated cone has 2 3 of the volume of a cylinder of the same height.The thickness can be assumed to be 0.2cm.Thus, the weight of a bottle when filled with water can be estimated using the following.16) where,   is the volume of the material to form the bottle.  is the outer volume,   is the inner volume.  is the weight of the bottle. is the density of the material of the bottle.  is the density of water.
The weight of a cup can be estimated similar to that of a bottle.The only difference is that a cup might have a handle and will make the volume calculation inaccurate.The size of the handle was assumed to be 30% of the weight of the cup .Thus, the weight of a cup when full of water can be estimated using the following.if ℎ ≥  : where,   is the volume of material to form the cup.  is the outer volume, and   is the inner volume.  is the weight of the bottle. is the density of the material of the cup.  is the density of water.Wine glass is a special cup with a long leg, so it was assumed that the capacity of the glass is 50% of a normal cup.Thus, the weight of a wine glass when full of water can be estimated using the expression:

Initial Grasp Force Calculation
The initial grasp force is calculated based on the predicted weight and the shape of the standard object.Fig. 6 illustrates the coordinate systems for grasping force initialization.The origin of the world coordinates is placed at the center of the object.The exoskeleton glove coordinates are located at the center of the Inertia Measurement Unit (IMU).The IMU is calibrated to align with the world coordinates at the beginning.Assuming that there is no torque applied on the object and the contact forces are normal to the last link of each of the exoskeleton fingers, for an arbitrary object, the force equilibrium equation can be expressed as: where,  ∈ {ℎ, , , , },  R  is the rotation matrix from the exoskeleton glove coordinates to the world coordinates, which is calculated based on readings from the IMU. R  is the rotation matrix from the fingertip  to the exoskeleton glove coordinates, which is calculated based on the forward kinematics of the glove (Xu Fig. 6 The coordinate systems for initial force estimation.WCS: world coordinate system.ECS: exoskeleton glove coordinate system.ICS: -th fingertip coordinate system. et al., 2020). F  is the vector of the contact force applied on fingertip , which is measured based on a calibrated Linear Series Elastic Actuator (LSEA) (Guo et al., 2021). is the mass of the object, and g is the vector of gravitational acceleration.
For the cylinder grasp and the tip grasp, the direction of the friction force on each fingertip is always opposite to gravity.Therefore, the above equation can be simplified to

Experimental Results
The experiment section encompassed three primary components.Initially, the datasets utilized for object detection validation and material classification were introduced.Subsequently, the performance of object detection, size estimation, and material classification within these datasets was assessed.Lastly, a vision-based HMI was integrated as an extension of the slip-grasp force planning method for the exoskeleton glove.The experiments were structured to contrast the combined approach of vision and the slip-grasp method against the exclusive use of the slip-grasp force planning method.

Datasets
Two small datasets were built to verify this application (object detection dataset); one for visionbased grasp force planning method validation and one for transfer learning material classification (material classification dataset).
The dataset for vision-based grasp force planning method validation has 30 images taken from 1080P SVWSUN Video Glass worn by an exoskeleton glove user.Each grasp object is labeled using a bounding box.Sample images are shown in Fig. 3.
The dataset for transfer learning consists of five labels: ceramic, plastic, metal, wood, and glass.Each class has a training set of 119 images, a testing set of 30 images, and a validation set of 20 images.Each image is labeled on the basis of the object's center material.This dataset contains images from an online image search, the FMD dataset, and images taken for the grasp objects used in this research.Sample images are shown in Fig. 7. Images in this dataset have more details and fewer contexts than images in MINC-2500.

Object Detection and ARUCO Marker Detection
The labeled object detection validation dataset was used to test the performance of different networks trained on the COCO dataset.A mean Average Precision (mAP) at 50% Intersection over Union (IOU) was used to quantify object detection performance.The speed was measured based on the average inference time of 10 images using the E5-1260 CPU.The results are shown in Fig. 8. Multiple networks were tested and YOLOV4 with a 0.75 threshold was selected based on mAP and speed.
The successful detection rate   of object detection and ARUCO marker detection can be calculated using the following equation: where,   is true positive, which means that the ARUCO API detection successfully detects the marker, and the object detection successfully identifies the center object. is false positive, which means that the marker detection recognized the wrong marker or the object detection detects the wrong center object. is the total number of test images.The experiments' successful detection rate was 90% in the collected object detection validation dataset.

Object Size Estimation
The experiment involved evaluating the object detection validation dataset by comparing the detected target object's size with the ground truth sizes.For this purpose, images successfully detected by both the YoloV4 object detector and the ARUCO marker detector were utilized.This dataset comprised 27 images featuring 15 different objects observed from various angles.To obtain the predicted size for each object, the average of the estimated sizes from different angles was taken.The ground truth sizes were determined based on the width and height of the orthographic projection, as illustrated in Fig. 9.
The obtained results are presented in Tab. 3. To quantify the difference between the predicted and actual object sizes, the percentage difference between the products of width () and height (ℎ) was calculated.This evaluation metric is termed the Mean Absolute Percentage Error (MAPE).The MAPE difference between the predicted and actual object sizes was found to be 26.9%.The main source of this error was identified as the estimation process, particularly when utilizing the bounding box to estimate the object's dimensions.This error tends to occur when the object is placed at an angle during detection.

Object Material Detection
The training and testing results in the proposed material classification dataset are shown in Tab. 2. According to the accuracy and speed of classification, the material classification network used is ResNet50V2.The weight is transferred from the MINC-2500 dataset.
Material classification validation was also performed on the object detection dataset.The material classification accuracy for all detected objects was 96%.In addition to material classification, material segmentation is performed using the CRF method to visualize the result of material classification.Sampe images of material segmentation are shown in Fig. 10.

Object Weight Estimation
The experiments on the object detection validation dataset involved comparing the weight of the target object with the weight of the corresponding ground truth.The dataset comprised 27 detected images used in the size estimation process, which relied on the estimated sizes obtained in the previous section.The materials used in the objects had different densities: plastic (0.92/ 3 ), metal (7.85/ 3 ), glass (2.7/ 3 ), ceramic (6/ 3 ), and wood (0.9/ 3 ).The results of these experiments are presented in Tab. 4.However, it is worth noting that the weight of the containers varied due to differences in the fluid level.For consistency, it was assumed that all containers were full.To assess the accuracy of the weight estimation, Mean Absolute Percentage Error (MAPE) was employed as the evaluation metric.The MAPE between the predicted and actual object weights was found to be 59.8%.The relatively large weight estimation error can be attributed to the following factors.First, weight estimation is heavily influenced by size estimation, which in turn can be affected by the angle at which the object appears in the camera.Second, the assumption of standard shapes for all objects, such as cylinders or boxes, may not hold true for most cases, where cups might have handles, and wine glasses may have long legs, leading to deviations from the standard shapes used in the estimation process.Furthermore, despite some instances of substantial percentage errors, the overall weight difference remains acceptable.For instance, the metal fork experienced a weight estimation error of 35g, representing a 159.1% overestimation compared to its actual size.The average weight difference across all objects is only 173g, which still provides meaningful information for initial grasp force planning.

Grasp Experiments
The experimental procedure involving human subjects in this study received approval from the Carilion Clinic Institutional Review Board .Due to the nature of the exoskeleton glove used in this research, which is a rigid linkage exoskeleton, the user cannot apply any force to the fingertips of the exoskeleton linkages when wearing it.
The grasp procedure is as follows: The user initiates the system using a personalized voice command system (Guo et al., 2020) to capture a 1280x760 pixel image.By employing the methods proposed in previous sections, the size and weight of the grasped object can be calculated.The 9-DOF MPU-9250 IMU detects the pitch, yaw, and roll of the exoskeleton glove using an AHRS filter.Using the weight of the object and the IMU data, the initial grasp force is computed, and the exoskeleton glove applies this force to each fingertip (Guo et al., 2021).The slip-grasp system is then utilized to stabilize the grasp.During the experiment, each of the 15 objects present in the object detection dataset was subjected to 2-6 grasping attempts from various angles and water levels (for containers), resulting in a total of 64 grasp trials.Among these trials, 6 experienced failure of object detection, while 5 encountered errors in material detection.The grasp success rate is defined as the success in picking up the target object.The overall grasp success rate using vision-based HMI combined with the slip-grasp method was 87.5%.

Comparison Between Vision-based Force Estimation and Slip Grasp Force Planning
To demonstrate the effectiveness of the visionbased force estimation method.We performed 64 experiments using only the slip-grasp force planning method and achieved a grasp success rate of 71.9%, while the vision-based method achieved Fig. 11 Experimental result of grasping daily used objects using vision-based initial grasp force prediction method and slip-grasp method.Blue: number of successful grasps performed using the vision-based initial force estimation with slip-grasp method.Red: number of successful grasps performed using only the slip-grasp method.Yellow: the total number of grasps for each individual method.
87.5%.The success rate for each grasp category is shown in Fig. 11.
The comparison experiment reveals that utilizing a combination of vision-based force estimation with the slip-grasp system leads to a higher success rate compared to using only the slip-grasp system.To demonstrate the benefits of utilizing the vision-based initial force estimation technique, we carried out an additional set of 20 grasp trials involving four distinct items: a plastic bottle, a wine glass, a plastic spoon, and a metal spoon.These particular objects were chosen based on their notable performance in previous grasp experiments.
For the vision-based method, the initial grasp force was determined using the vision-based force estimation system, and the slip-grasp method was not utilized in this experiment.For the slip-grasp method, a predefined initial grasp force of 2N and 200Nmm is used.This method adjusted the grasp force based on slippage to achieve a stable grasp (details can be found in paper by Xu et al. (2022)).
The grasping process was facilitated by 6 Series Elastic Actuators (SEAs) as depicted in Fig. 12.The force and torque output of the index finger and thumb rotatory SEAs, which are the most critical actuators during grasping, were measured and reported in Tab. 5.
The results from the additional 20 grasp experiments are presented in Tab. 5 and Fig. 13,  demonstrate that the vision-based force tion system can produce adequate initial grasp forces for various objects.This offers three main advantages during grasping.First, the initial grasp force estimate helps prevent the application of insufficient thumb torque, which can result in slippage.For example, in Fig. 13 (B), the plastic water bottle could not be lifted by the slip-grasp method due to the insufficient predefined thumb torque.Second, the initial grasp force can prevent the application of excessive force and torque.For example, in Fig. 13 (F), the plastic spoon could not be lifted by the slip-grasp method due to excessive fingertip force and thumb torque.Third, even for objects that can be successfully lifted by the slip-grasp method, incorporating a vision-based force estimation system allows for a reduction in the applied force (as shown in Tab.5), thereby optimizing the grasping process.

Vision-based HMI System Latency
The image processing is running on a desktop server with an E5-1260 CPU, and there is no GPU involved.The estimated size, weight, and surface friction coefficient are sent to the exoskeleton's onboard microcontroller, which generates the initial grasp force using IMU data and operates the exoskeleton.The computation time for processing a single image is around 700 ms.The processing time meets this application's requirements as only one image needs to go through the complete processing per grasp.The time consumption for processing one image is shown in Tab. 6.

Conclusion
This paper presented a novel vision-based Human-Machine Interface (HMI) aimed at estimating the initial grasp force required to manipulate a target object using an assistive exoskeleton glove designed for patients with Brachial Plexus Injuries.
The proposed approach employed object detection and material classification techniques to predict the initial grasp force, using information about the weight, size, and material of the object.In the validation dataset, the object size estimation produced a mean absolute percentage error (MAPE) of 26.9%, while the object weight estimation showed a MAPE of 59.8%.Although the MAPE of weight and size estimation was relatively high, vision-based initial grasp force estimation Fig. 13 Demonstration of grasping daily used objects using vision-based initial grasp force prediction method and slipgrasp method.(A) Successfully grasp a 512g water bottle with vision system.(B) Failed to grasp a 512g water bottle using the slip-grasp method due to inadequate thumb torque.(C) and (D) Successfully grasp an 188g wine glass with both the vision system and the slip-grasp method.(E) Successfully grasp a 3g plastic spoon with vision system.(F) Failed to grasp a 3g plastic spoon using the slip-grasp method due to excessive force and torque.(G) and (H) Successfully grasp a 48g metal spoon with both the vision system and the slip-grasp method.
still managed to produce a meaningful result to assist grasping.
The vision-based HMI successfully distinguished between different materials and accurately predicted the initial grasp force for objects of varying weights.When integrated with the pure slip-grasp method, the combined approach attained an impressive 87.5% success rate, outperforming the standalone slip-grasp method (71.9%).These results highlighted the importance of estimating the initial grasp force to prevent slippage caused by inadequate or excessive application of force and torque.
In conclusion, the proposed vision-based HMI demonstrated the potential to enhance the grasping capabilities of the exoskeleton glove, contributing to improved functionality and usability for patients with Brachial Plexus Injuries.The findings of this experiment pave the way for future advancements in assistive technologies, facilitating more effective and reliable interactions between users and robotic systems.
approved by the Carilion Clinic Institutional Review Board

Fig. 1
Fig. 1 The assistive exoskeleton glove used in this research.This assistive exoskeleton glove is designed for patients with BPI.(A) Overview of the exoskeleton glove.(B) The user grasps a water bottle using voice-based HMI.(C) The user grasps a paper box with a tip grasp.(D) The user grasps a plastic pen with a tripod grasp.(E) The user grasps a ceramic bowl with a lateral grasp.(F) The user grasps a plastic ball with a sphere grasp.(G) The user grasps a plastic marker pen with a tripod grasp.(H) The user grasps a plastic bottle with a cylinder grasp.

Fig. 2
Fig. 2 Overview of vision-based initial grasp force prediction procedure.

Fig. 3
Fig. 3 Sample images for the exoskeleton grasping environment, object category, and object material.

Fig. 4
Fig. 4 Illustration of the camera, marker, and pixel coordinates.

Fig. 5
Fig. 5 Training and inference procedure for vision-based material classification and segmentation.

Fig. 7
Fig. 7 Sample images used in the material classification training.

Fig. 8
Fig. 8 Object detection results.(A) mean Average Precision (mAP) at 50% Intersection over Union (IoU) of 7 different state-of-the-art neural networks.(B) mAP vs. average inference time of each neural network.

Fig. 12
Fig. 12 Series Elastic Actuators (SEA) are used to apply force on the exoskeleton glove in the grasp experiment.

Table 1
Results of training on MINC-2500 and testing on the collected dataset

Table 2
Performance comparison between transferImageNet and MINC-2500 weight to the collected dataset

Table 3
Size Estimation Experimental ResultsPredicted*: The predicted size is defined by the width times height in centimeters. b

Table 4
Weight estimation experimental results

Table 5
Comparison between vision-based force estimation and slip grasp force planning

Table 6
Inference speed of one 1280x760 pixel image using the vision-based HMI *: the inference time is measured by averaging the inference time of ten images on a E5-1260 CPU.