Cross-Modal Environment Self-Adaptation During Object Recognition in Articial Cognitive Systems

The cognitive connection between the senses of touch and vision is probably the best-known case of cross- modality. Recent discoveries suggest that the mapping between both senses is learned rather than innate. These 15 evidences open the door to a dynamic cross-modality that allows individuals to adaptively develop within their environment. Mimicking this aspect of human learning, we propose a new cross-modal mechanism that allows 17 artificial cognitive systems (ACS) to adapt quickly to unforeseen perceptual anomalies generated by the environment or by the system itself. In this context, visual recognition systems have advanced remarkably in recent years thanks to 19 the creation of large-scale datasets together with the advent of deep learning algorithms. However, such advances 20 have not occurred on the haptic mode, mainly due to the lack of two-handed dexterous datasets that allow learning 21 systems to process the tactile information of human object exploration. This data imbalance limits the creation of 22 synchronized multimodal datasets that would enable the development of cross-modality in ACS during object 23 exploration. In this work, we use a multimodal dataset recently generated from tactile sensors placed on a collection of objects that capture haptic data from human manipulation, together with the corresponding visual counterpart. Using this data, we create a cross-modal learning transfer mechanism capable of detecting both sudden and permanent anomalies in the visual channel and still maintain visual object recognition performance by retraining the visual mode for a few minutes using haptic information. Here we show the importance of cross- modality in perceptual awareness and its ecological capabilities to self-adapt to different environments.


32
Humans perceive the environment through multiple senses. A set of sensory information, acquired 33 through each modality, is integrated and transformed into a supra-modal representation. This process 34 requires cross-modality (also referred to as cross-modal transfer or cross-modal matching)the cognitive 35 ability to associate the sensory features acquired independently through multiple senses. Human 36 manipulation of objects, a natural example of cross-modality, connects the senses of sight and touch from 37 an early age, and this sensory connection is strengthened over the course of child development (1), and 38 stays throughout the lifespan (2). In particular, vision and haptics are complementary to each other, 39 improving the credibility of mental representation of object properties and recognition performance (3, 4, 40 5, 6). In this article, we present an Artificial Cognitive System (ACS) that builds on a cross-modality 41 ability using human manipulation data, achieving perceptual awareness and a dynamic capacity to adapt 42 to changing environments.

43
The question of how humans achieve cross-modality was sparked off by 17th-century natural philosopher 44 William Molyneux. In his letter to John Locke, he questioned whether a congenitally blind person, who 45 recently gained vision, would be able to visually recognize an object, previously known only by touch, or 46 she/he would need to learn to make the intermodal transfer from touch to vision (7). To this day, the 47 debate over whether this transfer ability is innate or acquired has led scientists to investigate cross-48 modality in newborns, animals, and congenitally blind individuals (8,9,10). Of these studies, a recent 49 cross-modal matching experiment, conducted on congenitally blind individuals who later gained sight as 50 adults, suggests that cross-modality is acquired, dynamic, and moldable rather than innate and 51 predetermined (9). This dynamic nature of cross-modality allows human observers to modulate the 52 strength of intermodal connection based on the reliability of the information derived from each modality 53 (5), an ability that has also been observed in various animal species, including capuchin monkeys (11), 54 rodents (12), and even bumblebees (13).

55
In the past few years, there has been a growing interest in the development of cross-modality in artificial 56 agents (14), especially robots, as it may facilitate the creation of systems that autonomously adapt to 57 different environments. In particular, with the advancement of visuo-tactile sensors for haptic capture 58 systems, researchers have started investigating the cross-modal connection between vision and touch in 59 robotics (15,16,17). The tactile patterns that result from these sensors can be related to the images 60 obtained using the visual mode, thus creating a framework that enables the establishment of cross-modal 61 relationships. By gathering haptic data in the form of images, the dimensional gap between touch and 62 vision features can be successfully overcome (18). Furthermore, the maturity of image recognition 63 methods (e.g. deep learning algorithms) has allowed some progress in the study of the relationship 64 between these two modes. However, existing haptic data is still a long way from what would be a haptic 65 dexterous robot exploration similar to that of humans. To solve this problem, researchers have recently 66 designed a glove that through a mechanoreceptor network can provide tactile patterns to the system, as 67 well as information related to the dexterity of the human grasp (19). However, that work does not relate 68 haptic data to visual mode. In this paper, we use a haptic capture system (20) that also leverages 69 information from the human manipulation of objects, but with the ultimate goal of delving deeper into the 70 design of cross-modality in ACS.

71
To achieve this objective, we designed and printed novel 3D objects that collect human exploration data 72 with multiple capacitive touch sensors on the object surface. With this dataset, our ACS achieves cross-73 modality via transfer learning from touch to vision. Unlike other approaches (17), where corrupted inputs 74 are incorporated during training time, we present a new mechanism that allows our ACS to use cross-75 modality to continuously monitor whether the information received from visual and haptic modes 76 matches using cross-modality, hence being able to detect anomalies (e.g. blurred vision). Given that the 77 two sensory modalities are independent but collaborative, like those of a human, we examine how our

88
Haptic data and object recognition 89 90 We designed a system that captures haptic information generated by humans during the object 91 manipulation process. This collected data is enough to create a haptic recognition system that outperforms 92 humans in a classification task with similar 3D shapes (20), both in accuracy and response time. As 93 illustrated in Fig. 1, the objects are six similar shapes we have digitally created and 3D printed (20, 21).

94
The external surface of each object is totally covered with 24 copper pads equally distributed and 95 connected to an electronic board placed inside, which also includes a gyroscope. During the object 96 manipulation, this system samples data from all the sensors at 40Hz and sends it to a computer through

101
and to make sure the presented algorithm does not use the sensors' order to recognize the objects, sensors are placed in a way that every position in the haptic state corresponds to a sensor located in the same 103 spatial location for every object. In the same way that touch receptors on the human hand would also 104 receive input from corresponding locations on different objects when the relative orientation of the 105 grasping hand and the objects would be held constant.

106
Our haptic dataset is based on these haptic states and their time evolution. The geometry of the objects 107 affects their handling, and this is reflected in our data. To gather our haptic dataset, one participant is 108 invited to manipulate each object with both hands and perform a random exploration task. Four series, 109 each lasting for five minutes, have been recorded per object.

110
To perform automatic haptic object recognition, the dataset is divided into two parts: three series for 111 training (15 minutes) and one for testing (5 minutes). To determine the probability of a set of n 112 consecutive haptic states (h 1 , … , h n ) belonging to a specific object S i (i ∈ [1, . . . , 6]), i.e., P(S i | 113 h 1 , … , h n ), we adopt a naïve Bayes approach as follows: Here the n-product P(h j | S i ) is the naïve condition, P(S i ) = 1 6 is the probability of each object and S î the 116 resulting prediction.

117
Our haptic object recognition system achieves an average accuracy of 89.63% after just 8 seconds of 118 manipulation (average time for best accuracy in humans (20)), as shown in Fig. 1.

121
As stated earlier, the 3D printed objects have been produced from a 3D digital render. Moreover, from the 122 data collected by the gyroscope located inside the objects and using the 3D renders, we synthesize a video 123 with the movements of the objects caused by human manipulation (Fig. 2). We create one video frame for 124 each haptic state; see Methods for details. We select the 5-minutes test data series mentioned above for 125 this purpose. This opens the visual channel to our system. Now, the ACS can receive data from haptic and 126 visual senses simultaneously (Fig. 2).

127
Since the ACS has previously had a haptic experience, it can autonomously tag the visual dataset through 128 the results of the already trained haptic object recognition system, creating a cross-modal relationship.

129
Obviously, after this transfer of knowledge, we can train a visual object recognition system from the new 130 labelled visual dataset. We divide the 5-minute visual dataset into three parts: 60% for training, 20% for 131 testing, and reserve the remaining 20% for a later experiment. Using this data, we train a set of 140 v j > τ with i ≠ j), that is, the visual object recognition system assigns the sample to more than one class. The Molyneux problem addresses the following question: would a person born blind that later regains 152 sight as an adult, be able to visually recognize the shapes of objects previously experienced by touch?

153
Recent empirical studies have pointed out that upon recovery of sight, subjects are initially unable to 154 recognize these objects visually. However, after they experience the world with both senses, in a few days 155 a cross-modal link is created allowing them to pass the Molyneux test (9). In the present study, this 156 connection between the two senses equips the ACS with a cognitive mechanism that allows it to 157 autonomously detect a faulty channel.

158
The aforementioned mechanism, which we have called the Molyneux mechanism, allows the ACS to

172
In order to test this mechanism and study the effectiveness of artificial cross-modality for perceptual 173 awareness, we stressed our visual channel by applying a blur filter. Once applied, the accuracy of the 174 visual object recognition system went down to approximately 20% for the 6 classes (see Fig. 3A). The

175
ACS detected this anomaly using the Molyneux mechanism, and did it quickly, with an average delay of 176 only 2.13 seconds (85.33 samples) after applying the blur filter to the visual input. In Fig. 4

214
The Molyneux mechanism defined in this article allows the ACS to check the coherence between two 215 synchronized samples (haptic-visual) (Fig. 3B). In other words, this mechanism allows the ACS to 216 answer the question: is what I touch and what I see the same object? As shown in Fig. 4

217
and also in Methods section, the study of these anomalies has been detailed to differentiate them from 218 those that lengthen in time. Filtering visual-haptic classification pairs in real-time allows the ACS to 219 realize that the visual object recognition system is not working properly when the blur filter is applied.

220
High and stable accuracy over time of the haptic classifier is assumed for this study as shown in Fig. 1.

221
The goal of the filtering process is to identify changes in the visual channel. It can be observed (see Fig.   222 4) that for objects lat00 and lon05 the change to blurred vision is not detected. This is because the visual 223 object recognition system continues to classify these two objects correctly despite the blur, and therefore 224 there is no incoherence of any kind other than a decrease in visual accuracies for these two objects, i.e., 225 the filter does not detect any change in the visual channel.

227
Realizing that the environment has changed is the first step in the process of self-adapting to it. The

228
proposed design with two independent object recognition systems and the cross-modality approach allows 229 the ACS to autonomously adapt to changes in the environment that affect one of its sensory modes using With the approach proposed in this article, we are aware that we are simplifying the problem by using 241 synthetic images and avoiding the occlusions caused by human hands during object manipulation.

242
Nonetheless, even though this work is not focused in solving this issue, this apparent problem could be 243 part of the solution, since these occlusions are strongly correlated with haptic data.

244
Another limitation of this approach is that, in this very first experiment where we have shown the benefits  Although the current trend is to place touch sensors on robotic hands' end-effectors, the use of sensors on 254 objects is an equally important field of research, especially in obtaining data for ACS. In fact, it seems 255 reasonable to assume there would be a correlation between what was obtained by the introduced haptic 256 capture system and the point cloud that a robotic hand could generate if it could interact with that system.

257
We hypothesize that placing sensors directly on objects is equivalent to obtaining data from a human-like 258 robotic dexterous hands. This would allow us to integrate our ACS to a robotic hand such as Shadow

273
In this section, we provide the methods and procedures used in this research article.       It is normal that some short duration mismatches between both channels appear although both channels 324 are working properly. In order to provide a stable decision regardless of whether a channel is failing or 325 not, a low pass filter is applied to the output of the Molyneux mechanism.

326
The low pass filter consists in a 6th order Butterworth filter offering a flat output for the passband 327 frequencies and avoiding ripples. The first 1000 samples of each figure test file are used to study the 328 duration of the mismatches when there is no failure in neither of the two channels, obtaining a mean 329 duration of  = 3.3 samples and a standard deviation of  = 9.1. Since it is desired to achieve a huge 330 attenuation for the mismatches frequencies, the cutoff frequency is set a decade before the frequency 331 corresponding to the  +  duration. Taking into account the sample rate of 40 Hz, the cutoff frequency 332 can be calculated as:

334
After applying the filter, the output of the Molyneux mechanism fluctuates between 0 and 1 (see the 335 orange plot in Fig. 4), where 0 corresponds to channels not matching (failure in one channel) and 1 to 336 channels matching. In order to offer a binary response, as the one show in Fig. 4, the filter output goes 337 through an hysteresis cycle, where the output goes from 0 to 1 if the input is higher than 0.8 and from 1 to 338 0 if the input is lower than 0.2.