Cortical hemodynamic mechanisms of reversal learning using high-resolution functional near-infrared spectroscopy: A pilot study.

OBJECTIVES
Reversal learning is widely used to analyze cognitive flexibility and characterize behavioral abnormalities associated with impulsivity and disinhibition. Recent studies using fMRI have focused on regions involved in reversal learning with negative and positive reinforcers. Although the frontal cortex has been consistently implicated in reversal learning, few studies have focused on whether reward and punishment may have different effects on lateral frontal structures in these tasks.


METHODS
During this pilot study on eight healthy subjects, we used functional near infra-red spectroscopy (fNIRS) to characterize brain activity dynamics and differentiate the involvement of frontal structures in learning driven by reward and punishment.


RESULTS
We observed functional hemispheric asymmetries between punishment and reward processing by fNIRS following reversal of a learned rule. Moreover, the left dorsolateral prefrontal cortex (l-DLPFC) and inferior frontal gyrus (IFG) were activated under the reward condition only, whereas the orbito-frontal cortex (OFC) was significantly activated under the punishment condition, with a tendency towards activation for the right cortical hemisphere (r-DLPFC and r-IFG). Our results are compatible with the suggestion that the DLPFC is involved in the detection of contingency change. We propose a new representation for reward and punishment, with left lateralization for the reward process.


CONCLUSIONS
The results of this pilot study provide insights into the indirect neural mechanisms of reversal learning and behavioral flexibility and confirm the use of fNIRS imaging in reversal-learning tasks as a translational strategy, particularly in subjects who cannot undergo fMRI recordings.


Introduction
Humans must be able to adapt to changes in their environment. This requires quickly adjusted responses to voluntarily inhibit or alter established behavior (prepotent response) (Ghahremani et al., 2010). Paradigms such as reversal-learning tasks (RLTs) can be used to measure behavioral exibility (Chamberlain et al., 2008;Ghahremani et al., 2010;Izquierdo et al, 2017;Xue et al., 2013). The RLT paradigm provides an approach to measure a participant's capacity to select an appropriate behavior (i.e. response) when the rules of the environment are modi ed. For example, participants must rst choose one stimulus (e.g. picture or action) associated with the desired outcome (to win money, for example, as positive feedback). Then, there is an alteration of the rule, and the stimulus associated with the positive feedback changes, a reversal occurs, and participants must select the new correct stimulus related to the desired outcome to appropriately update the response (Ghahremani et al., 2010). Impairments in reversallearning processes are associated with a wide range of abnormal behavioral abnormalities and psychiatric conditions characterized by impulsiveness and disinhibition, such as obsessional compulsive disorder (OCD) (Chamberlain et al., 2008).
Functional near infra-red spectroscopy (fNIRS) is used non-invasively in a natural environment for human infants and adults to analyze cortical activation (Wilcox and Biondi, 2015;León-Carrión andLeón-Domínguez, 2012, Mahmoudzadeh et al., 2013) or in pathological situations (Roche Labarbe et al., 2008). Brain activation is associated with neurovascular coupling that induces hemodynamic changes, modifying the optical properties of brain tissue, which can be assessed by fNIRS (León-Carrión andLeón-Domínguez, 2012, Jobsis et al., 1977). Here, we used the powerful temporal resolution of fNIRS (Cui et al., 2011), relative to fMRI, to investigate the cortical hemodynamic response to neuronal activation by analyzing the changes in oxygenated and deoxygenated hemoglobin concentrations (HbO and HbR) induced by RLT. We expected to observe rightward activation, at least in the DLPFC, under the punishment condition and leftward activation asymmetry under the reward condition during RLT. Moreover, we aimed to con rm the involvement of the DLPFC, IFG, and OFC in a RLT, as observed by fMRI (Nagahama et al., 2001;Cools et al., 2002;Remijnse et al., 2005;Ghahremani et al., 2010;Xue et al., 2013;Izquierdo et al., 2017). Thus, we aimed to characterize the involvement of the DLPFC and the IFG during the reversal process and differentiate their activity under reward and punishment learning conditions using an innovative noninvasive portable solution using high density fNIRS. To date, no one has studied the RLT paradigm using this approach, and replicating some of the fMRI ndings using fNIRS is of particular translational value for future studies in subjects who cannot be recorded in fMRI, e.g. subjects with brain electrodes.

Participants
Eight healthy right-handed participants (6 women and 2 men) aged between 22 and 56 years were enrolled in the study. All subjects had normal or corrected to normal vision and no history of neurological disease. They were asked to sit comfortably and limit their head movements as much as possible while performing the experiment. The experiment was conducted in compliance with the Code of Ethics of the World Medical Association (Declaration of Helsinki) and the protocol of the study was approved by the local ethics committee and the Comité de la Protection des Personnes (NO II N°2013-A01297-38). Informed written consent was obtained from each subject before the experiment.

Task
During the study, participants sat in a chair 30 cm from a 23-inch computer monitor in a dark room. They also had a tablet with a keyboard. The task was composed of 400 stimuli (including 80 trials for the training blocks) and lasted approximately 25 mins. It was divided into six blocks (2 training blocks followed by 4 task blocks). The two training blocks consisted of one positive and one negative training block with 40 positive and 40 negative stimuli, respectively. Each of the two training blocks had a duration of 180 s. After the two training blocks, the task was continued with four "process" blocks. Each process block had a duration of 360 s and was composed of 80 randomized stimuli (40 positive and 40 negative). The organization of the task is shown in Fig. 1. Every trial began with presentation of a pair of animal pictures from the IAPS (Lang et al., 2008), of neutral valence and intensity, presented on a grey background for a maximum of 300 ms. Each trial was followed by a xation-cross interstimulus lasting for 1000-1500 ms (randomly determined). The positions of the pictures of animals were randomized (one on top, the other at the bottom of the screen). During the 300 ms, participants had to decide behind which animal money was hidden. Once the response was made by the participant, feedback was delivered (Fig. 1). There were two types of trials: positive feedback pairs and negative feedback pairs. For the positive feedback pairs, one picture was associated with symbolic monetary gain feedback (+ 100$) and the other with neutral feedback (0$). For the negative feedback pairs, one picture was associated with symbolic monetary loss feedback (-100$) and the other with neutral feedback (0$). The order of presentation of the positive and negative trials was randomized. Reversal occurred after four to six correct answers for both the positive and negative feedback trials. In other words, after a certain period of time, the picture associated with a reward in the positive pair was associated with neutral feedback and the other picture from the positive pair was associated with the reward. Similarly, in the negative pair, the picture associated with the punishment was associated with the neutral feedback and the other picture from the negative pair was then associated with the punishment. Thus, the rules established during the acquisition phase (A) (phase of 4 to 6 trials with the same rule, during which the participants learn the stimuli/feedback associations) are reversed during the reversal phase (R). After 4 to 6 correct responses (CR) for the same pair during the reversal phase, a new reversal occurred. The number of correct consecutive answers (4 to 6) necessary for the switch was randomized. The picture of the animal was changed between odd and even blocks to avoid learning the rule of the association between the animals and the feedback. The task was programmed using E-Prime software and synchronized with the fNIRS system using an external trigger that was sent by E-prime to the fNIRS system to identify when each acquisition and reversal stimulus was presented to the subject. Behavioral responses Behavioral responses were labelled according to Reminsje et al. (Reminsje et al., 2005) (Fig. 1). During the acquisition phase, the correct answers (ACR) and erroneous answers (AE) were differentiated. During the reversal phase, the correct answers (RCR) and reversal errors (RE) were also identi ed following each switch of the rules. Among all reversal errors (RE) following the switch, the last error just prior to learning the new rule (i.e. the rst CR) was called the ' nal reversal error (FRE)'. After the rst CR, a false response, the 'no link to the switch (RENS)' was considered as the participant having learned the new rule.
Data acquisition/analysis fNIRS Recording fNIRS signals were recorded with a portable continuous wave MEDELOPT® system (Seenel Imaging TM) using a 32-detector and 16-emitter design ( Fig. 2A). Optodes covered the prefrontal, frontal, and parietal lobes, with an emitter-detector distance of 2.5 cm. The sampling rate was 2 Hz (500 ms each sample) and data were recorded at two wavelengths (660 and 850 nm). The position of the headgear was checked before and after the experiment; photos were taken to review placement and optode positions were digitalized for each subject using a 3D digitizer (NDI Medical Polaris Vega TM). No subject was excluded for an incorrect fNIRS sensor location. Optode positions (sources and detectors) were de ned according to the EEG 10-10 system coordinates to standardize the headgear position among the participants. The lower edge of electrode positions was located over the frontal area, with detector 2 (D2) centered above the highest point of the eyebrow (Fp2) (Fig. 2B). The headgear covered the temporal area, with detector 27 (D27) above the C line (C1).
The sensitivity of various con gurations was assessed using the AtlasViewer toolbox of Matlab to evaluate the best con guration of source and detector combinations for scanning the frontal and temporal lobes (Aasted et al., 2005) ( Fig. 2A and 2B). The fNIRS sensitivity map (Fig. 2C) was modelled using AtlasViewer and freely available Monte-Carlo photon transport software (tMCimg routine, number of simulated photons = 10 6 ). The more optimal sensitivity con guration of the fNIRS source and detector positions is presented in Fig. 2A.

Data analysis
The response of the participants associated with detection of the unexpected outcome for each trial following a switch of the parameters was assessed by the accuracy (the percent of correct answers) and reaction time (the time, in seconds, used by the participant to choose the animals). Each presentation following a switch was ordered for each trial: the rst presentation following a new rule was designated 1, the second presentation 2, etc., until the new switch of rules. The average accuracy and reaction time were then computed for each numbered presentation. The various parameters were compared between the rst and following presentations for each condition to analyze the effect of repetition. The number of successive errors was also evaluated. These errors, called perseverative reversal errors, indicate that the former discrimination strategy, which is now obsolete, was still used, indicating that the participant had not yet updated his/her strategy.
Several studies have evaluated the effect of reward and punishment feedback on reversal learning by focusing on contrasting reversal learning with the acquisition phase (Reminsje et al., 2005;Budhani et al., 2007;Ghahremani et al., 2010;Xue et al., 2017). Hemodynamic analysis was performed according to Reminsje (2002). The reversal effect (i.e. unexpected outcome) was then characterized by contrasting the responses recorded during reversal learning and those during the acquisition phase: (FRE + RE) -(AE + RENS) (Fig. 1). Such a contrast was performed here for the reward and punishment conditions. Contrasts were performed at the single-subject level.
Homer2 Matlab toolbox was used to analyze the fNIRS signal (Huppert et al., 2009). A band-pass lter between 0.03 and 0.1 Hz was applied to eliminate physiological noise (very low-frequency oscillations, respiration, and heartbeat). A window from − 5 to 30 s around the onset of the stimulation (t = 0 s) was used to analyze the hemodynamic response to RLT (Buxton, 2001). The xation-cross interstimulus interval throughout the whole task, served as the implicit baseline to allow the estimation of the hemodynamic response function by simple averaging (Costantini et al., 2013).
The weighted arithmetic average cerebral hemodynamic response was then computed for each type of stimulation for all combined blocks and each subject. The data were z-score normalized for each subject to homogenize the data according to the individual characteristics. A baseline correction [-5, 0] s was nally applied to the normalized data.

Region of interest (ROI)
We were able to associate each channel with a cortical ROI using the AtlasViewer toolbox (Aasted et al., 2005) based on the digitalized coordinates recorded with a 3D digitizer (Polaris Vega). Based on the MNI coordinates of each optode, our setup covers seven ROI: the right prefrontal cortex superior (r-PFC), right inferior frontal gyrus (r-IFG), right dorsolateral prefrontal cortex (r-DLPFC), left dorsolateral prefrontal cortex (l-DLPFC), left prefrontal cortex superior (l-PFC), left inferior frontal cortex (l-IFG), and orbitofrontal cortex (OFC) (Fig. 2D). Channels with an inter-optode distance < 5 cm were then selected (León-Carrión and León-Domínguez, 2012). Several studies previously tested various inter-optode distances to optimize the sensitivity of the signal due to the dispersion properties of the photons. They observed that the relative contribution of extracranial tissue decreased as the inter-optode distance increased. Thus, we focused on an inter-optode distance < 5 cm (Smielewski 1995).
The cerebral hemodynamic response per ROI was computed for each subject by averaging the responses of channels covering the same ROI. Finally, we computed the grand average per ROI by averaging the cerebral hemodynamic response of all subject for the same ROI. The shape and peaks of the curves were analyzed to compare the cerebral hemodynamic responses between ROIs under each condition. Three other parameters were also analyzed for various sub-periods from 0 to 30 s (every 5 and 10 s): the average, slope, and area under the curve (AUC).
Thus, the 'average' of the Hb values (HbO, HbR, or HbT) was computed for each timepoint of each subperiod. The dynamics of the responses were characterized by calculating the 'slope' for each considered subperiod between t0 and the rst maximum absolute amplitude (i.e. either peaks or valleys). We also examined the slope coe cient, which is indicative of the magnitude (and the direction) of the oxygenation responses over the stimulation period. Thus, a higher (and positive) slope value for HbO is associated with greater and faster cortical activation (Mandrick et al., 2013). The cerebral hemodynamic response over each region was characterized by computing the power of the activation, that is the cumulative sum of each point of the average (AUC) from the beginning to the end of the considered subperiod. Finally, the temporal evolution of the activation was visualized by projecting the cerebral hemodynamic response on the cortex for the selected channels using the

Behavioral analysis
The behavioral data for each repetition (1-8) of pairs on the screen during the acquisition and reversal phase for both the reward and punishment conditions are presented in Fig. 3. We used t-test to compare the following accuracy and reaction time parameters. During the acquisition phase ( Fig. 3A -left panel), between the rst and following presentations (2-8), the accuracy increased non-signi cantly ( Fig. 3A left panel) for the reward (Accuracy presentation 1 vs 2-8 + 17%, p > 0.1) and punishment (Accuracy presentation 1 vs 2-8 :+29%, p = 0.05) conditions. We observed no signi cant differences for the averaged accuracy between the punishment and reward conditions (p > 0.1).
The reaction time (RT) (Fig. 3B -left panel) decreased signi cantly between the rst and second presentation under both the punishment (RT presentation 1 vs 2 : -0,4 sec, p < 0.05) and reward RT presentation 1 vs 2 : -0,5 sec, p < 0.01) conditions. We observed no signi cant differences between the two conditions (p > 0.1) for the averaged RT.
During the reversal phase ( Fig. 3A -right panel), between the rst and following presentations (2-8), the accuracy ( Fig. 3A -right panel) increased to 69% for the punishment (presentation 1 vs 2-8, p < 0.001) and 57% for the reward conditions (presentation 1 vs 2-8 p < 0.001), suggesting that the participants did not predict the reversal. More precisely, after the rst presentation, the subjects were able to immediately recon gure the stimulus-reward association upon the second presentation [Accuracy presentation 1 vs 2: +46% for the reward (p = 0.002) and 66% for the punishment (p < 0.001) conditions, respectively]. The differences for the averaged accuracy between the reward and punishment conditions (p < 0.01) were signi cant for all presentations. For RT (Fig. 3B -right panel), we observed a non-signi cant increase between presentation 1 and presentation 2 for the punishment (-0,06 sec) and reward (-0,08 sec) conditions (p > 0.05). Comparison of the averaged reaction time for all presentations showed the difference between the two conditions to be signi cant (p < 0.001).
Finally, the increase in the number of perseverative errors (Fig. 3C) between the reward and punishment conditions was not signi cant (p = 0.07), suggesting that there may be strong cognitive costs when expressing the new associations under the reward condition. Based on the signi cantly lower accuracy, as well as the tendency towards an increase in the number of perseverative errors, these behavioral results suggest that the reversal more negatively affected the reward condition than the punishment condition.

Cortical hemodynamic response
We analyzed the detection of the unexpected outcome by computing the hemodynamic event contrast as described by Reminsje, (2002): (FRE + RE)-(AE + RENS). Cortical activation from fNIRS signals for an expected outcome is characterized by a substantial increase in HbO, with a lower delayed decrease in HbR (Ferrari and Quaresima, 2012;Perrey et al., 2010).

Brain regions differentially involved in reversal under punishment and reward conditions
We determined which ROIs were activated under the two conditions by investigating signi cant activation relative to baseline ([-5, 0] s) using t-test. We observed signi cant activation for the left hemisphere for both conditions (Fig. 4A). There was no signi cant hemodynamic response in the right hemisphere. Under the reward condition, l-DLPFC activation consisted of a hemodynamic response characterized by a signi cant increase in HbO (p < 0.036), with a latency to the peak of approximately 7 s (1.35 AU) and a slope for the coe cient of c = 0.13. The change in HbR was inverted, smaller, and nonsigni cant. The l-IFG showed a signi cant HbO response (p < 0.045), with a similar pattern and latency to the peak (approximately 3 sec, 1.3 AU, c = 0.21). There was a second peak at approximately 12 s (0.8 AU). Under the punishment condition, the hemodynamic response in the right hemisphere was not signi cant but we observed a tendency of a hemodynamic activation pattern. Thus, the hemodynamic responses of the r-DLPFC r-IFG consisted of an increase in HbO (r T-test comparison of the AUC for the 5-s subperiods to that of the baseline showed involvement of the l-DLPFC (p < 0.01 t 05 − 15 ) and l-IFG (p < 0.05 t 00 − 15 ) under the reward condition. Under the punishment condition, we also observed the involvement of the OFC (p < 0.05 t 00 − 15 ) and a tendency towards involvement of the r-DLPFC (p < 0.08 for t 03 − 08 ) and r-IFG (p < 0.08 for t00−05 and t 10 − 15 ).
We then evaluated these differences in activation between the two conditions (reversal vs punishment) for the same ROI using a student t-test, taking into account the three parameters (slope, AUC, average) across the subperiods.
Statistical analysis (Fig. 4B) of the changes in HbO in the l-DLPFC showed a signi cant difference between conditions for the AUC (p < 0.02) and average (p < 0.02) for the period t 10 − 15 . There were also signi cant differences for HbT (AUC, p < 0.03; average, p < 0.04). We observed no signi cant differences in any parameters for HbR. For the r-DLPFC, we observed signi cant differences between conditions for HbO (slope, p < 0.03) (Fig. 4C) and HbT (Slope p < 0.03), but none for HbR or any other parameters.
We then performed a Fisher post-hoc analysis to de ne the ROIs and conditions that showed a signi cant effect. This analysis was rst performed for the t 05 − 15 period and if no signi cant results were observed, it was performed for the t 05 − 10 and t 10 − 15 periods. For clarity, we present the results in three cross-tables, one for each parameter (slope, AUC, and average) (Fig. 5A). A cross-table summarizes the pairwise comparison of results from a multiple combination test. Thus, we determined whether there was a difference between the column ROI and the row ROI. Thus, two ROIs which were signi cantly different are symbolized by one or several asterisks (*), the number of asterisks depending on the p-value (*: < 0.05, **: < 0.01, ***: < 0.001, and t for tendency < 0.09). The ROI were organized by condition (reward and punishment) and then hemisphere (right, left, and both for the OFC). For example, the HbO matrix (Fig. 5A) for the average parameter showed a signi cant difference between the right and left DLPFC under the reward condition and for three ROIs under the punishment condition: OFC and a tendency towards a signi cant difference for the r-DLPFC and r-IFG. These observations con rm the signi cant difference between the right and left DLPFC under reward conditions (average, p < 0.05; AUC, p < 0.05) and the OFC between the two conditions (AUC, p < 0.05; average, p < 0.07). There was also a signi cant difference for the slope parameters between the two conditions for the r-DLPFC and r-IFG (p < 0.05).
The statistical analysis is added to the HbO hemodynamic mapping in Fig. 5B. The post-hoc statistical analysis for the average parameters con rmed the rst observations: DLPFC activation was signi cantly stronger (p < 0.04) on the left than right hemisphere at t 00 − 10 under the reward condition. Comparison of the two conditions showed a tendency for the r-DLPFC to be less activated under the punishment condition than the reward condition at t 00 − 10 (p < 0.065). This last result was also observed for the r-DLPFC under the punishment condition compared to the r-IFG under the reward condition (p < 0.08).
The fNIRS data showed signi cant results for the reward condition for the left hemisphere and involvement of the OFC and a tendency towards involvement of the right hemisphere under the punishment condition (Fig. 5C). The involvement of the l-IFG and l-DLPFC under the reward condition and the OFC under the punishment condition showed the usual hemodynamic pattern, characterized by an increase in [HbO] and a decrease in [HbR]. Over the right hemisphere, only a tendency (p < 0.08) towards a typical hemodynamic response was observed over the r-DLPFC and r-IFG. These observations were con rmed by statistical and post-hoc analysis.

Discussion
Previous studies on reversal learning used fMRI (O'Doherty et al., 2001;Cools et al., 2002;Remijnse et al., 2005;Ghahremani et al., 2010;Xue et al., 2013) or EEG (Sobotka et al., 1992) tools. In this study, we investigate the cortical hemodynamic response due to a reversal-learning task with fNIRS tool. fNIRS has a better temporal resolution, relative to fMRI (Cui et al., 2011), and better spatial resolution, relative to EEG (Parasuraman and Caggiano, 2005). Thus, fNIRS emerges as an alternative tool for subjects who cannot undergo fMRI recordings.
Here, we reveal distinct neural substrates for reversal learning driven by reward and punishment using comparable magnitude feedback, as proposed by Xue et al., 2013. The l-DLPFC and l-IFG were involved in the reversal process when receiving unexpected positive feedback (reward condition). Unexpected negative feedback (punishment condition) led to signi cant hemodynamic changes for the OFC but only a tendency towards signi cancy was observed for the r-DLPFC and r-IFG.
Behavioral data from our study show that all subjects understood the rules, as they succeeded in accumulating a positive amount of points. Under the reward condition, participants made more errors and were faster to respond (especially after the reversal trials) than under the punishment condition. For the HbO parameter, we observed (1) longer and higher signi cant activation (p < 0.02) for the l-DLPFC for t 10 − 15 under the reward condition than under the punishment condition, (2) signi cantly greater activation in the l-DLPFC than r-DLPFC under the reward condition, (3) greater and prolonged activation in the OFC under the punishment condition, (4) signi cantly faster and a tendency for prolonged and greater activation of the r-DLPFC and greater and faster involvement of the r-IFG under punishment conditions.
Finally, we observed signi cantly faster involvement of the right than left IFG for t 00 − 10 in terms of the HbR parameter under the punishment condition.

Contribution of the dorsolateral prefrontal cortex (DLPFC) in the reversal-learning task
The DLPFC has been examined in previous neuroimaging studies on RLT, either bilaterally (Chamberlain et al., 2008;Cools et al., 2002;Waegeman et al., 2014) or only the left (Fellow and Farah, 2003) or right hemisphere (O'Doherty et al., 2001;. Functional evidence has also shown the involvement of the OFC in reversal reward processing (Delgado et al., 2000;Cools et al., 2002;Izquierdo et al., 2017), more precisely the l-OFC (Sobotka et al., 1998;Bechara et al., 2005;Xue et al, 2009), whereas the r-OFC has been linked to punishment processing (Sobotka et al., 1992;O'Doherty et al., 2001). Concerning reinforcement learning, Xue et al. revealed distinct mechanisms underlying learning from positive and negative feedback (i.e. reward and punishment, respectively). They observed rightward asymmetry (right lateral OFC and DLPFC) in punishment processing under punishment conditions, but nothing was observed for the reward condition. In this study, Xue et al., used different types of feedback (shock and money) associated with an incomparable level of magnitude between the two conditions. As suggested by Xue et al., we used comparable feedback (gain or loss of money) with an equal level of magnitude (+ 100$ vs. -100$) to investigate local asymmetries in the treatment of information in RLT speci c to each condition (reward and punishment). The prefrontal cortex (PFC) regulates and monitors a number of "executive" cognitive functions (Weinberger, 1993), whereas the DLPFC is involved in various cognitive tasks: working memory (Belger et al., 1998), reasoning (Prabhakaran et al.,1997), changes in attention (Dias et al., 1996), and control (Hampshire et Owen, 2006). In accordance with these previous results, neurovascular coupling was observed over the l-DLPFC under the reward condition during the RLT. However, as reported in other studies (O'Doherty et al., 2001;Cools et al., 2002;Xue et al., 2013), our results also suggest the possible involvement of the r-DLPFC under punishment conditions. These two observations support the involvement of the l-DLPFC in the detection of switching (Remijnse et al., 2005;Cools et al.;2002; and the updating of response-outcome relationships and exible behavior (Ghahremani et al., 2010;O'Doherty et al., 2003). Moreover, its involvement is lateralized according to the condition: l-DPLFC for the reward condition and r-DLPFC for the punishment condition.

Contribution of the inferior frontal gyrus (IFG) in the reversal-learning task
The IFG is involved in inhibitory control (Kawashima et al., 1996;Konishi et al., 1998;Konishi et al., 1999;Garavan et al., 1999;de Zubicaray et al., 2000;Swick et al., 2008;Rygula et al. ,2010;. The IFG has also been shown to be involved in the RLT (Hampshire and Owen, 2006;Budhani et al., 2007), mostly in the right hemisphere (Cools et al., 2002;Ghahremani et al., 2010;Waegeman et al.;. Consistent with these previous results, we observed signi cant neurovascular coupling for the l-IFG under the reward condition, whereas greater and faster involvement of the r-IFG was observed under the punishment condition. This result con rms the role of the IFG in inhibiting a well-learned association (Ghahremani et al., 2010;2004, Rygula et al.; and also suggests lateralization according to the condition. Contribution of the orbitofrontal cortex (OFC) in the reversal-learning task We observed signi cant neurovascular coupling for the OFC under the punishment condition. O'Doherty et al., (2001) also measured an increase in lateral OFC activity following the subjects' receipt of punishment and deactivation following reward. Although our mapping did not allow differentiation of the subregions of the OFC, we con rm involvement of the OFC under punishment conditions. The OFC is involved in maintaining the current and expected motivational value of stimuli (O'Doherty et Dolan, 2006;Wallis, 2007) and motivation-related processes (Rothkirch et al., 2012;Spielberg et al., 2012). More precisely, the role of the OFC in reversal is to store the feedback association during reversal learning (Cai et Padoa-Schioppa, 2014;Kei in et al., 2013;Moorman et Aston-Jones, 2014). Numerous studies have argued that the right hemisphere plays a dominant role in experiencing unpleasant feelings, whereas the left hemisphere is essential for pleasant feelings (Davidson et Irwin, 1999;Deglin et Kinsbourne, 1996;Overskeid, 2000;Bechara et Damasio 2005) and positive affects (Baxter et al., 1989;Davidson et Henriques, 2000;Harmon-Jones et Allen, 1997;Herrington et al., 2007). Moreover, such lateralization has also been observed during reversal processing for the OFC, lateralized to the r-OFC, whereas reward has been found to be associated with the left hemisphere (Sobotka et al., 1992;Xue et al., 2009). These results support our hypothesis of lateralization for the DLPFC, IFG and, perhaps, the OFC. According to previous studies, the OFC provides information concerning the value of the stimulus to the DLPFC (Szatkowska et al., 2008), which could then be used to select appropriate goals. This relationship is bidirectional (Spielberg et al., 2012). Indeed, the involvement of the l-DLPFC (l-DLPFC under the reward and r-DLPFC under the punishment condition) appears to facilitate the updating of response-outcome relationships and exible behavior. Thus, the DLPFC may modulate value information stored in the OFC to be congruent with the current association (Spielber et al., 2012;Hare et al., 2009).
Overall cortical network activity in the reversal-learning task The involvement of the IFG (l-IFG for the reward condition and r-IFG for the punishment condition) was linked to the inhibition of a well-learned association and implementation of behavioral rules or strategies. It is likely that such inhibition co-exists with other cognitive functions required by the RLT task (e.g., updating, shifting), making it di cult to establish which structures are involved in updating and inhibition processes. In 2007, Dosenbach et al. was able to distinguish between two strongly inter-connected subnetworks that function in parallel (Dosenbach et al.,2007). One involved the DLFPC, which is associated with top-down attentional control (Dosenbach et al.,2007), maintaining goals, and updating information (Wager et Smith, 2003), whereas the other involved the anterior cingulate cortex (ACC), which is involved in detecting con icting responses and monitoring performance (Nelson et al., 2003;Seeley et al., 2007;Cole et al., 2013;Banich, 2019). In our study, involvement of the l-DLPFC is consistent with the involvement of this rst sub-network. However, involvement of the ACC cannot be robustly investigated by fNIRS due to the low sensitivity of fNIRS for deep brain tissues.
The causality between the two sub-networks is not clear (Duann et al., 2009;Neubert et al., 2010;Swann et al., 2012, Spechler et al., 2016 but studies have also suggested involvement of the subthalamic nucleus (STN), which is linked to the motor inhibition loop (Aron et Poldrack, 2006;Schmidt et al., 2013). The motor inhibition loop involves, among other elements, a "longer" indirect pathway (DLPFC-caudate-IFG-supplementary motor area -STN-motor area) related to the implementation of proactive modulation (Aron, 2011;Bari et Robbins, 2013, Verbruggen et Logan, 2008Cunillera et al., 2014;Tops et al., 2014;Cai et al., 2016). In addition, the involvement of the IFG in our study may be indicative of its participation in this inhibitory loop (Jonides et Nee, 2006;Swick et al., 2008). Consistent with this hypothesis, the results concerning the slope coe cient suggest longer and greater activation of the DLPFC, which may then recruit the IFG, which would participate in the initiation of a motor inhibition loop. Based on recent studies, the connection between the IFG and DLPFC can be bidirectional. First, the DLPFC recruits the IFG via an excitatory connection to initiate the longer indirect inhibitory pathway. Second, the IFG may inhibit DLPFC activity during reappraisal once the strategy process is updated (Banich et al., 2019;Morawetz et al., 2016). This second activation of the IFG may explain the second peak observed for the r-IFG under the punishment condition and the l-IFG under the reward condition.
In summary, we used fNIRS to con rm the differential impact of reversal: with involvement of the left hemisphere under the reward-guided condition, characterized by early l-DLPFC activation followed by involvement of the l-IFG and OFC, and right hemisphere involvement (r-DLPFC and r-IFGC) under the punishment condition. This suggests feedback-speci c processing during RLTs.
One of the limitations of this study had to do with the small number of participants. Indeed, the present study was a preliminary analysis before taking into account participants who had undergone deep brain stimulation.

Conclusion
This study extends our understanding of the neural mechanisms involved in reversal learning and improves our comprehension of risky behaviors in vulnerable populations characterized by impaired exibility. This approach demonstrates the possibility of using high-density optical imaging tools to study cognitive tasks and can provide insights into neurophysiological mechanisms and facilitate the translation of functional optical imaging into clinical applications, such as in OCD or Parkinson disease.

Declarations
Ethics approval and consent to participate The experiment was conducted in compliance with the Code of Ethics of the World Medical Association (Declaration of Helsinki) and the protocol of the study was approved by the local ethics committee and the Comité de la Protection des Personnes (NO II N°2013-A01297-38). Informed written consent was obtained from each subject before the experiment.

Consent for publication
The corresponding author, on behalf all authors, allows the publication of this Manuscript.

Availability of data and materials
The data that support the ndings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Competing interests
MM and FW have consulting activities for Seenel Imaging. Reversal-learning task. (A) Negative pair: antelope was the incorrect (-100 $) and cow the correct (+0 $) answer. (B) Positive pair: butter y was the incorrect answer (+0 $), and duck was the correct answer (+100 $). The switch occurred after four to six correct answers (randomized) and the feedback was reversed. Thus, during the reversal phase, antelope became the correct answer (gain: 0 $) and cow the incorrect answer (gain: -100 $) for the negative pair (C) and duck became the incorrect answer (gains: 0 $) and butter y the correct answer (gain: +100 $) for the positive pair (D). We labelled each trial for each condition. During the acquisition phase, we distinguished between correct (ACR) and incorrect responses (AE). After 4 to 6 correct responses, a reversal occurred, unknown to the subject. During the reversal phase, we classi ed the responses as reversal errors (RE), the nal reversal error (FRE) before the rst correct response (RCR), and the reversal errors after the rst CR and thus not linked to the switch (RENS).     HbO (red) and HbR (blue) responses for both conditions are plotted for seven ROIs, of which the locations are presented over the right and left hemispheres. For clarity, the total-Hb (HbT) curve is not presented. Signi cant differences relative to baseline t-5-00 are indicated within their speci c subperiods in red for HbO and blue for HbR. The contrast indicates the degree of signi cance (opaque: p < 0.01, (A) Fisher LSD post hoc analysis for HbO for t05-15. The gure is composed of three cross-tables, one for each parameter. Signi cant differences between two ROIs (line vs columns) is symbolized by one or several asterix (*: p < 0.05; **: p < 0.01; ***: p < 0.001), and t a tendency towards signi cance with a pvalue < 0.09. Values are negative for right < left under the reward condition. Values are positive for reward > punishment when comparing the reward to punishment condition. (B) Hemodynamic average difference, in post-hoc analysis for HbO between the reward and punishment conditions. The difference is indicated by a red line with an *: * indicates a p-value < 0.05 and t a tendency towards signi cance, with a p-value < 0.09. Area legend: 1: l-PFC, 2: l-DLPFC, 3: l-IFG, 4: OFC, 5: r-PFC, 6: r-DLPFC, 7: r-IFG. (C) Summary of the fNIRS results.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. TableS1rvwd1.pdf