The Ethics Committee of the Kyoto University Unit for Advanced Study of Mind (30-P-8) approved the experimental protocol and procedures. All participants provided written informed consent prior to the study and were remunerated equally for their participation after each session.
We recruited 72 males from all SES groups: 36 men without children (CG: mean age = 31.1 ± 6.2 years) and 36 first-time expectant fathers (PG: mean age = 33.1 ± 5.7 years; mean GA of partner = 20.9 ± 6.2 weeks). Approximately four months later, the same men were invited for a second (late-pregnancy session) and a third (after-birth) session four months thereafter. The mean intervals (in days) between first and late-pregnancy sessions were 99.18 ± 36.24 and 125.56 ± 23.53 for PG and CG, respectively. The mean intervals between the first and after-birth sessions were 250.94 ± 38.25 and 250.00 ± 27.44 for PG and CG, respectively. In the PG, the mean interval between child birth and the third session was 119.7 ± 20.3 days. Data for one PG subject at session 1 was considered as late-pregnancy, because data were obtained when the subject’s partner GA was 39 weeks (i.e. the subject was missing the data for session 1; interval between recordings = 132 days). One subject in PG did not participate in the third session (after-birth). Moreover, we excluded one PG subject from all analyses due to the lack of a late-pregnancy session and participation in the after-birth session approximately nine months after birth (250 days), which was considerably later than the rest of the fathers. In the CG, one subject was excluded from all analyses due to the inability to complete the fMRI recording session. Another subject was excluded due to only completing the first session of recording. Finally, data for one subject in the CG during the third session could not be recorded due to machine error. Moreover, we discarded the fMRI data of one PG subject (first session) and one CG subject (second session) due to high movement artefacts. The final sample is as follows: for the first session (early-pregnancy), PG = 33, CG = 34; for the second session (late-pregnancy), PG = 35, CG = 33; and for the third session (after-birth), PG = 34, CG = 33.
For the clustering analysis (see subsection “Exploratory analysis of the individual differences in the paternal brain” below), only subjects with three data points were considered. For this analysis, the subject number is 32 subjects for PG, and 32 subjects for CG.
All subjects from the PG and six from the CG were married or living with partners. Their marital status remained the same across the duration of the project. All subjects were of East Asian ethnicity (Japanese: n = 70; Korean: n = 2), had normal or corrected-to-normal vision and were all right-handed. No participants reported any major medical illnesses, major psychiatric disorders or neurological illnesses.
For each recording session, we collected the following demographic data: (a) socioeconomic status (SES), including household income, educational background and history and number of family members in the household and (b) weekly work/study time.
Additionally, we collected the following psychological characteristics: (c) State–Trait Anxiety Index using the State–Trait Anxiety Inventory (STAI-Trait47) and (d) depression index using Beck’s Depression Inventory (BDI)48. These two psychological traits have been implicated in the heterogeneity of parenting behaviours in mothers and fathers49, as demonstrated by a decrease in positive parenting style and an increase in negative parenting style, respectively50.
Furthermore, we considered the following behavioural data related to family and parenting: (e) positive and (f) negative attitudes towards parenting37, including 12 items for a positive image (e.g. parenting involves pleasure) and 15 items for a negative image (e.g. parenting is difficult). Each item was scored using a five-point Likert scale. Next, we measured (g) partner relationships using the Dyadic Adjustment Scale51 (DAS), which was only answered by married participants and those living with romantic partners with 32 items for assessing marital relationship quality. (h) Foetal–paternal attachment (The Paternal Antenatal Attachment Scale52 was used only for PG subjects) was used to measure the strength of attachment using 16 items. Each item was rated using a five-point Liker-type scale. The scales from (e) to (h) provided scores consistent with those of previous studies. Ceiling/floor effects were observed for several items on the (e), (f) and (h) scales, which were defined by whether the mean score of ±1 standard deviation exceeded the range of possible values. Items indicating a ceiling/floor effect were excluded, and mean scores were used for analysis. To compare the prenatal foetal–paternal attachment and postnatal infant attachment, (h) scores were normalised to the 0–1 range. For the (g) scale (DAS), two items in the scale were rated as 0 or 1, which is in contrast with the rest (i.e. using a five-, six- or seven-point Likert-type scale). Thus, these two items (i.e. ‘feeling of exhaustion when thinking about intercourse with partner’ and ‘expressions of affection’) were excluded from the overall DAS score to avoid uneven effects.
Finally, we performed additional surveys only in postpartum fathers to determine paternal behaviour. We measured fathers’ development and postnatal infant attachment (normalised to the 0–1 range), paternal involvement in caregiving, time together and times at papa school. Fathers’ development, which is based on Morishita53 (2006), measures psychological changes in men through the experiences of parenting (e.g. considering the feelings of parents with children) with 20-items rated using a five-point Likert-type scale. The postnatal infant attachment scale54 was used to measure the strength of attachment with 19 items, which were rated using a five-point Likert-type scale. Paternal involvement in caregiving was developed based on Abraham et al. (2014)12, which measures the frequency of fathers’ performance of 15 daily parenting activities (e.g. bathing children, changing diapers and feeding babies). Each item was rated using a five-point Likert scale. We also recorded the time that the participants spent alone with infants in hours per week and the number of times the participants attended parent–child classes at hospitals or municipalities during their partners’ pregnancy. These frequencies were respectively designated as time together and times at papa school. Moreover, we measured the degree of household chores (e.g. cooking meals, doing the laundry and cleaning the house) performed, which was developed based on Volling & Belsky (1992)55 and Edelstein et al. (2017)24. Eight items were scored on five-point Likert-type scale ranging from 1 (always the wife) to 5 (always the husband).
Table 1 provides a summary of the behavioural data. Several behavioural traits displayed group differences, such as trait anxiety (F = 11.8, p = 0.01, df = 1), weekly worktime (F = 115.7. p < 0.001, df = 1), household income (F = 133.8, p < 0.001, df = 1) and positive attitude towards parenting (F = 24.7, p < 0.001, df = 1). Despite these group differences, controlling for any of the covariates did not significantly change the differences in neural activation across the abovementioned sessions. Several traits exhibited a large correlation with one another (Fig S1). To reduce the number of tests performed across the study and to avoid confusion in our conclusions due to similarity among scores, we excluded a few of the measurements. In particular, fathers’ development displayed a high correlation with several other parent-related traits, such as postnatal infant attachment and positive attitude towards parenting and, to a lesser extent, time together and parental involvement. For this reason, we excluded this trait from analyses, as its inclusion would not provide any useful insights for the study objectives. The depression score, BDI, pointed to a large correlation with trait anxiety. Thus, it was also excluded from the analysis. State anxiety was likewise excluded, because the measurement itself provides no relevant information regarding parenting. Instead, it reflects a number of factors, such as post-recording stress or anxiety and may disrupt our analysis.
Data collection and analysis for hormonal levels
The subjects rinsed their mouths prior to saliva collection and chewed on an oral cotton swab after 10 min (Salimetrics, State College, PA, USA), which was placed sublingually for 3 min. This process was repeated to obtain a backup sample. The first samples were used for the calculation of the hormonal levels of each subject. If the first sample contained insufficient saliva specimen or if the hormonal levels of the first sample were outliers, then the value of the second sample was used as the individual value. The samples were stored at −80°C until assayed. Salivary oxytocin was assayed using a commercial kit (ADI-900-153A-0001; ENZO Co. Ltd., Tokyo, Japan) following the manufacturer’s protocol. In summary, 240 µL of saliva was dried using a Speedvac evaporator at room temperature for 3 h and reconstituted in 240 µL of an assay buffer out of which 100 µL was used for the assay. Salivary testosterone was measured by ELISA following our previous reports56,57. Furthermore, 25 µL of saliva was used in the assay using testosterone-3-CMO-HRP (FKA101; COSMO Bio, Tokyo, Japan) and a specific anti-testosterone serum (FKA102-E; COSMO Bio). All intra- and inter-assay coefficients of variation were less than 15%. We could not detect oxytocin from the samples of two subjects from the PG and two subjects from the CG. As such, these subjects were excluded from oxytocin analysis.
For the analysis of hormonal data, we addressed three nuisance effects, namely, time of day at measurement, seasonal changes and storage time. The first two refer to the fact that levels of oxytocin and testosterone vary across the day and across seasons. To minimise the first confounding effect, all experiments were scheduled at morning between 8:45 AM and 10:45 AM. However, minimising the seasonal effect at the experimental level is complex. The final nuisance effect, which may influence data quality, is the time of storage of the saliva samples before processing. Although we intended to analyse the data as soon as possible, we cannot overlook the possibility that a certain level of nuisance effect was included. Thus, to remove the three nuisance effects, we created individual linear models for each hormone using the three parameters and the raw hormonal measurement. Lastly, we obtained the residuals of the model. Across this work, the residuals were used as the values for the hormones instead of raw data.
For PG and CG from the first session to the second and third sessions, we observed a significant decline in oxytocin levels (F = 9.15, p = 0.002) and a marginally significant increase in testosterone levels (F = 5.21, p = 0.09). However, when residualising according to the time of day at acquisition, day of the year and time of storage before processing, these effects disappeared (session effect on testosterone: F = 2.51, p = 1) or became weaker (session effect on oxytocin: F = 7.5, p = 0.01).
We followed the protocol outlined in Diaz-Rojas et al. (2021)21 for fMRI data acquisition. In summary, inside the MRI scanner, subjects were shown silent videos of a male model performing actions from the first-person perspective with a duration of 30 s (Fig 1) and were instructed to observe the movements of the model’s hands. The videos were two infant-interaction videos (S1: playing with an infant; S2: changing diapers) and two control videos without infant interaction (C1: opening a box and removing a tripod from it; C2: wrapping a box with plastic). The male model and the infant were a father–child dyad of East Asian ethnicity and were unrelated to any of the participants. The control videos were performed by the same male model and were done to roughly match the movements in the corresponding infant-interaction videos. In the infant-interaction videos, the frame was set to exclude the infant’s eyes, because previous studies demonstrated that the facial features of infants may evoke different responses in human males (i.e. own- vs. other-child58,59 due to the increased response due to facial resemblance60,61). Therefore, we opted to minimise the confounding effects in our subjects. The control videos were then framed accordingly. The experiment followed a block design, where each video was presented only once per run followed by 30 s of rest (static grey screen). The order of presentation was pseudo-randomised with the caveat that the infant interaction and its respective control video were always presented consecutively of each other (for example, one run would be S1–C1–S2–C2, whereas the other was S2–C2–C1–S1). Each subject completed two runs of this task. Inside the scanner, the subjects also performed an auditory task immediately after the two video runs. Data for this task were used for another experiment and were, thus, excluded from the present study.
Emotional rating of stimulus
At the end of Session 3, all participants rated each stimulus using a seven-point Likert-type scale for two attributes, namely, emotional valence and arousal. The items for emotional valence range from strong feelings of pleasantness to strong feelings of unpleasantness, whereas those for arousal range from very excited to very calm.
fMRI data acquisition and processing
Neuroimaging was performed using a 3T MRI MAGNETOM Verio (Siemens Healthcare, Erlangen, Germany). Functional T2*-weighted images were collected using a gradient-Echo-Planar Imaging sequence with the following parameters: repetition time = 3000 ms, echo time = 30 ms, field of view = 192 mm, matrix size = 64 × 64, flip angle = 90°, voxel size = 3 × 3 × 3 mm; slice gap = 0; number of slices = 46 axial slices; slice order = interleaved. High-resolution structural T1-weighted images were collected using a three-dimensional magnetisation-prepared rapid acquisition with a gradient echo sequence using the following parameters: repetition time = 2250 ms, echo time = 3.51 ms, field of view = 256 mm, matrix size = 256 × 256, flip angle = 90°, voxel size = 1 × 1 × 1 mm, slice gap = 0; number of slices = 208 axial slices; slice order = interleaved.
Preprocessing of fMRI data was conducted using MATLAB R2018b (MathWorks, Natick, USA) and the SPM software package (SPM12 v7487, https://www.fil.ion.ucl.ac.uk/spm/). The first eight volumes for each run were discarded to allow for signal stabilisation. Functional images were corrected for acquisition time, resliced, realigned and normalised to match the MNI template brain and were spatially smoothed using a Gaussian kernel with a full width at half maximum (FWHM) of 4 mm. Subjects with movement artefacts greater than 3 mm within the runs were discarded (N = 2; one PG subject for first session and one CG subject for second session).
We modelled the response of each subject to the stimuli using a general linear model in which the stimulus blocks were defined as predictors and convolved with the standardised model of hemodynamic response function. For analysis, we defined three contrasts, namely, S1–C1, S2–C2 and the overall S–C (a combination of S1 and S2 and C1 and C2). However, to avoid the unnecessary increase in the number of tests, we focused mainly on the S–C contrast and used the two other contrasts for exploratory analyses.
fMRI whole-brain analysis
Whole-brain analysis for the interactions between group and session effects were measured using a single 2 × 3 flexible factorial model (group × session), in which each participant served as a random effect. Data were drawn from the contrast map for S–C for each session and for all subjects. To verify the cross-sectional and longitudinal effects as well as intra and inter-group effects, we performed the following analysis using individual t- and F-tests.
For each session (session 2 and 3), we conducted t-tests for the null hypothesis that the differences observed in a given voxel cluster between the PG and CG would be negligible. Two t-tests were conducted for each session, one for the possibility that PG > CG and one for the opposite.
Intragroup longitudinal analysis
Intragroup longitudinal comparisons (i.e. session effects) were conducted using an F-test for the overall session effect for each group. We also conducted t-tests for each pair of sessions (e.g., PG session 3 data vs session 1 data) for the null hypothesis that the differences observed in a given voxel cluster were negligible. Additionally, to test the overall development from pregnancy to postpartum, we added another t-test for the session 3 versus the average of session 1 & 2, for both groups.
Inter-group longitudinal analysis
Inter-group longitudinal comparisons (i.e. interaction group-session effects) were conducted using an F-test to determine the overall group-session effect. We also conducted t-tests between the two groups for each pair of sessions (e.g., PG session 3 data vs session 1 data > CG session 3 data vs session 1 data) for the null hypothesis that the differences observed in a given voxel cluster were negligible. Similar to intragroup analysis, we also tested the overall development from pregnancy to postpartum in the PG compared with the CG by adding an additional t-test for the session 3 versus the average of session 1 & 2 between the two groups.
Whole-brain regression analysis
To determine whether the observed neural activation in the PG and CG were related to any of the behavioural or hormonal measurements (covariates) obtained from the participants, we used whole-brain individual regression models with the S–C contrast of each group and session against each of the covariates. We used significance thresholds similar to those for the abovementioned whole-brain analysis. Diaz-Rojas et al. (2021) presented the data for session 1; thus, we did not report these data in the current paper.
Statistics for whole-brain analysis
For all whole-brain analyses, statistical maps were assessed at (a) a family-wise error (FWE)-corrected threshold of p < 0.05 at the voxel level or (b) an uncorrected threshold of p < 0.001 at the voxel level (i.e. cluster-forming threshold), and clusters were considered significant if they passed a cluster level threshold of p < 0.05 after FWE correction. Using the WFU Pick Atlas toolbox for MATLAB62,63 and the AAL atlas of the human brain64, we matched the surviving clusters to known anatomical areas (Supplementary Table S1).
MVPA was implemented using libsvm65. We first extracted the non-averaged and non-spatially smoothed voxel data for each subject for the S–C contrast from the left dmPFC. We focused on this ROI, because this area displayed the most consistent changes from pregnancy to postpartum. This area was anatomically defined using the left frontal superior medial area of the AAL atlas. Due to the large number of features (i.e. voxels) in this anatomical area, we performed dimensionality reduction to improve the performance of the analysis (Mahmoudi et al., 2012). Dimensionality reduction was conducted using the one-way ANOVA F-values ranking of each voxel, with the target labels (group or session) as the levels, selecting the top 125 voxels (equivalent to a 5x5x5 voxels cube) with the highest F-values. Using this data, we trained a support-vector machine (SVM) classifier with a linear kernel to perform a binary classification of a target label. These labels were of two kinds: session (early-pregnancy, late-pregnancy or postpartum), and paternity status (same as group, either PG or CG). To validate the model, we performed leave-one-out cross-validation. In other words, we trained a model using voxel data from all subjects included in the analysis except for one and used the excluded subject as test data. This procedure was repeated for each subject until all subjects served once as test data. Then, we averaged the prediction accuracy for each of the test data to obtain the mean prediction accuracy of the classifier for that analysis. We also displayed the 95% confidence interval, which was calculated using the normal approximation method for the binomial confidence interval. High accuracy implied that the model could separate multivoxel data between subjects belonging to one group/session from the other. In this manner, we established a relationship between neural activation patterns and the target label. Significance was assessed with an upper-tailed binomial test for the null hypothesis that the resultant classification accuracy emerged from a binomial distribution with a mean equal to the chance level (50%). Additionally, we calculated the true positive rate (TPR) and true negative rate (TNR) for each ROI–target label pair.
Exploratory analysis of the individual differences in the paternal brain
Clustering analysis was conducted using a k-means algorithm and the squared Euclidean distance as the distance metric. We set the number of groups as two. We intended to group the subjects based on the development of the paternal brain across pregnancy. Thus, we selected the percentage change between adjacent sessions as data for clustering (i.e. change from session 1 to 2, and from session 2 to 3). Next, we took the absolute values of these percentage change. Negative or positive changes cannot be easily interpreted as improvements or regressions in the paternal brain (for example, a negative change could be associated with decreased response due to familiarity with infant-interaction schemes or to perceived non-familiarity to the shown infant, which triggers own-child/other-child effects). Thus, we focused solely on change as a metric for the development of the parental brain. However, measuring the percentage change may frequently lead to extremely large values, because fMRI data may have very small values. To account for this phenomenon, we took the base 10 logarithm of the absolute percentage change, and then clipped to the outlier threshold any outlier values (defined as those values that are more than three scaled median absolute deviations away from the median). We focused on the neural data from the left dmPFC, which was the brain region that exhibited greater changes from pregnancy to postpartum. This analysis was conducted only with data from the PG.
The k-means algorithm is sensitive to initial conditions and to the initial value of the random number generator. To combat this issue and validate the findings, we first performed the following procedure. We excluded four random subjects from the set (12.5% of the total set) and performed k-means classification with the remaining data. We noted the cluster centres and repeated this procedure 10,000 times. Finally, we averaged the cluster centres over the iterations and performed one final clustering using these centres and the entire data set. For the final clustering, each data point was set as belonging to the group that it is closer to using the square Euclidean distance and without updating the position of the cluster centres, because this process is iteratively conducted in the k-means algorithm.
After dividing the subjects into two groups, we calculated for the mean of each behavioural and hormonal covariate at each time point. Statistical analysis was performed using two-way ANOVA. Supplementary Table S3 presents the results for all tests. Behavioural data that remained stable across sessions (i.e. age, years of education, weekly worktime and household income) were taken from session 3.