Same or Different? Perceptual Learning for Connected Speech Induced by Brief and Longer Experiences

Perceptual learning, dened as long-lasting changes in the ability to extract information from the environment, occurs following either brief exposure or prolonged practice. Whether these two types of experience yield qualitatively distinct patterns of learning is not clear. We used a time-compressed speech task to assess perceptual learning following either rapid exposure or additional training. We report that both experiences yielded robust and long-lasting learning. Individual differences in rapid learning explained unique variance in performance in independent speech tasks (natural-fast speech and speech-in-noise) with no additional contribution for training-induced learning (Experiment 1). Finally, it seems that similar factors inuence the specicity of the two types of learning (Experiment 1 and 2). We suggest that rapid learning is key for understanding the role of perceptual learning in speech recognition under adverse conditions while longer learning could serve to strengthen and stabilize learning.


Introduction
Connected speech recognition under adverse conditions (e.g., distortion, background noise) [1], improves substantially following brief experiences and prolonged practice [2][3][4][5][6][7][8][9]. These improvements re ect perceptual learning, de ned as relatively long-lasting changes in the ability to extract information from the environment following experience or practice [10,11]. An open question is whether rapid learning following brief experiences (rapid learning) and the learning that emerges with more intensive training (training-induced learning) re ect the same type of learning. Whereas some view only training-induce learning as true perceptual learning and refer to rapid learning as procedural or task learning, others view rapid learning as perceptual as well, because it shares some characteristics with training-induced learning [12,13]. In the case of speech, the literature portrays a complex picture. Both rapid and traininginduced learning of speech stimuli are usually considered perceptual [e.g., 1], but the degree to which they share characteristics like stimulus speci city and contribute to dynamic speech perception are not unanimously agreed on. Furthermore, rapid and training-induced learning were typically studied in different studies, with different training methods, stimuli and learning tests, making it hard to compare outcomes across studies.
Understanding the similarities and differences between rapid and training-induced learning has important implications for the role of perceptual learning in speech perception under challenging conditions. Speci cally, if rapid learning is both generalizable and long-lasting, brief experiences or short training episodes could gradually re-shape the perception of distorted speech, leading to a general increase in perception as a function of experience. On the other hand, if rapid learning is as stimulus speci c as training-induced learning, past learning of either type is unlikely to shape future speech perception because future conditions are unlikely to be an exact replication of the past. Rather, rapid learning could support perception in challenging conditions online by allowing listeners to quickly adapt to the acoustic characteristics of the current situation [14,15]. Here we focus on rapid learning of distorted (timecompressed) speech. In Experiment 1 we compared learning and retention between rapid and traininginduced learning. We also compared how the two types of learning relate to perception in independent challenging speech tasks. In Experiment 2 we compared four protocols of rapid learning to determine whether rapid learning is as stimulus speci c as found in previous studies on training-induced learning [16].

The Potential Role of Perceptual Learning in Speech Recognition
Theories of both perceptual learning [17] and speech processing [18,19] suggest that encounters with speech input trigger an implicit and largely automatic process which attempts to match this input to longheld representations. However, in daily listening situations inputs do not automatically match long-term representations (e.g., due to noise or accent), and the automatic matching process can therefore fail. According to the Reverse Hierarchy Theory [RHT,17], such failure can trigger a learning process that gradually allows listeners to resolve ner-grained acoustic details and help them recognize previously unrecognizable input. However, because learning is triggered by a speci c input, learning is at least partially speci c to the acoustics of the input [7,17,18]. This speci city probably constrains the role of learning in complex communication environments. One option is that intensive experience is required to yield learning that supports speech recognition. However, training-induced learning of challenging speech is often quite speci c to the trained stimuli [20][21][22][23]. Therefore, it can support future speech perception only to the extent that newly encountered situations replicate the conditions encountered in training which is unlikely. Therefore, intensive training studies are not a good analogue for real life conditions when a practice period is unlikely and the acoustics can change rapidly (e.g., in a multi-talker conversation). Consistent with this view, training in groups of listeners who need them most (e.g., due to hearing impairment) often fails to yield quanti able bene ts in any untrained conditions, despite good learning on the trained ones [24][25][26]. Studies on learning new speech categories [e.g., 27,28] are also not a good approximation for daily environments because they usually do not use connected speech.
Another potential role of perceptual learning which we pursue here is based on rapid learning: if learning occurs rapidly, it could serve as a skill listeners can recruit whenever they encounter new acoustic challenges. Accordingly, speci c learning could afford optimal adaptation to the particulars of a new acoustic challenge without more general and undesirable changes in speech perception. Rapid learning studies are more representative of real-world challenges than training studies, because they often include little stimulus repetition and connected speech materials [4,5,[29][30][31][32][33]. Therefore, this account is more ecological than accounts based on the generalization of past learning. Consistent with the idea that perceptual learning is a general resource, recent ndings show that learning is correlated across different tasks and even across modalities [34][35][36].

Rapid and Training-Induced Learning of Distorted Speech
Direct comparisons of learning that follows different training or exposure durations have been rare and did not include the conditions required to determine whether differences in outcomes are quantitative or qualitative [37][38][39]. On the one hand, improvements that follow either brief exposure or training are both maintained over time, as required by the de nition of perceptual learning [16,27,40]. On the other hand, one of the hallmarks of perceptual learning is its exquisite speci city to the physical attributes of the trained stimuli [41,42]. Whereas speech learning following training appears quite speci c to the acoustics of the trained stimuli [20,22,24,33,37,38,43], learning following brief exposure is thought to generalize more broadly [32,44,45]. For example, whereas rapid learning of time-compressed speech resulted in improved recognition of natural fast speech [46], no such transfer was observed after more intensive training [16,33]. Methodological differences make the outcomes hard to compare across studies. Therefore, one goal of the current study was to test the talker speci city of rapid learning of timecompressed speech across different learning protocols (short and long) and test times (immediate and delayed).

Overview of the Current Study
We conducted two experiments using a time-compressed speech task to elicit learning. In experiment 1, we compared learning and retention between rapid and training-induced learning of time compressed speech. We also asked whether the two types of learning are differentially correlated with speech recognition in independent tasks -speech in noise and natural fast speech. We report that learning of time-compressed speech is associated with the perception of natural fast speech and speech in noise, with no apparent differences between rapid and training-induced learning. Experiment 2 explored the effects of stimulus repetition and talker variability on rapid perceptual learning of time-compressed speech. Outcomes were compared to those of previous studies on learning following longer training protocols to suggest that the pattern of learning and speci city does not change between brief and prolonged training.

Experiment 1
Methods Participants 160 university students or recent graduates (ages 18-35 years, Mean = 26, SD = 3, 91 female and 69 male) participated in this experiment. Participants were volunteers and reported they were native Hebrew speakers, with normal hearing and no history of attention, learning or language de cits and no experience with time-compressed speech. The study was performed in accordance with the declaration of Helsinki.
All aspects of the study were approved by the ethics committee of the Faculty of Social Welfare and Health Sciences at the University of Haifa (permit #199/12). Informed consent was obtained from all participants. Participants were tested as described below; no other tests were conducted.
Participants were divided randomly to two groups, an exposure group (to assess rapid learning) and a training group (to assess training-induced learning) as explained below. Both groups completed two test sessions on separate days, in which they performed a time-compressed speech recognition task. At the end of the rst session, the training group completed an additional training phase as described below. We note that parts of the data from the exposure group were previously published as part of a conference proceedings [47], and re-analyzed for the purpose of the current study. One participant had missing data and was not included in data analysis, so we report data from 79 listeners in the exposure group (age: Mean = 26, SD = 4; 38 female, 41 male) and 80 listeners in the training group (ages: Mean = 26, SD = 3; 52 female, 28 male).

Overall Design
The experiment comprised of two sessions, 5 to 9 days apart. On each session, participants from both groups completed three speech recognition tests -time-compressed speech, natural-fast speech and speech-in-noise, in a counterbalanced order as described below. The training group received additional training on time-compressed speech at the end of the rst session. Participants completed the experiment in a quiet room on campus or in their homes. Stimuli were delivered diotically through headphones (Sennheiser HD-205 or HD-215) at a comfortable listening level, using costume software [22]. The timecompressed speech task was used to assess learning both within (rapid learning) and between (retention or consolidation) sessions. Comparison between the exposure and training groups was used to assess differences between rapid learning induced by the time-compressed speech task and training-induced learning. The other two tasks were used to determine if perceptual learning of one type of speech is related to recognition of other types of challenging speech.

Stimuli and Tasks
Stimuli. 290 simple sentences in Hebrew [based on 48], were used. Sentences were ve to six words long and had a subject-verb-object grammatical structure. Half of the sentences were semantically plausible (e.g., "the talented poet wrote a poem") and half the sentences were implausible (e.g., "the angry shopkeeper red the rabbit").
Stimuli for the speech-in-noise and time compressed speech tests were recorded by talker 1, a female native speaker of Hebrew with an average speech rate of 111 words/min (SD = 17). Stimuli for the natural-fast speech test were recorded by talker 2, at an average natural-fast rate of 214 words/minute (SD = 26) because pilot testing suggested that natural-fast speech by talker 1 was not fast enough to challenge university students who are native speakers of Hebrew. Sentences were recorded in a sound attenuating room at a sampling rate of 44.1 kHz, with a standard microphone and edited in Audacity ® software© 2.1.3 to remove remaining noise and equate root-mean-square (RMS) amplitude across sentences.
Speech Recognition Tests. Sentences were randomly divided across the different tests such that on each test half the sentences were plausible and half were implausible. Different sentences were used on each test and session. Order of presentation was random but xed across participants, with no sentence repetition. Sentence delivery was self-paced. Participants were asked to transcribe the sentences as accurately as they could, and the number of correctly transcribed words was counted for each sentence. Only perfectly transcribed words (ignoring homophonic spelling errors) were counted as correct. The proportion of correct words per sentence was used as an index of recognition accuracy.
Speech-in-Noise Tests. On each session participants had to transcribe 25 different sentences. Sentences produced by talker 1 were mixed with 4-talker babble noise [taken from 22] at a signal-to-noise ratio of -6 dB.
Natural-fast speech tests. On each session participants had to transcribe 20 different sentences produced by talker 2.
Time-Compressed Speech Tests. On each session participants transcribed 10 sentences produced by talker 1. To afford isolation of the rapid learning effects, we used the minimal number of sentences thought to yield rapid learning in the majority of participants based on previous work [15]. Sentences were compressed to 30% of their natural duration using a WSOLA algorithm [49].
Training. Three blocks of 60 sentences each produced by talker 1 were delivered. In the rst block, participants had to transcribe sentences compressed to 30% of their natural duration, as described above.
The additional two blocks were adaptive. For each sentence participants had to determine whether it was semantically plausible or not. Initial compression was 50%. Subsequently a 2-down/1-up staircase procedure was used to adjust compression based on participants' responses. This procedure was used to give participants extra training without overburdening them. Because the main goal of the study was to determine whether training-induced and rapid learning differed in their relationships with other types of challenging speech, data from the training phase itself was not analyzed.

Data Analysis
Recognition accuracy data were analyzed in R [50] with a series of generalized linear mixed models using the lme4 package [51].
Learning Analysis. We used data from the time-compressed speech tests to assess rapid perceptual learning within and between sessions as well as training-induced learning (see Results). Learning between the two test sessions was our main index of learning. To this end, for each participant the proportion of words correctly transcribed across all sentences within a session was averaged and the difference between the averages of the two sessions served as a learning index. For the exposure group, this is an index of the rapid learning induced by completing the tests. For the training group, the value is a mixture of the rapid learning that occurred during the tests and the additional contribution of traininginduced learning. Group effects in the statistical models described in the Results were used to statistically separate rapid and training-induced learning. Within-session learning across sentences was also modeled to further assess rapid learning and how it may interact with training-induced learning.

Rapid, Exposure-Induced Learning Conforms to the De nition of Perceptual Learning
Time-compressed speech recognition in the two groups and sessions is shown in Figure 1. In the exposure group, mean recognition accuracy was 0.20 (SD = 0.14) in session 1, and 0.33 (SD = 0.21) in session 2. In the training group, mean accuracy was 0.26 (SD = 0.18) in session 1 and 0.47 (SD = 0.22) in session 2. Our rst goal was to determine whether learning of time-compressed speech occurred between the two sessions and whether it differed between the two groups. Learning, de ned as the amount of improvement on time-compressed speech recognition accuracy between the two sessions, is also shown in Figure 1. This gure suggests that recognition accuracy of the majority of participants in both groups improved between the two sessions.
To determine whether this learning was signi cant, and whether it was modulated by additional practice, mixed modelling was conducted. Random effects included random intercept for participants, as well as a sentence by participant random slope to account for the possibility that rapid learning rates (changes in accuracy over sentences) vary across participants. Fixed effects included group (exposure, dummy coded as 0 and training, coded as 1), sentence number (coded 1 to 10) and session (session 1 coded 0 and session 2 coded as 1). A binomial regression with logistic link function was used (as recommended by Dunn & Smyth, 2018 for proportion data). Three models were constructed. A model that included the random effects only (AIC = 11485), a model with additional main effects for each of the three xed factors (AIC = 10670), and a "full" model that included all possible interaction terms between the xed factors (AIC = 10558). Model comparison (using anova) suggested that the model with main effects t the data signi cantly better than the model with random effects only (χ 2 (3) = 821, p ≤ 0.001) and the full model t the data better than the model with only main effects (χ 2 (4) = 120, p ≤ 0.001).
The effects in the full model (see Table 1) were used to determine whether learning occurred, whether it was maintained over time and whether it differed between the two groups. As expected from previous studies, a signi cant main effect of sentence was present, con rming that rapid learning of timecompressed speech occurred within session. Furthermore, overall performance was more accurate in the second session (main effect of session), and between-session improvements were even larger in the training group (signi cant group by session interaction), suggesting that training yielded additional learning between sessions. On the other hand, the effect of rapid learning itself (that is change over sentences within a session) was smaller in the second session (signi cant sentence by session interaction). Although gure 2 suggests that the magnitude of decline in rapid learning between sessions could have been larger in the training group, the group by session by sentence interaction was not signi cant. To help interpret the effects from the statistical model, within session learning is presented in Figure 2. Each listener transcribed 10 (different) time-compressed sentences on each session, and for the purpose of visualization, learning was de ned as the difference in transcription accuracy between the nal and rst 5 sentences in a session. This gure suggests that consistent with the sentence by session interaction, rapid learning rates were similar in the two groups during the rst session (Mean = 0.14 and 0.15; SD = 0.14 and 0.15 in the exposure and training groups respectively), and that rapid learning in the second session was reduced in both groups. Furthermore, while not statistically signi cant in the full model, inspection of the within session learning data suggests that whereas rapid learning during the second session was observed in 56/79 participants in the exposure group (with a median of 0.06 and interquartile range from 0 to 0.13), only 41/80 participants in the training group continued to improve during session 2 (Median = 0.003, IQR = -0.087 to 0.087; χ 2 = 6.44, p = 0.011).
Taken together, the time-compressed speech data suggests that rapid learning occurred during the rst test session, and to a lesser extent during the second session; additional training resulted in additional learning. Furthermore, rapid learning was maintained between sessions, conforming to the de nition of perceptual learning.

Rapid Learning is Responsible for the Relationships between Perceptual Learning and Speech Recognition
A second goal was to determine whether perceptual learning of time compressed speech was associated with speech perception in independent tasks (natural fast speech and speech in noise), and if so, whether rapid-and training-induced learning differed in this respect.
Speech perception in the two groups and sessions is shown in Figure 3  suggest that learning on time-compressed speech did not generalize to these other tasks and thus that any associations between between-session learning on time-compressed speech and session 2 speech perception (i.e. natural-fast speech and speech in noise) could only arise due to rapid learning. Therefore, speech perception data in each of the tasks was modelled as a function of group, session and group by session interaction as xed effects and random intercepts for participants and individual sentences.
Model comparison suggests that the model with all xed effects (AIC = 13832) is a better t to the natural-fast speech data than the model with random effects only (AIC = 13939; χ 2 (3) = 112, p ≤ 0.001). The xed effects (see Table 2) suggest that natural-fast speech recognition was more accurate in session 2, but as both the group effect and the session by group interaction were insigni cant, this is not due to generalization of training-induced learning in the training group. Similarly, for speech in noise, model comparison suggests that the model with xed effects (AIC = 25334) t the data better than the model with random effects only (AIC = 25959; χ 2 (3) = 631, p ≤ 0.001). Although speech-in-noise recognition was poorer in session 2 than in session 1, there was no indication that this is due to training (see Table 2). Therefore, session 2 speech perception data were used in the following analyses to assess the association between perception and learning. Table 2. Natural-fast speech and speech-in-noise perception as a function of group and session Speech recognition is plotted in Figure 4 as a function of perceptual learning. To determine how perceptual learning contributed to speech recognition in the two tasks, data was again modelled with mixed-effects binomial regression with a logistic link function. For each speech task, the following models were constructed: (1) a "random" model with random intercepts for participant and sentence; (2) a "main effects" model which included three additional main effects: group (exposure coded as 0 and training coded as 1), perceptual learning (the difference between session 2 and session 1 as plotted in Figure 1) and baseline recognition of time-compressed speech (mean of the rst 5 sentences from session 1); the two continuous predictors were scaled and (3) an "interaction" model in which the group by learning interaction was also included. Model comparisons (anova) were used to determine whether the "main effects" model ts the speech data better than the model with random effects only. Then the "main effect" and "interaction" models were compared to determine if the contribution of perceptual learning to speech perception differed between the trained and the exposure groups.  Table 3 for the parameters of the best t model).
For speech-in-noise, the "main effects" model (AIC = 11538) t the data signi cantly better than the model with random effects only (AIC = 11586, χ 2 (3) = 54, p ≤ 0.001). Addition of the group by learning interaction had no signi cant impact on the t (AIC = 11539, χ 2 (3) = 0.77, p = 0.381, see Table 3 for the parameters of the best t model). Therefore, it seems that additional training did not signi cantly change the contribution of rapid perceptual learning of time-compressed speech to either natural-fast speech or speech-in-noise recognition. Taken together, the current data replicates our previous ndings [15] in showing that rapid perceptual learning contributes to speech recognition in independent tasks. Furthermore, the current ndings suggest that this contribution does not change with training and is not attributable to the generalization of learning. Because learning was assessed across sessions, the present ndings also suggest that the learning/perception correlations re ect "true" perceptual learning and not a transient effect.

Experiment 2
Both talker variability and stimulus repetition were previously suggested to in uence the speci city of perceptual learning for speech [52][53][54]. In previous training studies on time-compressed speech, learning was speci c and neither of these factors in uenced it [16,38]. Experiment 2 therefore explored the effects of repetition and talker variability on rapid learning of time-compressed speech and its talker speci city.

Methods
Participants 255 native Hebrew speakers (ages 18-35 years, Mean = 27, SD = 4; 153 females, 102 males) participated in this study. All other details are as in Experiment 1. Participants were randomly divided to ve groups and tested as described below; No other tests were conducted.

Overview of the Experiment and Exposure Groups
Participants were assigned randomly to one of ve groups, a 'no exposure' control group and four exposure groups. The exposure groups transcribed 20 time-compressed sentences in one of the following conditions (see Table 4): 'baseline' (20 different sentences presented by a single talker), 'multi-talker' (the same 20 sentences presented by 5 different talkers such that each talker delivered 4 different sentences), 'multi-repetition' (four sentences presented by a single talker, each repeated ve times), and 'single sentence' (one sentence presented 20 times by a single talker).  On the rst session, participants in the exposure groups transcribed 20 time-compressed sentences as described below (see Table 4). After the exposure, all ve groups were tested on the time-compressed and natural-fast speech tests described below in a xed order. On the second session (~7 days after session 1) they were again tested on the same tests (with different sentences), again in xed order. Then they completed another natural-fast speech test, a speech in noise test and the matrices, digit-span and similarities subtests from WAIS-IV [55] in counterbalanced order. Baseline. 20 different sentences were presented by talker 1. This condition is similar to those used in past studies to document rapid learning of time-compressed speech [15,33,40].
Multi-talker. The same sentences as in the baseline conditions were presented by ve different talkers, including talker 1 and talker 2. Although we found no effect of talker variability on the perceptual learning of time-compressed speech and its generalization in the past, this condition was included because past literature on other types of challenging speech suggests that talker variability can in uence the transfer of learning (for review of past studies and our previous attempt see Tarabeih-Ghanayim et al. [16]).
Multi-repetition. Four sentences were selected randomly from the baseline condition and presented ve times each by talker 1 in pseudo-random order such that a single sentence could not repeat on two successive trials.
Single sentence. To further probe the effects of stimulus repetition, a single sentence was randomly selected from the baseline condition and presented 20 times by talker 1.
Test Conditions. On each test session participants had to transcribe 10 time-compressed sentences presented by talker 1, 10 time-compressed sentences presented by talker 2 and 10 natural-fast sentences presented by talker 2 (Table 4). These tests were presented in a xed order. In session 2, after completion of those tests, two other tests were carried out: in one test 20 natural-fast sentences recorded by talker 3 were presented; in the other 20 sentences recorded by talker 1 and embedded in background noise (as in Experiment 1) were also presented. In addition, three subtests from WAIS-IV (see Table 5). Those were presented in counterbalanced order.
Sentence Transcription. Across exposure and testing, presentation was self-paced. Listeners heard each sentence, transcribed it and continued to the next sentence by pressing a "continue" button on screen using custom software [22]. Each sentence was played once, and no feedback was provided.

Data Analysis
For each sentence, the proportion of correctly transcribed words was counted as in Experiment 1 and submitted for further analysis. Data was analyzed using mixed-effects generalized linear modelling (using lme4 in R) with random intercepts for sentence and participant. Proportions of correct responses on each test were the dependent variables. Exposure condition (coded 0, 1, 2, 3, 4 for the no-exposure, baseline, multi-talker, multi-repetition and single sentence conditions, respectively) and test session (session 1, session 2) were xed effects for the time-compressed and talker 2 natural-fast speech tests.
Exposure condition was the only xed factor for talker 3 natural-fast speech and for the speech-in-noise test which were conducted in session 2 only. For each dependent variable (talker 1, talker 2 etc.), model comparison was used to determine whether the inclusion of each of the xed effects improved the t of the model signi cantly.

Test Performance as a function of Exposure
Time-compressed speech recognition accuracy is shown in Figure 5 (left and mid panels) and Table 6. For talker 1, each successive model t the data better than the previous one. The model with exposure condition (AIC = 15397) t the data better than the model with random effects only (AIC = 15399, χ 2 (4) = 9.54, p= 0.049). Adding session reduced AIC to 15392 (χ 2 (1) = 7.31, p = 0.007). The model with interactions t the data best (AIC = 15344, χ 2 (4) = 55.78, p ≤ 0.001). However, this model was hard to interpret because as shown in Figure 5, performance in the no-exposure group (grey rectangles) improved between the two sessions, whereas changes in the exposure groups were variable: performance decayed in the 'baseline', 'multi-talker' and 'single-sentence' groups and somewhat increased in the 'multi-repetition' group. An inspection of the model parameters (Table 7) suggests that the condition by session interaction stems from a decrease in group difference between the baseline and the no-exposure groups, the multi-talker and the no-exposure groups and the single-sentence and the no-exposure groups, from session 1 to session 2. (1) = 1135, p < 0.001) and the exposure condition by session interaction (AIC = 15089, χ 2 (4) = 22.32, p < 0.001) did. However, as all groups improved from session 1 to session 2, it seems that learning during the tests was su cient for learning the time-compressed speech produced by talker 2 regardless of previous exposure (see Table 7).
Finally, there were no group differences in the recognition of either natural fast speech (talker 3) or speech in noise in the nal tests in session 2 ( Figure 6). For natural fast speech, adding exposure condition had no signi cant effect on the t compared to a model with random effects for item and participant only (AIC random = 10681, AIC condition = 10688, χ 2 (4) = 0.67, p = 0.955). The same is true for speech in noise (AIC random = 13941, AIC condition = 13945, χ 2 (4) = 3.64, p = 0.456). Given the group data shown in Table 6 and Figure 6 (top panel), we decided not to model that natural-fast speech data from talker 2 for group differences.

Discussion
Active listening to 10 time-compressed sentences was su cient for robust and long-lasting perceptual learning (Experiment 1), consistent with the available literature. This rapid learning was speci c to the acoustic characteristics of the speech used to elicit learning (Experiment 2). Although additional practice resulted in more learning, the associations between perceptual learning and speech recognition were driven by rapid learning (Experiment 1). In the context of previous works these data tentatively suggest that additional practice does not change the nature of the resulting perceptual learning. If this is the case, rapid learning is key in understanding the function of perceptual learning in speech recognition, as we discuss in the following sections.
Long-Lasting and Speci c: The Outcomes of Rapid Learning are Consistent with the Characteristics of Perceptual Learning In the current study (Experiment 1), the duration of the practice phase had quantitative but not qualitative effects on perceptual learning. Consistent with previous studies [16,40], this rapid learning was relatively long lasting, but also quite speci c to the acoustics of the stimuli that elicited learning. Although naturalfast speech recognition improved between the two test sessions, the improvement cannot be attributed to the transfer of learning of time-compressed speech because while learning itself was stronger in the training than in the exposure group, improvements in the recognition of natural-fast speech did not depend on group. Therefore, it is more likely to re ect relatively rapid learning of natural-fast speech and not transfer. Likewise, in Experiment 2, even when rapid learning occurred (in the baseline group), it was not re ected in the recognition of time-compressed speech produced by a new talker, similar to ndings on training-induced learning of time-compressed speech [37,38].
Second, if learning was not stimulus speci c, increasing the number of talkers or reducing the number of different sentences in Experiment 2 should not have interfered with learning. Yet these manipulations prevented rapid learning in line with previous reports on the effects of talker variability [16,56,57]. For example, when listening to speech produced by talkers with atypical /s/ or /sh/ pronunciations, adaptation to the unusual sounds was faster when each speaker was presented on its own than when the two were interleaved [56]. Talker variability during learning is thought to support the transfer of learning by providing listeners with a better sample of the systemic variability in the target speech [7,58]. However, this is not necessarily true for time-compressed speech, in which talker variability was found to slow training-induced learning with no effect on learning transfer [16]. Therefore, we suggest that rapid and training-induced learning are similarly speci c or general, and therefore that rapid learning of speech re ects true perceptual learning rather than merely procedural or task learning. Similar conclusions were reported for non-verbal auditory and visual learning [12,13,42]. If learning emerges once experience with novel speech has provided su cient familiarity with the characteristics of the target speech, both brief and prolonged practice could yield speci c or general learning, depending on the characteristics of the input. For time-compressed speech, we demonstrate that learning is quite talker speci c, as discussed above. On the other hand, learning of noise-vocoded speech seems to generalize more broadly across talkers and stimuli [32,45].
If more training does not change the nature of learning, what does it do? One option is that multi-session training could provide further opportunities for learning to stabilize and consolidate without changing the overall nature of learning [38,59,60]. This is consistent with the outcomes of both labbased [37,61,62] and rehabilitation-oriented [22,60] studies. For example, in speech category learning, listeners accumulate information about the acoustic characteristics of the talker over time [61,63], thus additional experience with a talker is likely to result in additional gains. Gradual accumulation of information about the talkers and the listening context could similarly support learning of perceptually challenging speech beyond the single word level. Furthermore, additional experience gives slowerlearning listeners the opportunity to 'catch-up'. Sadly, most social, educational, and professional environments are not likely to provide those opportunities. Therefore, added to the relative speci city of learning already discussed, it seems that for understanding the role of perceptual learning in speech perception in challenging 'real-world' conditions, rapid learning is the key.

Rapid Learning and Individual Differences in Speech Perception
Individual differences in rapid perceptual learning of speech account for unique variance in speech perception in independent tasks [15,33,64]. The current data essentially replicates this association for rapid learning of time-compressed speech and natural-fast speech and speech-in-noise perception.
However, we focused on between-session learning rather than on within-session learning which was the focus of previous studies. This let us assess whether the contributions of rapid and training-induced learning to speech perception on the other tasks differ ( Figure 4). Overall, individuals with good learning were more likely to accurately recognize both natural fast speech (odds ratio = 1.52) and speech-in-noise (odds ratio = 1.36). Additional practice on time-compressed speech by the training group had no signi cant contribution to speech-in-noise, and a negative contribution to the recognition of natural-fast speech. In the absence of cross-task generalization following training, these ndings suggest that in the current study rapid learning makes for the bulk of the speech/learning associations, consistent with the previous ndings on rapid within-session learning [15,33].
The idea that rapid perceptual learning plays an independent role in individual differences in speech perception has merit only to the extent that (rapid) perceptual learning is a general ability or capacity of an individual, at least within a domain. A few recent auditory [35] and visual [34,36] learning studies suggest that a common factor could explain learning across different tasks. Using visual and auditory discrimination tasks, Yang et al. [36] reported that despite large differences in learning rates across tasks, a common learning factor accounted for more than 30% of the variance across different learning tasks. Roark et al. [35] studied the learning of non-speech auditory and visual categories. They found that while learning rates were faster for visual than for auditory categories, categorization accuracy at the end of training was correlated between the auditory and the visual task, suggesting that individual differences in category learning are correlated across the auditory and the visual modalities. As for associations across speech learning tasks, accuracy data of the type we normally collect might be insu cient to address this issue given the analytical methods used in the studies that reported cross-task association. Furthermore, although the rapid rates of learning in some speech tasks might make it di cult to separate "perception" and "learning", the consistent replication of the contribution of rapid learning of time-compressed speech to the perception of natural fast speech and speech in noise suggests that this is not an incidental nding. Future studies should nevertheless test the hypothesis that different speech learning conditions cluster around a common factor.

Limitations & Implications
First, sample sizes were not based on a formal power analysis because it was not obvious how to conduct it based on previous data and the rapid rates of time-compressed speech learning. Nevertheless, our previous studies of time-compressed speech learning (with similar but not identical conditions) yielded signi cant group differences as a function of the training protocol with sample sizes of 10 to 24 per group [e.g., 16,37,38,43]. Current sample size was therefore su cient to uncover similar or larger effects. Furthermore, for the learning/recognition associations reported here (Table 3), the effect sizes for learning (expressed as odd ratios) were similar to those reported by Rotman et al. [15] (1.52 vs 1.44 -1.68 for natural-fast speech and 1.36 vs. 1.49 for speech-in-noise) with similar groups sizes.
Second, our ndings suggest that perceptual learning for speech is largely acoustically speci c. However, this is not to say that longer training can never be useful. Instead, training-based studies or interventions should consider the speci city of learning in their design and expected outcomes. A recent review of perceptual learning of dysarthric speech [65] suggests that this is feasible. Since learning of dysarthric speech is constrained by the dysarthria characteristics of individual patients [66] and even experienced clinicians still bene t from talker-speci c training [67] it is proposed that communication partners train to improve the intelligibility of speci c patients (e.g., a family member), accounting for learning speci city.
Third, speech perception under challenging conditions incorporates both stimulus (e.g., talker, input distribution; [18]) and listener (e.g., age, language, and cognition [68,69]) related factors. We now suggest that rapid learning is another meaningful listener related factor that could determine how well individual listeners adapt to new or changing auditory environments. Determining if individual differences are associated across different learning tasks or with performance in other situations requires further studies with different learning and perception tasks. Still the nding that individual differences in rapidlearning with one type of challenging speech predicts individual differences in the processing of a different type of challenging speech is telling despite the correlational nature of our work.      Time-compressed speech recognition by exposure condition in the immediate (session 1) and delayed (session 2) tests. For each exposure group Mean (across all sentences per condition) and 95% con dence interval are shown. The grey rectangles mark the 95% con dence interval of the no-exposure group who participated in testing only. The right-most section of the plot shows session 2 data for the exposure groups and session 1 data for the no-exposure group. and Speech-in-Noise (bottom) recognition. For each exposure group mean (across all sentences per condition) and 95% con dence interval are shown. The gray rectangles mark the 95% con dence interval of the no-exposure group who participated in testing only.