Description of Included Studies
In total, 342 studies met our inclusion criteria (see Supplementary C and N for a complete list). In two cases, two publications were based on the same data but reported different outcome measures and were thus treated as one study (Chao et al., 2016, 2017; Machů, 2015; Machů & Lukeš, 2019).
Most studies were conducted in North America (k = 205), followed by Europe (k = 68). Of the included studies, there were 158 journal articles, 166 dissertations, 12 project reports, and 6 conference papers. In total, 158,713 participants were included in the primary studies, including 62,729 students, 77,787 teachers, and 18,197 members from class and school teams including different professions, such as teaching assistants, and administrative staff. Sample sizes in the primary studies ranged from 3 to 31,000 participants, with a median of 77 participants. Twenty-three studies were conducted with preschool teachers, 115 with primary school teachers, and 76 with secondary school teachers; 128 studies did not define the school type where teachers were employed. Moreover, 188 studies applied a cross-sectional design, 99 had a single-group pre-posttest design, 37 had an independent group pre-posttest design, and 18 had an independent group post-test design. Most studies focused on inclusive education for students with special educational needs (k = 288), with 76 focusing on specific special educational needs (e.g., autism, learning disabilities). The remaining 54 studies focused on other diversity features, such as, second language learners and gifted students, or addressed multiple categories of heterogeneity.
The professional development programs investigated in the 154 intervention studies ranged from 2 to 750 hours, with a median of 20 hours, and lasted between half a day and three school years, with a median of three months. Most programs addressed a specific topic (k = 112; primarily specific types of special educational needs) but usually did not target a specific subject (k = 126). Most intervention studies assessed the professional development program’s impact immediately after its end (k = 98); 24 programs offered a certificate to the participants after completing the program, and 51 offered coaching in addition to the training.
In total, 1,123 effect sizes were calculated and distributed as follows among the outcome categories: 88 effect sizes for knowledge, 371 for skills, 461 for assessed beliefs regarding inclusive education, and 203 for influences on student behavior (Fig. 2). No differences were observed for effect sizes calculated from means and standard deviations compared to those calculated from reported test statistics (all F < 3.8, all ps > .05, see Supplement F).
Summary Effects
We calculated summary effects for each outcome category and investigated whether the subcategories within a category differ from each other. We observed significant positive effects in all four outcome categories (Fig. 3). The analysis of knowledge showed a large effect (g = 0.93 [0.76; 1.10]), with no difference between self-rated knowledge (g = 0.96 [0.70; 1.22]) and knowledge assessed using tests (g = 0.91 [0.68; 1.15], F(1, 86) = 0.50, p = .48). A moderate effect was observed on skills to implement inclusive education (g = 0.49 [0.41; 0.56]) and on its subcategories (see Fig. 3). The subcategories (implementation quality, use of inclusive methods, self-efficacy for inclusive teaching) did not differ from each other (F(2, 368) = 0.15, p = .86).
We observed a small but positive and significant effect on beliefs toward inclusive education (g = 0.23 [0.17; 0.28]), again with no differences between its subcategories (F(2, 457) = 2.34, p = .10). Still, for attitude (g = 0.23 [0.18; 0.28]) and perception of inclusive teaching methods (g = 0.27 [0.16; 0.39]) a positive effect of professional development participation was observed, while no significant effect was observed for concerns about inclusive education (g = 0.08 [-0.14; 0.30]). A small-to-moderate effect of teachers' participation in professional development was observed on students’ behavior (g = 0.37 [0.23; 0.51]), with no differences between student achievement (g = 0.41 [0.22; 0.61]) and other student behavior (g = 0.29 [0.11; 0.46], F(1, 201) = 0.001, p = .98).
Next, we investigated the presence of heterogeneity for each outcome category and the Q-test was significant for each (Table 1; for more detailed information, see Supplement E). The most variance was found on the between-study level, except for the beliefs category, where variance was mainly present on the within-study level (53.21%). These analyses suggest conducting moderator analyses in all four outcome categories.
Table 1
Overview of Summary Effects and Variance
Outcome category | k | N | Effect size | 95% CI | Q | pQ | Within-study variance (%) | Between-study variance (%) |
Knowledge | 50 | 88 | 0.926 | 0.76–1.10 | 1482.96 | < .0001 | 10.24 | 83.35 |
Skills | 153 | 371 | 0.485 | 0.41–0.56 | 7446.14 | < .0001 | 34.19 | 63.76 |
Beliefs | 200 | 461 | 0.230 | 0.18–0.28 | 2745.3 | < .0001 | 53.21 | 32.59 |
Student behavior | 51 | 203 | 0.372 | 0.23–0.51 | 3339.81 | < .0001 | 39.36 | 58.51 |
Note. k indicates the number of studies reporting data in the corresponding outcome category and N indicates the number of effect sizes per category. |
Moderator Analyses Of Study Characteristics
The results of moderator analyses for the four outcome categories are summarized in Table 2 (see Supplement J-M for more detailed information). Publication year, years since the legal anchoring of inclusive education, and the continent where the studies were conducted did not influence the observed effects. Knowledge and beliefs data were not influenced by the control variables describing the study characteristics. Effect sizes reporting effects on skills differed between intervention studies and cross-sectional studies, with the former reporting significantly larger effect sizes (g = 0.56 [0.46; 0.66]) than the latter (g = 0.36 [0.27; 0.46]).
Table 2
Overview of Moderator Analyses in the Outcome Categories
| Knowledge | | Skills | | Beliefs | | Student Behavior |
Moderator | F | df | p | | F | df | p | | F | df | p | | F | df | p |
Study design | | | | | | | | | | | | | | | |
Publication year | 1.06 | 1, 86 | .31 | | 3.22 | 1, 368 | .07 | | 0.08 | 1, 458 | .77 | | 2.56 | 1, 200 | .11 |
Years since legal anchoring | 0.22 | 1, 86 | .64 | | 0.12 | 1, 368 | .74 | | 0.91 | 1, 458 | .34 | | 0.60 | 1, 200 | .44 |
Intervention studies | 3.73 | 1, 86 | .06 | | 8.13 | 1, 369 | .005 | | 0.04 | 1, 458 | .84 | | 0.16 | 1, 201 | .69 |
Continent | 0.67 | 6, 81 | .68 | | 1.03 | 8, 362 | .41 | | 1.60 | 8, 451 | .12 | | 2.27 | 4, 198 | .06 |
Data collection | | | | | | | | | | | | | | | |
Time between training and data-collection | 0.38 | 1, 75 | .54 | | 1.63 | 1, 257 | .20 | | 0.41 | 1, 210 | .52 | | 2.72 | 1, 194 | .10 |
Instrument Focus: diversity feature | 6.57 | 1, 86 | .01 | | 0.00 | 1, 369 | .95 | | 6.21 | 1, 458 | .01 | | 1.08 | 1, 201 | .30 |
Instrument focus: method | 0.03 | 1, 86 | .86 | | 0.86 | 1, 369 | .36 | | 6.34 | 1, 458 | .01 | | 0.15 | 1, 201 | .70 |
Instrument type | 1.99 | 2, 85 | .14 | | 1.40 | 5, 365 | .22 | | 1.57 | 3, 456 | .20 | | 8.53 | 4, 198 | < .001 |
Participant characteristics | | | | | | | | | | | | | | | |
Mean age | 1.66 | 1, 34 | .21 | | 0.84 | 1, 89 | .36 | | 0.03 | 1, 170 | .87 | | 0.07 | 1, 41 | .79 |
Inclusive teaching experience | 3.89 | 1, 39 | .06 | | 2.16 | 1, 214 | .14 | | 0.08 | 1, 281 | .77 | | - | - | - |
School type | 0.57 | 3, 84 | .64 | | 1.57 | 3, 366 | .20 | | 2.66 | 3, 456 | .048 | | 2.58 | 3, 198 | .06 |
Professional development | | | | | | | | | | | | | | | |
Content focus | 0.01 | 1, 75 | .94 | | 0.02 | 1, 257 | .90 | | 0.38 | 1, 210 | .54 | | 2.55 | 1, 194 | .11 |
Active learning | 0.00 | 1, 75 | .98 | | 7.84 | 1, 368 | .005 | | 2.14 | 1, 210 | .15 | | 0.02 | 1, 194 | .90 |
Coherence | 0.70 | 1, 75 | .41 | | 2.10 | 1, 368 | .15 | | 0.91 | 1, 210 | .34 | | 1.68 | 1, 194 | .20 |
Duration | 0.02 | 1, 75 | .89 | | 0.56 | 1, 245 | .46 | | 1.05 | 1, 209 | .31 | | 0.01 | 1, 188 | .92 |
Collective participation | 1.89 | 1, 75 | .17 | | 0.07 | 1, 257 | .79 | | 0.85 | 1, 210 | .36 | | 3.86 | 1, 194 | .05 |
Certification | 8.27 | 1, 75 | .005 | | 0.74 | 1, 257 | .39 | | 0.93 | 1, 209 | .34 | | 0.07 | 1, 194 | .79 |
Note. Degrees of freedom differ based on the information available in the studies. |
Moderator Analyses Of Data Collection Characteristics
No differences were observed based on the number of weeks between the last session of the professional development program and post-data collection (Table 2; for more detailed information see Supplement J-M). Effects on knowledge were influenced by the type of instrument used: Studies applying instruments focusing on specific diversity features reported smaller knowledge gains (g = 0.75 [0.47; 1.02]) compared to instruments that focused on addressing diversity features in inclusive education (g = 1.04 [0.83; 1.24]). Tested knowledge was larger when assessed with (mainly self-developed) surveys (g = 1.10 [0.80; 1.39]) than with questionnaires and single-items (g = 0.63 [0.28; 0.98], F(2, 48) = 3.54, p = .04).
Effects on skills were not influenced by variables describing the data collection. Regarding beliefs, effects differed based on the applied instruments, with larger effect sizes being observed when instruments focusing on specific teaching methods were used (g = 0.51 [0.41; 0.60]) compared to instruments focusing on the implementation of inclusive education in general (g = 0.22 [0.17; 0.27]). The type of measurement influenced data collection on student behavior, with studies applying observational measures reporting larger effects (g = 0.69 [0.34; 1.03]) than studies using teachers’ self-reports (g = 0.16 [-0.09; 0.40]).
Moderator Analyses Of Participant Characteristics
None of the variables describing participant characteristics influenced the observed effect sizes in the categories (Table 2). School type influenced effect sizes assessing the attitudes toward inclusive education subcategory (F(3, 333) = 3.29, p = .02): We observed positive effects in primary (g = 0.24 [0.17; 0.32]) and secondary (g = 0.37 [0.25; 0.48]) school teachers but no effect in kindergarten teachers (g = -0.03 [-0.55; 0.49]). Two subcategories—tested knowledge and use of inclusive teaching methods—were influenced by participant characteristics: The higher the mean age, the smaller the effect observed for tested knowledge (F(1, 23) = 8.50, p = .01, B = -0.11, SE = 0.04), and the more the teachers reported having inclusive teaching experience, the smaller were effects on the use of inclusive teaching methods (F(1, 88) = 10.91, p = .001, B = -0.02, SE = 0.007).
Moderator Analyses Of Professional Development Design
Content focus did not influence any category of outcome variables, but it did influence a subcategory of student behavior: Changes in student achievement were positively influenced by higher content focus (F(1, 90) = 5.66, p = .02, B = 0.25, SE = 0.11). Active learning was a significant moderator for skills (F(1, 368) = 7.84, p = .005): Programs with more active learning opportunities reported larger effect sizes (B = 0.11, SE = 0.04). Specifically, the indicator alternating versus blocked design explained variance in effects sizes reflecting changes in the use of teaching methods (F(1, 88) = 9.03, p = .004), with larger changes reported in programs with alternating input and praxis phases (g = 0.78 [0.57; 0.99]) than in programs with a blocked design (g = 0.03 [-0.40; 0.46]).
Coherence with other learning activities had no influence, except for the subcategory changes in the perception of inclusive teaching methods were positively influenced by additional coherent learning activities (F(1, 60) = 4.80, p = .03, B = 0.11, SE = 0.07). Analyses of the indicators for coherence showed that programs requiring teachers to fulfill prerequisites for participation reported larger changes in the subcategories of perception of teaching methods (g = 0.60 [0.33; 0.87], F(1, 59) = 13.58, p = .005) and student achievement (g = 0.71 [0.18; 1.24], F(1, 90) = 5.20, p = .03) compared to programs that were open to all teachers (g = 0.13 [0.02; 0.24], g = 0.25 [0.09; 0.42], respectively).
We limited the analyses on duration to programs lasting up to 200 hours, representing about two-thirds of all effect sizes (64.6%), to reduce the influence of extreme programs due to large differences between them (range 2–750 hours). Following this reduction, we did not observe influences of the duration of training programs on any of the outcome categories and subcategories (Table 2 and Supplement J-M). Collective participation did not influence any outcome category but negatively influenced the subcategory of student achievement (F(1, 90) = 7.71, p = .01, B = -0.21, SE = 0.08). When all school personnel participated, no effects on student achievement were observed (g = 0.06 [-0.06; 0.17]); however, small-to-moderate effects were observed when class teams participated (g = 0.3 [0.1; 0.5]) and moderate effects were noted when teachers participated with one colleague (g = 0.66 [0.16; 1.15]) and without colleagues (g = 0.74 [0.1; 1.39]). Studies offering certification after successful completion of the program observed larger effect sizes for knowledge gain (g = 1.39 [1.07; 1.72]) than did programs without certification (g = 0.86 [0.65; 1.08]) but this did not influence the other categories of outcome variables.
Study Quality
Study bias in the included studies was generally high (M = 3.74, SD = 1.36, Min = 1, Max = 8.5). Although moderation analysis of study bias indicated that study quality did not influence the observed effect sizes in the four outcome categories (all F < 1.4, all ps > .2, see Supplement G), its influence was observed in the subcategories of self-rated knowledge (F(1, 35) = 11.06, p = .002, B = -0.25, SE = 0.08) and use of inclusive teaching methods (F(1, 133) = 4.89, p = .03, B = -0.12, SE = 0.05), where less risk of bias was related to smaller effect sizes.
Publication Bias
Regarding the presence of publication bias, visual analyses of the contour-enhanced funnel plots indicate that the individual effects are roughly symmetrical (see Supplement F, H, and I). Most fall within the 99% confidence interval, and outliers stem from published and unpublished studies. Egger regression tests suggested symmetry for the funnel plots for all outcome categories and subcategories (all F < 1.3, all ps > .2). The power-enhanced funnel plots (see Fig. 4) illustrate substantial differences in detecting the estimated effect sizes between studies. Moreover, a few studies with very low power were included, but these fall within the normal range of effect sizes. Most studies have low power, especially those assessing beliefs (medpower = 21.5%) and student behavior (medpower = 41.8%). Power in studies assessing skills was rather moderate (medpower = 69.2%) and sufficient in studies assessing knowledge (medpower = 90.2%). Publication status was an inconsequential predictor of knowledge (F(1, 86) = 0.88, p = .35) and skills (F(1, 368) = 2.07, p = .15), but it moderated effects on beliefs (F(1, 457) = 6.85, p = .01) and student behavior (F(1, 200) = 4.77, p = .03). Larger effects were reported in published studies (beliefs: g = 0.30 [0.22; 0.38], student behavior: g = 0.49 [0.28; 0.71]) than in unpublished studies (g = 0.19 [0.13; 0.24], g = 0.17 [0.07; 0.28], respectively).
These methods suggest that publication bias is present, to a varying degree, in the current meta-analysis in outcome categories and subcategories. The analyses reveal that two typical sources of publication bias do not exert a large influence on the estimated effect sizes, as studies with small sample sizes and low power in this meta-analysis report effect sizes within the normal range. Because half of the included effect sizes stem from unpublished studies (53%), the risk of publication bias was reduced by design.