Machine Learning Uncovers Aerosol Size Information From Chemistry and Meteorology to Quantify Potential Cloud‐Forming Particles

Cloud condensation nuclei (CCN) are mediators of aerosol‐cloud interactions, which contribute to the largest uncertainty in climate change prediction. Here, we present a machine learning (ML)/artificial intelligence (AI) model that quantifies CCN from model‐simulated aerosol composition, atmospheric trace gas, and meteorological variables. Comprehensive multi‐campaign airborne measurements, covering varied physicochemical regimes in the troposphere, confirm the validity of and help probe the inner workings of this ML model: revealing for the first time that different ranges of atmospheric aerosol composition and mass correspond to distinct aerosol number size distributions. ML extracts this information, important for accurate quantification of CCN, additionally from both chemistry and meteorology. This can provide a physicochemically explainable, computationally efficient, robust ML pathway in global climate models that only resolve aerosol composition; potentially mitigating the uncertainty of effective radiative forcing due to aerosol‐cloud interactions (ERFaci) and improving confidence in assessment of anthropogenic contributions and climate change projections.

Obtaining agreement of CCN predictions with observations is crucial toward mitigating the uncertainty associated with aerosol-cloud interactions. Two factors play the largest role in determining CCN (at a given water supersaturation): aerosol particle number size distributions (PNSD) and aerosol chemical composition (speciation; Fitzgerald, 1973;Junge & McLaren, 1971). While the debate continues (Crosbie et al., 2015;Dusek et al., 2006;Hudson, 2007;Twohy & Anderson, 2008) as to which factor plays a larger role, the more predominant effect is arguably that of PNSD due to the third order dependence on size for the solute effect that permits water vapor condensation as well as the greater variability of PNSD than that of speciation, except in polluted regions. However, most global climate models (GCMs) use simplified prescriptions to estimate aerosol numbers or CCN from speciation while assuming a fixed PNSD (Boucher & Lohmann, 1995;Menon et al., 2002;Menon & Rotstayn, 2006). This is due to current computational constraints, which limit the incorporation into GCMs of size-resolved microphysics models with a detailed treatment of processes pertinent to a more accurate representation of PNSD and hence CCN number concentrations.
Machine learning (ML) is a subset of artificial intelligence (AI) where computers are trained on a large number of scenarios to acquire knowledge by statistical learning and without explicit instructions. While ML has been in use for the last several decades (Dramsch, 2020;Reichstein et al., 2019), in recent years, novel techniques and rapid advances in ML have led to its emergent applications in the atmospheric sciences (e.g., Grange et al., 2018;Jin et al., 2019;Nair & Yu, 2020;Su et al., 2020), especially in grappling with ordinal, nonlinear, complex, and massive amounts of data. It is key, however, that these increasingly black-box ML/AI techniques remain grounded in reality for trustworthiness and generalizability.
We, therefore, set out to probe the inner workings of our recently proposed ML model (Nair & Yu, 2020) trained on a chemical transport model with detailed size-resolved microphysics for deriving CCN number concentrations, that is, why CCN can be predicted from aerosol speciation (and other commonly available atmospheric variables) without size information. Comprehensive multi-campaign airborne measurements over varied physicochemical regimes across the tropospheric extent are used to explore the key parameters determining [CCN].

Machine Learning Model
Random forest (Breiman, 2001) is a ML technique that can be used for regression analysis and understanding the dependence of an outcome on other variables (its predictors). This is an ensemble (to reduce overfitting) of several decision trees (Breiman et al., 1984), each obtained on random subsets (Breiman, 1996) of the training data. For the generalizability of this ML model, it requires to be trained on a large number of scenarios, for which presently available measurements are scant (see Text S2). Here, the RFRM (Random Forest Regression Model) is trained on 30 yr simulations by GEOS-Chem-APM: a state-of-the-science chemical transport model with detailed size-resolved microphysics (Yu & Luo, 2009). The present study uses the RFRM-ShortVars configuration (Nair & Yu, 2020), a fast implementation (Wright & Ziegler, 2017) of random forest models (Breiman, 2003) in the statistical computing language R (R Core Team, 2020). RFRM-ShortVars, which was developed to use PM 2.5 (mass of Particulate Matter (PM) with particle diameter ≤ 2.5 μm) speciation as predictors for number concentrations of CCN at 0.4% supersaturation ([CCN0.4]) is retrained to use airborne measurements of PM1 speciation (in lieu of PM2.5 speciation measurements). Henceforth referred to as RFRM, this model derives [CCN0.4] from the following 9 commonly measured variables of atmospheric state and composition as input predictors: (Meteorology) temperature (T) and relative humidity (RH), (Gas-phase chemistry) SO2 , NO , and O3 , and (Aerosol composition and mass) NH4 , SO4 , NO3 , and OA (organic aerosol). The present analysis focuses on [CCN0.4] for the purpose of demonstration and in future work will be extensible for the full CCN spectrum.

Multi-Campaign Airborne Measurements
Comprehensive (global scope, tropospheric vertical extent, varied seasons, and high temporal resolution) airborne measurements of atmospheric state and composition variables provide an unparalleled opportunity to probe the inner workings of the ML derivation of [CCN0.4] in varied atmospheric environments. Seven airborne campaigns were identified (Table S1) with simultaneous measurements of the 9 predictors as well as [CCN0.4] and with their spatial domain shown in Figure S1, instrumentation details in Table S2, and further details of [CCN0.4] measurements in Text S1. PNSD presented here are 1,000 nm, above which aerosol numbers sharply taper off and negligibly contribute to [CCN0.4]. For the ATom1-4 campaign, PNSD is measured using the aerosol microphysical properties (AMP) package  and for the other campaigns using a scanning mobility particle sizer (SMPS; and nano-SMPS for WE-CAN) and either an Ultra-High Sensitivity Aerosol Spectrometer (UHSAS: ARCTAS, DISCOVER-AQ TX , and WE-CAN) or a Laser Aerosol Spectrometer (LAS: DC3, KORUS-AQ, and SEAC 4 RS). To increase data coverage, if a measurement was missing and if there were measurements one second prior and/or after, it was imputed with their mean value. For DC3 [SO2 ] (0.1 Hz) and WE-CAN HR-ToF-AMS (0.2 Hz), measurements were assumed constant for 10 and 5 s, respectively.

Statistical Estimators to Quantify RFRM Performance
In the present study, we use the following statistical estimators for model-observation comparison: Kendall rank correlation coefficient ( ) to quantify correlation and %-Good to quantify agreement. The rationale and advantages of using these statistical metrics to evaluate model-observation comparisons are described in detail elsewhere (Nair et al., 2019). These estimators are defined as follows: where is the sample size, is the value, is the number of tied ranks in the group of tied ranks, and superscripts o and m denote observed and modeled values, respectively:

Machine Learning Successfully Derives CCN Number Concentrations
We compare three approaches:  Figure 1d) of the derived values are within the corridor of good-agreement between the dashed light red and dashed light blue lines. While the RFRM is overall robust, we examine the cases where it deviates from airborne measurements. When these model-observation disagreements (absolute FB (|FB|) > 1 ) do occur, they are rare (5.9% ) and in a regime where their effect on cloud properties will be smallest (Martin et al., 1994;Ramanathan, 2001), that is, the sensitivity of cloud droplet numbers to changes in aerosol numbers is reduced at their high concentrations. For high ( 3× 10 3 cm −3 ) measured [CCN0.4] RFRM low bias (FB < −1 ) is largely associated with the wildfire plume measurements during the ARCTAS and WE-CAN campaigns. It must be noted here that the low likelihood of the RFRM being exposed to these scenarios of high [CCN0.4] and predictor values in its training (on the GEOS-Chem-APM global simulations) may contribute to this observed low bias. Ultimately, however, this scenario is infrequent: ARCTAS (8.7% of its measurements), WE-CAN (8.3%), SEAC 4 RS (2.7%), and other campaigns ( 0.5% ). The high bias (FB > +1 ) of RFRM-derived [CCN0.4] occurs mainly during SEAC 4 RS (14%) and WE-CAN (7.1%). While the reason for this remains to be determined, there may be measurement uncertainties; for instance, in Figure S4a, [CCN0.4] measured directly and inferred separately are in large disagreement for SEAC 4 RS during these instances of apparent RFRM-high-bias.
While the Random Forest Regression Models demonstrate a high degree of predictive performance overall, we examine their performance in higher detail, leveraging the high temporal resolution of airborne measurements, in Figure 2. For illustration, we select a day (June 10, 2016 from the KORUS-AQ campaign) with large variability in altitude (surface-8.5 km) as well as the 9 predictors. Shown is the time series of the measurements of these variables during this day: measured [CCN0.4] in black in Figure 2a Figures S9 and S10. For WE-CAN (4-6 km) and ARCTAS (1-3 km), the earlier noted tendency of the RFRM to underpredict [CCN0.4] is seen in the splitting and skewing left of the violin distribution ( Figures S9 and S10). Examining this in further detail, for observations with PM 1 OA 40 μg ⋅ m −3 , mean fractional bias (MFB) for ARCTAS(WE-CAN) is − 1.3(−0.6) as compared to − 0.03(+0.2) when otherwise (PM1 OA ≤ 40 μg⋅ m −3 ). This suggests that the RFRM-underestimation is due mostly to the high organic mass (likely in biomass burning plumes) not experienced by the RFRM during its training or the underestimation of the potential contribution of organic aerosol to CCN numbers in current models or a combination of these factors.

Aerosol Mass Speciation Contains Size Distribution Information as Revealed by Machine Learning
In GCMs that do not resolve particle size distributions, proxies for aerosol numbers or cloud droplet numbers are obtained from aerosol mass speciation alone, assuming a fixed aerosol number size distribution. In this study, LinReg is an effective representation of the aerosol mass-to-number prescription in GCMs. This is due to linearly regressing for measured [CCN0.4] on all the measured aerosol speciation variables. Therefore, by virtue of overfitting, there can be no better aerosol mass-to-number prescription for the airborne measurements used in this study. Despite this, LinReg is demonstrated to be inadequate (Figure 1a). A potential improvement-RFRM-PM-employs one of the most accurate ML approaches for regression and appreciably (%-Good: 38 → 68% ) improves the degree of agreement with CCN measurements. The importance of considering 19 predictor variables of atmospheric state and composition (not limited to aerosol mass speciation) for accurate RFRM-derivation of [CCN0.4] has been demonstrated (Nair & Yu, 2020). Considering observational limitations, reduction to nine important predictors including T, RH, [SO2 ], [NO ], and [O3 ] is possible without significant deterioration of model performance. RFRM, which considers these variables in addition to only aerosol speciated mass, is in agreement with measured [CCN0.4] to a much greater degree (%-Good: 38 → 68 → 80% ; Figures 1 and 2, and Figures S9 and S10). With the significant amount of measurement data that these airborne campaigns provide, we search for the reasons for why consideration of predictors beyond PM1 speciation helps improve the machine-learning model derivation of [CCN0.4].
The RFRM-PM performs better than LinReg for deriving [CCN0.4] when only the PM1 speciated masses are used as input (Figure 1). To examine the reason for this, Figure 3 shows how the PM1 mass contains information about the aerosol number size distribution (PNSD; P: particle/aerosol) that the random forest approach can leverage. The average normalized (to ≈ 60 nm : the rough cut-off size for CCN0.4) airborne measured PNSD is shown in Figure 3. Figure 3a shows that for two different total PM1 mass ranges the PNSD profile varies. While the linear regression implicitly assumes a fixed average PNSD (black curve), the RFRM derives [CCN0.4] using decisions in the subspace corresponding to the PM1 total mass, which defines more representative variations of PNSD. In addition, Figure 3b demonstrates that the aerosol composition (speciated mass fractions of aerosol mass) also carries PNSD information. The four panels correspond to distinct clusters of aerosol composition, and each cluster with speciated composition of the total PM 1 mass within a range of ± 2.5% to ensure in-cluster homogeneity as well as each cluster spanning the entire range of PM1 total mass. The clusters are determined with the aid of an unsupervised ML technique ( -means clustering), described in the Text S2 and illustrated in Figures S12 and S13. Thus : 0%-5%, and NH4 : 4.5%-9.5%), Cluster 2 (SO4 : 19%-24%, OA: 37%-42%, NO3 : 22%-27%, and NH4 : 12%-17%), Cluster 3 (SO4 : 47.5%-52.5%, OA: 37%-42%, NO3 : 0%-5%, and NH4 : 6%-11%), Cluster 4 (SO4 : 0.5%-5.5%, OA : 91%-96%, NO3 : 0%-5%, and NH4 : 0%-5%), and (black) respective cluster-wise average. Typical aerosol composition for each cluster is illustrated by the inset pie charts. aerosol mass and composition confer to the RFRM-PM the ability to implicitly consider the PNSDs pertinent to PM1 mass and speciation in its derivation of [CCN0.4] and enhance its skill compared to linear regression with an assumed mean PNSD.

Further Size Information Can Be Machine-Learned From Additional Chemistry and Meteorology
To examine why RFRM is more robust than RFRM-PM in its derivation of [CCN0.4], we consider the subset of the data where RFRM-derived [CCN0.4] is in good-agreement with airborne measurements. Counterintuitively, RFRM-PM overestimates (FB 0.6 ) mostly (83.6%) when higher [CCN0.4] is measured and underestimates (FB −0.6 ) mostly (82.4%) when lower [CCN0.4] is measured. This is indicative that rather than a general bias in the RFRM, it is the non-consideration of the predictors other than PM speciation contributing to the RFRM-PM bias. In Figure 4, RFRM-PM-derived [CCN0.4] is classified into excellent-agreement (|FB| < 0.2 ; roughly 22% deviation from airborne measurement of [CCN0.4]; black), overestimation (orange), and underestimation (purple). The percentages corresponding to these classes are noted in each campaign's panel. Illustrated are the typical PNSD normalized to the ∼ 60 nm diameter, corresponding roughly to the cut-off size of CCN0.4. Across all campaigns, differences in these size distributions with respect to the degree of estimation remain consistent. More detailed differences in PNSD across the vertical extent of the troposphere are also illustrated in Figure S11. In the scenario of a more typical PNSD, with high Aitken and low accumulation mode, both RFRM and RFRM-PM are in agreement with measurements. When the accumulation mode is much higher and Aitken mode is much lower than average, RFRM is in agreement but RFRM-PM overestimates. This is because the aerosol mass distribution toward the larger diameters results in less numerous particles than a mean size distribution would suggest. When the Aitken mode is much higher and the accumulation mode much lower than average, the corollary follows. The additional consideration of chemical species of SO 2 , NO , and O3 and meteorology (T and RH), which are important for chemistry and gas-to-particle conversion (including new particle formation and growth) and hence PNSD, enables RFRM to contain more discerning subspaces for its decision making than RFRM-PM. With regards to the PNSD, these additional predictors carry rich information about the air mass history, sources of primary aerosols, and occurrence of atmospheric new particle formation and growth and photochemical processing toward the secondary aerosol formation. Future investigations will focus on comprehensive assessment of individual contributions of each predictor variable, consideration of all variables in the full-RFRM pertinent toward the improved reflection of the ambient PNSD, and delineation of the physicochemical processes that determine CCN (spectrum) number concentrations.

Conclusions
This work demonstrates, using comprehensive airborne multi-campaign measurements encompassing the varied physicochemical conditions across the troposphere, the overall success of ML in deriving CCN number concentrations. Importantly, ML can extract aerosol size information from aerosol composition and additionally from atmospheric chemical and meteorological variables; this demonstrates that the statistical learning of ML/AI algorithms is emergent from the underlying physical (and chemical) laws. This physicochemically explainable and robust ML model can provide a computationally efficient pathway for a more accurate representation of CCN in GCMs. This may potentially reduce the uncertainties associated with aerosol-cloud interactions in the assessment of anthropogenic forcing and climate change projection.

Data Availability Statement
Data from the following aircraft campaigns were used in this study-ARCTAS (Jacob et al., 2010): ARCTAS Team (2020) (Uin et al., 2017a(Uin et al., , 2017b data were obtained from the Atmospheric Radiation Measurement (ARM) user facility, a U.S. Department of Energy (DOE) Office of Science User Facility managed by the Biological and Environmental Research program, which is publicly available at the ARM Discovery Data Portal (https://www.archive.arm.gov/discovery/).