Regression with race-modifiers: towards equity and interpretability

The pervasive effects of structural racism and racial discrimination are well-established and offer strong evidence that the effects of many important variables on health and life outcomes vary by race. Alarmingly, standard practices for statistical regression analysis introduce racial biases into the estimation and presentation of these race-modified effects. We introduce abundance-based constraints (ABCs) to eliminate these racial biases. ABCs offer a remarkable invariance property: estimates and inference for main effects are nearly unchanged by the inclusion of race-modifiers. Thus, quantitative researchers can estimate race-specific effects “for free”—without sacrificing parameter interpretability, equitability, or statistical efficiency. The benefits extend to prominent statistical learning techniques, especially regularization and selection. We leverage these tools to estimate the joint effects of environmental, social, and other factors on 4th end-of-grade readings scores for students in North Carolina (n = 27, 638) and identify race-modified effects for racial (residential) isolation, PM2.5 exposure, and mother’s age at birth.


D R A F T
The race-modified model allows for race-specific x-effects.Fit to 4th end-of-grade reading scores in North Carolina, the main-only model obscures important race-specific differences in the effects of racial isolation (RI) that are uncovered by the race-modified model.The negative RI effect observed globally in (a) is driven by the negative RI effect for non-Hispanic Black (NHB) students, which does not persist for Hispanic (Hisp) or non-Hispanic White (NHW) students in (b).
To provide context for race-modified effects, we present a regression of 4th end-of-grade reading scores on racial (residential) isolation (RI) and race (Figure 1).The dataset, detailed and reanalyzed subsequently, includes n =2 7 , 638 students in North Carolina (58% non-Hispanic (NH) White, 36% NH Black, 6% Hispanic).RI measures the geographic separation of NH Black individuals and communities from other race groups, and thus is an important measure of structural racism (5,6,(28)(29)(30).The main-only (or ANCOVA) model (Figure 1a) includes race only as an additive effect, which restricts the RI effect to be common across race groups.This widely-used model reports the same adverse effect of RI on reading scores for all students.However, the race-modified model (Figure 1b) provides the essential context: the RI effect is significantly negative for NH Black students, but not for other race groups.Thus, a race-modified model is necessary to uncover and quantify these racial discrepancies in the effects of structural racism on educational outcomes.Despite these benefits, there are significant racial biases that occur in commonplace estimation, inference, and presentation of results for regression analysis with race as a covariate.Both the main-only and race-modified models (Figure 1) are overparametrized: neither {-0 ,r } r nor {-1 ," r } r are identifiable without further constraints.Any constant could be added to -0 and subtracted from each {r }, and similarly for -1 and {" r }, which alters each parameter but leaves the model unchanged.
Critically, neither the main nor the race-specific parameters can be estimated or interpreted without additional constraints.Undoubtedly, the most common approach is reference group encoding (RGE): a reference group is selected, typically NH White, and removed (-NHW = " NHW =0).This is the default D R A F T for all major statistical software implementations of (generalized) linear regression, including R, SAS, Python, MATLAB, and Stata.However, RGE output is racially biased (31), difficult to interpret, and obscures important main and race-modified effects.We categorize these significant limitations into presentation bias and statistical bias.
Presentation bias.Table 1 (left) displays standard output for a race-modified model.Under RGE, the RI effect (red) actually refers to the RI effect only for NH White individuals, -1 = -1 +" NHW = µ Õ x (NHW).Similarly, Intercept refers to the NH White intercept, -0 = -0 + -NHW = µ(0, NHW).We emphasize that the presentation format in Table 1 (left) is predominant in scientific journals.Among recent publications in social science journals, it was found that 92% of such tables used NH White as the reference group, while less than half explicitly stated the reference group (31).
First, this output is inequitable: it elevates a single race group above others.Further, all other race-specific effects are presented relative to the NH White group.For instance, RI:NH Black refers to the difference between the RI effects for NH Black students and NH White students: . This framing presents NH White as "normal" and other race groups as "deviations from normal", which is known to bias interpretations of results (32).
Second, this output is unclear: it is nowhere indicated that the intercept and RI effects are specific to NH White students.A cursory inspection of this output might result in a mistaken interpretation of the RI effect as a global effect, rather than a NH White effect.Finally, this output is misleading: the RI effect is reported to be small and insignificant, despite clear evidence to the contrary (Figure 1).Under RGE, the addition of the race-modifier substantially alters the estimates and reduces the statistical power for the RI main effect (-1 ).(33).Broadly, these regularization strategies seek to stabilize (i.e., reduce the variance of) estimators, typically by "shrinking" coefficients toward zero.This approach is particularly useful in the presence of a moderate to large number of covariates that may be correlated.However, under RGE, shrinking or setting coefficients to zero introduces racial bias to the estimation.Critically, shrinkage or selection of the race-specific terms, " r ae 0, does not innocuously shrink toward a global slope; rather, it implies that the coefficient on x for race r is pulled toward that of the NH White group, µ Õ x (r)=-1 + " r ae -1 , and -1 = µ Õ x (NHW).Not only is this estimator racially biased, but also it attenuates the estimated differences between the x-effects for each race and NH White individuals.
Identification and quantification of such race-modified effects are precisely the goals of race-modified models.Furthermore, RGE cannot distinguish between shrinkage toward a global, race-invariant x-effect and shrinkage toward the NH White x-effect: both require " r ae 0 for all r.A fundamental goal of penalized estimation and selection in this context is to remove unnecessary race-modifiers.
However, with RGE, the cost is racial bias in the shrinkage and selection.Thus, default RGE cannot fully and equitably leverage the state-of-the-art in statistical learning.
Although RGE is used in the overwhelming majority of regression analyses, there are several alternatives.Data disaggregation subsets the data by (race) groups and fits separate regression models (12,25,30,34).This approach produces race-specific intercepts and slopes, and thus implicitly acknowledges the importance of race-modifiers.However, these separate models do not produce global (race-invariant) x-effect estimates or inference, cannot incorporate information-sharing or regularization across race groups (often leading to variance inflation and reduced power), and require separate model diagnostics.Sum-to-zero (STZ) constraints address the inequities in RGE, but the resulting model parameters are difficult to interpret and the estimators do not offer any of the appealing statistical properties provided by the proposed approach.Finally, overparametrized estimation omits any identifying constraints and relies on regularized regression to produce unique estimators.But the model parameters remain nonidentified, thus making the estimates extremely difficult to interpret.Again, the estimates fail to offer the useful statistical properties discussed subsequently.
The primary goal of this paper is to introduce, study, and validate alternative statistical methods that eliminate these racial biases.By carefully reframing the regression model, we ensure equitable and interpretable parameters with accompanying estimators that offer unique and appealing statistical properties.We apply these tools to identify and quantify the race-modified effects of multiple environmental, social, and other factors on 4th end-of-grade readings scores for students in North Carolina.Although we focus on race, the proposed methods remain applicable for other categorical covariates including sex, national origin, religion, and other protected groups.

Results
Abundance-Based Constraints (ABCs) for Linear Regression.We update the race-modified model (Figure 1b) for multivariable regression with p covariates X =( X 1 ,...,X p ) € , where the effect of each variable may be modified by race: [1] where α =( -1 ,...,p ) € are the main x-effects and γ r =( " r,1 ,...," r,p ) € are the race-modifier effects.The main-only version omits all interactions (" r,j =0 ).The intercepts are race-specific, µ(0,r)=-0 +r , while the race-modified model yields race-specific slopes for each variable j =1,...,p: The parameters {-0 ,r } r and {j ," r,j } r,j must be further constrained to enable unique estimation and meaningful inference.Linear constraints of the form q r c rr =0and q r c r " r,j =0are most common: RGE sets c 1 =1and c r =0for r>1, while STZ uses c r =1for all r.However, the equitability, interpretability, and statistical properties of the parameters and estimators depend critically on the choice of {c r }.
We propose abundance-based constraints (ABCs) that use the race group abundances: fr " r,j =0 for j =1,...,p fr = proportion in (race) group r or equivalently, E f(-R )=0and E f(" R,j )=0for all j, where the expectation is taken over a categorical random variable R with P(R = r)=f r .If known, the population proportions may be used for {f r }; otherwise, we use the sample proportions.Historically, ABCs were considered for main-only models (35), but lacked sufficiently compelling advantages over alternative approaches (RGE, STZ, etc.) in that class of models to gain widespread adoption.Here, we advocate for ABCs in race-modified models based on equitability, interpretability, and special statistical properties.
To evaluate equitability and interpretability, we consider the meaning of each parameter in the race-modified model.Under ABCs, the race-modified model satisfies E f{µ(x,R)} = -0 + x € α, which produces a global (race-invariant) linear regression.As a consequence, each main x-effect may D R A F T be expressed as the race-averaged slope for the jth variable: Unlike with RGE, wherej = µ Õ x j (NHW), ABCs do not anchor each main x-effect to the NH White group and instead provide a global interpretation for these key parameters.The benefits cascade down to the other parameters: which is the difference between the race-specific slope and the race-averaged slope for variable j.The intercept also retains a convenient, more equitable interpretation.Suppose that each continuous covariate is centered, xj =0.Then the intercept parameter is a marginal expectation: where the expectation is taken (separately) over X ≥ px for px the empirical distribution of {x i } n i=1 and R ≥ f.The race-specific intercept coefficients proceed similarly: Again, unlike for RGE, the parameters -0 andr no longer elevate the NH White group.Instead, ABCs define all parameters as 1) global, race-averaged main effects or 2) race-specific deviations.
Estimation and Inference.Given data {y i , x i ,r i } n i=1 , the race-modified model with ABCs is estimated by applying linearly-constrained ordinary least squares (OLS) estimation.Standard errors, confidence intervals, and hypothesis testing are available as in traditional OLS estimation.Options for regularized (ridge, lasso, etc.) regression are provided (see Methods).Because the estimators satisfy ABCs, they retain the same properties and interpretations as the parameters above.
Statistical Properties.A central obstacle with race-modified models is that, for default approaches (RGE, STZ, etc.), the inclusion of these interaction terms fundamentally alters the interpretations, estimates, and standard errors for the main x-effects.We observe this empirically (Table 1, left): compared to the main-only model, the race-modified model under RGE attenuates the RI main effect (-M 1 = ≠0.042 vs. -1 = ≠0.013)and inflates the standard error (SE(-M 1 )=0 .007 vs. SE(-1 )=0 .011).These results are not contradictory: the RI effect is weaker for the NH White group (Figure 1b) than for the aggregate (Figure 1a), while NH White students represent a subset (58%) of the full sample.The broader implication is that analysts may be reluctant to include race-modifiers.However, omitting race-modifiers can produce misleading results (Figure 1).

D R A F T
ABCs resolve these problems.The key result is estimation invariance: under ABCs, the OLS estimates of the main x-effects are nearly identical between the main-only model and the race-modified model, under appropriate conditions.For p =1(Figure 1), we establish the following: x [1] for all race groups r [2] where ‡2 ).Similar results are available for general p>1 under suitable modifications of the equal-variance condition (Theorem 2).
The equal-variance condition in Eq. ( 2) requires that the scale of x is approximately the same for each race group.Otherwise, a one-unit change in x is not comparable across race groups.In that case, race-specific slopes are necessary, and the global slope from the main-only model (-M 1 ) is not a suitable summary.However, the estimation invariance of ABCs is empirically robust to violations of the equal-variance condition.This condition is strongly violated for RI in Table  With ABCs, the analyst may include race-modifiers "for free": the estimated main x-effects are nearly unchanged by the addition of race-modifiers (x:race).This result is unique to ABCs and makes no assumptions about the true relationship between Y , X, and race.Notably, arbitrary dependencies are permitted between X and race-including varying means and distributions of X by race group-as long as the equal-variance condition holds.Thus, this result is distinct from classical estimation invariance results with OLS that require uncorrelatedness (36).
Sparsity.Sparsity is often prioritized to remove extraneous parameters, reduce estimation variability, and simplify interpretations.Regularized regression can produce sparse estimates, but depends critically on the parametrization.Importantly, sparsity of the race-modifiers, " r,j =0, is meaningful under ABCs: it implies that the race-specific slope equals the race-averaged slope, µ Õ This eliminates the racial bias and inequity under RGE, where the same sparsity instead implies that the race-specific slope equals the NH White slope, µ Õ x j (r)= µ Õ x j (NHW)+" r,j = µ Õ x j (NHW).An especially concerning case arises when the race-modifier is nonzero (" r,j " =0), but the main x-effect is zero (j =0).Statistical approaches often eschew this scenario, and instead require that interactions are nonzero only if a main effect is nonzero (37,38).Such restrictions are not necessary for ABCs: it is plausible that some race-specific x-effects are nonzero, µ Õ x j (r)=" r,j +j = " r,j " =0, while the race-averaged x-effect is zero,j = E f{µ Õ x j (R)} =0.Alarmingly, fitting a main-only model would produce misleading results.Applying Eq. ( 2) (Theorems 1-2), the estimated x-effect would D R A F T be near zero, -M j ¥ 0, when in fact the x-effect is both significant and race-specific.Thus, it is possible that existing quantitative analyses based on main-only models (i.e., without race-modifiers) obscure both important and race-specific effects of certain variables (Figures 1 and 2).
NC Education Data Analysis.We apply the proposed methods to study the effects of multiple environmental, social, and other factors on educational outcomes-and assess whether, and how, these effects vary by race.Using ABCs, we fit equitable and interpretable race-modified models, empirically evaluate estimation and inference invariance properties, and study regularized (lasso) regression solution paths under competing parametrizations.
Data overview.We construct a cohort of n =27, 638 students in North Carolina (NC) by linking three administrative datasets: NC Detailed Birth Records include maternal and infant characteristics for all documented live births in NC.We compute maternal covariates-mother's race, age (mAge), education level, marital status, and smoking status-and child covariates, sex and birthweight percentile for gestational age (BWTpct).RI is computed using residential addresses at birth.

NC Blood Lead
Surveillance includes blood lead level (BLL) measurements for each child.Lead is an adverse environmental exposure with well-known effects on cognitive development and educational outcomes (39,40).exposure (PM 2.5 ) over the year prior to the test, which is an adverse environmental exposure linked to educational outcomes (41).A.1; additional details are provided elsewhere (30,42,43).Data management, access, and analysis are governed by data use agreements and an Institutional Review Board-approved research protocol at the University of Notre Dame.

Data characteristics are in Table
Race-modified regression with ABCs.We estimate a multivariable linear regression for 4th endof-grade reading scores that includes these environmental, social, and other factors, as well as race-modifiers (Table 2).Each continuous covariate (BLL, PM 2.5 , RI, mAge, and BWTpct) is centered and scaled and each categorical variable (mother's race, child's sex, mother's education level, mother's marital status, mother's smoking status, and economically disadvantaged) is identified using ABCs (see Methods).Race-modifiers are included for BLL, PM 2.5 , RI, mAge, and BWTpct.
Standard model diagnostics confirm linearity, homoskedasticity, and Gaussian error assumptions.
ABCs generate output for all main effects, all race-modifier effects, and each group in every categorical variable, which eliminates the presentation bias that would otherwise accompany each categorical variable under RGE.There are highly significant (p<0.01)negative effects for BLL

D R A F T
and RI, where the adverse RI effect doubles for NH Black students (μ Õ RI (NHB)= -RI + "RI:NHB = ≠0.020+≠0.020= ≠0.040).This critical result for RI expands upon the previous model fit (Table 1): here, the model adjusts for many additional factors, yet the effect persists.Significantly lower test scores also occur for students who are NH Black, Male, or economically disadvantaged, and whose mothers who are less educated, unmarried, or smokers at time of birth.Significant positive effects are observed for the opposite categories-which is a byproduct of ABCs (e.g., the Male and Female proportions are identical, so the estimated effects must be equal and opposite)-as well as mAge and BWTpct.Finally, PM 2.5 is not identified as a significant main effect (p =0.403), yet the race-specific effects are significant.Alarmingly, a fitted main-only model (without race-modifiers) conveys an insignificant PM 2.5 effect (Figure 2), which oversimplifies and misleads.2).ABCs exhibit invariance: despite the additional race-modifier parameters, the point and interval estimates for the main effects (blue) are nearly indistinguishable from those in the main effects-only model (black), thus effectively allowing the inclusion of race-modifiers "for free".In contrast, the RGE terms (red) correspond to the x-effects for the NH White group and deviate substantially for PM2.5, RI, and mAge, including shifts in location and much wider intervals.
expanded, race-modified model.Evidently, ABCs allow estimation and inference for numerous race-specific effects (Table 2, right column) "for free": the inferential summaries for the main effects are unchanged by the expansion of the model to include race-modifiers.This result empirically confirms the multivariable extension of Eq. (2) (Theorem 2), despite moderate violations of the equal-variance condition (Table A.2). Unsurprisingly, no such invariance holds for RGE (red): the point and interval estimates are substantially different, with uniformly wider intervals and conflicting conclusions about nonzero coefficients (PM 2.5 , RI).These concerning discrepancies occur because the RGE "main effects" are exclusively for NH White students.
Regularized regression with ABCs.We assess regularized regression and variable selection with ABCs using lasso regression, including all variables from Table 2.We report estimates across tuning parameter values ⁄ for the model coefficients {j , "r,j } r,j and the race-specific slopes {μ Õ x j (r)= -j + "r,j } r,j ; ⁄ ae 0 yields OLS estimates, while ⁄ aeOEyields sparse estimates.Since the penalized estimates depend critically on the parameterization, we compare ABCs and RGE.The estimated ⁄-paths for RI are in Figure 3; results for the remaining race-modified effects (BLL, PM 2.5 , mAge, and BWTpct) are in Figures A.2-A.5.RGE fixes "r,j =0for all ⁄, which results in 1) racially-biased shrinkage of the race-specific effects toward the NH White-specific effect and 2) attenuation of the RI effect -j (Figure 3, top right).ABCs resolve these issues.First, the model parameters are separately and equitably pulled toward zero (Figure 3 2. Small λ approximately corresponds to OLS, while increasing λ yields sparsity.Under RGE, the estimates are pulled toward the reference (NH White) estimate-inducing statistical bias by race-and the RI effect is attenuated.By comparison, ABCs offer more equitable shrinkage toward a global RI effect, which is nonzero and detrimental for 4th end-of-grade reading scores.
attenuated, and preserves its magnitude until log ⁄ ¥ 5 (Figure 3, top left).Finally, the race-specific RI effects merge at a global, and negative, RI effect estimate, and this variable is selected by the one-standard-error rule (33) for choosing ⁄ (Figure 3, bottom left).
These themes persist for the remaining race-modified effects (Figures A.2-A.5).We supplement the ABC and RGE lasso paths by including the lasso paths for overparametrized estimation (Over), which does not include any identifiability constraints.The parameters cannot be estimated uniquely by OLS, but can be estimated by lasso regression with ⁄>0.In most cases, Over sets one of the coefficients {j ," r,j } r to zero immediately (small ⁄) for each variable j.This effect reproduces RGE and thus Over inherits the same racial biases in estimation and selection.When this implicit selection sets "j:NHW =0 , then the Over paths resemble those for RGE (RI, not shown; BWTpct,   1) and thus contain extraneous race-modifiers.ABCs (gold) outperform both RGE (light gray) and Over (dark gray) within each estimation method (ridge, lasso, OLS).By definition, the OLS race-specific slopes and fitted values are invariant to the constraints (ABCs or RGE), and Over cannot be computed for OLS.
Estimation and predictive accuracy for simulated data.We evaluate estimation and prediction for ABCs, RGE, and Over across several estimation methods: OLS, ridge, and lasso regression (Figure 4).Data are simulated from a Gaussian main-only model with p =1 0covariates and a categorical variable with four levels.For fair comparisons, the data-generating process satisfies both RGE and ABCs.To mimic the challenges of real data analysis, the fitted models are misspecified as Eq. ( 1), and thus contain extraneous race-modifiers.Root mean squared errors are computed for the regression coefficients {-0 ,j ,r ," r,j } r,j , the race-specific slopes {j + " r,j } r,j , and the model expectations µ(x,r) across 500 simulated datasets.In each case, ABCs are substantially more accurate within each estimation method (OLS, ridge, lasso).The estimation invariance of ABCs offers a plausible explanation: whereas each fitted model includes extraneous variables (the race-modifiers), only ABCs reproduce the main effect estimates from the main-only model, which here is the ground truth.This unique statistical property of ABCs is not only convenient for interpreting race-modified models, but also provides more accurate estimates and predictions under both OLS and regularized regression.

Discussion
The path to more equitable decision-making and policy requires a precise and comprehensive understanding of the links between race and health and life outcomes.Alarmingly, the primary statistical tool for this task-regression analysis with race as a covariate and a modifier-in its current form propagates racial bias in both the presentation of results and the estimation of model

D R A F T
parameters.We introduced an alternative approach, abundance-based constraints (ABCs), with several unique benefits.First, ABCs eliminate these racial biases in both presentation and statistical estimation of linear regression models.Second, ABCs produce more interpretable parameters for race-modified models.Third, estimation with ABCs features an appealing invariance property: the estimated main effects are approximately unchanged by the inclusion of race-modifiers.Thus, analysts can include and estimate race-specific effects "for free"-without sacrificing parameter interpretability, equitability, or statistical efficiency.Finally, ABCs are especially convenient for regularized regression and variable selection, with meaningful and equitable notions of parameter sparsity and efficient computational algorithms.
Using this new approach, we estimated the effects of multiple environmental, social, and other factors on 4th end-of-grade readings scores for a large cohort of students (n =2 7 , 638) in North Carolina.In aggregate, this analysis 1) identified significant race-specific effects for racial (residential) isolation, PM 2.5 exposure, and mother's age at birth; 2) showcased the racial biases and potentially misleading results obtained under default approaches; and 3) provided more equitable and interpretable estimates, uncertainty quantification, and selection, both for main effects and race-modified effects.Simulation studies demonstrated substantially more accurate estimates and predictions with OLS, ridge, and lasso regression compared to alternative approaches.
We acknowledge that the interpretation of any "race" effect requires great care (44).Race encompasses a vast array of social and cultural factors and life experiences, with effects that vary across time and geography (27,45).In some settings, race data are unreliable or partially missing (46,47).These overarching challenges are not addressed in this paper.
We extend ABCs based on the joint distribution of the categorical variables R. Specifically, let π = fr 1 ,...r L = P(R 1 = r 1 ,...,R L = r L ).If known, the population proportions may be used for π; otherwise, we use the sample proportions based on the observed data {r i } n i=1 , i.e.,

D R A F T
fr 1 ,...,r L = n ≠1 q n i=1 I{r i,1 = r 1 ,...,r i,L = r L }. Concisely, the generalized ABCs are where β R =(-1,R 1 ,...,-L,R L ) € , γ R,j =(" 1,R 1 ,j ,...," L,R L ,j ) € , and 0 L is an L-dimensional vector of zeros.Eq. ( 4) may be equivalently represented via separate marginal expectations for the L sets of categorical covariate parameters: for instance, ABCs in Eq. ( 4) provide interpretable parameter identifications with equitable presentation and estimation.These interpretations are unchanged if some or all interaction terms are omitted from Eq. ( 3), which may occur if multiple categorical variables (e.g., sex, education level) are included as covariates, but only race is included as a modifier.ABCs imply that E π{µ(x, R)} = -0 + x € α, so that averaging the regression Eq. ( 3) over all categorical variables (jointly) yields a multivariate regression with only continuous variables.Individually, each x j -effect satisfies where µ Õ x j (r)=µ(x j +1, x ≠j , r)≠µ(x j , x ≠j , r) is the slope in the jth direction.To further simplify the interpretation, the expectation under π in Eq. ( 5) need only be taken with respect to the categorical variables that are interacted with x j (e.g., race).By comparison, the RGE parametrization yields x j (r 1 =1 ,...,r L =1 ) , which is the group-specific slope for x j with each group set to its reference category (e.g., NH White, Male, etc.).Clearly, this representation compounds inequity across each categorical variable and fails to deliver a global interpretation of the x j -effect.
Interpretation of group-specific slopes and the parameters " ¸,r ¸,j proceeds by considering partial expectations π≠¸, which is analogous to the joint distribution π but omits the ¸th categorical variable.Here, as with Eq. ( 5), this expectation need only consider the categorical variables that are interacted with x j ; if the ¸th categorical variable is the only interaction term, then no expectation is needed at all.Then the x j -effect when the ¸th categorical variable has level r ¸, averaged over the remaining categorical variables, is or equivalently, " ¸,r ¸,j = E π≠¸{ µ Õ x j (r ¸, R ≠¸) }≠E π{µ Õ x j (R)}.The interpretation is simpler than the notation: Eq. ( 6) directly extends the usual notion of race-specific slopes to average over any other categorical variables that modify x j .
Estimation.Statistical estimation with ABCs requires solving a linearly-constrained least squares problem given data {x i , r i ,y i } n i=1 .Define θ to be the model parameters {-0 , α,-¸,r ¸, γ ¸,r ¸}r ¸,¸a nd xi to include the intercept, covariates, race variable indicators (i.e., "dummy variables"), and D R A F T covariate-race interactions such that Eq. ( 3) may be written µ(x i , r i )=x € i θ.L e tC encode ABCs such that Cθ = 0 enforces Eq. ( 4 7) is equivalently solved using unconstrained OLS: The QR-decomposition has minimal cost due to the efficiency of Householder rotations and the low dimensionality of C (48).
Although alternative computing strategies are available, the reparametrization in Eq. ( 8) is especially convenient for generalizations to regularized (lasso, ridge, etc.) estimation.Let P(θ) denote a complexity penalty on the regression coefficients.The penalized least squares estimator under ABCs is θ(⁄) = arg min θ n ÿ i=1 (y i ≠ x€ i θ) 2 + ⁄P(θ) subject to Cθ = 0 [9] where ⁄ Ø 0 controls the tradeoff between goodness-of-fit and complexity (measured via P).Following Eq. ( 8), we instead compute which requires the solution to an unconstrained penalized least squares problem.
We focus on complexity penalties of the form where Ê j > 0 are known weights, " =1produces sparse coefficients (adaptive lasso regression) and " =2guards against collinearity (adaptive ridge regression).Under ridge regression (" =2), the solution is The lasso version (" =1 ) can be solved efficiently using the genlasso package in R (49).
For practical use, we set Ê j to be the sample standard deviation of the jth column of X (with Ê 1 =1for the intercept).This strategy applies a standardized penalty to each covariate, which is especially important for ABCs.In particular, the magnitudes of the race-specific coefficients vary according to the abundance of the group: by construction, low abundances in group r will correspond to larger group r-specific coefficients.The standardized penalty adjusts for this effect to avoid overpenalization of group-specific coefficients for groups with low abundance.
Inference.The reparametrization strategy in Eq. ( 8 where I is the Fisher information and ζ, θ are the true parameter values.When the regression model is paired with independent and identically distributed Gaussian errors with variance ‡ 2 , the unconstrained OLS estimator satisfies ζ ≥ N {ζ, ‡ 2 ( Z€ Z) ≠1 } and thus even in finite samples.This sampling distribution for the OLS estimator under ABCs ensures unbiasedness and efficiency, and provides the means to compute standard errors, hypothesis tests, and confidence intervals. Theory.
Theorem 1.Under ABCs, the OLS estimates for the main-only model (Figure 1a) and the race-modified model (Figure 1b) satisfy Theorem 2. Consider the multivariable race-modified model Eq. ( 1) and the multivariable main-only model [10]

D R A F T
Empirical verification of estimation invariance with ABCs.To empirically verify Theorem 1, we generate 500 synthetic datasets that mildly violate the equal-variance condition.Iteratively, we sample a categorical variable R with groups {A, B, C, D} and respective probabilities π =(0.55, 0.20, 0.10, 0.15) € and then sample a continuous variable with the distribution determined by the group: By design, X depends on R in both mean and distribution.The R-specific population variances of X are each one, but Theorem 1 requires that the R-specific sample variances are identical, which will not be satisfied for any simulated dataset.Thus, Eq. ( 11) includes a mild deviation from the equal-variance condition of Theorem 1.
We fit the main-only and race-modified models and record the estimated x-effects -M 1 and -1 , respectively, for each simulated dataset.These estimated coefficients depend on the constraints: we compare ABCs, RGE (with reference group A), and STZ constraints for {r ," r } (Figure A.1).
Although the conditions of Theorem 1 are not satisfied, the ABC estimates lie along the 45 degree line with -1 = -M ; the estimated x-effect is nearly unchanged by the addition of the race-modifier.This invariance is not satisfied for RGE or STZ.The estimated x-effects under RGE or STZ vary considerably between the main-only and race-modified models, with greater discrepancies as the magnitude of the race-modifier effect increases.By comparison, the estimation invariance of ABCs is robust to the magnitude of the race-modifier effect.The mild deviations from the equal-variance condition of Theorem 1 are most impactful when " =1.5, which represents the unusual setting in which the interaction effect is much larger than the main effect.Even in this challenging case, the ABC estimates remain nearly invariant between the main-only and race-modified models, especially when compared to the RGE and STZ counterparts.

D R A F T
Contrast coding.For OLS estimation, identifiability constraints may be imposed using contrasts.
In this approach, the linear model is fit under any minimally sufficient identification (RGE, STZ, ABCs, etc.) and the categorical variable coefficients are post-processed using linear contrast matrices.
Examples include dummy coding (akin to RGE), effects coding (akin to STZ), weighted effects coding (WEC; akin to ABCs), and Helmert coding (for ordered categories).However, contrasts are typically reserved for main-only models and are difficult to combine with regularized regression and variable selection.Further, these previous approaches do not consider or resolve the inequities of reporting or estimating race-specific effects.In particular, WEC has been advocated only in cases when "a categorical variable has categories of different sizes, and if these differences are considered relevant" (50) or "certain types of unbalanced data that are missing not at random" (51), with regression output that suffers from the same presentation bias that afflicts RGE (52).We do not agree with such restrictions for ABCs, and instead argue that this approach offers an equitable and interpretable parametrization with unique and appealing statistical properties, including both estimation invariance and regularized regression.These estimation invariance results and regularized regression analyses are notably absent from previous contrast coding approaches.

Fig. 1 .
Fig. 1.Linear regression models for an outcome variable Y with a continuous covariate X and a categorical (or nominal) covariate race.The models parameterize the expected outcome, E(Y | X = x, race = r)=µ(x, r), with corresponding x-effect (or slope) µ Õ x (r) := µ(x +1,r) − µ(x, r).(a) The main-only model assumes a global (race-invariant) x-effect.(b) The race-modified model allows for race-specific x-effects.Fit to 4th end-of-grade reading scores in North Carolina, the main-only model obscures important
Data contains 4th end-of-grade standardized reading scores, economic disadvantage status (determined by participation in the National Lunch Program), and residential address at time-of-test.The reading scores, standardized by the year of test (2010, 2011, or 2012), serve as the outcome variable Y .The residential information is used to estimate the average PM 2.5

Fig. 3 .
Fig.3.Estimated lasso paths for RI across varying sparsity levels (log λ) for the model coefficients αRI, γRI:r (top) and the race-specific slopes μÕ RI (r)= αRI + γRI:r (bottom) under ABCs (left) or RGE (right); vertical lines identify λ for the minimum cross-validated error (solid) and one-standard-error rule (dot-dashed).The outcome is 4th end-of-grade reading score and the covariates include all variables in Table2.Small λ approximately corresponds to OLS, while increasing λ yields sparsity.Under RGE, the estimates are pulled toward

Figure A. 5 )
Figure A.5); when the selection corresponds to the smallest |" r,j | among race groups r from ABCs, then the Over and ABC paths are similar (BLL, Figure A.2; BWTpct, Figure A.5).However, when this selection sets the main effect to zero, -j =0(PM 2.5 , Figure A.3) or overshrinks multiple coefficients toward zero (mAge, Figure A.4), then the Over paths differ substantially from both the RGE and ABC paths and demonstrate erratic behavior (Figure A.4).

Fig. 4 .
Fig.4.Estimation and prediction accuracy for the regression coefficients (left), the race-specific slopes (center), and the fitted values (right) for n =250(top) and n =10,000 (bottom) across 500 simulated datasets; nonoverlapping notches indicate significant differences between medians.Data are generated from a Gaussian main-only model with p =1 0covariates and a categorical variable with symmetric proportions π =(0.15, 0.35, 0.15, 0.35) € ; both RGE and ABCs are satisfied in the true data-generating process.All fitted models use the race-modified model Eq.(1) and thus contain extraneous race-modifiers.ABCs (gold) outperform both RGE (light gray) and Over (dark gray) within each estimation method (ridge, lasso, OLS).By definition, the OLS race-specific slopes and fitted values are invariant to the constraints (ABCs or RGE), and Over cannot be computed for OLS.

7 ]
), so C has m = L(1 + p) rows corresponding to the number of constraints.The OLS estimator under ABCs is θ = arg min θ n ÿ i=1 (y i ≠ x€ i θ) 2 subject to Cθ = 0. [To compute θ-and subsequently provide inference and penalized estimation-we reparametrize the problem into an unconstrained space with m fewer parameters.Let C € = QR be the QRdecomposition of the transposed constraint matrix with columnwise partitioning of the orthogonal matrix Q =(Q 1:m : Q ≠(1:m) ) with R € =(R 1:m,1:m : 0), since C € has rank m.It is straightforward to verify that θ = Q ≠(1:m) ζ satisfies Cθ = 0 for any ζ.Then, using the adjusted covariate matrix Z = XQ ≠(1:m) with X =(x 1 ,...,x n ) € , the solution to Eq. (

Table 1 . Linear regression output under default reference group encoding (RGE; left) and abundance- based constraints (ABCs; right): race-modified effects of racial isolation (RI) on 4th end-of-grade reading scores for students in North Carolina (y ≥ 1 + RI + race + RI:race).
Statistical bias.The racial inequity in RGE also permeates statistical estimation and inference.Modern statistical learning commonly features penalized regression, variable selection, and Bayesian inference

Table 2 . Linear regression output (under ABCs) for the race-modified effects of environmental, social, and other factors on 4th end-of-grade reading scores for students in North Carolina.
Data restricted to individuals with 37-42 weeks gestation, mother's age 15-44 years old at birth, BLL ≤ 80µg/dL (and capped at 10µg/dL), birth order ≤ 4, no current limited English proficiency, and residence in NC at the time of birth and time of 4th end-of-grade test."Economically disadvantaged" is determined by participation in the National Lunch Program.
Estimation invariance with ABCs.Figure2presents the estimates and 95% confidence intervals for the main effects that are modified by race.We compare the main-only model (variables only in the left column of Table2) to the race-modified model (all terms in Table2), including both ABCs and RGE output for the race-modified models.Remarkably, under ABCs, the estimates and uncertainty quantification for the simpler, main-only model are nearly indistinguishable from those for the Fig.2.Estimates and 95% confidence intervals for the main effects in the multivariable regression without race-modifiers (black) and the multivariable regression with race-modifiers under ABCs (blue) and RGE (red).Results are presented for blood lead level (BLL), PM2.5 exposure (PM2.5),racial isolation (RI), mother's age (mAge), and birthweight percentile for gestational age (BWTpct), each of which is interacted with race in the expanded model (blue, red); additional covariates include sex, mother's education level, mother's marital status, mother's smoking status, and economically disadvantaged (Table ) allows direct application of classical inference theory to the ABC OLS estimator: θ is a known, linear function of the (unconstrained) OLS estimator ζ.Thus, it is straightforward to derive the (Gaussian) sampling distribution of the ABC OLS estimator, which can be used to compute standard errors, hypothesis tests, and confidence intervals, and to establish unbiasedness and efficiency of the estimator.Under minimal assumptions, d ae N (0, I(ζ) ≠1 ) and thus the ABC OLS estimator satisfies