2.1 Study design
We designed a two-sample Mendelian randomization (MR) study to investigate the relationships among cholesterol levels, lipoprotein particle sizes, and gp41 C34 expression (HIV cell entry marker). Genetic IVs were selected on the basis of stringent criteria, and the analysis was conducted via MR methods (random-effect IVW) with various sensitivity analyses (MR‒Egger, weighted median, simple mode, weighted mode and leave-one-out analysis). The results were filtered on the basis of significance thresholds and outlier correction and reverse causation deletion.
Fig. 1The flowchart of the study is described below.
2.2 Data Sources
C34 gp41 exposure (marker of HIV viral entry) genetic data were acquired from the Cooperative Health Research in the Region of Augsburg (KORA) cohort in Germany, which comprises 997 individuals. Within this cohort, 509,946 common autosomal single nucleotide polymorphisms (SNPs) were analyzed for their association with gp41 c34 peptide levels in blood plasma. Linear additive genetic regression models were used for the analysis, adjusting for relevant covariates to identify significant associations[24].
The genetic instruments for 39 lipoprotein particle traits, encompassing various cholesterol levels and different sizes of lipid particles such as HDL, IDL, LDL, VLDL, and chylomicrons, were derived from genome-wide association studies (GWASs) conducted within the UK Biobank (UKB). These trait genotyping quality controls excluded individuals with sex mismatch, sex chromosome aneuploidy, and non-European descent, resulting in a sample size of 115,082. [25]. Measurements were taken by nonfasting EDTA plasma samples. All analyses were conducted under UKB application #15825, ensuring a robust and comprehensive approach to selecting genetic instruments for studying lipid levels[26].
The exposure and outcome data show different cohorts and different countries, and we might reasonably assume that no direct participant overlap was identified.
2.2. Selection of Genetic Instrumental Variables
2.2.1 Filter exposure genetic variables
SNPs were selected on the basis of a stringent p value threshold of P < 5 × 10⁻⁸ to ensure strong associations with lipid traits. This ensures that only robust genetic associations are included in the analysis.
2.2.2 Linkage Disequilibrium (LD) Considerations
Linkage disequilibrium (LD) measures the nonrandom association of alleles at two or more loci. High LD means that a few tag SNPs can represent a larger genomic region, simplifying genetic analyses. In this study, LD analysis was conducted on a European population. We used the r2 metric to describe allele correlation at different loci. An r2 value close to 1 indicates high correlation, whereas a value close to 0 indicates low correlation. To ensure SNP independence, we implemented a clumping procedure, excluding SNPs within a 10,000 kb radius that exceeded an LD threshold of r2 > 0.001. This minimizes redundancy and ensures the independence of IVs, enhancing the validity of our analysis by reducing bias and improving precision.
2.2.3 Outcome Data Retrieval
We retrieved outcome data associated with the identified SNPs, ensuring that these SNPs were not significantly related to the outcome. Relevant instrumental variables were selected without using proxy SNPs when direct SNP data were unavailable.
2.2.4 Eliminating SNPs with intermediate allele frequencies and palindromic sequences
An MAF > 0.03 was set to exclude extremely low-frequency variants, reducing false positives and sequencing noise[27]. Alignment of allele coding between the exposure and outcome datasets was performed to ensure consistency in the interpretation of genetic effects. Palindromic SNPs, which can be misinterpreted owing to their reverse-complement nature, were excluded to avoid potential biases. This alignment process ensures that the effect estimates are comparable and accurately reflect the genetic associations.
2.2.5 The strength of the IVs is assessed via the F statistic, which is calculated as F=(N−k−1)/k ⋅ R2/(1- R2) , where N is the sample size, k is the number of IVs, and R2 represents the proportion of variance explained by the SNPs in the exposure database. R2 is computed as R2=∑{SE²⋅N/2⋅(1−MAF)⋅MAF⋅β2}, where MAF denotes the minor allele frequency, β is the genetic effect size, and SE is the standard error. An F statistic greater than 10 is considered sufficient to provide strong evidence of the influence of a genetic variant on exposure.
2.3 Mendelian randomization analysis
2.3.1 Random-effect inverse variance weighting (random-effect IVW)
The inverse variance weighted (IVW) method estimates causal relationships in MR studies. However, in our comprehensive analysis, we observed significant heterogeneity (IVW Q p value < 0.05) in traits such as cholesterol levels in IDL, the concentration of small HDL particles, and cholesterol levels in small HDL particles. This indicates that the assumption of homogeneity is violated, making the random-effect IVW model preferable[28]. The causal effect is calculated similarly to ordinary IVW, but it includes an additional variance component that accounts for the heterogeneity among SNPs. More robust to heterogeneity and potential pleiotropy because it allows for variation in the causal effect across SNPs.
2.3.2 Weighted Median
The weighted median method provides a robust causal estimate even when up to 50% of the instruments are invalid. The robustness of the weighted median method complements the sensitivity of the IVW method, ensuring that the final estimate is less likely to be biased by a few invalid instruments. Applying both random-effect IVW and weighted median methods provides a comprehensive view of the potential impact of pleiotropy and invalid instruments. If the estimates of both are similar, they can conclude that the findings are robust. This holistic approach enhances the credibility and reliability of causal inference and serves as a form of cross-validation[29].
2.3.3 MR‒Egger
MR‒Egger regression detects and adjusts for pleiotropy, providing unbiased estimates. In MR‒Egger regression, the null hypothesis states that there is no directional pleiotropy. It regresses SNP-outcome effects on SNP-exposure effects, including an intercept. The slope gives the causal estimate. If the p value associated with the intercept term is less than the chosen significance level (typically 0.05), the null hypothesis of no directional pleiotropy is rejected, which means that the genetic instruments have a direct effect on the outcome independent of the exposure.[30].
2.3.4 Simple mode
The simple mode method identifies the most frequent causal estimate (mode) among the genetic variants. It involves calculating the causal estimate for each SNP and then determining the mode of these estimates. This method assumes that most of the genetic instruments are valid and that the mode represents the true causal effect. This approach is beneficial when the majority of the instruments cluster around the true effect, providing a simple yet effective estimate[31].
2.3.5 Weighted mode
The weighted mode method is similar to the simple mode method but incorporates weights to increase the influence on more precise estimates. It involves calculating the causal estimate for each SNP, assigning weights on the basis of the inverse of the variance, and determining the mode of the weighted estimates. This method assumes that the valid instruments are clustered around the true causal effect, and it provides a robust estimate even when some instruments are invalid. By weighting the estimates, this method improves the precision of the causal estimate [31].
2.3.6Leave-One-Out Analysis
To assess the robustness of the MR findings, a LOO sensitivity analysis was conducted. LOO analysis systematically excludes one SNP at a time from the set of instrumental variables and recalculates the causal estimate to evaluate the influence of each individual SNP on the overall MR estimate. The TwoSampleMR package in R was used to perform the MR analysis and the LOO sensitivity analysis. Forest plots were generated to visualize the LOO results, showing the causal estimates and confidence intervals with each SNP excluded. The consistency of the causal estimate across different SNP exclusions indicates robustness. Significant deviations in the causal estimate upon exclusion of a specific SNP suggest potential issues such as pleiotropy or violations of MR assumptions[31].[32]
2.4 Application of the False Discovery Rate (FDR) in Multiple Testing
To account for multiple testing analyses, we applied Benjamini‒Hochberg (BH) correction to p values from the IVW method in our MR analysis to estimate causal relationships between the C34 gp41 HIV fragment and lipid traits, identifying significant and suggestive associations while controlling the false discovery rate (FDR) . This method balances the risk of Type I errors (false positives) and Type II errors (false negatives). This makes it more powerful, particularly for large datasets with many comparisons. The steps include ranking p values from MR analysis, calculating critical values (k/m * α, where k is the rank, m is the total number of tests, and α is 0.05), and comparing each p value to its critical value. P < FDR < 0.05 indicates a significant association, P < 0.05 < FDR indicates a suggestive association, and 0.05 < P < FDR denotes no association.