Research on model transparency has remained relatively limited within the growing field of multimodal machine learning, particularly with text-tabular datasets. To address this research gap, we present a novel multimodal masking framework that extends SHapley Additive exPlanations (SHAP) to text-tabular datasets. This framework, which we make publicly available, enables the generation of SHAP explanations for any text-tabular dataset using any combination method. By masking features according to their modality, our framework ensures that features are treated consistently across unimodal and multimodal settings. Furthermore, by deferring the model input formation until after the masking call, we make the framework agnostic to how the input is formatted, avoiding the issues that arise when pre-forming the data into text and applying the existing text masker. In an extensive study, we examine the impact that combination strategies and language models have on SHAP explanations. Notably, the choice of combination method considerably influences the features identified as most important by the model. Furthermore, our findings reveal that methods converting all input to text tend to assign greater relative importance to text features over tabular features.