Here we describe one of the most comprehensive prospective collections of consumer purchase data (CPD) to date. More than 400 participants have been registered to date and additional recruitment is ongoing, enabling continuous prospective collection of CPD. Though this number may not seem large, we have collected data spanning up to 8 years of CPD, currently from 34 retailers, including several large supermarket chains, as well as one generic food database and two specific product databases. The various data sources enabled broad exposure assessment of otherwise hard to obtain information on consumption of grocery products including, food, tobacco and alcohol, sanitary products and chemicals.
We found that comprehensive enrichment is possible, which both quantifies known product groups contributing to disease, such as smoking and alcohol consumption, and allows for broad exposure assessment of otherwise hard-to-access exposures, such as indoor pollutants (i.e. candles and sprays), and chemicals in cosmetics (9). Though we report on self-reported diseases here, the continuous CPD collection may be linked to health outcomes, enabling register-based longitudinal analysis of how lifestyle related factors are associated with disease onset, propagation, or cessation.
All 210,203 unique products sold to participants could, in principle, cause disease, as is often the case in foodborne outbreaks (6). Apart from the most commonly bought products and products with high brand recognition, it may be challenging to identify such products using questionnaires as only some participants would remember. If other diseases or changes in disease trajectory are triggered by the purchase of a single product, a raw CPD sample with high coverage could thus be superior to questionnaires, as was the case in a recent simulation study (7). We found high dispersion in proportion of purchases from each product group, when data from only one retailer was used compared to using data from four large retailers. All product groups, including important groups such as tobacco, had a relative fractional difference of 100 at the 10th percentile when using data from only one retailer, resulting in possible misclassification of purchase behavior, such as classification of an individual as an apparent non-smoker, when in reality, the individual is buying cigarettes at another retail chain. Our results illustrate that having data from only one retailer increases the risk of measurement error, and that this was reduced as more retailers were added, an important finding for future studies comparing consumption patterns to data from i.e. food frequency questionnaires.
Information was retrieved beyond the product item level for 87.7 % of all products sold, enabling analysis beyond the product item name or item number and allowing for the assessment of ingredients, nutrient information, and many other exposures of relevance to health (11, 13, 16). As products were matched to Frida and a range of custom-made, non-food categories using regular expressions, some mismatches were expected. Moreover, product-specific discrepancies, such as in caloric content within the same product group due to differences between average product values in the food database and the specific brand (e.g., different spread products) were expected. We quantified differences between the average product values of generic products in Frida and the GTIN specific product information from the GS1 database. Here, we found the median relative difference for KJ to be 0.26, which is higher than what was previously shown in a large British study where products were manually compared (29). These findings are only partially explained by generic versus branded product differences, and some variation is due to misclassification of products e.g. low versus full-fat products or residual errors in the matching algorithm. Manual assessment of 1000 randomly selected matched products revealed that 70 % of products were good matches to generic products, and 80% matched the groups on the groups level i.e. misclassifying the milk type i.e. skimmed versus full milk, but getting milk right. This highlights the need for continuous improvement of the matching algorithm, as the current framework allows for and encourages further development, as well as the use of other approaches including natural language processing. More promising is direct matching to structured specific product databases such as GS1 Tradesync, Kemiluppen, and Open Food Facts (22-24). The unique GTIN13 maintained by GS1 enables direct matches, but the coverage of these databases may vary by country and for GS1, the coverage is producer-dependent, where some product producers do not allow research-related access. For the foreseeable future, efforts probably need to combine specific and generic product databases, where supplier generated product information and information from specific product databases should be preferred over generic information.
Here we combined name matching and GTIN matching to create a near real-time enrichment pipeline for purchased products. Combined, this framework allows us to follow key health determinants, such as tobacco, alcohol, dental products, and diet over time, and thereby investigate time-varying exposures, possible associations with health outcomes, and resultant targets of interventions (17, 30, 31). Though the outcomes described here were self-reported, the consent given by participants allows researchers further to enrich the cohort with information from Danish registers. The Danish registers provide access to outcomes such as prescription data, microbial and laboratory results, including calprotectin, C-reactive protein, and cholesterol, outpatient and hospital visits, and surgical procedures, in addition to a number of key health and economic outcomes as well as spatial and social data. These registers have been extensively described elsewhere (32-35). In addition to register data, other research projects may add further data to the cohort, enabling the future assessment of onset, flare-ups, alleviation, and cessation of a wide range of diseases (17, 18).
Limitations
This study has a number of limitations. First, the age and sex distributions of the cohort differ from the Danish population as a whole, with the cohort being older and more female, highlighting the risk of selection bias. However, the potential for analytically dealing with this bias is promising due to the large amount of longitudinal data for both the cohort and the total population (14). Other challenges include purchases being made at other retailers, loss to e.g. food waste, the level of detail in product grouping, and the knowledge gaps regarding what is being consumed and by whom (e.g. eating out, at friends’, and within a household) (14, 36). Though identifying individual consumption is a challenge, Danish registers allow for identification of household information including single households where this is less of an issue(14). Further our database did not include all retailers and loss of data due to delays in the participant's updating of the digital receipt app and when credit/debit cards expire and are renewed is a challenge and the most likely cause of the average 7 months without purchases found in this cohort. This is also reflected in the lower-than-expected total expenditure of 1,868 Danish Kroner per participant per month, compared to the mean expenditure of 3,164 for each Dane published in 2019 by Denmark Statistics (37). In addition to the lack of complete consumerome coverage, finding available product data beyond what is reported via receipts is also a challenge. Here, 39% of unique products and 13% of all the products by volume bought by participants were not matched. Further, though a larger amount of information is available for generic products, many other key variables, such as weight or volume, are only available to a limited extent from product names and need structured product specific databases such as GS1 to be assessed (14, 36).
Strengths
Major strengths of the current database are the broad number of retailers providing the sample, the ability to ascertain the impact of having CPD from more than one retailer, the length of follow (up to 8 years of CPD), the ability to enrich and classify the majority of products that combined enable the ability to investigate multiple, hard-to-assess exposures over prolonged time frames with minimal time or effort required from the participants and no social desirability bias or recall bias. Many of the technical limitations mentioned above could be addressed by further improving the product enrichment pipeline, including adding Natural Language Processing approaches, providing services that encourage individuals to provide CPD for the entire household, or by adjusting the data for household composition that could work in tandem with targeted questionnaires to address identified knowledge gaps (14, 36).
Another strength is the ability to use the personal identification number provided by participants to collect key covariates from Danish registries, including household size, income, education, and social data, in addition to other external exposome data and health outcomes, as detailed above.
Implications
The My Purchases cohort combines consumer purchase information with health outcomes. To ensure large-scale collection of CPD, creating services that provide insights to participants, while addressing the need for information, choice, and appropriate safeguards for participants is evident (38). Options such as being able to select/deselect various categories of research and selecting different transaction data/CPD data sources could improve participants trust in the recipient organizations (39). In the future, CPD could enable post-marketing, epidemiological assessment of products and help unravel the commercial determinants of health, including health effects of additives and foods introduced to consumers (40). Such information may then inform residents, politicians, and key institutions, such as European Food Safety Authority (EFSA) and European Chemicals Agency (ECHA) of such effects (41)(42). With time and in combination with other sources, the exposures to biological pathways at different life stages and identification of early signs of health damage caused by environmental factors could be discovered and real-time lifestyle advice ameliorating or preventing the impact of these factors could be directly communicated to consumers.