A Comprehensive Data Fusion to Evaluate the Impacts of COVID-19 on Passenger Travel Demands: Application of a Core-Satellite Data Collection Paradigm

doi:10.21203/rs.3.rs-1976226/v1

Download PDF

Research Article

A Comprehensive Data Fusion to Evaluate the Impacts of COVID-19 on Passenger Travel Demands: Application of a Core-Satellite Data Collection Paradigm

https://doi.org/10.21203/rs.3.rs-1976226/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The COVID-19 pandemic has altered travel patterns in cities across the world. Previous studies have found that travel choices during the pandemic are affected by attitudes and perceptions of risk in addition to transportation system level-of-service attributes. However, traditional travel demand models often rely on household travel survey data, which rarely include information on attitudinal factors. Conversely, specialized surveys are often lengthy, so they offer the ability to collect detailed attitudinal information but suffer from limited sample sizes. This study demonstrates the feasibility of fusing a “core” household travel survey with three specialized “satellite” surveys to evaluate the impacts of COVID-19 on passenger travel demand in the Greater Toronto Area (GTA). The study uses a non-parametric implicit data fusion method to generate multiple synthetic datasets that contain observed travel diaries and socioeconomic attributes of the trip-makers from the core survey, along with imputed attitudinal statements based on the satellite surveys. The results highlight the ability of the method to sufficiently reproduce the distribution of the attitudinal variables and the ability of the imputed variables to support the estimation of an advanced econometric model. The proposed method can reduce the risk of potential biases in the imputed data that can adversely impact subsequent data analysis. This method can be used to capitalize on the benefits of specialized surveys while still being able to utilize data from large-scale household travel surveys.

Data fusion

COVID-19

Attitudinal factors

Implicit matching

Hybrid choice model

The spread of the novel coronavirus disease (COVID-19) caused mass casualties worldwide (1). Many countries implemented policies, including temporarily closing businesses, mobility restrictions, and stay-at-home campaigns to curb the spread of the virus. These policies severely disturbed travel patterns in cities across the world (2,3). Prior studies have found that these policies tend to coincide with declines in trip frequencies, trip distances, and the amount of time spent outside of one’s home (4–6). However, it appears that the pandemic has also impacted travel behaviour through its influence on attitudes and perceptions of risk, affecting out-of-home activity participation and modal preferences. For example, de Haas et al. (6) found that survey respondent who feared contracting COVID-19 were more likely to report spending their entire day at home. Parady et al. (7) found that so-called COVID dread was associated with reduced trip frequencies. In terms of the shift in modal preference, the pandemic has coincided with an increased preference for individual modes (such as driving and active modes) and a reduced preference for shared modes (such as public transit and ride-sourcing services) (3,6,8). Capasso da Silva et al. (9) found that individuals who are “concerned about the pandemic response” traveled the most using private vehicles. In contrast, those equally concerned about the pandemic response and the health effects of COVID-19 used personal bicycles and transit the most.

These findings indicate that in addition to traditional determinants, psychological constructs such as personal attitudes and risk perceptions affect our travel decisions in this unprecedented time. Thus, there is an increasing need to incorporate these variables in travel behaviour analysis during the pandemic. A growing body of literature indicates that the inclusion of these factors in discrete choice models can help improve their behavioural realism and explanatory power, as attitudes and perceptions have been shown to influence the decision-making process (10–12). However, traditional household travel surveys do not collect attitudinal information owing to the high cost and respondent burden. Conversely, specialized surveys that collect attitudinal data usually do not include a travel diary component and often suffer from limited sample sizes. To overcome the issue, this study presents a data fusion method to enrich a “core” household travel survey (that collected one-day travel diaries from respondents) of the Greater Toronto Area (GTA) by linking it to three “satellite” surveys that collected rich attitudinal information from GTA residents.

We adopt a four-step approach to fuse the attitudinal data from the satellite surveys (donors) to the core survey (receptor). We utilize a k-Nearest Neighbour (k-NN) type extension of the non-parametric hot-deck imputation technique (13) to generate multiple fused datasets, each of which contains observed travel diary and socioeconomic attributes of the trip-makers from the core survey along with an instance of the inferred attitudinal statements based on the satellite surveys. Such multiple imputations help reduce potential biases affecting subsequent analysis using the fused data. Common personal and household attributes are carefully evaluated to select the optimal set of matching variables for the fusion exercise. The resulting synthetic datasets are analyzed to ensure that the imputed attitudinal variables meet our a priori expectations regarding the respondents' socioeconomic status and travel behaviour. While doing so, we also test the feasibility of using the fused data to estimate discrete choice models, which are considered as the workhorses for travel demand modelling practices.

The remainder of the paper is organized as follows: Section 2 presents background information on data fusion methods. Section 3 describes the core and the satellite survey data. Section 4 presents details of the data fusion framework and the empirical validation. Section 5 discusses the results of the fusion process. Section 6 presents an application of the fused data to investigate the factors affecting commute mode choice decisions during the pandemic using a hybrid choice model. Finally, Section 7 discusses the usefulness of the proposed fusion method, as well as the associated challenges and possible improvements.

Data fusion refers to combining data from multiple sources to obtain “improved information”, which may refer to less expensive, higher quality, or more relevant information (14). However, the exact definition varies with the application framework. In this study, data fusion can be considered as a model-based approach for providing joint information on variables and indicators collected through multiple surveys drawn from the same population. Although data fusion has existed in the statistics and marketing communities since the 1980s (15), its application in the transportation field is relatively new (16, 17). Most of the available studies in this field has focused on fusing data from traffic sensors and other passive sources (18–22). In terms of survey data fusion, the focus has mainly been on abstract methods, with few practical implementations (23, 24). This could be attributed to the complexity of travel survey data and the lack of context or the exact purpose for data fusion. Moreover, quantifying the statistical properties of the fused dataset has always been challenging (25). Nonetheless, fusing data from multiple surveys has become necessary to better understand travel behaviour in the context of emerging technologies like connected and automated vehicles, ride-sourcing, car sharing, bike sharing etc. There is an increasing need for a data fusion method that can be used to capitalize on the benefits of specialized surveys while still being able to utilize data from large-scale household travel surveys.

Fusion in the context of transportation survey data typically takes the form of statistical matching (26), which refers to a stochastic procedure for integrating nonoverlapping datasets with common variables (27). The fusion process typically follows a donor-receptor framework, where the donor data contains information that would be preferable to have in the receptor data. Variables unique to the datasets are called specific variables, while variables existing in all the datasets are called common variables. The data fusion problem involves inferring target variable(s) from the donor to the receptor surveys by using the observed relationships among the common and specific variables in the donor file. A record in one dataset can be matched to multiple records in a different dataset as long as the common variables are sufficiently similar.

There are two main approaches for data fusion in statistical matching: the explicit method and the implicit method. In the explicit method, a correlation model is first developed using the donor dataset to predict a target variable as a function of the common variables. Then this model is applied to impute the missing target variable in the receptor file. Conversely, the implicit method transfers values of the target variables from the donor record to the receptor records that have been matched based on the similarity of common variables (25, 28, 29). The main advantage of the explicit approach is its ability to measure the robustness of the matching process using various goodness-of-fit statistics. However, the assumptions made regarding the type of correlation model can induce undesirable statistical phenomena in the subsequent analysis of the synthesized complete dataset. We adopt the implicit fusion method in this study to avoid this issue. Being non-parametric, the implicit method relies, to a lesser extent, on making potentially erroneous or unverifiable distributional assumptions (30).

The implicit fusion method has been used in a handful of studies in the transportation survey literature (27, 30–33). For example, Lugo and Srinivasan (33) used the method to match health information from the American Time Use Survey (donor) to the National Household Travel Surveys of 2008–2009 (receptor). Specifically, they imputed the body-mass index and a self-assessed physical health score using matching variables like gender, census region, education, housing tenure, student status, age, race, employment status, income, and household size. The study imputed five instances of a target variable and used the average as the assigned value to the receptor dataset. Müller and Axhausen (27) used the hot-deck imputation technique (a non-parametric micro-matching method that assumes conditional independence) to create a synthetic population by fusing register survey and transportation micro census data in Switzerland. Pawlak et al. (30) present a method that uses the implicit fusion technique and the multiple imputation method of Rubin (34) to combine information on individuals’ digital lifestyles and travel behaviour.

In this study, we match records from the core survey (receptor) with candidate records from the satellite surveys (donor) using a variation of the hot-deck imputation technique. The basic hot-deck method imputes the non-observed variables in the receptor record using observed variables in a donor record whose common variables are closest to that of the receptor record (29). Here, ‘closest’ refers to the difference in values between the two records. Different types of distance measures can be used for this purpose (e.g., Euclidian, Manhattan, Mahalanobis, Gower, etc.). In this study, we adopt a k-NN type extension of the hot-deck imputation technique (similar to the one presented in (30)) that automatically generates multiple candidate donor records. This extension enables the application of the multiple imputations framework (34), thereby reducing the biases that can potentially affect subsequent analyses using the fused data.

The datasets used in this paper were collected as part of a comprehensive data collection program to assess the impacts of COVID-19 on urban passenger travel demand in the GTA. The program was designed based on the concept of the core-satellite approach to data collection (35). The core is a household travel survey that collects basic traveller information and trip patterns, and the satellites are four surveys that target specific travel behaviour. This approach helps reduce respondent burden, enables sufficient sampling of hard-to-reach sub-groups, and ensures that the datasets can be fused to obtain a more comprehensive representation of travel behaviour (36). Participants were randomly recruited from commercial survey panels comprised of GTA residents. A detailed description of the data collection process can be found in (2). This study focused on four components of the data collection program – the core survey and three satellite surveys (see Table 1). As shown in Fig. 1, the surveys were conducted when the number of new COVID-19 cases reported was relatively low, and public health measures were relatively relaxed.

Table 1

Datasets used in the study
Component	Survey name	Description	Study period	Sample size
Core	COVHITS	COVid-19 influenced Households’ Interrupted Travel Schedule	Oct - Nov 2021	4,678 households containing 8,911 individuals (≥ 18years)
Satellites	SiSTM	Study into the use of Shared Travel Modes	July 2021	767 individuals
	SPETT	Stated Preference Experiment on Travel mode and especially Transit choice behavior	July 2021	849 individuals
	CASAS	Covid Activity Scheduling and Adaptation Survey	July 2021	860 individuals

COVHITS is a household travel survey that collects the socioeconomic characteristics of the household members and their (weekday) travel diaries. The satellite surveys have a common set of questions on the socioeconomic characteristics of the respondents and their general pandemic response behaviour, followed by detailed satellite-specific questions. These questions include attitudinal statements, detailed activity frequency questions, and a series of stated preference experiments. Figure 2 shows the data model of the four surveys in the context of this study.

It is seen that the core and the satellite surveys have several common variables regarding the personal and household attributes of the respondents. The presence of these common variables facilitates the use of fusion to generate synthetic datasets containing travel and socioeconomic attributes of the trip-makers from the core survey, along with inferred attitudes and perceptions based on the satellite surveys. Table 2 presents the descriptive statistics of these common variables for the surveys and compares them with the 2016 Transportation Tomorrow Survey (TTS). The TTS is a household travel survey that surveys 5% of the population in the GTA every five years. The latest iteration was conducted in 2016, and the data was expanded to match the socioeconomic distribution in the 2016 Canadian Census (37). Therefore, the TTS can be regarded as a representative reference of the overall population in the study area.

Table 2

Descriptive statistics of the core and the satellite surveys
Variables	Satellites			Core	2016 TTS
	SISTM	SPETT	CASAS	COVHITS
	n = 767	n = 849	n = 860	n = 8,911
Age	43.31 (16.47)	45.28 (15.84)	42.53 (15.84)	49.02 (17.55)	46.64 (17.43)
Gender
Male	40.4%	43.0%	39.2%	47.5%	48.5%
Female	58.5%	56.5%	60.0%	51.2%	51.5%
Other	1.0%	0.5%	0.8%	1.3%
Student status
Full-time student	12.6%	6.8%	12.3%	7.3%	7.4%
Part-time student	7.0%	9.3%	8.0%	2.5%	2.6%
Not a student	80.3%	83.9%	79.7%	88.3%	90.0%
Decline/don't know	0.0%	0.0%	0.0%	1.9%
Employment status
Full-time (≥ 30 hours/week)	56.2%	55.8%	60.7%	49.8%	42.6%
Part-time (< 30 hours/week)	15.6%	15.1%	13.1%	11.2%	9.6%
Not employed	28.2%	29.1%	26.2%	39.0%	47.9%
COVID-19 vaccination status
Fully vaccinated with two doses	67.4%	62.3%	57.3%	91.5%
Received first dose	15.8%	18.5%	17.1%	1.5%
Not taken any dose, but plan to get vaccinated	9.9%	7.8%	8.3%	1.0%
Not taken any dose and no plan to get vaccinated	5.6%	7.5%	11.4%	2.9%
Prefer not to answer	1.3%	3.9%	5.9%	3.1%
Household size	2.80 (1.60)	2.78 (1.46)	2.84 (1.40)	2.58 (1.21)	2.70 (1.48)
Household vehicles	1.38 (0.99)	1.40 (0.89)	1.42 (0.88)	1.52 (0.98)	1.39 (0.98)
Household income
below $14,999	3.5%	2.9%	3.3%	2.5%	4.8%
$15,000 - $39,999	16.0%	14.6%	16.7%	9.1%	14.1%
$40,000 - $59,999	15.9%	16.7%	15.8%	11.1%	13.8%
$60,000 - $99,999	28.9%	29.6%	30.1%	25.9%	21.4%
$100,000 - $124,999	11.1%	11.4%	11.6%	17.1%	10.1%
$125,000 and above	16.0%	16.3%	16.2%	26.4%	18.1%
Decline/ don't know	8.5%	8.5%	6.3%	7.9%	17.8%
Household region
Toronto	47.5%	46.2%	46.9%	50.7%	53.2%
York	18.8%	19.1%	18.8%	19.1%	17.1%
Peel	24.1%	24.5%	24.0%	20.1%	20.5%
Halton	9.6%	10.2%	10.3%	10.1%	9.2%

As shown in Fig. 2, we fused 33 key attitudinal and risk perception attributes (target variables) whose descriptions are presented in Table A1 in the Appendix. Each question has a five-point Likert scale response, with options ranging from strongly disagree to strongly agree. The attitudinal statements inferred from the SiSTM survey highlight respondents’ increased risk perception of travelling and their adjustment of travel preferences during the pandemic. The attitudinal statements inferred from the SPETT survey provide an overview of the respondents’ concerns regarding the pandemic, their safety perception of different types of transit vehicles, and their attitude towards returning to transit. Finally, the inferred statements of the CASAS survey relate to the respondents’ general attitude towards telecommuting and hybrid work arrangements and their preferences for online vs. in-store shopping.

In this study, donor records from the three satellite surveys were matched with the receptor records of the core survey through a four-step process. First, the prerequisites of harmonization and coherence of the datasets are tested. Second, the degree of association of the common variables with the target variables from the satellite surveys is analyzed. Based on these steps, the most suitable common variables are selected as the matching variables for the imputation process. Third, the k-NN extension of the hot-deck imputation technique is applied to match the donor and the receptor records. Finally, the quality of the fused attitudinal data is assessed at two levels. Each step is described in detail below.

4.1 Harmonisation and reconciliation of sources

The main condition of data fusion is the existence of a set of common variables among the surveys that are coherent and have high explanatory power about the specific imputation needs. As the surveys were designed based on the core-satellite data collection paradigm, the common variables are coherent regarding the wording of questions, definition of concepts, and guidelines. However, some variables differ slightly in terms of attribute levels. To ensure coherence, the response options of these variables were suitably aggregated and re-coded. Also, while the satellite surveys collected data from all five municipalities in the study area, the core survey only focused on Toronto, York, Peel, and Halton. As such, the records belonging to residents of the municipality of Durham were removed from the final satellite data sets to maintain consistency. Again, COVHITS collected data from all household members, whereas the satellite surveys collected data from individuals aged at least 18 years. To maintain consistency, this study used COVHITS data for respondents who were at least 18 years old, resulting in a sample of 8,911 individuals. We also categorized the continuous variables to ensure each category contained enough records in each survey.

A careful comparison of marginal distributions is then implemented to identify the consistent and coherent common variables among the surveys. We also computed similarity or dissimilarity measures (e.g., total variation distance, Bhattacharyya coefficient, and Hellinger’s distance) between the marginal distributions. Potential matching variables were identified using these values based on the smallest difference between the core and the satellite datasets.

4.2 Analysis of the explanatory power for common variables

To select the common variables to be used in the matching process, their association with the target variables from the satellite surveys was analyzed using the Cramer’s V coefficient, which is a traditional measure of association calculated as:

$$V=\sqrt{\frac{{\mathcal{X}}^{2}}{n\times min\left[I-1,J-1\right]}}$$

Here, ${\mathcal{X}}^{2}$ is the chi-square value, $n$ is the sample size, and $I$ and $J$ are the numbers of rows and columns of a two-way, $X\times Y$ contingency table built for each pair of couple-response predictors ($I$ is the number of categories of $X$ and $J$ is the number of categories of $Y$). Cramer’s V ranges from 0 to 1.

Common variables that were relatively coherent and displayed good explanatory power for the target variables are selected as the potential matching variables. We set different cut-off values of the Cramer’s V coefficient for the different target variables to include a minimum number of matching variables in their matching process. Moreover, we applied a priori expectations and intuition to make the final selection of matching variables.

4.3 Matching method

A k-NN type extension of the hot-deck imputation technique was applied to generate multiple instances of the fused dataset. This is because imputing only a single instance carries several significant risks (34). First, it assumes that the imputation framework can predict the desired value perfectly, thus neglecting any stochastic variation in the imputed variables. Second, this approach would underestimate the variability in the data, leading to underestimation of the standard errors in any subsequent parameter estimates. To overcome these problems, we introduce a degree of randomness in the imputation by identifying a set of $k$ most similar respondents (nearest neighbours). Then we use Monte Carlo draws from this set to provide values for the target variables in the receptor dataset with the probability of selection dependent on the similarity to the target respondent. In this way, we generated 100 synthetic datasets allowing us to draw valid inferences in subsequent analysis. It would also offer a better understanding of the full range of parameter values that might be obtained in any subsequent model estimation exercise.

We used Gower’s dissimilarity coefficient (38) as the measure of distance, which calculates the final dissimilarity between the ${i}^{th}$ and ${j}^{th}$ records as a weighted sum of dissimilarities for each matching variable.

$$d\left(i,j\right)=\frac{{\sum }_{v}{\delta }_{ijv}{d}_{ijv}{w}_{v}}{{\sum }_{v}{\delta }_{ijv}{w}_{v}}$$

Here, ${d}_{ijv}$ represents the distance between the ${i}^{th}$ and ${j}^{th}$ records computed with regards to the ${v}^{th}$ variable, while ${w}_{v}$ is the weight assigned to variable $v$. For our common variables, which are either categorical or logical, the distance ${d}_{ijv}$ is 0 if ${x}_{iv}={x}_{jv}$ where $x$ represents the matching variables. We tested various sizes of the potential donor subset (i.e., the value of $k$) and selected 20 as the most optimal value. Moreover, we applied the concept of donation classes to constrain the matching process within each of the four major regions of the GTA. This approach ensured that donor records from Toronto are matched with receptor records from the same area and so on. This is important since the regions vary widely regarding land-use patterns and transportation infrastructure, ultimately affecting their residents' travel behaviour and attitudes.

4.4 Quality evaluation

Although certain prerequisites for coherence and association are ensured in the data fusion process, the results obtained must still be validated in terms of their potential to provide reliable and accurate estimates. In this regard, Rässler (39) proposes four levels of validity:

The marginal and joint distributions of target variables in the donor sample are preserved in the fused file,

The correlation structure and higher moments of the variables are preserved after fusion,

The true joint distribution of all variables is reflected in the fused file,

The true but unknown values of the target variable of the recipient units are reproduced.

In this study, we adopt levels 1 and 4 to assess the quality of the fused attitudinal data. For level 1, we compare the marginal distributions of the target variables in the donor and the fused datasets. Figure 3 below shows that the observed shares of the target variables in the SiSTM satellite survey are replicated reasonably well in the fused datasets, with minor deviations. For a more detailed comparison, we computed the mean, standard deviation, and skewness of the distributions, which also indicate that the aggregate fusion results are quite accurate.

Most data fusion studies validate their output only through measures of internal consistency, similar to the level 1 validation above. This is because the true value of the target variables of the recipient units is rarely known (hence the necessity of the fusion). Some studies, like (33), have tried to circumvent this issue by validating their data fusion method based on a subset of the donor data, which is treated as the pseudo-receptor sample for the validation process. However, such validation should be interpreted cautiously. It is important to acknowledge that while treated differently, the donor and the pseudo-receptor samples originate from the same dataset. The performance of the fusion method might not be the same when applied to the actual receptor sample, which will typically have a very different data generation process. In our study, however, we conduct a level 4 validation by comparing true (observed) values of a target variable of the actual recipient units in the core survey with their imputed values. This comparison is possible because the core COVHITS survey collected attitudinal responses for some of the target variables directly from the recipient units. These common attitudinal variables shared between the core, and the satellite surveys were used as the basis of the validation shown in Fig. 4.

From the figure, it is observed that the data fusion method manages to replicate the general trend of the attitudinal responses for most of the validation variables. For some variables like “risk of carpooling” and “less willing to travel”, the distributions are quite different. They indicate that people are becoming less concerned about the risk associated with carpooling and are more willing to travel than they were earlier in the pandemic. Overall, this change in attitude reflects that people are gradually getting used to the COVID-19 pandemic, especially since the vaccination rate strongly increased between the two survey periods (July and September 2021).

Table 3 summarizes the results of the fusion process. To conserve space, it only shows the attitudinal statements imputed from the SiSTM survey. The other imputed variables are shown in Tables A2 and A3 in the Appendix. However, the following discussion applies to all the variables considered in the study. We clustered the records in the final synthetic files into two groups based on their imputed attitudinal attributes. The “do not agree” cluster comprises records with imputed attitudes of strongly disagree, disagree, or neutral for an imputed target variable. On the other hand, the “agree” cluster comprises the respondents who agree or strongly agree with the attitudinal statements based on their imputed values.

Table 3

Summary results of the data fusion process (for variables imputed from the SiSTM survey)
Target variable		%	Age	Average trip rates (trips per person)
				Total		Work		School		Shopping		Other
Risk of being outside	Do not agree	35.2%	44.1	1.011		0.204		0.032		0.133		0.179
	Agree	64.8%	50.8	1.005		0.172		0.019		0.147		0.203
Risk of ridesourcing	Do not agree	30.6%	45.8	1.038		0.204		0.031		0.142		0.185
	Agree	69.4%	49.6	0.993		0.174		0.020		0.142		0.199
Risk of taxi	Do not agree	31.7%	45.3	1.022		0.218		0.024		0.132		0.186
	Agree	68.3%	49.9	1.000		0.167		0.024		0.146		0.199
Risk of carpooling	Do not agree	26.8%	43.5	1.079		0.250		0.033		0.123		0.178
	Agree	73.2%	50.2	0.981		0.158		0.020		0.149		0.201
Risk of car-sharing	Do not agree	32.8%	43.4	1.006		0.202		0.031		0.129		0.184
	Agree	67.2%	50.9	1.007		0.174		0.020		0.148		0.200
Less willing to travel	Do not agree	41.0%	46.0	1.049		0.214		0.024		0.131		0.194
	Agree	59.0%	50.1	0.978		0.161		0.024		0.149		0.195
Less willing to visit distant places	Do not agree	36.2%	46.6	1.027		0.202		0.028		0.139		0.184
	Agree	63.8%	49.5	0.996		0.172		0.021		0.144		0.201
Prefer social distancing	Do not agree	27.8%	44.5	1.090		0.232		0.036		0.136		0.184
	Agree	72.2%	50.0	0.975		0.164		0.019		0.144		0.199
Target variable					Modal share
					Driver		Passenger		Transit		Walk & cycle		Other
Risk of being outside	Do not agree			60.01%		8.74%		15.34%		14.76%		1.15%
	Agree			65.83%		8.51%		11.64%		12.79%		1.23%
Risk of ridesourcing	Do not agree			57.61%		7.80%		16.09%		17.03%		1.46%
	Agree			66.61%		8.95%		11.50%		11.85%		1.09%
Risk of taxi	Do not agree			61.58%		7.56%		15.33%		13.93%		1.60%
	Agree			64.81%		9.08%		11.82%		13.27%		1.02%
Risk of carpooling	Do not agree			59.27%		9.00%		16.32%		14.11%		1.30%
	Agree			65.59%		8.43%		11.59%		13.23%		1.17%
Risk of car-sharing	Do not agree			57.89%		8.24%		16.55%		16.02%		1.30%
	Agree			66.64%		8.76%		11.19%		12.25%		1.16%
Less willing to travel	Do not agree			63.35%		7.12%		14.68%		13.53%		1.31%
	Agree			64.09%		9.68%		11.65%		13.45%		1.13%
Less willing to visit distant places	Do not agree			63.42%		8.51%		14.17%		12.87%		1.04%
	Agree			63.98%		8.64%		12.23%		13.84%		1.30%
Prefer social distancing	Do not agree			62.01%		7.13%		15.79%		13.72%		1.35%
	Agree			64.53%		9.22%		11.72%		13.38%		1.14%

Overall, the imputed attitudinal variables meet our a priori expectations regarding the socioeconomic status of the respondents and their travel behaviour. Age is a significant factor affecting individuals’ perception of risks and adjustment to travel during the pandemic. The “agree” clusters are older than the “do not agree” clusters for each related attitudinal statement. Older respondents are also more concerned about the various aspects of the pandemic (e.g., daily new case count, the emergence of new variants, mortality rate). These findings fit the expectation, as COVID-19 is considered more deadly for the aged population (40). They perceive public transit as less safe (especially bus/streetcar). They wouldn’t feel safe to use it even after they have been fully vaccinated, but before enough people have been vaccinated to end the crisis formally. In terms of online vs. in-store grocery shopping, the older respondents demonstrated a greater preference for in-store shopping during the pandemic, as reflected by the three attitudinal statements related to in-store shopping. On the other hand, the younger population demonstrates a greater preference for online grocery shopping.

The imputed attitudinal variables also demonstrate that differences in perceptions and attitudes lead to alterations in travel behaviours. Overall, individuals with higher imputed levels of perceived risks made fewer trips on a given weekday. For most of the risk-related variables (e.g., risk of being outside, risk of ride-sourcing, risk of taxi, risk of carpooling), respondents who agree with the statements (indicating higher perceived risks) generated lower average trip rates. Similarly, individuals who adjusted their travel preferences during the pandemic (i.e., those who agree with the statements regarding being less willing to travel, visiting distant places, and preferring social distancing) made fewer daily trips. In addition, individuals who agree with the advantages of telecommuting completed fewer work trips per day. However, no direct correlation was observed for those who prefer hybrid work arrangements over the exclusive work-from-home arrangement. Individuals with a greater preference for online grocery shopping made fewer shopping trips. In contrast, those who have a greater preference for in-store shopping made more shopping trips on a given weekday. However, the preference for online grocery shopping did not directly contribute to reducing out-of-home activities during the COVID-19 pandemic. This is consistent with the findings of other studies in the literature (41).

Regarding modal share, individuals with higher perceived risk are inclined to rely on driving and avoid public transit compared to groups with lower risk perceptions. A similar trend is observed for individuals who are more concerned about the pandemic and adjusted their travel patterns during this period. In fact, some of the lowest shares of transit trips and highest shares of driving trips observed in the final synthetic data are for individuals with higher risk perceptions and who wouldn’t feel safe using transit during the different stages of the pandemic. This demonstrates that the perceived risk associated with a travel mode significantly affects travellers' mode choices during the pandemic. Among the different types of transit vehicles, bus/streetcar is especially critical, given that only 17% of the respondents agree with the statement that they perceive bus/streetcar to be mostly safe (which makes sense given the proximity of riders in these vehicles compared to the subway or regional transit). However, this specific group of individuals has the highest transit modal share among all the clusters identified in the table. This observation brings important policy implications for post-pandemic transit usage recovery. Restoration of public trust is crucial. Transit agencies should invest in mitigating the public’s health and safety concerns about taking a ride in the transit system (especially the bus/streetcar network). These intuitive results prove that the fusion method can be reliably used to produce a more detailed representation of travel behaviour during the pandemic by merging information from the core and the satellite surveys.

The synthetic fused data were further analyzed to investigate the factors affecting commute mode choice behaviour in the GTA during the pandemic. Specifically, we take a subset of the travel diary data representing commuting trips from the core survey, the socio-demographic information of the respondents, and their attitudinal statements imputed from the satellite surveys. We use this information to jointly model the determinants of mode choices with respondents’ attitudes and risk perceptions (latent attributes derived from the imputed attitudinal variables) using a hybrid choice model (HCM) framework (42, 43).

The model estimation data observed five major commute modes: car drive, car passenger, transit, walk, and bicycle. Although some observations of commuting trips used other modes, their number is too low to have reasonable market shares, so these observations are omitted. Transportation level-of-service (LOS) attributes of the five modes are generated using a Google application programming interface (API) framework, namely the Tool for Incorporating Level of Service attributes (TILOS) (44). TILOS generates mode-specific travel time and distance information using trip origin and destination coordinates, and departure time. Auto LOS relies on a mix of historical and real-time travel data. Transit LOS uses General Transit Feed Specification (GTFS) data. To generate cost by auto mode, we employed a list of available cost matrices widely used for transportation planning by various government and public agencies in the study region (45). For transit fare, we used a calibrated Deterministic User Equilibrium traffic assignment model of the study area called the GTA model, which generates transit fare based on origin and destination traffic analysis zones and departure times.

The choice set for each individual was determined using feasibility constraints: one must have a driver’s license and a car to use the car drive mode; total transit travel time over 120 min is considered to be infeasible for commuting; a distance over 3 km is considered to be infeasible for walking, and a distance over 10 km is considered to be infeasible for using bicycle. The feasibility constraints for transit, walk, and bicycle is based on the sample data. In the raw data, 95% of the commuting trips made by walking have a trip length of less than 3 km. Similarly, 10 km for bicycle trips and 120 min for transit trips correspond to 95th percentile and 98th percentile values, respectively. After removing the observations with infeasible mode choices, missing personal and household socioeconomic information, and missing level-of-service attributes, a final dataset of 3,319 commuting trips and commuters is obtained. In this data, car drive is the most dominant commuting mode, with a 65% share. Car passenger contributes to 6.5% of daily commuting trips. Transit share is 19.5% whereas walk and bicycle respectively contribute 8% and 1% of daily commuting trips.

6.1 Model specification

Factor analysis was conducted to identify latent factors based on the imputed attitudinal questions (referred to here as indicator variables). The most consistent findings were obtained using two factors, which loaded strongly onto eight statements (with loadings larger than 0.4) as shown in Table 4.

Table 4

Definition of latent attitudinal factors based on the fusion outputs
Latent Construct	Observed indicator	Factor Loading
Perception of increased risk during the pandemic	I believe there are more risks associated with leaving my home than before the pandemic	0.402
	I believe there is more risk associated with using ride-sourcing services than before the pandemic	0.400
	I believe there is more risk associated with using taxi services than before the pandemic	0.494
	I believe there is more risk associated with carpooling than before the pandemic	0.445
	I believe there is more risk associated with using car-sharing services (e.g., Zipcar, Communauto) than before the pandemic	0.436
Concerns regarding the pandemic	I am concerned about the number of daily new cases in Ontario, Canada	0.479
	I am concerned about the emergence of the new variant of COVID-19	0.483
	I am concerned about the mortality rate of the disease which is causing the pandemic	0.445

Figure 5 shows the two components of the HCM – the discrete choice model and the latent variable model consisting of structural and measurement equations. In the HCM, the utility $U$ of mode $m$ for individual $n$ is defined by the following function, which combines both observed and latent variables:

$${U}_{m,n}={V}_{m,n}+{\epsilon }_{m,n}={\beta }_{m}{X}_{m,n}+{\lambda }_{l}{\alpha }_{l,n}+{\epsilon }_{m,n}$$

Here, $V$ represents the systematic component of the utility function of the corresponding alternative,

$X$ is a vector of observed variables (socio-demographic attributes and modal LOS attributes),

$\beta$ is a vector of utility coefficients associated with the observed variables,

$\alpha$ is a vector of latent variables ($l=1,\dots ,L$ where $L=2$),

$\lambda$ is a vector of coefficients associated with the latent variables,

$\epsilon$ represents a random error to capture the unobserved component of the utility function of the corresponding alternative. The error is IID across alternatives and observations and follows a type I extreme value distribution.

The structural equation of the latent variable ${\alpha }_{l}$ for individual $n$ is given by:

$${\alpha }_{l,n}={\gamma }_{l}{Z}_{l,n}+{\eta }_{l,n}$$

Where, $Z$ is a vector of socio-demographic characteristics of individual

$\gamma$ is a vector of estimated parameters,

$\eta$ follows a standard Normal distribution across individuals, capturing the random component of the latent variable

Thus, the likelihood of the observed choice $m$ for individual $n$, conditional on $\beta$ and ${\alpha }_{n}$ is given by:

$${L}_{{C}_{n}}\left(\beta ,{\alpha }_{n}\right)=\frac{{e}^{{V}_{{m}^{*},n}}}{{\sum }_{m=1}^{5}{e}^{{V}_{m,n}}}$$

Where ${m}^{*}$ is the alternative chosen by individual $n$. The two latent variables are also used to explain the value of the associated attitudinal questions, where we adopt ordered logit specifications. The likelihood of the ordered logit models is given by:

$${L}_{{I}_{n}}\left(\tau ,\zeta ,{\alpha }_{n}\right)=\prod _{i=1}^{I}\left(\sum _{p=1}^{5}\left({I}_{n,i}=p\right)\left[\frac{{e}^{{\tau }_{i,p}-{\zeta }_{i}{\alpha }_{n}}}{1+{e}^{{\tau }_{i,p}-{\zeta }_{i}{\alpha }_{n}}}-\frac{{e}^{{\tau }_{i,p-1}-{\zeta }_{i}{\alpha }_{n}}}{1+{e}^{{\tau }_{i,p-1}-{\zeta }_{i}{\alpha }_{n}}}\right]\right)$$

Where, ${\zeta }_{i}$ is an estimated parameter that measures the impact of ${\alpha }_{n}$ on the attitudinal indicator ${I}_{i}$, and ${\tau }_{i}$ is a vector of threshold parameters for this indicator. The term $\left({I}_{n,i}=p\right)$ will be equal to 1 if and only if individual $n$ answers with level $p$ to indicator ${I}_{i}$, where $p=1,\dots ,5$.

The combined log-likelihood of the HCM is then given by:

$$LL\left(\gamma ,\zeta ,\tau ,\beta \right)=\sum _{n=1}^{N}log{\int }_{{\eta }_{n}}^{ }{L}_{{C}_{n}}\left(\beta ,{\alpha }_{n}\right){L}_{{I}_{n}}\left(\tau ,\zeta ,{\alpha }_{n}\right)\varphi \left({\eta }_{n}\right)d{\eta }_{n}$$

Having imputed $s$ instances of each attitudinal statement in the fusion process, we could estimate $s$ hybrid choice models, which in turn allowed us to generate the distribution of each parameter in the model and have a detailed understanding of the extent of the effect of each mode choice attribute. For this, we first finalized the model specification using the nearest neighbour fusion result. We then re-estimated the model for all the 100 sets of imputed attitudinal statements to generate distributions of the parameters. All models were coded and estimated in Apollo v.0.1.0 (46).

6.2 Model estimation results and discussion

The results of the HCM estimation are shown in Table 5. The specification of the final model using the NN fusion output is derived based on the accommodation of variables with proper signs and statistical significance. The critical value (1.96) of the t-stat with a 95% confidence limit is used as the threshold value of considering variables in the model. However, some parameters with t-stat values lower than 1.96 are retained in the model because the corresponding variables provide considerable insight into the behavioural process. In all the subsequent estimations (using the multiple imputation fusion outputs), the same attributes were found to be significant with similar signs (and somewhat similar magnitudes), highlighting the statistical robustness of the final specification reported in Table 5.

Table 5

Hybrid choice model of commuting modes estimated using fusion outputs
		NN fusion		Multiple imputation fusion
Variable	Mode	Para-meter	t-stat	10th perc.	25th perc.	Median	75th perc.	90th perc.
Choice model component
Alternative specific constant	Car drive	-12.351	-3.428	-8.583	-8.235	-7.479	-6.582	-6.093
	Car passenger	0.000	-	0.000	0.000	0.000	0.000	0.000
	Transit	3.709	3.571	4.768	5.102	5.342	5.568	5.778
	Walk	3.595	5.016	4.010	4.088	4.263	4.493	4.604
	Bicycle	0.685	0.866	1.143	1.234	1.403	1.637	1.730
In-vehicle travel time (min)	Car drive	-0.144	-2.603	-0.185	-0.175	-0.165	-0.155	-0.150
	Car passenger	-0.376	-8.312	-0.421	-0.416	-0.413	-0.408	-0.400
	Transit	-0.102	-5.632	-0.113	-0.111	-0.110	-0.108	-0.107
(Access + egress) time (min)	Transit	-0.080	-2.195	-0.098	-0.096	-0.089	-0.084	-0.081
Number of transfer(s)	Transit	-0.768	-2.409	-0.978	-0.949	-0.918	-0.893	-0.846
Travel cost ($)	All motorized modes	-0.450	-3.620	-0.556	-0.548	-0.534	-0.521	-0.509
Parking cost ($)	Car drive	-0.049	-1.630	-0.052	-0.050	-0.048	-0.047	-0.047
Trip length (km)	Walk & Bicycle	-1.062	-8.430	-1.181	-1.168	-1.152	-1.134	-1.123
Number of vehicles in household	Car drive	6.015	5.557	4.518	4.584	4.836	4.898	5.088
Number of vehicles in household	Car passenger	0.525	2.927	0.540	0.562	0.575	0.595	0.608
Number of bicycles in household	Bicycle	0.473	3.258	0.445	0.452	0.457	0.461	0.470
Gender: Female	Bicycle	-1.384	-2.926	-1.487	-1.467	-1.449	-1.436	-1.422
Increased risk perception	Car drive	13.135	5.365	9.993	10.178	10.878	11.042	11.382
Increased risk perception	Car passenger	-2.996	-5.447	-3.534	-3.465	-3.313	-3.196	-3.107
Pandemic concern	Transit	-3.351	-8.927	-3.895	-3.837	-3.741	-3.684	-3.638
Structural model for latent variable "Increased risk perception"
Age		0.018	7.588	0.014	0.014	0.015	0.016	0.016
Usual workplace during COVID: workplace		0.163	2.472	0.175	0.181	0.189	0.197	0.207
Student		-0.439	-5.210	-0.547	-0.522	-0.502	-0.494	-0.473
Structural model for latent variable "Pandemic concern"
Age		0.009	2.508	0.010	0.012	0.013	0.015	0.016
Student		-0.190	-1.409	-0.186	-0.121	-0.091	-0.049	-0.019
Household income below $60,000		-0.373	-2.506	-0.478	-0.468	-0.409	-0.369	-0.348
Household has adult > 60 years old		0.629	5.506	0.550	0.568	0.625	0.663	0.697
Measurement models for latent variable "Increased risk perception" (ordered logit)
Risk of being outside
Threshold 1		-2.984	-22.695	-3.241	-3.163	-3.086	-3.025	-2.979
Threshold 2		-1.722	-19.818	-1.942	-1.898	-1.832	-1.770	-1.709
Threshold 3		-0.168	-2.266	-0.475	-0.433	-0.375	-0.340	-0.294
Threshold 4		1.891	20.144	1.598	1.647	1.687	1.762	1.820
Impact of latent variable		0.263	4.585	0.024	0.052	0.103	0.133	0.157
Risk of ridesourcing
Threshold 1		-3.794	-20.611	-4.635	-4.562	-4.441	-4.304	-4.086
Threshold 2		-2.262	-22.533	-2.735	-2.613	-2.564	-2.489	-2.424
Threshold 3		-0.521	-7.132	-0.782	-0.727	-0.689	-0.661	-0.596
Threshold 4		1.317	15.912	0.979	1.033	1.097	1.166	1.213
Impact of latent variable		0.223	3.809	0.033	0.058	0.113	0.171	0.216
Risk of taxi
Threshold 1		-3.464	-21.707	-4.245	-4.081	-3.980	-3.852	-3.766
Threshold 2		-2.029	-21.628	-2.611	-2.509	-2.418	-2.376	-2.321
Threshold 3		-0.354	-4.847	-0.675	-0.591	-0.569	-0.504	-0.471
Threshold 4		1.254	15.204	0.988	1.024	1.073	1.130	1.152
Impact of latent variable		0.232	4.038	0.014	0.037	0.078	0.115	0.150
Risk of carpooling
Threshold 1		-3.549	-20.977	-4.225	-4.092	-3.988	-3.888	-3.794
Threshold 2		-2.004	-21.016	-2.698	-2.631	-2.526	-2.469	-2.421
Threshold 3		-0.317	-4.192	-0.831	-0.813	-0.770	-0.725	-0.690
Threshold 4		1.448	16.330	0.976	1.006	1.036	1.092	1.124
Impact of latent variable		0.305	5.255	0.018	0.041	0.095	0.127	0.148
Risk of car-sharing
Threshold 1		-2.990	-22.239	-4.138	-4.006	-3.841	-3.767	-3.655
Threshold 2		-1.899	-20.273	-2.558	-2.506	-2.440	-2.381	-2.296
Threshold 3		-0.321	-4.145	-0.758	-0.722	-0.661	-0.604	-0.566
Threshold 4		1.623	17.452	1.071	1.112	1.186	1.220	1.244
Impact of latent variable		0.303	5.207	0.087	0.099	0.154	0.193	0.233
Measurement models for latent variable "Pandemic concern" (ordered logit)
Concerned about daily new cases
Threshold 1		-2.615	-20.524	-2.561	-2.462	-2.398	-2.308	-2.285
Threshold 2		-1.468	-14.573	-1.434	-1.396	-1.339	-1.276	-1.211
Threshold 3		-0.416	-4.441	-0.482	-0.422	-0.393	-0.349	-0.291
Threshold 4		1.469	13.591	1.214	1.280	1.325	1.356	1.471
Impact of latent variable		0.489	5.533	0.116	0.161	0.208	0.245	0.280
Concerned about new variants
Threshold 1		-3.643	-20.721	-3.304	-3.245	-3.080	-3.008	-2.881
Threshold 2		-1.749	-17.512	-1.761	-1.705	-1.653	-1.617	-1.533
Threshold 3		-0.752	-8.318	-0.831	-0.758	-0.719	-0.681	-0.574
Threshold 4		0.832	8.509	0.585	0.658	0.689	0.736	0.833
Impact of latent variable		0.429	5.343	0.087	0.151	0.200	0.253	0.333
Concerned about mortality rate
Threshold 1		-3.147	-20.616	-3.031	-2.940	-2.814	-2.731	-2.661
Threshold 2		-1.739	-15.899	-1.757	-1.697	-1.653	-1.601	-1.525
Threshold 3		-0.483	-4.828	-0.504	-0.453	-0.409	-0.374	-0.323
Threshold 4		1.260	11.187	1.118	1.159	1.215	1.250	1.293
Impact of latent variable		0.519	6.200	0.237	0.250	0.294	0.337	0.399

The specification of the choice model component shows that LOS attributes (travel cost, trip length, the different trip time components, and the number of transit transfers) have signs that match expectations. Among the different travel time components, in-vehicle travel time is more relevant to commute mode choice decisions than walking (access and egress) time. Such findings may be related to the high level of transit access coverage in the study area. The model shows that females are less likely to cycle than males in terms of personal and household attributes. On the other hand, as expected, household vehicle and bicycle ownership positively affect car use (car drive and car passenger) and bicycle use.

Regarding the role of the latent attitudinal variables, the choice model shows a strong and positive effect of “increased risk perception” towards car drive mode. It seems that the decision to commute by car is mainly determined by the increased risk perception associated with travelling during the pandemic rather than traditional personal and modal attributes (given the high parameter value and significance of the latent variable). Interestingly, the same latent variable shows a negative effect for car passenger mode, which is understandable given that the car passenger mode in our model also includes ride sharing option with non-household members. Thus, individuals with higher levels of increased risk perception are less likely to share rides with others. As for the “pandemic concern”, it is found that individuals who are more concerned about the various aspects of the pandemic are less likely to choose transit as their commute mode. These findings are in line with previous studies (3, 5, 9).

The estimates of the parameters $\gamma$ in the structural models help explain the influence of individual’s socio-demographic characteristics on the latent variables. For example, it is found that older respondents and respondents who had to be physically present in their workplace during the pandemic have higher risk perceptions. Similarly, older respondents and respondents who have lived with senior household members (aged 60 years and above) have increased concern regarding the pandemic. These findings make intuitive sense, given that COVID-19 has been found to be more dangerous for the older population. On the contrary, individuals whose household income is below $60,000 are less likely to be concerned about the pandemic than higher-income individuals. In terms of the measurement models, the $\zeta$ estimates have the expected sign (positive) and are statistically significant, confirming the results of the factor analysis. Thus, more positive values of the risk perception and pandemic concern latent variables increase the probability of stronger agreement with the associated attitudinal statements.

Overall, the findings of the HCM provide important behavioural insights about the commute mode choice decisions during the pandemic that meet a priori expectations and are consistent with other studies in the literature. This also highlights that the fused data can be reliably used for much more complex and stable investigations than would be possible individually with either the core or the satellite survey data.

This study demonstrates the feasibility of fusing a “core” household travel survey with specialized “satellite” surveys to evaluate the impacts of COVID-19 on passenger travel demand in the Greater Toronto Area (GTA). The study uses a non-parametric implicit data fusion method to generate multiple synthetic datasets that contain observed travel diaries and socioeconomic attributes of the trip-makers from the core survey, along with the imputed attitudinal statements based on the satellite surveys. Imputing multiple fused datasets helps reduce potential biases that can affect subsequent analyses using the data. Common personal and household attributes shared by the surveys are carefully analyzed to select the optimal set of matching variables for the fusion exercise. We use the new comprehensive data sets generated to conduct two types of analysis. First, we use descriptive analysis to investigate the correlation between the inferred attitudinal responses and travel behaviour indicators like trip rates and modal shares. Second, we also use the fused data sets to estimate hybrid choice models to identify the observed and latent factors affecting commute mode choice decisions in the GTA during the pandemic.

The descriptive analysis and the estimated choice models indicate that we can derive valid inferences using the fused data. For example, individuals with higher imputed levels of perceived risks are found to make fewer trips on a given weekday, whereas individuals who agree with the advantages of telecommuting complete fewer work trips per day. Individuals who have a greater preference for online grocery shopping made fewer shopping trips, whereas those who have a greater preference for in-store shopping made more shopping trips on a given weekday. However, the preference for online grocery shopping did not directly contribute to reducing out-of-home activities during the COVID-19 pandemic. The estimated choice models indicate that commuters prioritize pandemic-related attitudinal factors when choosing a mode. Specifically, the decision to commute by car is found to be mainly determined by the increased risk perception associated with travelling during the pandemic rather than traditional personal and modal attributes. Moreover, having imputed multiple fused data sets, we could estimate multiple instances of the choice model, which in turn allowed us to have a detailed understanding of the extent of the effect of each mode choice attribute.

The fusion results are validated by comparing (1) the marginal distributions of the target variables in the donor and the fused datasets and (2) the observed and imputed values of some of the target variables of the recipient units. While the market shares of the target variables were replicated reasonably well in the fused datasets, researchers must be careful when making interpretations directly from the inferred attitudinal statements. Even though the core and the satellites were conducted only two months apart, there was a massive increase in the number of fully vaccinated people in the study area, which could have made them less concerned about the pandemic. This was reflected in the changed market shares of some of the attitudinal statements. However, the hybrid choice model results were less susceptible to such effect. This also highlights that the fused data can be reliably used for much more complex and stable investigations than would be possible individually with either the core or the satellite survey data.

Overall, this study presents a proof of concept of how the implicit data fusion method may be used to integrate multiple travel survey data to produce a more detailed representation of travel behaviour. The analysis also sheds light on the ideal satellite design for the core-satellite type data collection paradigm. The validation proves that while it is important to have a comprehensive set of consistent and coherent common variables shared among the core and the satellites that ideally are well associated with the target variables, it is also important to conduct the surveys during the same period (to control for any external effects). This is especially important if the objective is to fuse attitudinal statements. The indicators can evolve quickly due to uncontrollable external factors (like the vaccination rate in our exercise). The next phase of the work will focus on comparing the performance of the proposed method with explicit fusion methods, where parametric models will be developed using the donor data to estimate the most probable attitudinal responses, which will then be applied to the receptor data to infer the missing target variables. Future works should also be geared towards testing the effect of different cut-off values for the similarity score in the matching step of the implicit fusion. This will help identify an optimum similarity score that will generate better candidate records for matching.

Funding: The research was funded by an NSERC Discovery Grant. However, the authors claim the sole responsibilities of all results, comments, and interpretations made in the paper.

Competing interests: On behalf of all authors, the corresponding author states that there is no conflict of interest.

Availability of data and material: Data used in this research were collected by the authors as part of a comprehensive data collection program to assess the impacts of COVID-19 on urban passenger travel demand in the Greater Toronto Area.

Code availability: The analysis presented in the paper were conducted using the “apollo” package in the statistical software “R” and various custom codes and applications.

Authors' contributions: Study conception and design: S. Hossain, K.N. Habib; Data preparation: S. Hossain, P. Loa, K. Wang, S.M. Mashrur, A. Dianat; Analysis and interpretation of results: S. Hossain, P. Loa, K. Wang, S.M. Mashrur, A. Dianat, K.N. Habib; Manuscript preparation: S. Hossain, P. Loa, K. Wang, S.M. Mashrur, K.N. Habib. All authors reviewed the results and approved the final version of the manuscript.

Lipsitch M, Swerdlow DL, Finelli L. Defining the Epidemiology of Covid-19 — Studies Needed. N Engl J Med [Internet]. 2020 Feb 19;382(13):1194–6. Available from: https://doi.org/10.1056/NEJMp2002125
Habib KN, Hawkins J, Shakib S, Loa P, Mashrur S, Dianat A, et al. Assessing the impacts of COVID-19 on urban passenger travel demand in the greater Toronto area: description of a multi-pronged and multi-staged study with initial results. Transp Lett [Internet]. 2021 May 28;13(5–6):353–66. Available from: https://doi.org/10.1080/19427867.2021.1899579
Shamshiripour A, Rahimi E, Shabanpour R, Mohammadian A (Kouros). How is COVID-19 reshaping activity-travel behavior? Evidence from a comprehensive survey in Chicago. Transp Res Interdiscip Perspect. 2020;7:100216.
Beck MJ, Hensher DA. Insights into the impact of COVID-19 on household travel and activities in Australia – The early days under restrictions. Transp Policy. 2020;96:76–93.
Bucsky P. Modal share changes due to COVID-19: The case of Budapest. Transp Res Interdiscip Perspect. 2020 Nov 1;8:100141.
de Haas M, Faber R, Hamersma M. How COVID-19 and the Dutch ‘intelligent lockdown’ change activities, work and travel behaviour: Evidence from longitudinal data in the Netherlands. Transp Res Interdiscip Perspect. 2020;6:100150.
Parady G, Taniguchi A, Takami K. Travel behavior changes during the COVID-19 pandemic in Japan: Analyzing the effects of risk perception and social influence on going-out self-restriction. Transp Res Interdiscip Perspect [Internet]. 2020;7:100181. Available from: https://www.sciencedirect.com/science/article/pii/S2590198220300920
Molloy J. MOBIS-COVID19/28: Results as of 16/11/2020 (Second Wave) [Internet]. 2020 [cited 2022 Jul 26]. Available from: https://ivtmobis.ethz.ch/mobis/covid19/reports/mobis_covid19_report_en_2020-11-16.html
Capasso da Silva D, Khoeini S, Salon D, Conway MW, Chauhan RS, Pendyala RM, et al. How are Attitudes Toward COVID-19 Associated with Traveler Behavior During the Pandemic? Findings [Internet]. 2021;(June). Available from: https://doi.org/10.32866/001c.24389
Bahamonde-Birke FJ, Kunert U, Link H, Ortúzar J de D. About attitudes and perceptions: finding the proper way to consider latent variables in discrete choice models. Transportation (Amst) [Internet]. 2017;44(3):475–93. Available from: https://doi.org/10.1007/s11116-015-9663-5
Chorus C. What about behaviour in travel demand modelling? An overview of recent progress. Transp Lett [Internet]. 2012 Apr 1;4(2):93–104. Available from: https://doi.org/10.3328/TL.2012.04.02.93-104
Popuri Y, Proussaloglou K, Ayvalik C, Koppelman F, Lee A. Importance of traveler attitudes in the choice of public transportation to work: findings from the Regional Transportation Authority Attitudinal Survey. Transportation (Amst) [Internet]. 2011;38(4):643–61. Available from: https://doi.org/10.1007/s11116-011-9336-y
Andridge RR, Little RJA. A Review of Hot Deck Imputation for Survey Non-response. Int Stat Rev [Internet]. 2010 Apr 1;78(1):40–64. Available from: https://doi.org/10.1111/j.1751-5823.2010.00103.x
Castanedo F. A Review of Data Fusion Techniques. Ursino D, Takama Y, editors. Sci World J [Internet]. 2013;2013:704504. Available from: https://doi.org/10.1155/2013/704504
Van Der Putten P, Kok JN, Gupta A. Data fusion through statistical matching. Available SSRN 297501. 2002;
Miller E, Lee-Gosselin M, Habib K, Morency C, Roorda M, Shalaby A. Changing Practices in Data Collection on the Movement of People. 2014.
Amey A, Liu L, Pereira F, Zegras C, Veloso M, Bento C, et al. State of the practice overview of transportation data fusion: technical and institutional considerations. Transp Syst Pap Ser (MIT-Portugal Program). 2009;
Bachmann C, Abdulhai B, Roorda MJ, Moshiri B. A comparative assessment of multi-sensor data fusion techniques for freeway traffic speed estimation using microsimulation modeling. Transp Res Part C Emerg Technol. 2013 Jan 1;26:33–48.
Huang Z, Ling X, Wang P, Zhang F, Mao Y, Lin T, et al. Modeling real-time human mobility based on mobile phone and transportation data fusion. Transp Res Part C Emerg Technol. 2018 Nov 1;96:251–69.
Kusakabe T, Asakura Y. Behavioural data mining of transit smart card data: A data fusion approach. Transp Res Part C Emerg Technol. 2014 Sep 1;46:179–91.
Saffari E, Yildirimoglu M, Hickman M. Data fusion for estimating Macroscopic Fundamental Diagram in large-scale urban networks. Transp Res Part C Emerg Technol. 2022 Apr 1;137:103555.
Guo F, Polak JW, Krishnan R. Predictor fusion for short-term traffic forecasting. Transp Res Part C Emerg Technol. 2018 Jul 1;92:90–100.
Verreault H, Morency C. Integration of a phone-based household travel survey and a web-based student travel survey. Transportation (Amst) [Internet]. 2018;45:89–103. Available from: https://doi.org/10.1007/s11116-016-9726-2
Wang K, Hossain S, Habib KN. A hybrid data fusion methodology for household travel surveys to reduce proxy biases and under-representation of specific sub-group of population. Transportation (Amst) [Internet]. 2021; Available from: https://doi.org/10.1007/s11116-021-10228-x
Bayart C, Bonnel P, Morency C. Survey mode integration and data fusion: methods and challenges. In: Transport survey methods: Keeping up with a changing world. Emerald Group Publishing Limited; 2009. p. 587–611.
D’Orazio M, Di Zio M, Scanu M. Statistical matching: Theory and practice. Rome, Italy: John Wiley & Sons; 2006.
Müller K, Axhausen KW. Using Survey Calibration and Statistical Matching to Reweight and Distribute Activity Schedules. Transp Res Rec [Internet]. 2014 Jan 1;2429(1):157–67. Available from: https://doi.org/10.3141/2429-17
Saporta G. Data fusion and data grafting. Comput Stat Data Anal [Internet]. 2002;38(4):465–73. Available from: https://www.sciencedirect.com/science/article/pii/S016794730100072X
Aluja-Banet T, Daunis-i-Estadella J, Pellicer D. GRAFT, a complete system for data fusion. Comput Stat Data Anal. 2007;52(2):635–49.
Pawlak J, Sivakumar A, Polak JW. An Imputation Approach to the Fusion of Travel Diary and Lifestyle Data: Application to the Analysis of the Interaction of ICT and Physical Mobility. In: New Techniques and Technologies for Statistics Conference. Brussels, Belgium; 2013.
Kressner JD, Garrow LA. Using Third-Party Data for Travel Demand Modeling: Comparison of targeted Marketing, Census, and household travel survey Data. Transp Res Rec [Internet]. 2014 Jan 1;2442(1):8–19. Available from: https://doi.org/10.3141/2442-02
Sivakumar A, Polak JW. Exploration of data-pooling techniques: modeling activity participation and household technology holdings. In: 92nd Annual Meeting of the Transportation Research Board. Washington DC, USA; 2013.
Lugo M, Srinivasan S. Multimodal Transportation Choices and Health: Exploratory Analysis Using Data Fusion Techniques. Transp Res Rec [Internet]. 2016 Jan 1;2598(1):37–45. Available from: https://doi.org/10.3141/2598-05
Rubin DB. Multiple imputation for nonresponse in surveys. Vol. 81. John Wiley & Sons; 2004.
Goulias KG, Pendyala RM, Bhat CR. Keynote — Total Design Data Needs for the New Generation Large-Scale Activity Microsimulation Models. In: Zmud J, Lee-Gosselin M, Munizaga M, Carrasco JA, editors. Transport Survey Methods [Internet]. Emerald Group Publishing Limited; 2013. p. 21–46. Available from: https://doi.org/10.1108/9781781902882-002
Srikukenthiran S, Loa P, Hossain S, Chung B, Habib KN, Miller EJ. Transportation Tomorrow Survey (TTS) 2.0 Final Report. 2018. p. 121.
Data Management Group. 2016 TTS: Data Expansion and Validation [Internet]. 2018 [cited 2022 Jul 21]. Available from: http://dmg.utoronto.ca/pdf/tts/2016/2016TTS_DataExpansion.pdf
Gower JC. A General Coefficient of Similarity and Some of Its Properties. Biometrics [Internet]. 1971 Jul 26;27(4):857–71. Available from: http://www.jstor.org/stable/2528823
Rässler S. Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches. Vol. 168. Springer Science & Business Media; 2012.
Mallapaty S. The coronavirus is most deadly if you are older and male — new data reveal the risks. Nature. 2020 Sep 3;585(7823):16–7.
Irawan MZ, Belgiawan PF, Joewono TB, Bastarianto FF, Rizki M, Ilahi A. Exploring activity-travel behavior changes during the beginning of COVID-19 pandemic in Indonesia. Transportation (Amst) [Internet]. 2022;49(2):529–53. Available from: https://doi.org/10.1007/s11116-021-10185-5
Ben-Akiva M, Walker J, Bernardino AT, Gopinath DA, Morikawa T, Polydoropoulou A. Integration of choice and latent variable models. Perpetual motion Travel Behav Res Oppor Appl challenges. 2002;2002:431–70.
Abou-Zeid M, Ben-Akiva M. Hybrid choice models. In: Hess S, Daly A, editors. Handbook of Choice Modelling [Internet]. Cheltenham, UK: Edward Elgar Publishing; 2014. p. 383–412. Available from: https://www.elgaronline.com/view/edcoll/9781781003145/9781781003145.00025.xml
Hasnine MS, Kamel I, Habib KN. Using Google Map to impute transportation level-of-service attributes: application in mode and departure time choice modelling. In: 11th International Conference on Transport Survey Methods, Esterel, Quebec, Canada. 2017. p. 24–9.
Natural Resources Canada. 2021 Fuel Consumption Guide [Internet]. 2021 [cited 2022 Jul 21]. Available from: https://www.nrcan.gc.ca/sites/nrcan/files/oee/pdf/transportation/tools/fuelratings/2021 Fuel Consumption Guide.pdf
Hess S, Palma D. Apollo: A flexible, powerful and customisable freeware package for choice model estimation and application. J choice Model. 2019;32:100170.

No competing interests reported.

APPENDIX.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

A Comprehensive Data Fusion to Evaluate the Impacts of COVID-19 on Passenger Travel Demands: Application of a Core-Satellite Data Collection Paradigm

Status:

Version 1

Abstract

Figures

1. INTRODUCTION

2. Background On Data Fusion Methods

3. Description Of Datasets

4. Data Fusion Framework

4.1 Harmonisation and reconciliation of sources

4.2 Analysis of the explanatory power for common variables

4.3 Matching method

4.4 Quality evaluation

5. Data Fusion Results

6. APPLICATION OF FUSED DATA TO EXPLORE THE DETERMINANTS OF MODE CHOICE DURING THE PANDEMIC: THE HYBRID CHOICE MODEL

6.1 Model specification

6.2 Model estimation results and discussion

7. Conclusion And Future Research

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1