Four datasets were linked in pairs to determine the overlap between the datasets. Data linkage software was used to automatically suggest potential matches by comparing the link probabilities to a threshold, then confirmed manually. The following sections describe the datasets and linkage approach.
Datasets
The following datasets with commercial fishing incident data from Oregon and Washington were studied: the Commercial Fishing Incident Database (CFID), the Vessel Casualty (VC) database, the Nonfatal Injuries (NFI) database, and the Oregon Trauma Registry (OTR).
The Commercial Fishing Incident Database (CFID) contains information regarding commercial fishing vessel disasters and fatalities due to traumatic injuries from the entire United States (CDC/NIOSH, 2019). The CFID definition of a vessel disaster is an event such as sinking that forces the crew to abandon the vessel because it is no longer safe to remain onboard. The types of data collected include the date, time, and location of the incident, vessel details, contributing factors, and personnel injury and fatality information. A single incident may involve injuries to and/or fatalities of multiple personnel. The original sources of data used to populate this database include United States Coast Guard (USCG) reports and news articles. CFID was developed and is actively maintained by the Centers for Disease Control and Prevention’s (CDC) National Institute for Occupational Safety and Health (NIOSH), Western States Division.
The Vessel Casualty database (VC) recorded information about commercial fishing vessel-related incidents in Alaska, Oregon, and Washington that (a) are not classified as vessel disasters and (b) did not involve any fatalities. These incidents tend to be less serious but still present problems with vessel systems that can put crewmembers at risk, such as loss of power, propulsion, or steering (Case & Lucas, in press). This dataset was maintained by NIOSH and merged with CFID in 2020. Cases were originally obtained from USCG reports. This dataset provides information about the incident date, time, location, and circumstances, in addition to the vessel information. Unlike CFID, this dataset does not include any personnel information, focusing instead on vessel damage. For this study, data were requested from Oregon and Washington only.
The Nonfatal Injuries (NFI) database was developed at NIOSH to complement the information recorded in CFID. To date, this dataset has covered Alaska, Washington, Oregon, and California. The NFI records injuries sustained during commercial fishing that are not recorded in CFID, such as those incurred while working on deck. Cases were originally obtained from USCG reports. Variables in NFI are similar to those in CFID; in addition to personnel demographics and injury characteristics, the NFI also contains vessel information. Data were requested from Oregon, Washington, and California only.
The Oregon Trauma Registry (OTR) (Oregon Health Authority Public Health Division, 2019) includes information concerning all Oregon patients who either entered into the trauma system in Oregon or met specific clinical- or admission-based criteria for inclusion in the registry, based either on field entry by EMS responders or by the activation of a trauma team or surgeon at a receiving hospital. Information recorded includes patient demographics (including occupation), date, time and location of incident, emergency service response, injury circumstances and details, medical procedures performed, length of stay, insurance, and costs. Data were requested for patients with work-related injuries and occupations of farming/fishing/forestry. This dataset was further pruned using incident location and narratives to include only fishing-related cases. The OTR is maintained by the Injury and Violence Prevention Program of the Oregon Health Authority Public Health Division.
Each dataset used in this study contains data from different date ranges and regions (Table 1). The date range varied by data source due to lags in data abstraction, coding, and review, or to availability of data elements during specific time periods.
By definition, (a) the CFID dataset should not overlap the VC or NFI datasets, (b) the VC and NFI dataset may overlap, and (c) the OTR dataset could overlap with any of them (Fig. 1).
Data linkage method
Data linkage is a statistical technique used to identify records from two datasets that likely describe the same event. The two datasets must have some parameters in common (i.e., the matching variables) that can be used to distinguish events and link the records. Every record in one dataset is compared with every record in a second dataset. The likelihood that two records match (their match probability) is determined by comparing the contents of the matching variables for that pair. Match probabilities range from 0 to 1. Any record pair with a match probability above a specified threshold is designated as a link. Those below the threshold are designated as non-links. In our project, this was followed by a manual review process where all links were examined further to identify true matches.
Matching variables must be selected carefully. Ideal matching variables are independent, reliable, and complete. A major aim of this project is to determine the feasibility of using data linkage methods with commercial fishing incident data when personally identifiable information (PII) is not available. Depending on the two datasets involved, the matching variables used for linking in this study were some combination of: Incident Date, Incident State, Vessel Official Number, and Latitude/Longitude. These independent variables were identified during preliminary data linkage analyses to be the strongest indicators of links.
The linking results presented here were derived using components of the Python Record Linkage Toolkit software (De Bruin, 2019). This toolkit includes several data linkage classifiers, which use different methods to separate record pairs into links and non-links. The quality of the performance of each classifier depends on the dataset involved. Each of the classifiers described below were tested to determine the optimum classifier for our datasets.
Classifiers can be divided into two groups: supervised and unsupervised. Supervised classifiers require training using a "golden data set", a subset of the data where the true match status is known. Unsupervised classifiers, on the other hand, do not require training.
For the supervised classifiers, a golden data set was derived for each pair of datasets to be linked. First, a rudimentary approach was used to identify a small list of potential matches. True matches were then verified manually. Next, a set of non-matches was derived by creating fake records from scrambled real records. Finally, a golden data set for the dataset pair was created consisting of a combination of these verified true- and non-matches. For this study, the supervised classifiers used were Naïve-Bayes (NB), Logistic Regression (LR), and Support Vector Machine (SVM). Classifier definitions are provided in the Supplementary Materials.
The unsupervised classifier used in this study was the Expectation/Conditional Maximization Algorithm (ECM), a probabilistic classifier closely related to both the Naïve-Bayes Classifier and the probabilistic Fellegi and Sunter (1969) approach.
Quality metrics
Quality metrics provided by data linkage include TP (the number of True Positives), FP (False Positives), FN (False Negatives), and TN (True Negatives). For supervised classifiers, these describe how well the linkage technique performed on the golden data set. Additional metrics can be derived including precision, recall and f-score (Eqns. 1-3, respectively). Each of these three metrics (precision, recall, f-score) can range in value from 0 (worst) to 1 (best).
The f-score is a summary metric of the classifier's performance that balances precision and recall. Generally, a higher f-score indicates better performance.