Broadening the Capture of Natural Products Mentioned in FAERS Using Fuzzy String-Matching and a Siamese Neural Network

Increased sales of natural products (NPs) in the US and growing safety concerns highlight the need for NP pharmacovigilance. A challenge for NP pharmacovigilance is ambiguity when referring to NPs in spontaneous reporting systems. We used a combination of fuzzy string-matching and a neural network to reduce this ambiguity. We aim to increase the capture of reports involving NPs in the US Food and Drug Administration Adverse Event Reporting System (FAERS). Gestalt pattern-matching (GPM) and Siamese neural network (SM) were used to identify potential mentions of NPs of interest in 389,386 FAERS reports with unmapped drug names. We refined the identified candidates through manual review and annotation by health professionals. After adjudication, GPM identified 595 unique NP names and SM 504. There was little overlap between candidates identified by the approaches (Non-overlapping: GPM 347, SM 248). In total, 686 novel NP names were identified in the unmapped FAERS reports. Including these names in the FAERS collection yielded 3,486 additional reports mentioning NPs.

consumption.The constituents of which may interact across multiple biological systems in complex ways to contribute to their effects. 6promising approach to assess safety concerns for NPs is a retrospective pharmacovigilance analysis of adverse event reports from spontaneous reporting systems such as the FDA Adverse Event Reporting System (FAERS) 7,8 .A major challenge in pharmacovigilance for NPs is the need for more standardization for coding events involving NPs.The lack of standardization in adverse event reports related to NPs leads to challenges in parsing and identifying the products' names and ingredients due to their non-uniform representation in the reports 9,10 .Therefore, researchers often encounter unfamiliar NP names or spelling variations when identifying reports for pharmacovigilance 2 . Forexample, the FAERS database includes more than 44 names referring to "Licorice" including "Liquorice", "Glycyrrhiza glabra", "Glycyrrhiza laevis." Previous work has used fuzzy string-matching to overcome this limitation 11 .This approach helps mitigate the effects of similar name variations and misspellings but does not fully bridge the distance between the spectrum of names referring to the same product 9,11 ; such as matching the common name "Liquorice" to its equivalent Latin binomial name "Glycyrrhiza glabra." To address these shortcomings, we propose combining fuzzy string-matching and deep learning to broaden the capture of candidate NP names.A combination approach can leverage both the reliability of fuzzy string-matching and the exibility of deep learning to identify both spelling variations and alternative names for a given product name.For example, given a misspelled form of Licorice, such as "Likorice", the model will be able to map it to its Latin binomial name, "Glycyrrhiza glabra" and its other species by outputting a small distance between them.For this work, we utilized Gestalt patternmatching 12 (GPM) as the fuzzy string-matching component to maximize the identi cation of candidate spelling variations (Shown in Eq. 1 and Fig. 1).The proposed deep learning approach relies on the cosine distance (Shown in Eq. 3 and Fig. 2) between learned embeddings to create a model that matches natural product names.The deep learning approach is based on the Siamese model (SM) architecture.The SM architecture facilitates learning the embeddings by comparison of the inputs through the contrastive loss function (Shown in Eq. 4).The Siamese neural network was chosen for this task because they have been shown to successfully address the challenge of identifying similarities over a considerable range of problems 13 .Given an unknown term and a set of alternatives, the model learns to embed the inputs to minimize the cosine distance between terms that are spelled similarly or that are semantically similar.Additionally, they have been successfully trained with relatively little data 14 .

Data Collection:
The rst data source was the Center for Excellence for Natural Product-Drug Interaction Research (NaPDI) Database, from which we collected the known product names of several NPs, some of the previously identi ed spelling variations, and their corresponding Latin binomial names 15 .A second data source was the FAERS database, from which we identi ed additional product names or spelling variations using fuzzy string-matching for 70 different NPs 7 .FAERS data from Q1 2004 -Q2 2021 was loaded into a standardized database and manual annotation was used to map 5,358 drug name strings from adverse event reports that matched to NP names.The remaining 389,386 unmapped drug names from FAERS were used for the novelty experiment in this study.

Experiments:
The data was used to train and evaluate the Siamese model (SM) by conducting several experiments to study the effectiveness of the SM at matching potentially relevant terms from the reports to the corresponding NP names.We initially explored the SM's performance as a distance metric to relate NP names effectively.Then, we evaluated how the SM compared to fuzzy string-matching approaches in tackling the same problem, validating that the SM can match novel names or spelling variations from FAERS to the correct equivalent group of NPs.Finally, we combined both approaches to produce a set of candidate NP names to be manually validated and utilized during FAERS report collection.

Data Pre-Processing & Inclusion Criteria:
The training data consisted of pairs of spelling variations of the product names from the manual annotation and a distance label where "1" indicated distant terms and "0" indicated matching terms.An example row of a positive matching pair might be ("Likorice", "Liquorice", 0) and a negative matching pair ("Cinnamon", "Liquorice", 1).This representation allows the Siamese Model to learn the associations between query and target terms and represent the associations as a distance between 0 and 1.For simplicity, we decided to reduce the variation across terms.To this end, the data was standardized such that any non-alphabetical characters were removed from the terms, with the only exception being the whitespace character.All characters in the terms were then capitalized.Due to limitations of the implementation of Keras' Embedding Layer 16 , a xed-sized cutoff for the maximum length of the terms is required, and inputs must be represented as positive integers.We chose our cutoff by choosing a number close to the sum of the average size of the terms in the data (30) plus one standard deviation (31).
Therefore, terms longer than 65 characters were discarded.The nal step in this initial processing was to encode the terms into integer sequences, where each letter was mapped to its corresponding position in the English alphabet, so [A-Z] became [1-26], and the space character was mapped to the integer 27.For sequences smaller than the 65 maximum size cutoff, 0-padding was used to pad the rest of the sequence up to the 65 elements.
After this initial processing was done, two rounds of balancing were performed.First, we balanced the representation of each target name to roughly the same amount.The additional pairs were generated by using any names matching the target name; the names were modi ed by adding random modi cations to create new unique pairs.The second round of balancing was similar.It generated matching and nonmatching pairs as necessary to balance the total number of matching and non-matching pairs in the complete dataset.After the balancing procedures were completed, the 70/30 train-validation split, and a separate test/holdout set was created.A description of the number of samples in each of the sets is provided in Table 1.We utilized the SM architecture, as shown in Fig. 1.A SM comprises two identical neural network towers with the same architecture.In our implementation, each tower is made from 65 recurrent bidirectional Long Short-Term Memory (LSTM) cells 17 .The outputs of the towers were combined using the cosine distance between the vectors of the embedded terms.The contrastive loss function was utilized during training to measure the model's accuracy.The corresponding input to each tower was rst embedded into a 30-dimensional space by an embedding network comprised of 2 layers of 130 dense nodes each.The hyperparameters for the number of dense nodes, embedding dimensions, and the number of layers in the embedding network were chosen experimentally.

Comparison with Fuzzy String-Matching:
To evaluate the model's usefulness in identifying the correct matching NP name, we compared the model's performance against a fuzzy string-matching approach.The algorithms utilized for fuzzy stringmatching were the Levenshtein edit-distance (LED) as implemented in TensorFlow's "edit_distance" 18 and Gestalt pattern-matching (GPM) as implemented in Python's "di ib" library "get_close_matches" function 19 .The LED is a metric used for comparing the similarity between two sequences based on their "edit distance."Gestalt pattern-matching is an algorithm also used to compare the similarity between two sequences.The metric used for comparison was Mean Reciprocal Rank (MRR) 20 , with which we measured the top 20 results predicted to be the most similar to the target value annotated in the dataset.Additionally, we also compared the top results to any of the product names equivalent to the target.
These top 20 results are used as candidate NP names to be validated further.

Novelty Experiment:
Finally, to evaluate the applicability of our methods for pharmacovigilance research, we extracted 389,385 drug name strings from the FAERS database that were not mapped to any drugs or NP names and might contain NPs.After processing the unmapped names, 7,751 were removed because they were longer than 65 characters.Another 41,849 sequences were identi ed as duplicates and were also removed; the remaining 339,785 were utilized for this novelty experiment to identify unique NP names from unmapped reports.For this experiment, we utilized a subset of 70 NPs of interest from the set of NP names used for training.This subset contains both the Latin Binomial and a known common name, referred to as the preferred term (PT) for each of the 70 natural product name pairs.These 140 names were utilized as a query set to identify candidate mappings from terms found in the FAERS database.We then utilized GPM and SM to match the top 20 unmapped FAERS strings with results predicted as the most similar to (least distant to) the query terms.

Manual Validation:
The candidate mappings between the query NP names and unmapped FAERS strings yielded by the novelty experiment were manually annotated by two health professionals to assess whether the candidate mappings were correct.This process aims to leverage their expertise with drug and NP names to validate the results from the model.We further corroborated the annotations through Cohen's kappa interrater agreement metric and an adjudication process to resolve the points of disagreement 22 .

Model Training Results:
After training the SM for up to 500 epochs, the model terminated early at 24 epochs.The best-performing epoch in this training run achieved a validation accuracy of 0.97 (validation loss: 0.03).The weights from that epoch were saved and utilized for the rest of the experiments.
A holdout set containing 2,500 pairs was utilized to compare the MRR performance.For the MRR evaluation, we were only interested in a subset of the matching pairs (n = 1,000) given that we used the rst element of the pair as the query and the second element as an indicator of the correct answer.Using the top 20 NP names reported as the least distance to the query term by each approach, we looked for exact matches to the target pair and matches to terms equivalent to the target pair.
For the exact matching where , the LED approach performed best (MRR = 0.567).In the equivalent matching where , the LED approach also performed best with (MRR = 0.903).In both cases, the GPM approach performed similarly to LED with slightly lower MRR scores (exact = 0.563, equivalent = 0.894.)In both tests, the SM achieved comparably lower MRR scores (exact = 0.438, equivalent = 0.672.)See Fig. 3 and Table 2.

Novelty results:
The single-blind test evaluation showed strong agreement (Kappa = 0.86) between the annotators on the identi ed candidate mappings.The speci city of the identi ed terms was the primary cause of disagreements between the annotators.In the presence of disagreements, the rules in Table 3 were utilized for adjudication.After adjudication, evaluators reported that the SM identi ed 504 correct terms, and GPM identi ed 595 (details shown in Table 4).For the 70 NPs of interest, we considered those where one or more correctly identi ed NPs were covered by the approach (details shown in Table 6).When comparing these results, the GPM and SM approaches performed similarly, respectively identifying an average of 6 and 5 reports for the products they covered.From this novelty experiment, we were able to identify a total of 158 novel NP names and spelling variations for 70 NPs.

Rules
If the query term was a Latin Binomial and the result was a general common name or only a Genus name, this was identi ed as a non-match.
If the query term was Common name and the result could potentially collide with another medical term, this was identi ed as a non-match.

Manual Validation:
It is worth noting that many of the terms did not overlap between the approaches (see Table 5).The SM identi ed 248 unique names, while GPM identi ed 347.The unique terms obtained from this mapping were incorporated into our quarterly data collection from FAERS data between Q1 2004 and Q2 2022 23 .
For mining reports containing mentions of NPs, we only looked at the reports involving the products for which the novel product names were identi ed; these 57 NPs are a subset of the original 70 NPs of interest.Including the novel terms from the experiments above resulted in the capture of 3,486 additional reports that were not previously identi ed in the database (see Table 7).This study combined fuzzy string-matching and Siamese neural network approaches to identify NP names in adverse event reports in the FAERS database and successfully broadened the capture of NP reports by approximately 7.5%.Prior work in using string matching methods to identify NP strings in spontaneous reporting systems have used multiple sources of NP names to create a thesaurus to identify adverse event reports 9,24 .This requires maintenance of the thesaurus and regular updates to capture relevant NPs and name variations.This study expands upon the prior work that uses string matching using a manually annotated dataset from the FAERS database that can be used to train the model to identify NP variations.The approach can also be effectively utilized to broaden the capture of reports in other spontaneous reporting systems and overcome challenges in NP pharmacovigilance, including lack of interoperability among NP data sources, lack of coverage of synonyms, scienti c names and common names, and ambiguity in NP names in adverse event reports 9,24 .The manual annotation results showed that both approaches contribute su cient unique candidate mappings that help increase the number of reports identi ed in FAERS, which is essential considering that approximately only 0.4% of the reports in FAERS involve NPs.Using a combination of fuzzy string-matching and a Siamese Neural Network, we increased our capture of relevant reports by approximately 7.5%.

Combined Approach:
We trained a SM to serve as a proxy distance metric for identifying potential spelling variations of NP names.Looking at the results from the training process, it is encouraging to see the potential of the method in tackling the problem of mining emerging variations in adverse event reports.In agreement to previous work that suggests natural language processing approaches can outperform current methods 8 , we expected SM to outperform fuzzy string-matching approaches.During our work, it was clear that this was not the case with our current implementation.Although the approach minimized the distance between similar terms, as seen during the training evaluation, it did not effectively maximize the distance between dissimilar ones, as suggested by the MRR comparison.This may be due to potential overlaps between spelling and semantical similarities of the query and target space.
Potential limitations with the training of the SM includes the completeness of the data, shortcomings of the evaluation metrics, and the generalizability of the methods.Due to the nature of the problem, the data on spelling variations for NPs utilized for training was in no way complete or exhaustive.Our approach to data processing and augmentation lends itself to increasing the model's capacity to generalize novel variations at the risk of saturating and confounding the embedding space.As implemented, the SM is learning two different tasks, one for "denoising" the spelling variations to the preferred term and another for matching equivalent terms as similar.Separating these tasks and creating a model architecture for the specialized handling of each task might prove advantageous.In the current work, the MRR metric only measures the top response and not the results' completeness.Tweaking this aspect of how we measured MRR might provide a more accurate assessment of the applicability of the approaches.
The SM architecture was chosen for this work because it possesses the following qualities.It can easily be used for distance metric learning between pairs.Siamese models have been shown to learn a distance metric even with relatively little data successfully.The SM approach was at most only comparable to approaches such as LED and GPM.Nonetheless, such an approach proved helpful in mining adverse event reports for mentions of NPs, as seen in the novelty experiment.The novel NP names identi ed in the novelty test (listed in Appendix 1) will help re ne the task of mining natural products from adverse event reports (AERs) in the future.

Limitations:
We encountered some limitations in our implementation, such as the need for a xed input size.Since the average length of the name of the NPs considered for the study was 30 characters with a standard deviation of 31, we chose a value close to the mean plus the standard deviation for our sequence length cutoff.In turn, the current model targets sequences of up to 65 characters, approaches that might enable us to generalize applicability past this threshold are desirable.This means that currently, we cannot process sequences longer than 65 characters.A second limitation was identi ed in the MRR comparison experiment.For the current problem, the orthographical and semantical spaces are not mutually exclusive; overlaps between spelling similarity and semantical dissimilarity and vice-versa can hurt the model's performance.Another limitation of our work is that candidate names were mined for only 70 NPs of interest.Another area for improvement is that, as implemented, our model did not prioritize semantic similarity over spelling similarity, leading to increased misidenti ed candidate NP names.Finally, the scalability of the manual validation process presents a hurdle as the amount of candidate names increases.

Future Work:
Our future work will involve assessing how different elements, such as the amount of noise used in data processing and the size of the train/validation data split, impact the model's training performance.We also plan to investigate alternative ways of handling data processing, including adding features to the data and creating model architectures that separately consider orthographical and semantic similarity.Moreover, we aim to expand our candidate identi cation process by mining candidates for a broader range of natural products.We will prioritize semantic similarity over spelling similarity to improve accuracy.Additionally, we will focus on enhancing the reliability of our methods to reduce the need for manual validation.We believe it is important to continue this work because as our methods of identifying the mention of NPs in AERs improve we expect to pick up more NaPDI signals, enhancing patient safety through NPs pharmacovigilance.

Conclusion:
A SM was trained to identify potential spelling variations of NP names.The SM model training terminated early at 24 epochs, achieving a validation accuracy of 0.97.In MRR evaluation, the SM performance was, at most, comparable to that of the fuzzy string-matching approaches.In the novelty experiment, GPM and SM performed similarly in identifying correct terms.The unique terms obtained were incorporated into the quarterly data collection process, resulting in the capture of    Finding the similarity between the terms S 1 : "ZINGIBER" and S 2 : "GINGER" using the Gestalt Pattern-Matching approach.The longest matching substring K 1 : "ING" servers as an anchor to align the inputs.
Then recursive matching happens for matching substring to the left and right of K 1 , here K 2 : "ER" represents a second matching substring of characters to the right of K 1 .The calculation of a similarity score based on GPM is shown at the bottom of the gure.

8. 0
Author Contributions:O.Dilán-Pantojas: Main author of the manuscript was involved in the end-to-end implementation of the Siamese model, the fuzzy string-matching comparison, and the novelty experiments.Contributed programming expertise, in addition to methodological contributions to the comparison between the Siamese model and fuzzy string-matching and performed the analysis.Tanupat Boonchalermvichien: Contributed extensively to the methodology and initial implementation of the Siamese model experiment, carried out the data collection and data preparation.

Figures
Figures

Figure 1 Example
Figure 1

Figure 2 Example
Figure 2

Table 2
Results from Mean Reciprocal Rank comparison:

Table 7
3,486 additional reports.By combining both the SM and GPM a broader capture of NP names was achieved.Nonetheless, careful manual validation is still required for validation of the identi ed candidate names.Through this process of novel NP name discovery and interaction detection, we can help further research on natural product drug interactions.All data supporting the ndings of this study are available within the paper and its Supplementary Information.The full list of natural product names identi ed in FAERS for the 70 NPs of interest can be found in Appendix Table1.The data utilized for the identi cation of NP candidates through the combined approach is available as open access data through Zenodo: https://doi.org/10.5281/zenodo.8155759.
The code and data utilized for this work are available from the following GitHub: https://github.com/dbmi-pitt/np_name_nder/blob/main/README.md.The repository includes the code, con guration les, and data required to reproduce the work.