Summary statistics of patents that include MGR
The GenBank patent division, the European Bioinformatics Institute database (EMBL-EBI) and the DNA DataBank of Japan (DDBJ) exchange their data daily and together form the International Nucleotide Sequence Database Collaboration (INSDC). Genetic sequences associated with patents were retrieved from the Patent division of GenBank from the National Center for Biotechnology Information (GenBank database) on 10 November 2022 (ftp.ncbi.nih.gov/genbank/); this included 24,600,503 annotated sequences. All files (from gbpat1.seq.gz to gbpat254.seq.gz) were downloaded and processed following the methodology of Arnaud-Haond, Arrieta, and Duarte 2011 to create database entries with information on the nucleotide sequence of DNA, species name, patent number, patent data, and the party registering the patent. This was done by splitting each file into individual sequences and by extracting the data in the ORIGIN field (nucleotide sequence), ORGANISM field (species name), JOURNAL field (patent application number, year of application, patent system, and patent applicant name) for each sequence. Unlike previous studies (e.g., Arnaud Haond et al. 2011, Blasiak et al. 2018) that restricted their analysis to sequences submitted in a given patent system, here we considered both patents submitted in national jurisdictions and those filed under the World Intellectual Property Organization’s Patent Cooperation Treaty (“international” patents).
As of November 2022, sequences from a total of 14,708 different species were included in the GenBank database. To determine the subset of “marine species” within the database, the taxon match tool of The World Register of Marine Species (WoRMS) was used for all database entries, resulting in a filtered list of 4,000 species. Web searches were conducted for each of these species to verify the marine origin and to collect further information about the nature of each species. More than half of the matched species were subsequently excluded as non-unique to marine environments, resulting in a final list of 1,474 marine species which was used to select patent records associated with marine species. See Blasiak et al. 2018 for details of marine origin determination and criteria for filtering.
The taxonomy (Domain and Phylum) of 879 marine species were retrieved from the WoRMS database. In cases where such taxonomic levels were not available, we obtained species taxonomy from NCBI taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy) and Wikipedia (https://en.wikipedia.org/wiki/) (220 and 356 species respectively). We did not succeed in matching 19 of the marine species (predominately, marine bacterial strains) into related taxonomic groups due to lack of certainty in organism names. The complete list of marine species selected for this study is given in Table S1.
MAPBAT construction
Marine biotechnology pipelines usually focus on the search for biological compounds that encode new functionality (Rotter et al., 2021). There are two types of nucleotide sequences encoded in DNA: protein-coding sequences and non-coding sequences. The latter could either have a functional or non-functional role in genome regulation, including DNA fragments that code for proteins involved in all cell functions. Except for short peptides like cone snail peptide toxins (Terlau and Olivera, 2004), most natural products are derived from proteins, which are polypeptide chains of a certain length. While identifying the shortest polypeptide chain length to form a protein is still controversial, it is currently estimated at between 50 (Woolfson, Baker, and Bartlett, 2017) to 100 (Brunet, Leblanc, and Roucou, 2020) amino acids or 150 to 300 DNA base pairs, respectively.
Another important metric widely used to analyse genome composition variation in molecular biology and genomics is nucleotide usage, which is normally calculated as GC-content – the percentage of certain nucleotide bases (guanine and cytosine) that form stronger chemical bonds in DNA strings. Modern genetic engineering techniques such as CRISPR (Zhang et al., 2014) have proven to be very useful at enhancing important functions of proteins by altering DNA makeup. This could involve changing individual nucleotides, or introducing short sequences that control gene regulation and protein synthesis. Hence, GC content for modified proteins with similar functionality remains the same. Short DNA sequences, below the shortest DNA length required for protein formation, have various functions, including in the amplification of a specific gene sequence (as PCR primers), and usually have a wide range of GC content.
To predict whether genetic sequences are protein-coding or not, we applied two filtering criteria: sequence length threshold and the presence of open reading frame (ORF) – a gene region that has the potential to be transcribed into RNA and after translated into proteins. Sequences with an ORF longer than 150 base pairs have been considered protein-coding sequences. As most natural products are derived from proteins, we reason that at least one protein-coding sequence has to be included in a patent application, in order to be related to marine bioprospecting. Following that, we selected 12,716 protein-coding together with 7,357 of non-protein sequences associated with marine species that have been submitted as a part of the same application.
For all companies that have registered patents associated with MGR, we counted the total number of nucleotide sequences, and calculated the average sequence length (Figure 2). Based on the shortest protein length estimation, the number of protein-coding or non-coding sequences for each company was identified. In each category, for the 10 companies with the highest counts of genetic sequences attached to patent claims, we calculated the length and DNA composition (GC-content) of each sequence, and colored by distinct species origin (Figure S2).
For each sequence that was included in patent applications submitted in national jurisdictions as well as “international” patents (sequences of special commercial interest), we collected the description of the invention and the protein function, if nucleotide sequence search (BlastX) resulted in a significant match to a protein with annotated function. Web searches were conducted for each of these proteins to collect further information about protein function and potential application. The resulting information about the sequences of special commercial interest is available in Table S3.
Patents owned by subsidiaries were replaced with ultimate owner names of controlled subsidiaries as stated in the Orbis company database, which contains information on around 400 million companies worldwide (Orbis; https://orbis.bvdinfo.com/). For jointly owned patents, the ownership was assigned to the first company on the list. After filtering and removing duplicate entity names, and aggregating subsidiaries, a total of 588 applicants were identified, and web searches were used to collect information about each, including the country where it is headquartered, and the type of entity that it represents. Our classification resulted in five major entity types: multi-national (presence in more than two countries) or national companies, universities and their commercialization centers, governmental agencies, and “other” (predominately, applications submitted by private individuals).
Each record in the resulting MABPAT database includes: (1) the genetic sequence data (2) whether the sequence contains protein-coding information (3) marine species name, (4) the date of deposition in the INSDC, (5) whether species can be classified as “deep-sea” species (6) patent application record number (7) patent bureau (8) applicant name, as well as (9) type of entity and (10) country where it is headquartered.
Deep-sea presence of marine species
The search for presence of species in deep-sea habitats was conducted based on multiple sources. For species in the Eukarya domain of life, we used the World Register of Deep-Sea species, a taxonomic database of deep-sea species (WoRDSS; Glover et al. 2022). As Bacteria and Archaea species are not present in WoRDSS, we used web-search based on the PubMed (https://pubmed.ncbi.nlm.nih.gov/) and Integrated Microbial Genomes and Microbiomes (https://img.jgi.doe.gov/) databases to establish their potential presence in deep-sea habitats. Samples of species collected from deep-sea environments that have already been found to be associated with international patent applications (Blasiak et al. 2018) are also marked as “deep-sea” species.
BlastX sequence similarity model
Sequence similarity models are widely used to identify newly sequenced data or unknown species (Pearson, 2013). We queried 7,467,396 sequences with unknown taxonomic origin (‘unknown’, ‘unidentified’, and ‘synthetic construct’ species tag) – 62.7% of all GenBank records – to conduct sequence similarity BlastX searches (translated nucleotide versus protein) against the database of annotated protein sequences (UniProtKB/Swiss-Prot; UniProt Consortium 2023). BlastX searches with the following set of search parameters – E-value: less or equal to 10-5, query coverage: more or equal to 80%, hit identity: more or equal to 99% – verified that 24% could be identified to a genus level with at least 95% confidence (correct hit) (Figure S4A). We also tested whether correct hits and searches with confidence below 95% tend to be included in certain patent applications, patented by certain actors or in certain patent systems, and we did not find any major preference (Figure S4B-D). Finally, we compared summary statistics (number of sequences, number of patents, and median year of application) for top 10 largest patent applicants that referenced sequences with disclosed marine origin, and top 10 applicants that referenced sequences with predicted marine origin (Figure S1 and Figure S5 respectively), and found to be similar to each other.
Patent share estimation
The total number of sequences associated with marine species was estimated based on the number of records that included a species name with disclosed marine origin, and newly recovered records for each company. Predictions were made based on a linear regression model:
where
where Xobserved is the number of unique genetic sequences referenced in patent applications with species name specified, Xrecovered is the number of unique genetic sequences referenced in patent applications for recovered data, and FDLTP>95%, the fraction of data loss given the chosen set of BlastX search parameters that yielded 95% positive recovery rate (Figure S4). Following parameterization of our model, we found that FDLTP>95%, was 0.24.
To make a prediction on the total number of patent applications and species referenced, for each company we randomly selected Xrecovered divided by FDLTP>95%. to account for searches that did not result in correct hits. We then estimated how many unique patent application numbers and unique species were referenced in the total number of predicted sequences (Xpredicted ). To calculate the mean value for each company, the selection was repeated 100 times. The resulting average estimates are shown in Figure S6.
Hydrothermal vents presence and ABNJ-unique species counts
The geolocation of hydrothermal vents was collected from the InterRidge Vents Database (Beaulieu and Szafranski, 2022). The maritime boundaries map of World High Seas (World_High_Seas_v1_20200826/High_Seas_v1.shp) was downloaded from Marine Regions (https://marineregions.org/). Each set of hydrothermal vent coordinates was checked for presence within any of the High Seas polygons. Spatial vector data were analysed with the R package sf version 1.0-9 (Pebesma, 2018).
To establish the list of species uniquely present in ABNJ, we used species geographical abundance data from OBIS. We first retrieved all 28,375 species with at least one occurrence record in ABNJ (https://obis.org/area/1). For each ABNJ-present species, we checked if it was also observed in the territorial waters of any country. Species with at least one occurrence record were excluded. Data were obtained from the OBIS database (2022) using the R package robis version 2.11.0. (https://zenodo.org/record/6969395) and parallel version 3.6.2. (https://rdocumentation.org/packages/parallel/versions/3.6.2).
References for Methods
1. Arnaud-Haond, S., Arrieta, J. M., and Duarte, C. M. Marine Biodiversity and Gene Patents. Science, 331,1521-1522 (2011).
2. Blasiak, R., et al. Corporate control and global governance of marine genetic resources. Science Advances, 4, (2018).
3. Rotter, A., et al. The Essentials of Marine Biotechnology. Front Mar Sci, 8, (2021).
4. Terlau, H., and Olivera, B. M. Conus Venoms: A Rich Source of Novel Ion Channel-Targeted Peptides. Physiological Reviews, 84, 41–68 (2004).
5. Woolfson, D. N., et al. How do miniproteins fold? Science, 357, 133–134 (2017).
6. Brunet, M. A., Leblanc, S., and Roucou, X. Reconsidering proteomic diversity with functional investigation of small ORFs and alternative ORFs. Experimental Cell Research, 393, 112057 (2020).
7. Zhang, F., Wen, Y., and Guo, X. CRISPR/Cas9 for genome editing: progress, implications and challenges. Human Molecular Genetics, 23, 40–46 (2014).
8. The UniProt Consortium et al. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51, 523–531 (2023).
9. Glover, A.G., Higgs, N., and Horton, T. (2023). World Register of Deep-Sea species (WoRDSS). Accessed at https://www.marinespecies.org/deepsea on 2022-11-15. doi:10.14284/352
10. Jefferson, O. A., et al. Public disclosure of biological sequences in global patent practice. World Patent Information, 43, 12-24 (2015).
11. Beaulieu, S.E., and Szafranski, K. (2020) InterRidge Global Database of Active Submarine Hydrothermal Vent Fields, Version 3.4. World Wide Web electronic publication available from http://vents-data.interridge.org. Accessed 2022-11-15.
12. Pebesma, E. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal, 10, 439 (2018).
13. Pearson, W. R. An Introduction to Sequence Similarity (“Homology”) Searching. Curr Protoc Bioinformatics, Chapter 3:3.1.1-3.1.8 (2013).