Random Sampling
In the present communication, the number of subsets is fixed to N = 10 and the dimension of each subset is varied from 100 to 10,000 (D =100, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000 or 10,000). The random assembly of the subset was performed as described in the Supplementary Material.
F stability
A series of features F were examined, some related to sequence (the length of the protein chain defined by its number of residues, and the amino acid composition) and others related to structure (the crystallographic resolution and the secondary structure composition).
Figure 1 shows how the length of the protein chain depends on D. As expected, if D is small (for example D = 100), the average protein length is quite variable amongst the N = 10 subsets. It ranges from 226.5 to 267.9 with an average value of 245.1 and an estimated standard error of 4.0. On the contrary, if D is larger (for example D = 10000), the average protein length is much less variable amongst the N = 10 subsets. The difference between the maximal (250.3) and minimal (244.9) values is much smaller and the average value (247.8) is associated with a much smaller estimated standard error (0.5).
The difference between the maximal and minimal value of protein length decreases when D increases, up to D = 6000. It goes from 41.4 to 27.2, 22.1, 10.6, 7.9, 7.1, 7.7 and 6.4 when D goes from 100 to 500, 1000, 2000, 3000, 4000, 5000 and 6000. Then is close to 5 and nearly constant for larger D values (4.9 for D = 7000, 5.0 for D = 8000, 4.9 for D = 9000 and 5.4 for D = 10000).
The estimated standard errors decrease too when D increases, going from 4.0 for D = 100 to 0.74 for D = 6000. Then, for larger D values, they are nearly constant (0.57 for D = 7000, 0.61 for D = 8000, 0.52 for D = 9000 and 0.52 for D = 10000).
Note that both protein length average values and standard errors are computed on the ten average values, each of which is computed with one of the ten subsets of size D. The decrease of the standard errors is thus not due to the growing of the number of data used to computed them.
This trend is observed not only when F is the length of the protein chain. Figures S1-S3 (Supplementary material) show the dependence on D of other features, the crystallographic resolution, the secondary structure composition, and the amino acid composition.
By visual inspection it is possible to estimate that, on average, the divergence between the estimations of F in different subsets of the PDB decreases up the D = 6000 to become stable for larger subsets with D ≥ 7000.
It is thus reasonable to deduce that subsets of the PDB containing 7000 protein chains are large enough for sampling the corpus of protein structures deposited in the PDB.
Internal redundancy
It is now necessary to check the level of redundancy in randomly selected subsets of the PDB that contain 7000 protein chains.
Given that an all-against-all sequence alignments would have been too expensive – nearly 25 million alignments for each of the ten PDB subsets – only 10000 randomly selected alignments were considered for each subset.
The average redundancy is very small. Amongst the ten PDB subsets: the average pairwise percentage of sequence identity ranges from 9.74% to 11.18%, with a mean value of 10.52% (estimated error 0.06%).
The percentage of sequences pairs with percentage of sequence identity larger than 40% is also small (0.06%).
This clearly shows that randomly built PDB subsets containing 7000 protein chains are sufficiently small to avoid excessive internal redundancy.
Furthermore, it is necessary to check how frequently the same protein sequence is randomly selected in two or more of the 10 PDB subsets. In other words, it is necessary to evaluate the degree of overlap between the subsets. This is a relevant problem, since there are 70000 entries in ten groups of 7000 entries and this represents a considerable fraction (about 40%) of the whole PDB.
The overlap between the two groups, each containing D protein sequences, is evaluated by removing duplicate sequences from one of them. As a consequence, two groups might contain R and R’ protein sequences, with R ≤ D and R’ ≤ D. The expression NOVP =100 min(R,R’)/D would be equal to 100 if no overlap occurs, and it would be lower than 100 in case of overlap. Here NOVP ranges from 66 to 100, amongst all unique pairs of the ten PDB random subsets, with an average vale of 81 (standard error 3). This indicates that it is uncommon that the same structure is selected twice in ten PDB subsets of 7000 protein chains.