We will discuss use of local invariants, then proposed nonlocal invariants and improvements of refinement procedure which can resolve some ambiguities in Weininger’s method of corresponding primes [3]. After that procedure of obtaining canonical absolute SMILES (with chirality) will be described.
Local invariants. Degree, atomic number, and bond type are used as local invariants of the atom in the original Morgan algorithm [7]. Weininger proposed to use the number of connections, number of non-hydrogen bonds (degree), atomic number, the sign of charge, absolute charge and number of attached hydrogens as local invariants [3]. For the full representation of local invariants of an atom, we can classify them as properties of atom proper (atomic number – count of protons in atomic nuclei, atomic mass – the sum of counts of protons and neutrons in atomic nuclei, charge – count of protons in atomic nuclei minus count of electrons of atom) and properties of the nearest neighborhood of the atom (degree – count of explicit connections, count of attached hydrogens, connectivity – total count of connections, valence – the sum of bond orders of all connections). For using as an initial rank of the atom, we must combine all local invariants in the atomic vector in some order. Weininger has pointed out [3] that a terminal atom is preferred to start of traversal of the molecular graph (if it is possible). From this reason, degree of the atom must be at the start of a combined atomic vector (1 character). Next positions consecutively: atomic number (3 characters), count of attached hydrogens (1 character), the sign of charge (1 character: ‘0’ – if no charge, ‘1’ – if the charge is positive, ‘2’ – if the charge is negative), absolute charge (1 character), connectivity (1 character), valence (1 character), atomic mass (3 characters, ‘000’ if unspecified). The atomic vector of local invariants has 12 characters.
Chirality invariant. In SMILES, a designation of chirality depends on the enumeration order of atoms, and from this reason, simple parity symbols ‘@’ and ‘@@’ cannot be used as invariant because it’s meaning change with transposition of atoms. But we can determine the order of atoms around the chiral center relative to symmetry classes of its neighbors [14, 15]. To determine this order we should use extended connectivity algorithm to atomic invariants (local and ring nonlocal described further) without chirality designation and obtain symmetry classes for non-chiral structure. For all atoms with no chiral parity chirality invariant equal to 0 (by definition). If two or more neighbors of the atom with chiral parity have the same symmetry class then chirality invariant of this atom equals to 0 also. If the order of symmetry classes can be obtained by an even number of swaps from direct ascending order of these numbers then chirality invariant equal to count of the sign ‘@’ in parity, and otherwise chirality invariant equal to 3 - count of the sign ‘@’ in parity. Such a definition of chiral invariant gives us the truly invariant property of chiral atom which is not depending on the order of atoms in SMILES because symmetry classes of atoms are independent of its order. This invariant can be concatenated to the atomic vector of each atom (1 character). After concatenation of chiral invariant, we must recalculate all atom ranks in structure.
Thus complete atomic vector of each atom must contain 13 characters: 12 for local invariants and 1 character for chirality invariant. Since all characters of the atomic vector are digits then we can transform the atomic vector to the number.
Ring invariant of atom. If we use only traditional local invariants for each atom of the molecular graph from Fig. 1 we will obtain only two symmetry classes after refinement procedure of extended connectivity algorithm: one for atoms with degree = 3 and another one for atoms with degree = 2. But obviously, there are three symmetry classes in this structure. How do we know that? Because eight atoms with degree = 2 belong to 6-membered rings, but another four atoms with degree = 2 do not. This very simple observation gives us obvious nonlocal invariant for each ring atom – ring size of the smallest ring to which this atom belongs (for non-ring atom this invariant = 0 by definition). This value is invariant, i.e. it does not depend on the order of atoms in the molecular graph. Now, for molecular graph from Fig. 1 for eight atoms with degree = 2 this invariant = 6, and for another four atoms with degree = 2 this invariant = 12. Taking into account local invariants of each atom we obtain three symmetry classes for atoms of this molecular graph at once and refinement procedure do not change this.
But if we look at the structure in Fig. 2, we see that this invariant is insufficient because all atoms belong to 3-membered rings and therefore have the same ring size of the smallest ring to which this atom belongs.
To solve this problem, we can first calculate invariant for each bond – the size of the smallest ring to which it belongs. We can use a breadth-first search (BFS) from the first atom which is incident to this bond to second atom which is incident to this bond for obtaining the shortest non-trivial path (if such path does not exist then this is non-ring bond). After obtaining this ring invariant for each bond we can choose for each atom minimal and maximal values of ring sizes of the bonds which are incident to this atom. The minimal ring size of the bonds which are incident to the atom will be ring size of the smallest ring to which this atom belongs, but maximal will be maximum of minimal ring sizes of the bonds to which this atom is incident. Both of these values will be invariants of each atom. For the molecular graph in Fig. 2, four atoms have both these invariants equal to 3, but four another have minimal ring size equals to 3, but maximal ring size equals to 6. The minimal and maximal ring sizes of bonds which are incident to the atom in combination with local invariants are sufficient to distinguish symmetry classes of atoms in all known to us ‘pathological’ molecular graphs but to avoid any possible ambiguity we propose to use the product of corresponding primes for ring size of each bond of the atom as unambiguous ring invariant of the atom (contribution for a non-ring bond to the product will be equal to 1). As an example, for the molecular graph in Fig. 2 this ring invariant will be 5*5*5 = 125 for the atoms with the maximum ring size of bonds equals to 3 and 5*5*13 = 325 for the atoms with the maximum ring size of bonds equals to 6 (since 5 is third prime and 13 is the sixth prime number). Ring invariant will be unambiguous because reverse factorization of it gives us such invariants as the count of ring bonds of atom and the minimal sizes of rings to which all bonds of atom belong.
Now, we have a 13-digit atomic vector and numeric ring invariant for each atom. The product of the numeric value of the atomic vector and ring invariant will be the initial rank of the atom in further refinement step of extended connectivity algorithm.
Refinement procedure. After obtaining initial ranks for all atoms we can rank them for obtaining consecutive small integer ranks, and this manipulation does not change their relative order. Further manipulation, in general, corresponds to the CANON algorithm described by Weininger et al [3], but some changes in this algorithm are necessary. If you look on simple structure in Fig. 3 then you see that after obtaining initial atomic vectors, transforming them to numbers and ranking we obtain four symmetry classes. If we follow the original description of the CANON algorithm then after the first iteration we obtain products of corresponding primes for atoms: for CH3 – 7, CH – 2*3*5 = 30, CH2 – 5*7 = 35, O – 3*7 = 21. After ranking of these integers we obtain new ranks:
As you can see there is rotation of ranks in the ring. This is an infinite loop of ranks rotations that cannot give us stable numbers for symmetry classes for atoms in such type of structures. The first idea for solving this problem is to multiply the corresponding prime of the current rank of each atom with the product of corresponding primes for ranks of its neighbors. Try this approach, products for first iteration: CH3 – 2*7 = 14, CH – 7*2*3*5 = 210, CH2 – 3*5*7 = 105, O – 5*3*7 = 105. After ranking we obtain that now O and CH2 have the same rank, although previously they had different ranks! This issue arises because the products of primes are an unambiguous function only if we do not consider transposition of number in the product, but in 3-membered rings this fact led to the leveling of previously different ranks. A solution of this problem for 3-membered rings is to multiply the square of the corresponding prime of the current rank of each atom with the product of corresponding primes for ranks of its neighbors. Try this approach, products for first iteration: CH3 – (2*2)*7 = 28, CH – (7*7)*2*3*5 = 1470, CH2 – (3*3)*5*7 = 315, O – (5*5)*3*7 = 525. After ranking: CH3 – 1, CH – 4, CH2 – 2, O – 3. As you can see ranks are stable after such calculation for this structure. It can be easily proved that such a method of obtaining new ranks from previous has never led to leveling or swapping of previously different ranks in 3-membered rings. But for structure in Fig. 4 squaring of the corresponding prime of the current rank do not help us: after the first round of the refinement, we have previously different ranks leveled.
If we try to use the third power of the corresponding prime of current rank for this structure then there is no leveling. Generally speaking, this issue arises when we have two or more neighbors of the atom with the same rank. Since in molecular graphs we never have atom degree higher than 6 (structures like SF6), we propose to use the eighth power of the corresponding prime of current rank to guarantee that leveling or swapping of ranks in structure never occurs on refinement step (since 8 = 23, the eighth power can be calculated quickly).
Another issue with the original CANON algorithm is that it does not take into account bond orders between atoms. For example, it determines only four symmetry classes for structure in Fig. 5 while there are five.
This fact has led to arbitrary choosing between atoms and generation of different SMILES by traversal of molecular graph. We propose to solve this problem by multiplying the eighth power of the corresponding prime of current rank with the corresponding primes for ranks of its neighbors in power equal to the order of bond which connects current atom with the neighbor.
Pseudocode of Modified CANON algorithm:
Set atomic ranks to initial invariants;
Rank them to obtaining consecutive small integer ranks;
Do
Set for each atom the new value of rank as a product of the eighth power of the corresponding prime of current rank with the corresponding primes for ranks of its neighbors in the power equal to the order of bond which connects current atom with the neighbor;
Rank them to obtaining consecutive small integer ranks;
Loop until no invariant partitioning observed;
The Modified CANON algorithm terminates when the count number of different values of ranks stop to increase from the previous step of the algorithm. After its termination we must have the same ranks only on symmetrically equivalent atoms, i.e. ranks of atoms are their symmetry classes [3]. These symmetry classes can be used for calculation of chirality invariant. For acyclic and simple cyclic structures they can to be used also directly for further generation of canonicalized SMILES. If after use of the Modified CANON algorithm the count of various ranks in the molecular graph is less than the count of atoms, then to obtain unambiguous order of molecular graph traversal we need to make breaking ties procedure, as Weininger et al pointed out [3]. Our method for this is the same as described: doubling all ranks and reducing the value of the atom, which is tied, by one. The set is then treated as a new invariant set, and the previous algorithm for generating an invariant partitioning is repeated. But Weininger et al assumption that “the double-and-tie-break step does not introduce ambiguity into the ordering since only otherwise equivalent atoms will be tied at any point” can be wrong in the general case. Ivanciuc pointed out [1]: “Two vertices from different atom invariant class cannot be automorphic, while two vertices from the same atom invariant class are not necessarily automorphic. Despite numerous efforts, no vertex graph invariant is known which is sufficient to establish the automorphism partitioning, because for certain graphs non-automorphic vertices are partitioned in the same class.” For this reason we suggest that only a rigorous way to produce canonical ordering for any chemical graph is to generate permutations for all atoms with the same atom invariant class, obtain SMILES for each permutation by traversal and select lexicographically minimal (or maximal) [1]. Our approach to doing this is: after obtaining symmetry classes choose ring atoms with maximal tied ranks, make tiebreaking for each of them, refine each partition by the Modified CANON algorithm, and save it. If these new partitions have atoms with tied ranks then repeat tiebreaking for each of them as long as ring atoms with tied ranks exist in partitions. For all partitions with fully partitioned ranks for ring atoms generate SMILES by traversal and select lexicographically minimal as canonical SMILES for this molecular graph.
After establishing of canonical order of atoms by the Modified CANON algorithm with, possibly, the tiebreaking further procedure to obtain canonical SMILES corresponds described by Weininger et al [3] (procedure GENES): the structure is treated as a tree and a SMILES string is generated that corresponds to a depth-first search (DFS) of that tree. The lowest canonically numbered atom is chosen as the starting point. This atom becomes the root of a tree for a subsequent DFS. Branching decisions are making by directs branching toward the lowest labeled atom at the fork in the branch. For cyclic structures, Weininger et al use a two-pass method: on the first pass, the DFS algorithm detects which bonds will become ring bonds (chords) and on the second pass, the ring closure indicators (digits) could be appended to the chords node symbols.
After the second pass of the DFS algorithm, there is full information about the canonical order of atoms in the neighborhood of chiral centers. We use the third pass of the DFS algorithm to obtain canonical SMILES with correct parity (absolute SMILES): if canonical order of atoms on the second pass of the DFS algorithm can be obtained by even number of swaps from the original order of atoms in the molecular graph then parity symbol (‘@’ or ‘@@’) is preserved, if not – then parity symbol is inverted. We cannot use order relatively canonical ranks for correct designation of chirality because of the existence of structures that have only relative configurations and have no chiral centers (structure in Fig. 6 for example).
For such structure, all chirality invariants equal to 0 and that led to two possible traversals of this molecular graph. Previously described tiebreaking procedure allows us to avoid this ambiguity by the selection of lexicographically minimal SMILES as canonical.