Robust data storage in DNA by de Bruijn graph-based decoding


 Data storage in DNA, which store information in polymers, is a potential technology with high density and long-term features. However, the indels, strand rearrangements, and strand breaks that emerged during synthesis, amplification, sequencing, and storage of DNA molecules need to be handled. Here, we report a de Bruijn graph-based, greedy path search algorithm (DBG-GPS), which can efficiently handle all these issues by efficient reconstruction of the DNA strands. DBG-GPS achieves accurate data recovery with low-quality, deep error-prone PCR products, and accelerated aged DNA samples (solution, 70℃ for two weeks). The robustness of DBG-GPS was verified with 100 times of multiple retrievals using PCR products with massive unspecific amplifications. Moreover, DBG-GPS shows linear decoding complexity and more than 100 times faster than the multiple alignment-based methods, indicating a suitable solution for large-scale data storage.

This would signi cantly increase the cost of DNA data storage. A cost-e cient mechanism to handle the indels is highly desirable. The second challenge is the correction of strand rearrangements. Copying and retrieval of DNA data requires polymerase chain reaction (PCR) for ampli cation of DNA strands.
However, as a common sense in molecular biology, PCR experiments result in unspeci c ampli cation occasionally even with optimized conditions. Large data storage in DNA requires enormous DNA strands of various sequence contents. PCR ampli cation with various DNA strands can result in unspeci c ampli cations even more frequently. Unspeci c ampli cation will cause massive strand re-arrangements. This is a serious problem to the data integrity, which has not been solved yet. Although previous studies have proved that perfect decoding of DNA data can be achieved with high quality PCR products, limited number of data retrievals have been performed, which is far from the industrial standards of storage media. The robustness of DNA data storage under massive multiple retrievals is still not veri ed yet. To achieve robust data retrieval, the unspeci c ampli cation problem emerged occasionally needs to be handled. The third challenge is the strand break problem. As typical biological polymers, DNA molecules are highly degradable 12,13 . Previous studies developed methods for protection of DNA molecules by silicon beads or earth alkaline salts 8, 14 , which can reduce the speed of DNA degradation signi cantly 8 .
However, degradation of DNA molecules is inevitable fundamentally, especially for long-term storage.
Thus, a new mechanism that can reconstruct the original DNA sequences from small fragmented DNA strands is highly desirable for enhancing the robustness of DNA data storage.
DNA data storage achieves data operations by DNA synthesis, PCR, and sequencing processes. These processes always generate multiple error-rich copies of DNA strands fast and cheaply. This unique 'multiple copy' feature, also termed as "natural redundancy" in previous studies, or 'data repetition' by de nition of information theory 15 , makes DNA data storage a unique channel 15,16 . Data repetition is regarded as a low e cient way of redundancy coding in traditional media and channels since additional data copies are super time-and cost-expensive. However, data repetition in the forms of DNA strand copies is cheap, fast, and basically unavoidable. Indeed, it may increase the cost signi cantly to make a single copy DNA data since additional efforts are required. Efforts on reducing the "multiple copy" feature by serial dilutions observed massive strand dropouts when reducing the average copy number under ten, making reliable data retrieval impractical 17 . Additionally, DNA molecules, as highly degradable polymers, break easily. Thus, even DNA data with single copy of each strand is obtained, the data stability is still questionable because strand breaks, which cannot be handled by any reported codes, are fatal to DNA data integrity. Taken together, a small number of strand copies, e.g. 10 to 100, which do not hamper the data density, is required for practical DNA data storage.
In this study, we report the robust data storage in DNA by an e cient strand reconstruction algorithm based on de Bruijn Graph-based Greedy Path Searching (DBG-GPS). DBG-GPS can handle all the three challenges by taking advantage of the multi-copy feature for high-e cient DNA strand reconstruction. With the optimized procedure, DBG-GPS shows high resistance to indels as well as substitutions. While correction of indels with single copy data is still challenging 18, 19 , errors (indels and/or substitutions) in a high rate of 10% can be e ciently corrected by DBG-GPS. Importantly, DBG-GPS achieves robust data retrieval with low-quality PCR products with massive strand rearrangements caused by unspeci c ampli cations, and highly degraded DNA samples with massive DNA strand breaks. Furthermore, although python scripts are utilized, DBG-GPS shows more than 100 times faster than the clustering and multiple alignment-based methods. The linear decoding complexity revealed by simulations makes DBG-GPS a suitable solution for large-scale data storage.

Results
Principles and potentials of de Bruijn graph-based decoding Due to the di culties of long DNA synthesis, large data storage in DNA has to distribute the information in massive DNA fragments. Thus, decoding of DNA data is assembly of the original data from small pieces of information carried by massive DNA fragments. This process is similar to the process of genome assembly, in which the complete genome sequences are assembled from small pieces of DNA sequences obtained from shotgun genome sequencing. However, as shown by a large number of genome sequencing projects, it is a challenge to obtain the complete genome sequences by shotgun genome sequencing solely 20 . Imperfect genome assembly is acceptable since following up experiments can be performed to improve the assembly 20 . Obviously, imperfect assembly is unacceptable for data storage applications. To decode the DNA data accurately, the extra options of indexes and error correction codes bring opportunities. De Bruijn graph theory, which is an e cient way to represent a sequence in terms of its k-mer components, has been successfully applied to genome assembly with second-generation sequencing data 20,21 . However, the potentials of de Bruijn graph theory in DNA data storage haven't been explored thus far.
To investigate the potentials of de Bruijn graph theory in DNA data storage, we rst analyzed the representations of substitutions, indels and strand breaks in de Bruijn graph as shown in Fig. 1a. The simple de Bruijn graph on the right was constructed from the four sequence copies of "AACGGCGGCTGT" in the left with a k-mer size of four ( Fig. 1a-i). The DNA sequence of "AACGGCGGCTGT" is a path, i.e. "a→b→c→d→e→c→f→g→h", in the constructed graph. A substitution of "G" to "A" in sequence #2 will lead to a branch path, as shown in red color, and reduced weights (coverages) of the correct path ( Fig. 1aii). Indels are much more di cult to handle than substitutions by traditional EC codes. However, the representations of base indels in de Bruijn graph are very similar to substitutions: both creating interfering path(s) and reducing the weights of the correct path ( Fig. 1a-iii). Although strand breaks can cause serious problems for traditional EC codes, they only cause subtle in uence on the de Bruijn graph, by reducing weights of the correct path without changing the graph structure. A greedy path search algorithm and a path selection mechanism for recognition of the correct path, could recover the DNA strand theoretically. In this way, the errors are transformed into a path search and selection problem by de Bruijn graph theory with multiple sequence copies. With this mechanism, as long as the correct path, e.g. "a→b→c→d→e→c→f→g→h" in Fig. 1a, is preserved, the corresponding DNA strand can be decoded theoretically. Thus, the path robustness indicates the potentials of de Bruijn graph theory-based decoding.
Obviously, the sequence error rates and copy number together determine the robustness of a de Bruijn graph path. To further investigate how the strand copies and error rates affect path robustness, we simulated DNA strands with different rates of errors. Variant numbers of the simulated containing sequences were utilized for de Bruijn graph construction and path integrality analysis. The integrality of the correct path was tested by checking the existence of all k-mers in this path in the constructed graph. The rate of the strands with a complete path in the graph was de ned as 'maximal strand decoding rate (S m ), which stands for the theoretical maximal strand decoding rate which can be achieved by an actual de Bruijn graph-based decoder. Firstly, we run the simulations with different types of errors separately.
With twenty strand copies, the path robustness with different types of errors in the range of 1-10% with a step length of 1% were estimated. As shown in Fig. 1b, although substitutions, indels, and strand breaks are quite different errors based on traditional information theory, they show very close S m values with the same error rates. Importantly, with merely 20 sequence copies, the de Bruijn graph path shows high robustness with an error rate of 5%, which is a common error level of DNA data storage reported 9,10,22 . In realistic situations, the errors that emerged in DNA data channel are mixtures of all types of errors. Thus, we then perform thorough simulations with an error mixture ratio of 50% substitutions, 25% insertions and 25% deletions with different rates of errors and copies of strands. Total error rates in the range of 1-10% with a step length of 0.5%, and strand copies in the range of 5 to 100 with a step length of 5 were considered. As shown in Fig. 1c, S m > 0.99 can be obtained with 60 sequence copies (red dashed line) with high rate errors of 10%, revealing the great potentials of de Bruijn graph-based decoding.

Development of a general framework for de Bruijn graphbased decoding
Based on the analysis of previous section, two major steps are required for decoding of DNA strands from multiple error-rich sequence copies utilizing de Bruijn graph theory: 1) greedy path search for possible path candidates and 2) selection of the correct path. In this section, we describe the issues emerged during the implementation of a de Bruijn graph-based decoding algorithm and strategies to solve them. A four-step framework is then proposed for effective strand reconstruction practice with de Bruijn graph. Eventually, the DBG-GPS algorithm was implemented following this framework with Python.
Elimination of low coverage k-mers decreases the greedy path search complexity A serious problem emerged is the explosion of de Bruijn graph size caused by base errors. About thirty times enlargement of the de Bruijn graph size was observed when 4% random errors were introduced into 10,000 DNA strands with 100 copies each (Fig. S2a). Greedy path search with this enlarged de Bruijn graph resulted in overwhelming number of possible paths, leading to unbearably long path search time.
The enlarged de Bruijn graph contains massive noise k-mers, i.e. k-mers introduced by base errors. To quantify the noise k-mers, we introduce an indicator , which is the ratio of noise k-mers and correct kmers in the de Bruijn graph. Clearly, more errors and strand copies will introduce more noise k-mers to the de Bruijn graph, which will result in a larger . Monte carlo simulations show a quick increase of by increasing either the error rate or the sequence copies (Fig. S2a). The number of branch paths and K n K n K n decoding time increase exponentially with the increasement of error rates (Fig. S2b) or strand copies (Fig.  S2b). Thus, a pre-processing step that can remove the noise k-mers is desirable.
Although DNA data operations can introduce massive errors, these errors are low probability events individually. Thus, the occurrences, i.e. the coverages of noise k-mers, should be signi cantly lower than the correct ones. In other words, the low coverage k-mers are more likely to be noise k-mers. Preliminary testing with the elimination of k-mers with coverage of one signi cantly reduced the decoding burden of branch paths without signi cant affections to the path integrity (Fig. S2b). To further investigate the feasibility of coverage as an indicator for distinguishing noise k-mers, we used a Poisson distribution model to simulate the copy number distribution of DNA strands. The coverage distribution of noise kmers and correct k-mers are then investigated based on Monte Carlo simulations. The simulation is performed with 10,000 DNA strands, a λ value (i.e. the expectation value) of fty for strand copies and an error rate of 4% (a common level of error rates reported in previous studies). The vast majority of the noise k-mers show coverages lower than eight, while the correct ones show coverages higher than eight (Fig. S2c). Thus, a coverage cut-off of eight can effectively remove most of the noise k-mers suggesting an effective mechanism for the elimination of noise k-mers.
Cyclic redundancy check (CRC) codes effectively recognize the correct path Elimination of low coverage k-mers can effectively remove most of the noise k-mers based on Monte Carlo simulation, the remaining noise k-mers can still have great affections on the accuracy of strand decoding. Thus, a mechanism to recognize the correct path and verify its integrality is required. Kinds of traditional EC codes can achieve this goal theoretically, e.g. CRC codes, RS code, or hash functions. Although hash functions are suitable for such tasks in principle, we failed to nd a proper hash function that can generate short key words of few bytes. Since erasure is not presented in the path selection process, RS codes, which is capable of handling erasures, is not an optimal solution hence. Here, we choose CRC codes which is powerful in error detection without unnecessary error-correction mechanism. Intensive Monte Carlo experiments show that CRC codes can recognize the correct path in all one million times of independent simulations, proving its effectiveness in the selection of correct paths, i.e. correct strands.
A both-way search strategy increases decoding speed Although elimination of noise k-mers based on coverage can remarkably reduce the greedy path search burden, the remained noise k-mers still can increase the decoding time signi cantly, especially with high error rates. To further reduce the greedy path search time, we proposed a both-way search strategy which can search for possible paths simultaneously from both ends of DNA strands. To enable this both-way search strategy, we designed a DNA strand structure with an additional 'anchor' element as shown in Fig. 2c. The anchor is an integer generated from the index by a hash-like function. This function generates one speci c number with a given index. For proof-of-concept, the pseudorandom function of Python is used for the generation of the anchors by using the indexes as seeds. Due to the feature of pseudorandom function, it always gets the same 'random' value(s) with the same seed. Thus, the greedy path search engine can calculate the anchor based on the index without additional information, enabling the both-way search strategy. Simulation shows that this both-way search strategy can reduce the decoding time signi cantly, especially when sequence error rates are high. With an error rate of 10% and a copy number of 100, the both-way search strategy shows about 60% reduction in decoding time compared with one-way search (Fig. S3). No signi cant difference in decoding accuracy is observed with the two search strategies.
Four-step framework for effective strand reconstruction practice with DBG-GPS algorithm For effective strand reconstruction with DBG-GPS, we proposed a four-step framework as illustrated in Fig. 2a: Step 1, Construction of de Bruijn Graph. Here, the construction of de Bruijn Graph refers to the process of counting of k-mers. In the current python implementation, a simple hash dictionary data structure was applied. For further studies with a more advanced data structure of the de Bruijn Graph, we suggest creating the advanced data structure after the elimination of noise k-mers (Step 2).
Step 2, Elimination of noise k-mers. In the present study, we use coverage as an indicator for the elimination of noise k-mers. Additional codes for the identi cation of noise k-mers can also be introduced to make the elimination more e cient in future studies.
Step 3, Greedy path search. If the error rate is still high, a bothway search strategy can be applied to reduce the decoding time.
Step 4, Selection of the correct path. The paths found are translated into bit strings and the embed EC codes are then used for the selection of the correct path. Only in case of one path passes the EC code checking, this path will be selected and strand decoding succeed. Following this framework, we implemented the DBG-GPS algorithm with Python. Reliable strand reconstruction (S r > 96.7%) was achieved with various error rates range from 1-10% with sixty strand copies as proved by Monte-Carlo simulations (Tab. S2).
Integration of DBG-GPS with outer erasure codes DBG-GPS is designed as an inner decoding mechanism for the error-free reconstruction of DNA strands. It can be easily combined with erasure codes, e.g. fountain codes or RS codes, for large-scale data storage. Integration of DBG-GPS with erasure codes is quite simple. Simply add CRC codes for strand integrity veri cation during strand encoding, and use the DBG-GPS as an inner decoding process for the decoding of DNA strands. To use other kinds of codes for strand veri cation with DBG-GPS, just replace the CRC codes with speci c codes and switch to the corresponding checking method during strand selection. Since the encoding process is highly similar to the previous studies, we only illustrated the overall decoding process using DBG-GPS as the inner decoding method as shown in Fig. 2c. There are three major steps in the decoding process.
Step 1, High-throughput sequencing. The DNA strands are sequenced by high-throughput sequencing technologies.
Step 2, Inner code decoding. The aim of this step is to decode DNA strands as many as possible. DBG-GPS is a complete solution for Step 2.
Step 3, Decode the original information by outer erasure codes. A python implementation using fountain codes as outer codes and DBG-GPS as inner codes is available at https://switchcodes.coding.net/public/switch-codes/DNA-Fountain-De-Bruijn-Decoding/git/ les.

Decoding speed of DBG-GPS
To the best of our knowledge, the clustering and multiple alignment-based (CL-MA) methods are the only methods that can rebuild the DNA strands from multiple error-rich strand copies 10,23,24 . To compare the decoding speed of DBG-GPS with CL-MA methods, we produced 10,000 DNA droplets with DNA fountain codes using CRC codes as redundancy codes for strand selection. For each DNA droplet, we generated fty sequence copies with 4% random errors introduced (2% substitutions, 1% insertions and 1% deletions). The 500,000 error-rich reads generated were then utilized for strand decoding by DBG-GPS and CL-MA respectively. Since the clustering process is quite complicated in reported studies, for the CL-MAbased decoding, we feed the program with grouped reads. For decoding speed comparison, a clustering time was estimated based on the recent study by Cyrus et al. 23 . Since the multiple alignment process was performed by a standalone program (Muscle 25 ), there will be 10,000 times of IO (input and output) with the le system during strand decoding by CL-MA. To estimate the IO time, one sequence of the DNA droplet was repetitively submitted to Muscle for multiple alignment analysis 10,000 times. This process is iterated three times and the average time is subtracted in the CL-MA based decoding speed test. The kmer counting step of DBG-GPS was performed with JellyFish. As shown in Table 1, although a Python implementation of DBG-GPS was utilized in the decoding speed test, DBG-GPS still shows 103 times faster than the CL-MA based methods without signi cant differences in decoding accuracy. A C + + implementation of DBG-GPS is expected to further enhance its performance remarkably. Importantly, DBG-GPS shows linear decoding complexity while increasing the total strand scale from 10 to 10,000 (Fig. S5), making it a suitable solution for large-scale data storage in DNA.

Experimental veri cation of the robustness of DBG-GPS
A 314 KB le compressed from two text les and four jpeg pictures, was encoded into 12,000 DNA strands in the length of 200 bp using the DNA fountain codes implemented in this study. Each oligo encodes thirty bytes of data, three bytes of CRC codes, four bytes of index, and four bytes of anchor codes. The overall strand redundancy is 10.5% enabling reliable data retrieval if more than 92% of the strands could be decoded. The designed DNA strands were synthesized by Twist Bioscience. The single strand DNAs (ssDNA) obtained from Twist Bioscience were dissolved in ddH 2 O. 0.6µl of the dissolved ssDNA was used as template for PCR ampli cation and data retrieval by sequencing. To verify the robustness of DBG-GPS with different types of errors, i.e. the base errors (indels and substitutions), strand rearrangements, and strand breaks, we performed three harsh experimental tests, as shown in Fig. 3a, b, and c correspondingly.

Reliable data recovery with deep error-prone PCR products
To verify the robustness of DBG-GPS with high rates of base errors including indels and substitutions, we utilized error-prone PCR (ePCR) for ampli cation of the ssDNA library obtained from Twist Bioscience. Six series of ePCR were performed using an error-prone PCR Kit from TIANDZ. All ePCR experiments were performed for 30 thermal cycles with the highest error-rate settings instructed by the Kit manual. Accordingly, these settings could introduce eight errors per 1,000 bp per ampli cation cycle. The rst round ePCR utilized 0.6 µl samples of the ssDNA master pool as templates. The ve following ePCR experiments use 1 µl products of the previous round ePCR as templates. The six ePCR samples obtained, i.e. ePCR#1, ePCR#2, ePCR#3, ePCR#4, ePCR#5, and ePCR#6, were sequenced on Illumina HiSeq platform. Due to unspeci c ampli cation, PCR products of ePCR#2, ePCR#3, ePCR#4, ePCR#5 and ePCR#6 show severe smear band in the range of 200 bp to 3,000 bp which cannot be sequenced directly by Illumina platform. Thus, the PCR products were digested with a target length of ~ 200 bp using DNase I. After digestion, all digested fragments are collected for sequencing and DBG-GPS based decoding.
As shown in Fig. 4a, the estimated S m values of the six samples are all very high (> 99%). It was surprising to notice that the S m values of ePCR#1, ePCR#2, and ePCR#3 were estimated to be 100%, indicating perfect reservation of all strand paths even after 90 cycles of ePCR. We then apply DBG-GPS to recover the encoded DNA strands with the six ePCR samples. As expected, with high S m values, high strand recovery rates (S r ), between 93.2-99.8%, were obtained, especially for ePCR#1, ePCR#2, and ePCR#3. However, signi cant decerase trend of S r was observed with more error-prone PCR cyles were introduced. Considering the S m values of ePCR#5 and ePCR#6 are still very high, the decrease of S r is mainly caused by massive base errors introduced. Although the S r value of ePCR#6 (93.2%) is relatively lower than the other samples, with the 10.5% strand redundancy codes, we can recover the encoded data correctly by the outer fountain codes. Interestingly, the maximal decoding rate (S m ) of ePCR#6 was estimated to be up to 99.2%, which was just slightly decreased. In future studies, incorporation of advanced codes designed for de Bruijn graph-based decoding could further enhance the strand recovery rate up to 99.2%.
To further prove the robustness of DBG-GPS, we perform random sub-sampling experiments on the sequencing results of ePCR#1 and ePCR#2 with different coverages, i.e. reading copies, range from 1 to 256. The k-mer dropout rates, , noise indicator and S r in different coverages were calculated corresponding. For each coverage tested, 12 independent random sub-sampling experiments were performed and the average values of estimated , and S r are presented. As shown in Fig. 5a, as expected, the values of ePCR#2 are higher than ePCR#1, indicating high rates of errors were introduced to ePCR#2. The differences of the two samples become more and more signi cant when increasing the coverages. The value of ePCR#2 is higher than the value of ePCR#1. The values of the two samples dropped quickly while increasing the coverages from 1 to 256. As shown in Fig. 5b, the two samples can be reliably decoded at low coverages (coverage 16 for ePCR#1, coverage 24 for ePCR#2).
Massive parallel retrieval test with unspeci c ampli cation PCR products For practical data storage, massive multiple retrievals to test the data robustness is required. However, only a limited number of retrievals is tested for DNA data storage yet. To achieve reliable data storage with multiple retrievals, the occasionally emerged unspeci c ampli cation problem of PCR needs to be handled. As shown by the deep error-prone PCR experiments, DBG-GPS based decoding can work with PCR with products with unspeci c ampli cations. To further prove its robustness in handling unspeci c ampli cation, as well as the robustness of DNA data storage under multiple reading, we performed 100 times of parallel data retrievals with PCR products with serious intended unspeci c ampli cations. As shown in Fig. 3, we rst performed one round of PCR ampli cation with Taq DNA polymerase with a total volume of 100 µl. Then, 100 parallel PCR ampli cation was performed using 1 µl of the rst round PCR reaction mixtures without puri cation. The puri cation step is removed in purpose of enhancing unspeci c ampli cation. The 100 PCR samples were then sequenced by Illumina sequencing platform and the sequencing results are committed for data decoding with DBG-GPS. Although all PCR products show serious unspeci c ampli cations as expected (Fig. 6a) Reliable data recovery with seriously degraded DNA samples Different from the previously reported codec systems, which require complete DNA strands for decoding, the decoding process of DBG-GPS utilizes very short k-mer fragments. Thus, DBG-GPS can be utilized for data decoding with degraded DNA samples, of which massive truncated DNA strands caused by strand breaks. To verify the capability of DBG-GPS with degraded DNA samples, we performed accelerated aging experiments as shown in Fig. 3c. We rst ampli ed the master pool with Taq DNA polymerase. Two liquid samples of 50 µl of the ampli ed Taq PCR products were accelerated aged at 70℃ for one and two weeks respectively. The accelerated aged samples were sequenced with Illumina sequencing platform. As shown by the analysis results of Agilent 2100 Bioanalyzer (Fig. 7b and c), the two DNA samples show serious degradation. Such degree of degradation would totally destroy the encoded data with previous reported codec systems. Dramatically, with DBG-GPS based decoding, accurate data decoding was achieved on both samples with high quality (> 99% strand recovery, Fig. 7d). It worth mention that the DNA sample that accelerated aged at 70℃ for two weeks shows no obvious peak at the 200bp, indicating most of DNA strands are corrupted. The high-quality decoding with the two samples well proved the capacity of DBG-GPS in handling DNA degradation.

Exploration of DBG-GPS's capacity by large-scale simulations
To further test DBG-GPS's performance with large-scale data sets, large-scale simulations were also performed. An Ubuntu server 20.10 installation disk image (~ 1 GB) was used as the input le for the generation of DNA droplets. Each DNA droplet was set to carry 60 bytes of information with a total length of 316 bp, around the technical limits of current DNA synthesis technologies. We rst tried with ten million DNA droplets, representing a data scale of 600 megabytes (MB). With an expected copy number of 50, error rate of 4% (approximately 2% substitutions, 1% insertions, 1% deletions), a 155 GB sequence le was generated. However, although a computation workstation equipped with 640 GB memories is employed, the k-mer counting step by Jelly sh encountered memory limitations. We then reduced the DNA droplet number to ve million, representing a data size of 300 MB. Preliminary decoding simulations on this data scale reveal a problem of long decoding time. Detailed analysis shows that this is due to the entangled DNA strands caused by repeated presentation of k-mers in different strands. To solve this problem, we propose two strategies. The rst strategy is applying a ltering process during the production of DNA droplets to lter out the seriously entangled DNA strands. The second strategy is setting a path number boundary for greedy path search. If the greedy path search engine nds more paths than this value, it aborts search. This will reduce the strand decoding rate theoretically. We run the simulations with 5 million DNA droplets three times independently. On average 4,971,880 DNA droplets (recovery rate = 99.44%) were successfully recovered.

Conclusions
DNA data channel is unique with massive 'data repetition' in the forms of multiple error-rich strands. E cient mechanism that can utilize this data 'repetition' feature for robust data storage is still missing due to the challenge of handling the massive errors, especially indels. Furthermore, the PCR-based strand ampli cation and storage of DNA molecules can lead to critical issues of strand rearrangement and strand break. In this study, we developed an effective mechanism, named 'DBG-GPS', that can well handle these problems by e cient strand reconstruction from multiple, error-rich sequences based on de Bruijn graph. DBG-GPS is applicable for all kinds of molecular data storage using any kind of polymers, e.g. DNA data storage with extended alphabets of non-natural nucleotides. Unlike previous studies, DBG-GPS can directly construct the DNA strands without computation-intensive clustering and multi-alignment process which cannot scale up well. Strategies of low coverage k-mers elimination, CRC codes, and bothway search were introduced and solved the decoding challenges caused by base errors. While traditional EC show di culties in handling indels, DBG-GPS can e ciently correct high rates (10%) errors of indels, as well as substitutions. More errors can be corrected by incorporating more sequences for decoding. Importantly, by the strand reconstruction mechanism with short k-mers, DBG-GPS achieves reliable data recovery with unspeci c ampli cation PCR products and highly degraded DNA samples. These achievements greatly enhanced the robustness of DNA data storage. This new mechanism, in combination with the previous reported methods of protecting DNAs by silicon beads 8 , nanoparticles 26 , and alkaline salts 14 , could easily enable robust data storage in DNA for over ten thousands of years even in room temperature. Furthermore, despite Python scripts were utilized in this study, DBG-GPS shows more than 100 times faster than the CL-MA based methods. The linear decoding complexity revealed by simulations makes DBG-GPS a potential solution for large-scale DNA data storage. Although DBG-GPS based decoding shows great potentials in DNA data storage, the pure python implementation, is far from optimal. The simple hash dictionary-based data structure for k-mer retrieval is memory intensive. Although no theoretical limitation on the data scale is shown with DBG-GPS, we failed to simulate with 500MB data on a computational workstation with 640 GB memories installed due to memory limitation. C + + implementation with a more advanced k-mer data structure is expected to signi cantly reduce the memory cost, as well as decoding time 27,28 . Further studies on the design of error correction codes that can incorporate with the noise k-mers elimination can further boost the capacity of DBG-GPS.

Methods
Library design, PCR and Sequencing A DNA library of 12,000 oligonucleotides in length of 200 nt was synthesized by Twist Biosciences. Errorprone PCR was performed with Controlled Error-prone PCR Kit, TIANDZ, CAT#:160903-100. Thermo cycle parameters were 94℃ for 3 mins, 94℃ for 1min, 45℃ for 1min, 72℃ for 1min, 30 cycles. Six series error-prone PCR were performed to introduce high rates of errors. The rst round ePCR utilized 0.6 µl samples of the master pool as templates. The ve following ePCR use 1 µl products of the previous round ePCR as templates. All PCR experiments were performed with the following primers: P1-'CCTGCAGAGTAGCATGTC', P2-'CGGATGCATCAGTGTCAG'. The PCR products were sequenced by Tianjin Novogene Sequencing Center & Clinical Lab. Sequencing libraries were generated with puri ed PCR products using Illumina TruSeq DNA PCR-Free Library Preparation Kit (Illumina, USA) following manufacturer's recommendations and index codes were added. For the products of the second-round error-prone PCR with very large DNA strands, DNase digestion is applied to break the long strands into small fragments around 200~300 bp. After digestion, all fragments are collected for further library construction using the protocol as just described. The library quality was assessed on the Qubit@ 2.0 Fluorometer (Thermo Scienti c) and Agilent Bioanalyzer 2100 system. All the libraries were sequenced on Illumina HiSeq platform and 150 bp paired-end reads were generated.

k-mers counting and elimination of low coverage k-mers
The python implementation provided in this study used simple hash dictionary structure for counting of k-mers. De Bruijn graph is quite memory expensive. which is not memory e cient. Thus, although the scripts implemented with functions to read the common nucleotide sequence les, i.e. FastQ or FastA, and count the k-mers accordingly, for large data sets, we strongly suggest to use a professional k-mer counting system, e.g. Jelly sh 27 , which is much more memory e cient and multi-core supporting. To count k-mers by the python scripts, please call the following functions of DeBruijnGraph() object: for reading and counting of FastQ le, use DeBruijnGraph-openFastQ(' le name'); for reading and counting of Fasta le, use DeBruijnGraph-openFasta(' le name').
For elimination of low coverage k-mers, call function DeBruijnGraph-remove_low_cov_kmers(n), then all k-mers with coverages lower than n will be removed. For usage of Jelly sh for counting of k-mers and dumping of k-mers satisfying the criteria, please refer to the Jelly sh manual, which is available at (http://www.cbcb.umd.edu/software/jelly sh/). The dumped k-mers can be introduced to the DeBruijnGraph() object using DeBruijnGraph-openDump(' le name') function for further decoding test.

Choice of k-mer size
Obviously, with speci c k-mer size, the data scale can be decoded is limited. For example, it is clearly that only very small amount of data can be decoded with an extremely short k-mer size of three. However, the mathematical answer to this question is yet unsolved. In de Bruijn Graph theory, for each k-mer, the front k-1 bases were used for positioning (i.e. indexing), and the one base at the terminal was used for extension of path, i.e. for encoding of data. Thus, we estimate the decoding capacity ( ) of k-mers with a speci c size of as follows: see formula 1 in the supplementary les.
Where is the combination number of possible indexes with a length of , 2 bits stands for the encoding capacity of one base at the tail. Based on this formula, the decoding capacities of different k-mer sizes were calculated and listed in Tab. S3. As shown in the table, the decoding capacity is estimated to be around 57.6 PB with a k-mer size of 30, and 6×10 5 EB for k-mer size of 40, indicating a capable method for large-scale data decoding. Additionally, it should be noticed that, due to the repeated presentations of identical k-mers in different DNA strands, the actual data scale can be higher than this estimation. However, to avoid decoding troubles with entangled DNA strands by repeated k-mers, we strongly recommend to use a k-mer size with a decoding capacity at least ten times larger than the actually encoded data.

Details of greedy path search
The greedy search process was illustrated in Fig. S4. The details are described as follows: Step 1, Select an index, e.g. 562451, and calculate the anchor value by the Python pseudorandom function using the index value as seed in case of both-way search mode. An  Simpli ed error analysis in concept of de Bruijn graph theory Although distribution pro les of error rate and copy number among different strands shapes the DNA data quality, they cannot re ect the data quality straightforwardly. Additionally, the sequence error rate is complicated for accurate accounting. For better estimation of the qualities of DNA data and simpli cation of the error analysis process in concept of de Bruijn graph theory, here we introduce two indicators, i.e. (in range of 0 to 1) and (in range of one to 4 k ), which together can well de ne the data quality. The stands for dropout rate of all correct k-mers and is the ratio of noise k-mers and correct kmers. DNA data with a close to 1 and a close to 0 stands for a high-quality data with high data coverage rate and low noises. Other than error rate which is complicated for accounting, and can be easily calculated by the following formulas: see formulas 2 and 3 in the supplementary les.
In formula 1, stands for the number of k-mers in the original encoded sequences and stands for the number of k-mers that presented in the original encoded sequences but not presented in the sequencing results. In formula 2, stands for the number of k-mers in the sequencing results and stands for the number of correct k-mers, i.e. k-mers that presented in the original encoded sequences. DNA data with a close to 1 and a close to 0 stands for a high-quality data with high data integrity and low noises. The maximal strand decoding rate is determined by . With speci c , the actual strand decoding rate is affected by the value of . The greedy path search becomes extremely slow when decoding DNA data with higher values, due to too many possible paths for each strand. The DBG-GPS will stop decoding if the path number exceed speci c threshold. In this proof-of-concept study, we use 100 for fast decoding, 1,000 for high-quality decoding.

Details of the implementation of fountain codes
The outer fountain codes in Python is modi ed from https://github.com/dbieber/fountaincode. DNAFountain and DNADroplet objects were implemented for handling of DNA strings. The assignment mechanism of the random degrees and chunk indexes of the Droplet object were modi ed to support DBG-GPS based decoding. Luby transform is applied for generation of random degrees. Speci c index encoded in the DNA Droplet is used for selection of a degree from a pre-generated degree table. This index is also used as seed for generation of the random strand combination. The decoding process of the original Glass object is improved.

Details of simulation experiments
Simulations for investigation of the potentials of de Bruijn graph were performed with 10,000 DNA strands generated by the fountain codes with a strand length of 320 bp. For de Bruijn graph construction, a k-mer size of 18 was applied supporting a data scale of ~4 GB theoretically.
For the large-scale simulations, an Ubuntu Server 20.10 live install disk image (MD5: f51594d36008c456cf0360aeadd130d6, size: 1 GB) was used as input le for generation of random DNA droplets. Each DNA droplet was set to encode 60 bytes of data and two bytes of CRC codes. The length of index and anchor are both set to four bytes. Primers of P1-'CCTGCAGAGTAGCATGTC' and P2-'CGGATGCATCAGTGTCAG' were added to the ends. This will result with a DNA droplet length of 300 bp, around the theoretical limits of current DNA synthesis technologies. To avoid of entangled DNA strands, a lter process that always discard the DNA droplets contains a k-1 mer that already presents more than four times was applied to the droplet generating stage. For counting of k-mers by Jelly sh, a k-mer size of 24, and k-mer coverage cut-off of ve, i.e. all k-mers with a coverage less than ve is discarded, was applied.

Declarations Data Availability
All the algorithms proposed in this study were implemented with Python and the source code is available at https://switch-codes.coding.net/p/switch-codes/d/DNA-Fountain-De-Bruijn-Decoding/git/tree/ForPublication. Tab. 1 Decoding speed of DBG-GPS in comparison with clustering and multi-alignment (CL-MA) based method. The simulations were performed with 10,000 DNA strands in length of 300 bp. For each strand, fifty strand copies with an error rate of 4% were generated and utilized for decoding. The time of clustering stage in CL-MA based decoding was estimated based on a recent study by Cyrus Rashtchian et al. 23 The consensus calling stage used majority voting algorithm same as the recent study by Antkowiak, P.L. et al. 24   iii, De Bruijn graph with one insertion and one deletion on sequence #2 and #3 respectively; iv, De Bruijn graph with four strand breaks. b, Estimated maximal strand decoding rates (Sm) of different types and rates of errors with 20 sequence copies; Random errors were introduced with speci c type and rate of errors from 1% to 10% with a step size of 1%. The error-rich sequences generated were then used for construction of the de Bruijn graph. The integrity of the correct paths was checked. This process is iterated for ve times and the average Sm values are shown. The standard derivations, which are too small to be shown, are detailed in Tab. S1. c, Sm values estimated with different sequence copies and rates of mixture errors (50% substitutions, 25% insertions and 25% deletions). Sequence copies in range of 1 to 100 with a step size of 5, mixture error rates in range of 1% to 10% with a step length of 1% were considered. The maximal strand recovery rates (Sm) are the rates of DNA strands with complete paths in the graph during simulations. A Sm value > 99% was obtained with merely sixty sequence copies as shown by the dashed red line.

Figure 3
Experimental veri cation of the robustness of DBG-GPS. a, Procedure of the series deep error-prone PCR experiments. b, Multiple retrieval test with unspeci c ampli cation PCR products. c, Accelerated aging experiments. These three experiments were performed with focus of indels, strand rearrangements and strand breaks respectively. Although super harsh conditions were applied, we could decode the original data correctly in all cases presented.

Figure 4
Reliable data recovery with deep error-prone PCR products. a, Agarose electrophoresis analysis of the series error-prone PCR (ePCR) products. The rst-round ePCR (#1) using 0.6 µl sample of master pool as templates. The following series ePCR used 1 µl of previous round PCR products as templates. The six Page 23/25 ePCR were performed with highest error rate setting of 8 errors per 1,000 bp for each ampli cation cycle.
Note, the length of initially synthesized ssDNA is 200bp. b, Reliable strand reconstruction with low quality ePCR products. The dashed red line stands for the decoding lower boundary of the outer fountain codes in this case study. After 180 cycles (30 cycles × 6 = 180 cycles) of error-prone ampli cations, the data still can be correctly decoded using DBG-GPS.   Reliable data retrieval with accelerated aged DNA samples. a, b, and c are DNA integrity analysis results of DNA samples without accelerated aging, accelerated aged for one week, and accelerated aged for two weeks respectively. d, strand recovery results of the three samples. Although the integrity of the DNA strands was damaged seriously after accelerated aged at 70℃ for two weeks as shown in gure c, high strand recovery rate (Sr > 99%) was achieved by DBG-GPS based decoding. Indeed, no signi cant differences could be observed between the three samples, suggesting a robust decoding mechanism against DNA degradation.