On a greedy approach for genome scaffolding

Background Scaffolding is a bioinformatics problem aimed at completing the contig assembly process by determining the relative position and orientation of these contigs. It can be seen as a paths and cycles cover problem of a particular graph called the “scaffold graph”. Results We provide some NP-hardness and inapproximability results on this problem. We also adapt a greedy approximation algorithm on complete graphs so that it works on a special class aiming to be close to real instances. The described algorithm is the first polynomial-time approximation algorithm designed for this problem on non-complete graphs. Conclusion Tests on a set of simulated instances show that our algorithm provides better results than the version on complete graphs.


Motivation
In this paper, we focus on a bioinformatic problem occurring in the production of genomes. Genomes are usually obtained by sequencing. Sequencing produces an important amount of small sequences of nucleotides called reads. Herein, the lengths range from hundreds to tens of thousands of characters, depending on the sequencing technology. As a rule of thumb, shorter reads, produced for example by second generation sequencing (Illumina) have a higher quality (contain less read-errors) than long reads produced by third generation sequencing technologies (PacBio or Oxford Nanopore) [1]. The assembly process exploits overlaps between reads to reconstruct the targeted sequence. However, this is complicated by repeated parts in real-world genomes. Assembly algorithms cannot uniquely infer the original genome if it contains such repeated regions (the longer the repeated region with respect to the read length, the harder it is to infer the original genome). To avoid misassembly, such algorithms reconstruct only parts of the genome which is then returned as set of "contiguous regions" (or contigs). A thus fragmented genome is not ideal for further processing, and one would like to have as few contigs as possible while avoiding misassembly. A way to approach this are hybrid strategies using both long and short reads [2]. However, many genomes comprising current databases have been assembled before the development of third generation sequencing, preventing such hybrid strategies. One way to reduce the fragmentation of genomes in these databases while avoiding costly re-sequencing, is the exploitation of "meta-information" about the available reads.

Genome scaffolding
In second generation sequencing, short reads come in pairs, indicating that a fragment of the DNA molecule exists whose ends correspond to the two reads of a pair. In particular, the total length of said fragment is known approximately. This pairing information can be used to

Open Access
Algorithms for Molecular Biology infer the order (and orientation) of the given contigs on the chromosome, thus completing the genome (modulo possible gaps between the contigs). The mathematical problem modeling this inference, called scaffolding, is made complicated by possible inconsistencies in the pairing information. See [3] for a recent overview of models, variants, and methods in this context. The problem we study here is an optimization problem in a special graph called scaffold graph. The present formulation use both pairing information and some genomic structural constraints, like a fixed number of linear and circular chromosomes. In [4], we presented preliminary results about the complexity of this problem and a first polynomial-time approximation on complete graphs. Those results were extended and completed by another polynomial-time algorithm [5] and by a randomized approach [6]. We also explored exact algorithms [7], and studied some sparse special cases of scaffold graphs [8]. The contribution of the present paper is a continuation of published works [9,10], where special classes of graphs have been studied (from sparse to very dense). Since real instances are usually sparse but contain some dense regions, due to abundance of repeats [11], we are interested in graphs built from cliques that are separated by bridges (i.e. edges whose removal disconnects the graph). The main contribution is the extension of the approximation algorithm on complete graphs of Chateau and Giroudeau [5] to a particular class called "connected cluster graph ". Ultimately, the objective is to adapt the algorithm to sparse classes of graphs. To keep the approximation algorithm in polynomial time, one condition is that the decision problem of the scaffolding must be solvable in polynomial time. We propose a negative result, (i.e. it is N P-complete) for a particular sparse graph class. Finally, since the presented approximation has a polynomial approximation ratio in some particular cases, we show that the scaffolding problem can not be approximed with a ratio better than a polynomial function in such cases.

Organization of the paper
The next section is devoted to notations and the description of the scaffold problem. In "Computational hardness" section, we show a N P-hardness results for sparse scaffolding graphs. In "Non-approximability" section, we address inapproximability. "Feasibility function for connected cluster graphs" section is devoted to a greedy algorithm for a special class of graph called connected cluster graph. Finally, we provide experimental results for the greedy algorithm.

Graph definitions
For a graph G, we denote by V(G) and E(G) the set of vertices and edges of G, respectively. Let u be a vertex of G, the degree d(u) of u is the number of edges incident with u. The girth g(G) of G is the length of the smallest cycle of G. A graph is bipartite if its vertices can be partitioned into two sets of non-adjacent vertices. A graph is planar if it can be drawn in the two-dimensional plane without crossing edges.
A matching M * ⊆ E(G) of G is a set of non-adjacent edges. M * is called perfect if it touches all vertices of G. For a vertex u, we let M * (u) denote the unique vertex v (if it exists) such that uv ∈ M * . In a scaffold graph, vertices represent extremities of contigs. Given a matching M * , the matching edges represent contigs and edges outside the matching represent possible contiguity relationship between contigs. The confidence that two contigs (more precisely, contig-extremities) occur consecutively in the genomic sequence is represented by a weight on edges outside the matching. An alternating path (resp. alternating cycle) is a path (resp. cycle) such that its edges alternatingly belong to M * or not. The extremal edges of an alternating path must be in M * .
A clique of G is a set of vertices such that all vertices are adjacent. A bridge (resp. cut vertex) of G is an edge (resp. vertex) such that its deletion increases by one the number of connected components of G. In "Feasibility function for connected cluster graphs" section, we study a particular class of graph called connected cluster graph, defined as follows.
Definition 1 A connected cluster graph G is a graph that admits a decomposition of its edges E(G) = E ′ ∪ B such that the subgraph induced by E ′ is a disjoint union of cliques and each edge e ∈ B is a bridge of G. An example of a connected cluster graph is given in Fig. 1.

Scaffolding problem
A scaffold graph (G * , M * , ω) is a simple, loopless graph G * with a perfect matching M * and a weight function ω on the non-matching edges. The matching M * represents the set of contigs and for an edge uv, ω(uv) indicates the confidence that the contig extremity v follows the contig extremity u in the genomic sequence. 1 The alternating girth of a scaffold graph denoted by g * (G * ) is the number of matching edges in the smallest alternating cycle of (G * , M * , ω) . In this paper, we study a decision and optimization version of scaffolding, defined as follows.
Let S be a collection of p alternating paths and c alternating cycles. We call the number p + c the cardinality of S and, we let σ p (S) := p and σ c (S) := c.

Greedy algorithm
The main contribution of this paper is an extension of a known polynomial-time 3-approximation [5] to connected cluster graphs. Whereas the original algorithm was developed to work in complete graphs, it can be adapted for the general case, as shown in Algorithm 1.
The two integers σ p and σ c are used to model restrictions on the sought genomic structure by representing the number of linear and circular chromosomes, respectively.
The idea of this greedy algorithm is to consider each nonmatching edge in decreasing order of weight and add it into a partial solution, if possible. The key instruction is the feasibility function: given a partial solution S and an edge e, this function indicates whether S ∪ e can still be The solution S given in the input of the feasibility function is called initiating solution. In general, since scaffolding is N P-complete, feasibility cannot be decided in polynomial-time, even if S = ∅ (unless P = N P ). Thus, we focus on restricted classes of graphs. In [5], a constant-time feasibility function was developed for complete graphs, leading to the following result.

Theorem 1 ([5])
In complete graphs, Algorithm 1 gives a solution with an approximation factor of 3.
In "Feasibility function for connected cluster graphs" section, we develop a feasibility function for connected cluster graphs and show that Algorithm 1 gives a 5-approximate solution in this case. Notice that, on graph classes containing the 2 × k grids, the worst-case approximation factor of the greedy algorithm cannot be better than polynomial, even if a polynomial-time feasibility function exists (see Fig. 2).
We conclude this section with a note on real-world instances, which are too sparse to fall into our considered class. However, we can transform them by adding some non-matching edges with weight zero. This technique was used to run the feasibility function for complete graphs on simulated instances [5] and the computed solution was close to the optimal. One of the reasons we develop a feasibility function for connected cluster graphs is that we conjecture that using a feasibility function for a graph class that is closer to the original instance (edge-deletion distance from the class) provides better approximation in practice, even though the theoretical approximation factor of the algorithm becomes worse. We test this hypothesis in "Experimental results" section.

Computational hardness
Like said in the previous section, when using the greedy algorithm on a real instance, we must complete the original instance by adding non-matching edges with weight zero. To minimize the number of added edges, the solution is to adapt the greedy algorithm to a sparse class of graphs. In order to do that, scaffolding must be solvable in polynomial time in this particular class since otherwise, the feasibility function can not be run in polynomial time. In this section, we show that scaffolding is N P-hard for the particular class of graphs where |M * | = 2σ c + σ p . That is, we show that the greedy algorithm can not be executed in polynomial time in this special case. In such instance, any feasible solution S contains only alternating paths of length one and alternating cycles of length four (i.e. the smallest possible elements). While scaffolding is polynomial in this case [5], a natural extension would be to consider slightly longer alternating paths and alternating cycles. Unfortunately however, it turns out that deciding whether (G * , M * ) contains a collection with alternating paths of length one and alternating cycles of length six is already N P-complete. In order to show this, we focus on the value of the alternating girth of the scaffold graph. Indeed, in a solution of scaffolding with g * (G * ) · σ c + σ p edges, each alternating path consists of exactly one matching edge and each alternating cycle is an alternating girth. We show that finding such a solution is N P-complete, even if g * (G * ) = 3 , by reducing independent set to it. Fig. 2 Unbounded ratio of Algorithm 1 in the general case. Let (G * , M * , ω) be a 2 × k grid where the perfect matching (bold edges) corresponds to the edges between the two rows. Let (x 1 , . . . , x k ) and (y 1 , . . . , y k ) be the vertices of the first and second row, respectively. We are looking for a solution of max scaffolding with σ c = 0 and σ p = 1 . If the algorithm chooses first the edge x 1 x 2 , then the only feasible solution is S = {x ℓ x ℓ | ℓ mod 2 = 1} ∪ {y ℓ y ℓ+1 | ℓ mod 2 = 0} (dashed edges). Suppose that an optimal solution is S opt = E(G * ) \ (M * ∪ S) (solid edges). If all edges of S opt and x 1 x 2 are valued by one and all edges of S \ {x 1 x 2 } are valued by zero, then we have (k − 1) · ω(S) = ω(S opt ) which leads to an unbounded ratio  17:16 IS is N P-complete in general graphs. In order to build our reduction, we need G to be subcubic and trianglefree (i.e. �(G) ≤ 3 and g(G) > 3 ). Note that Lozin et Milanič [12] showed that independent set remains N P-complete in F -free planar subcubic graphs if F does not contain a tree with exactly three leaves. By choosing F := {C 3 } (where C 3 is the cycle on three vertices), we obtain the desired N P-completeness. Our reduction uses the following construction.
Construction 1 (see Fig. 3) Given a subcubic, trianglefree graph G, construct a scaffold graph (G * , M * , ω) as follows: • for each edge e i ∈ E(G) , construct a matching edge u i u i , and • for each vertex v t ∈ V (G) , introduce the matching edges {u j t u j t | j ≤ 3 − deg(v t )} =: E t and construct an alternating 6-cycle C t on the vertices E t ∪ {u i u i | v t ∈ e i } such that no two u (or u ) vertices are adjacent.
The alternating cycles C i are called vertex-cycles. A bipartition is given by the u-and u-vertices. Note that, if G is planar, it is also possible to construct a planar graph (which may no longer be bipartite). To show hardness of scaffolding when |M * | = g * (G * ) · σ c + σ p , we use the following properties of graphs resulting from Construction 1.

Lemma 1
Let G be a subcubic triangle-free graph and let (G * , M * , ω) be its scaffold graph produced by Construction 1. Let S be a collection of σ c = k alternating cycles and σ p = |M * | − 3k alternating paths. Then, (a) g * (G * ) = 3, (b) every alternating cycle in S is a vertex-cycle, and (c) let C t and C t ′ be vertex-cycles in S, the vertices v t and v t ′ are not adjacent in G.
Proof (a) By construction, each vertex-cycle contains exactly three matching edges and, thus, g * (G * ) ≤ 3 . Suppose there is an alternating cycle containing exactly two matching edges e and e ′ . Let C t be a vertex-cycle containing e. Since C t has length six, there is another vertex-cycle C t ′ � = C t that contains e ′ . Indeed, e and e ′ are both in C t and C t ′ since, otherwise, their extremities cannot be adjacent. By construction, there are two edges e i and e j in G that are incident to both v t and v t ′ , contradicting G being simple. Hence, there is no alternating cycle with two matching edges and g * (G * ) = 3.
(b) Let C be an alternating cycle in S. By Lemma 1(a), |M * | = g * (G * ) · σ c + σ p , implying that C has length six. Let u i u i be a matching edge of C. If there is a matching edge v 1 t v 1 t ∈ C then, by construction, the third matching edge of C is either v 2

Fig. 3
Example of a scaffold graph produced by Construction 1. Left: input graph with an independent set of size two given by the black vertices. Right: output graph with a collection of two alternating cycles and one alternating path in black. A bipartition is given by gray and white vertices. An example of a vertex-cycle is C v1 = {v 1 1 , v 1 1 , u 1 , u 1 , u 2 , u 2 } . It is possible to turn this graph into a planar graph by replacing the edges is the vertex-cycle C t . Suppose there is no matching edge v 1 t v 1 t in C. For any pair of matching edges (u k u k , u k ′ u k ′ ) of C, e k and e k ′ are incident to a same vertex in G. Let u i u i , u j u j and u k u k be the three matching edges of C. Since G is triangle-free, e i , e j and e k are adjacent in G, hence, C is a vertex-cycle. (c) Let e i = v t v t ′ ∈ E(G) . The matching edge u i u i is in C t and C t ′ and, thus, S cannot contain both C t and C t ′.
□ In the proof of correctness, we simulate vertices of the independent set with vertex-cycles. If a solution S contains two vertex cycles C i and C j , then v i and v j are not adjacent in G. Hence, if a solution S contains k vertexcycles, then there is an independent set of k vertices in G. Theorem 2 scaffolding is N P-complete, even in bipartite (or planar) subcubic scaffold graphs (G * , M * , ω) were |M * | = g * (G * ) · σ c + σ p and g * = 3.
Proof Since, clearly, scaffolding is in N P , it remains to show that Construction 1 is a reduction, that is, G has an independent set of size k if and only if there is a collection of k alternating cycles and |M * | − 3k alternating paths in (G * , M * ).
"⇒ ": Let I be an independent set of size k in G. We build a solution of scaffolding as follows. For each vertex v t ∈ I , we construct the vertex-cycle C t in S. For each remaining matching edge in M * \ v t ∈I C t , we construct an alternating path of length one. We obtain a solution S as thought.
"⇐ ": Let S be a solution in (G * , M * ) containing k alternating cycles and |E(G) − k alternating paths and let I := {v t |C t ∈ S} . By Lemma 1(b), any alternating cycle of S is a vertex-cycle in (G * , M * ) and, thus, |I| = k . Moreover, by Lemma 1(c), I is independent in G. □ Note that Theorem 2 can be generalized to g * (G * ) > 3 by modifying Construction 1 as follows. First, we build our construction from a graph G with g(G) > ℓ ≥ 3 . IS remains N P-complete in such graphs by the result of Lozin and Milanič: it suffices to take F = {C i | i ≤ ℓ} , where C i is the cycle of order i. Then, we increase the length of every vertex-cycle by taking . By making these modifications, we construct a scaffold graph with g * (G * ) = ℓ and we preserve properties Lemma 1(b) and Lemma 1(c). This leads to the following result. Corollary 1 scaffolding is N P-complete even in bipartite (or planar) subcubic scaffold graphs (G * , M * ) were |M * | = g * (G * ) · σ c + σ p , for all g * (G * ) ≥ 3.

Non-approximability
In this section, we discuss the hardness of approximating max scaffolding. Notice that, since scaffolding is N P-complete, there is no polynomial-time approximation algorithm for max scaffolding (unless P = N P ). However, this argument does not hold for graph classes where scaffolding is in P (i.e. classes for which the feasibility function (and, thus, the greedy algorithm) runs in polynomial time).
We show that, in this case, max scaffolding is still Poly-APX-hard, that is, it is not possible to approximate max scaffolding within a factor better than a polynomial function in |V (G * )| + |E(G * )| (unless P = N P ). Recall that Fig. 2 already shows that the greedy algorithm can not approximate max scaffolding with a ratio better than a polynomial function. The inapproximability result presented in this section shows that it is the case for any polynomial-time algorithm. In the following, we construct an S-reduction (see [13]) from the optimization version of independent set. Fig. 4) Let G be a graph. Then, construct the following scaffold graph (G * , M * , ω):

Construction 2 (see
, let e i and e j be the k th and k + 1 st edges of A t , respectively, and add a non-matching edge between u t i and u t j . -Let e i and e j be the first and last edges of A t , respectively, and add the non-matching edges v t 1 u t i and v t 2 u t j . • Each non-matching edge has weight zero, except the edges v t 2 e j which have weight one. and is denoted by C(v t ) . Note that a long vertex-cycle has weight one. Now consider the following properties.

Lemma 2 Let G be a graph and let
(a) Every non-zero-weight alternating cycle C of S is a long vertex-cycle.
Proof Note that it is always possible to build a collection of |V (G)| + |E(G)| (weight-0) alternating cycles in (G * , M * , ω) by constructing the alternating cycle . Then, no alternating cycle of S contains both e i e i and v t Proof Towards a contradiction, assume that there is such an alternating cycle C. By pidgeonhole principle, one of the |V (G) + E(G)| alternating cycles in S, say C ′ , avoids both e i e i and v t 1 v t 1 for all i, t ∈ N . Let u t i u t i be a matching edge of C ′ for some e i = v t v q . Then, C ′ cannot contain u q i u q i as, otherwise, e i e i cannot be part of an alteranting cycle in S, implying that S is not a solution. Thus, each matching edge of C ′ is on the long vertexcycle C(v t ) . Since the graph induced by the vertices of  (a): Let C be a non-zero-weight alternating cycle of S and assume towards a contradiction that C is not a long vertex-cycle. Since C contains a non-zero-weight edge v t 2 u 1 i , the matching edge v t 2 v t 2 is in C. As C is not a long vertex-cycle, there is some e i = v t v q such that C contains both u t i u t i and u q i u q i . Thus, either the matching edge e i e i is in C, contradicting Claim 1, or e i e i consists of a singleedge alternating path of S, contradicting our choice of S.
(b): Towards a contradiction, assume that S contains C(v t ) and C(v q ) such that e i = v t v q ∈ E(G) . Then, the matching edge e i e i is a single-edge alternating path of S, contradicting our choice of S.
We now show the Poly-APX-hardness of max scaffolding, even for graph classes for which Scaffolding ∈ P . Reusing the same idea of Theorem 2, we simulate the vertices of the independent set with long vertex-cycles. If a solution S of max scaffolding has weight k, then S contains k long vertex-cycles and, since their related vertices cannot be adjacent, we can construct an independent set with k vertices in G.

Theorem 3 max scaffolding is Poly-APX-hard, even for graph classes for which Scaffolding ∈ P.
Proof Let G be an instance of independent set and let (G * , M * , ω) be the scaffold graph produced by Construction 2. Let S be the set of all collections of σ p = 0 alternating paths and σ c = |V (G)| + |E(G)| alternating cycles in (G * , M * , ω).
Recall that independent set is Poly-APX-complete for general graphs [14]. We show that G has a size-k independent set if and only if S contains a solution S of score k.
"⇒ ": Let I be an independent set of size k in G. We construct a solution S ∈ S as follows.
First, for each v t ∈ I , construct the alternating cycle Third, for each edge e i = v t v q not incident with a vertex in I, construct the alternating cycle Since each long vertex-cycle has weight one, we obtain a solution S with ω(S) = k. "⇐ ": Let S ∈ S with ω(S) = k . We construct an independent set I by taking all vertices whose long vertexcycle is in S, that is, I := {v t | C(v t ) ∈ S} . Since each long vertex-cycle has weight one, Lemma 2a implies that S contains k long vertex-cycles. Thus, |I| = k . Further, by Lemma 2b, I is independent.
Let f be the function corresponding to Construction 2 and let g be a function that computes an independent set in G from a solution in f(G), as described above. Suppose that there is a polynomial-time algorithm A with approximation factor ρ for max scaffolding. The approximation factor of g • A • f is equal to ρ , thus Construction 2 constitutes an S-reduction. Non-approximability results of independent set transfer to max scaffolding. □

Feasibility function for connected cluster graphs
In this section, we present a feasibility function for connected cluster graphs using dynamic programming. For simplicity, we consider in the following scaffold graphs (G * , M * , ω) such that G * is a connected cluster graph and no matching edge is a bridge. The case were a bridge can be a matching edge is included in the feasibility function for block graph that (see "Experimental results" section).

Definitions
Notice that the structure of a connected cluster graph is close to a tree (that is, collapsing each clique of G * into a single vertex leads to a tree), so we will use a similar vocabulary: a rooted connected cluster graph is a connected cluster graph where a clique r is designated as a root. Then, the following notation applies: the parent of a clique x is the clique connected to x on the unique x-r-path. A child of a clique c is clique of which c is the parent. Any clique without children is called a leaf. A vertex v of a clique c is a door of c if v is adjacent to a vertex u in a child of c. In that case, for simplicity, we say that the clique containing u is a child of v. The upper door of a clique c = r is the unique vertex v that is adjacent to a vertex of the parent of c. Let c be a clique of G * and let S be a partial solution in G * . Let S ′ be the intersection of S and c, an alternating element of c is either an alternating cycle of S ′ or an alternating path of S ′ . Notice that an alternating path of S can be decomposed into several alternating elements if it belongs to several cliques. Let e be the alternating element containing the upper door of c. The subclique c ′ of c is the subgraph containing every vertex of c that does not belong to e. For- . We use the tree-structure to develop a bottom-up algorithm, that is, we construct and assemble some partial solutions from the leaves to the root. We define some operations to combine this partial solutions.

Operations
Let G 1 and G 2 be two edge-disjoint subgraphs. We can build a solution in the graph induced by from a solution in G 1 and a solution in G 2 , using four operations.
Definition 2 Let G 1 and G 2 be edge-disjoint subgraphs of G * . Let S 1 and S 2 be solutions of G 1 and G 2 , respec- S is a composition of S 1 and S 2 if S can be obtained from S 1 ∪ S 2 by at most one of the following operations: close an alternating path (u 1 , u 2 , . . . , u 2t ) of S 1 and an alternating path an alternating cycle by adding the non-matching edges u 2t v 1 and v 2q u 1 . Absorption: replace a non-matching edge v 2i v 2i+1 of an alteranting path in S 2 with an alternating path (u 1 , u 2 , . . . , u 2t of S 1 by removing v 2i v 2i+1 and adding the non-matching Finally, if no operation is necessary to obtain S from S 1 ∪ S 2 , we say that S is obtained by juxtaposition.
Note that all presented operations add only edges of ) . Note further that not all compositions of two solutions are guaranteed to exist for a pair S 1 and S 2 . In the algorithm, we manipulate sets of solutions: we can create a new set of solutions from two sets of solution if all pairs of solutions of the two input sets are used in the resulting set.

Definition 3
Let G 1 and G 2 be two edge-disjoint subgraphs of G * and let S 1 and S 2 be sets of solutions of subgraphs G 1 and G 2 , respectively. Let op be one the operation described in Definition 2. Then, we call the To ensure the possibility of building a complete composition from two sets of solutions, it is useful to characterize a solution according to the operations we can perform on it.

Definition 4
Let G and G ′ be two edge-disjoint subgraphs of G * and let S be a feasible solution of scaffolding for (G, M * , ω) .
1. S is closeable if S contains an alternating path (u 1 , u 2 . . . , u 2t ) and G ′ contains an alternating path v is an extremity of an alternating path and v has a neighbor in . . , u 2t ) and G ′ contains an alternating path with extremities v and w such that vu 2i , wu 2i+1 ∈ E(G * ) \ M * for some i < t . Note that an absorbent solution can also be closeable, alternating or frozen.
Note that all closeable solutions are also extensible. If a solution S is closeable by a subgraph G ′ , then we can close an alternating path of S into an alternating cycle by adding some edges of G ′ . If a solution S is extensible by a subgraph G ′ , then we can add some edges of G ′ in an extremity of an alternating path of S without changing the cardinality of the solution. Finally, if a solution S is absorbent to a subgraph G ′ , then we can replace an absorbent edge of S by a path of length three without changing the cardinality of S. An example of the different operations of Definition 4 is given in Fig. 5.

Semantics
Since the number of possible solutions can be exponential, we just store the possible cardinalities in the table entries, which is sufficient to answer the question of feasibility. Recall that, if X, Y ⊆ N are two sets of integers, then the sum of X and Y is defined as In the following, we call an integer j eligible with respect to a set S of solutions and an integer i if there is a solution S ∈ S containing i alternating cycles and j alternating paths. Then, our dynamic programming table has the following semantics.

If S is the set of solutions composed with a juxtaposition operation, then
Proof Let S ∈ S and let S 1 and S 2 denote the solutions of S 1 and S 2 , respectively, such that S is composed by S 1 and S 2 . Then, 1 since S 1 and S 2 have a common alternating path in S, we have σ p (S) = σ p (S 1 ) + σ p (S 2 ) − 1 and since no cycle is formed, σ c (S) = σ c (S 1 ) + σ c (S 2 ) . Thus, since S is a complete composition of S 1 and S 2 , we have We use Lemma 3 to define the four functions juxtapose, merge t , absorb, and close t , which provide table entries for complete compositions "composed" with a juxtaposition, merge, absorption or closing operation, respectively. Although Lemma 3 is defined for two sets, we use a generalized version which can take as parameters more than two sets. The functions merge t and close t have a parameter t that indicates the number of paths merged or closed during the operation. For example, if we have three sets S 1 , S 2 , and S 3 and if it is possible to construct a single alternating path in the resulting composition by taking one alternating path in each set, then we use the function merge 3 ({S 1 }, {S 2 }, {S 3 }) . Note that the parameter t can be different from the number of sets. In addition, it is sometimes possible to close a single alternating path into an alternating cycle and, in that case, the function close 1 is used. The four functions are defined in Algorithm 2, Algorithm 3 and Algorithm 4. However, we must ensure that the associated operation is feasible before using one these functions. For each traversed subgraph, we use four different sets of solutions distinguishing solutions according to their properties.

Definition 6
Let S be a partial solution of G * . Let x be a vertex, a partial path, a subclique or clique of G * and let S ′ be a solution of the subgraph G * (x) . Then, and S is extensible and S ∩ E(G * (x)) ⊆ S ′ . • S ∈ A(x) ⇔ S is frozen and absorbent and S ∩ E(G * (x)) ⊆ S ′ . • S ∈ F(x) ⇔ S / ∈ A(x) and S is frozen and S ∩ E(G * (x)) ⊆ S ′ .
In the algorithm, we traverse four different types of subgraphs defined as follows.
• Let v ∈ V (G * ) , let child(v) be the set of children of v in G * (possibly empty). Then, G * (v) denotes the subgrah of G * that is induced by v and every branch linked • Let e be an alternating element. Then, G * (e) denotes the subgraph of G * that is induced by e and all children of its vertices. Formally, G * (e) = G * [ v∈e V (G * (v))]. • Let c be a clique of G * and let c ′ be the subclique of c.
For all x ∈ {c, c ′ } , the subgraph G * (x) is the union of x and all children of .

The algorithm
We now present a method to provide the feasibility function needed by Algorithm 1. In the next paragraphs, we describe the algorithms that calculate the table entries for the four types of subgraphs described above.

Vertex
Let v ∈ V (G * ) . We show in this part how to compute the table entries for the sets F(v) and E(v) . Note that, since the edge between G * (v) and its parent is a bridge, any solution S ′ for G * (v) can have at most one edge incident to v. Thus, the sets C(v) and A(v) are empty. If v is not incident to an edge of S ∩ E(G * (v)) , then we construct the table entries by successively merging the table entries of the children adjacent to v. For that, we use at each step an intermediate graph G i . Let V i be the set of the first i children of v. G i is the subgraph of G * induced by v and all vertices in V i . If v is incident with an edge S ∩ E(G * (v)) , then any solution containing S is in E(v) . An example of solutions computed by Algorithm 5 is depicted in Fig. 6.
We distinguish two cases.
Case 1: there is an edge uv ∈ S ∩ E(G * (v)) . Thus, S ′ is extensible and is composed by the merge of an extensible solution in G * (c u ) with uv and the juxtaposition of any solution for each child c u ′ � = c u . Hence, lines 9 and 11 are correct. Case 2: there is no edge uv ∈ S ∩ E(G * (v)) . Then, S ′ is frozen if and only if it does not contain an edge incident to v. As there is no edge uv in any child c t , S ′ is composed by juxtaposition of any solution for each child c t and the assignment in line 13 is correct. If S ′ is extensible, then there is a unique child c t of v such that an alternating path from S ′ ∩ E(G * (c t )) has been expanded to v and, therefore, the solution S ′ ∩ E(G * (c t )) is extensible. Thus, S ′ is composed by a merge of a extensible solution of a unique child and the juxtaposition of any solution in other children. Hence, line 14 is correct. Note that the only possibility to obtain an absorbent solution of G * (e) is when e is a path that is closed into an alternating cycle. However, if an absorption operation is done in the function compute_subclique , then  . , e k } be a list of alternating elements of c ′ , let t ≤ k , let E t = {e 1 , . . . , e t } , and let V t = e∈E t V (G * (e)) . Let G t be the subgraph of G * induced by V t . At step t, a solution S ′ is in A + (resp. E + ) if and only if (1) S ′ is a solution of G t , (2) S ′ contains a set C = ∅ of closeable paths and (3) S \ C is not extensible (resp. extensible). Proof Assume table entries returned by compute_alternating_element are correct. We show by induction that the values calculated in each step t are correct for the graph G t . First, G 0 is the empty graph and the unique solution is that containing zero alternating cycles and paths and this solution is frozen. Thus, lines 1 to 3 are correct. Now, consider the alternating element e t and suppose the previously computed values are correct (i.e. values stored in F ′ , A ′ , E ′ , A ′ + and E ′ + ). Let S 1 be a solution in G t−1 , let S 2 be a solution in G * (e t ) and let S ′ be a composition of S 1 and S 2 . We have the following properties: • if S ′ is obtained by a juxtaposition, then S 1 is in Thus, there are 25 complete compositions to consider. If S 2 ∈ C(e t ) (resp. E(e t ) ) and S ′ is obtained by a closing (resp. merge), then S ′ is closeable (resp. extensible) if S 1 contains more than one closeable (resp. extensible) alternating path or absorbent, otherwise. Hence, a complete composition obtained with a closing or a merge is not included in a unique set among those defined. This problem can be solved by ignoring certain solutions: S ′ can be ignored if there is another solution in G t with the same cardinality.
1 Suppose S ′ is obtained with a closing (resp. merge) and S 1 contains more than one closeable (resp. extensible) alternating path. Let p 1 and p 2 be closeable (resp. extensible) alternating paths of S 1 . There is a solution S ′ 1 similar to S 1 except that p 1 and p 2 have been closed into a cycle (resp. merged into a unique alternating path) during a previous step. We can obtain a solution in G t with the same cardinality as S ′ by juxtaposing S ′ 1 and S 2 . Thus, S ′ can be ignored, and we suppose that a solution obtained with a closing does not contain a closeable alternating path (i.e. is not in A + or E + ). Likewise, we can suppose a solution obtained with a merge between a solution of E ′ ∪ E ′ + and a solution of E(e t ) does not contain an extensible alternating path (i.e. is not in E(c ′ ) or E + ).
2 Assume that one of the following conditions is true.
(1) S 1 ∈ A ′ + , S 2 ∈ E(e t ) and S ′ is obtained by a merge, (2) S 2 ∈ E ′ + , S 2 ∈ F(e t ) and S ′ is obtained by a merge, (3) S 1 ∈ A ′ + , S 2 ∈ F(e t ) and S ′ is obtained by an absorption. Let p be a closeable alternating path of S 1 that is absorbed or merged in S ′ . There is a solution S ′ 1 similar to S 1 except that all non-matching edges of p have been merged or absorbed during previous steps. We can obtain a solution in G t with the same cardinality as S ′ by juxtaposing S ′ 1 and S 2 . Thus, S ′ can be ignored. Fig. 8 shows an example of case (3).
The second item allows us to ignore three complete compositions: there are 22 still to be considered. Each of these complete compositions is in only one of the six sets of solutions among F(c ′ ) , A(c ′ ) , E(c ′ ) , A ′ + and E ′ + .
• Suppose S ′ is frozen. The only feasible operation to obtain S ′ is juxtaposition because an addition of an edge of E(c ′ ) \ S creates an absorbent solution. S 1 and S 2 are frozen as, otherwise, their juxtaposition is not frozen. Thus, line 9 is correct. • Suppose S ′ is absorbent. Thus, S ′ contains at least one edge in E(c ′ ) \ S.
-If S 2 is frozen, then the only feasible operation is juxtaposition and S 1 is absorbent. -If S 2 is extensible, then its extensible alternating path is merged with an extensible alternating path of S 1 that is not closeable. Thus, S 2 is in E ′ . -If S ′ results from an absorption, then S 1 is absorbent and S 2 is closeable. -If S ′ results from a closing, then S 1 and S 2 are closeable. Since the resulting solution is absorbent, S 1 is in A ′ + .
Hence, line 10 is correct.
• Suppose S ′ ∈ A + . Then, S ′ is extensible and does not contain any extensible alternating paths.
-If S ′ results from a juxtaposition, then S 1 does not contain an extensible alternating path and S 2 is either frozen or closeable. In the first case, S 1 must be closeable and therefore S 1 ∈ A ′ + . In the second case, S 1 is in F ′ , A ′ or A ′ + . -If S ′ results from a merge, then S 1 is closeable and S 2 is either extensible or closeable. In the first case, the extensible alternating path of S 1 is merged with an extensible alternating path of S 2 so that the resulting solution is not extensible. Thus, S 1 is in E ′ + . In the second case, S 1 does not contain an extensible alternating path since otherwise S ′ is extensible. Thus, S 1 is in A + .
Hence, line 11 is correct.
• Suppose S ′ is extensible. Then, either S 1 contains an extensible alternating path or S 2 is extensible.
-If S 1 is extensible and S ′ results from a juxtaposition, then S 2 is not closeable since otherwise the resulting solution is also closeable. Thus, S 2 is frozen or extensible. -If S 1 is extensible and S ′ results from a merge. Then, since we only consider solutions of E ′ with a unique extensible alternating path, S 2 cannot be extensible since otherwise the resulting solution is absorbent. Thus, S 2 is closeable. -If S 1 is in E ′ + , then since S ′ is not closeable, the extensible alternating path of S 1 is either merged with an alternating path or closed into a cycle with a closeable alternating path. Thus, S 2 is extensible and S ′ results from a merge or S 2 is closeable and S ′ results from a closing.
-If S 2 is extensible and S 1 does not contain any extensible or closeable alternating path, then S ′ results from a juxtaposition and S 1 is frozen or absorbent.
Hence, line 12 is correct.
• Suppose S ′ is in E + . Then, S ′ is closeable and contains one extensible alternating path. Recall that we ignore solutions resulting from merge between a solution of E + and a closeable solution. Thus, S ′ results from a juxtaposition and either S 1 or S 2 contains an extensible alternating path.
-If S 1 is in E ′ + , then S 2 can be any solution. -If S 1 is in A ′ + , then for S ′ to contain an extensible alternating path, S 2 must be extensible.
-If S 1 is extensible, then for S ′ to contain a closeable alternating path, S 2 must be closeable.
Hence, line 13 is correct.
As after these assignments, each of the solutions of G t is in a unique set and is a composition of a solution of G t−1 and G * (e t ) , computed values for the table entries are correct for G t . d is an extremity of an alternating path of S ′ . Likewise, S ′ ∈ E d ′ if and only if S ′ ∈ E(e) and d is not an extremity of an alternating path of S ′ . Note that E(e) = E d ∪ E d ′ . In order to compute these two sets, we reuse the value of I e , computed in compute_alternating_element.
Finally, after the execution of the loop, computed values for sets F(c ′ ), A(c ′ ) and E(c ′ ) are correct for G k = G * (c ′ ) . It remains to compute the value of the table entry for C(c ′ ) . Sets containing closeable alternating paths are exactly the sets A + and E + , thus A + ∪ E + = C(c ′ ) . Hence, the assignment line 15 is correct. □

Clique
Let c be a clique of G * and let d be the upper door of c. We show in this part how to compute the table entries for the sets F(c) and E(c) . Note that, since the edge between G * (c) and its parent is a bridge, the sets C(c) and A(c) are empty. Let e be the alternating element of c containing the upper door d of c. The idea is to first compute the table entries for the graph G * (e) and then merge the obtained Any other vertex 0 Table 2 Compute_alternating_element Table 3 Compute_subclique Table 4 Compute_clique

Feasibility function
We can now provide an answer to the feasibility of finding a solution for Scaffolding by using Algorithm 9. Let r be the root of G * . Notice than since r does not have an upper door then the subclique of r corresponds to r. Thus, it is not possible to call compute_clique on r. That is why the first recursive call of the algorithm is made with the function compute_subclique. A running example is depicted in Fig. 9 and Example 1 (Tables 1, 2, 3, 4 and 5).  Example 1 Running example on the graph depicted in Fig. 9. Tables 1, 2, 3 and 4 depicte the table entries resulting from Algorithms 5 to 8, respectively. Table 5 display the values of the table entries after each iteration of alternating element for the subclique c ′ 5 . Let c be the value given by the column "#cycles" and x be the item considered in the first column. For each X in F, A, A + , E, E + and C , the interval given by the column X corresponds to [X(x), c].

Approximation result
We now prove the following approximation result. Theorem 4 Algorithm 1 provides a solution for (σ p , σ c ) -scaffolding in connected cluster graphs with an approximation ratio of at most five and a time complexity O(|V | · |E(G * )| · σ 2 c ) . The approximation ratio is tight.
Proof We suppose that the input of the algorithm is a scaffold graph (G * , M * , ω) with non-negative weights and such that G * is a path connected cluster graph. We first show that the algorithm is correct. Note that, since each time we add an edge e to S, we remove from E all incident non matching edges to e, the set S induces only paths and cycles.
If it is not possible to build a solution from the graph, then the feasibility condition is not verified and then the algorithm returns an error. Otherwise, since we ensure that the feasibility condition is verified at each step, when the algorithm terminates, then it builds σ p paths and σ c cycles. Now, we prove the approximation ratio. Since they always appear in any solution, we do not consider the edges of M * in what follows. Notice that, since there is, for each path, one chosen edge less than the number of involved matching edges, and for a cycle, the same number of chosen edge as the number of involved matching edges, then the number of non-matching edges in every solution is exactly n − σ p .
We denote by e 1 , . . . , e m the edges of the graph G * , sorted in non-increasing order by their weights. We denote by e A 1 , . . . , e A n−σ p the edges of the solution S A given by Algorithm 1, sorted in non-increasing order by their weights. In the same way, we denote by e opt 1 , . . . , e opt n−σ p the edges of an optimal solution S opt for the problem, also sorted in non-increasing order. Both sequences e A 1 , . . . , e A n−σ p and e opt 1 , . . . , e opt n−σ p are clearly subsequences of e 1 , . . . , e m . Let ϕ : S opt → S A be a mapping such that Inequality (1) indicates that for each e ∈ E in an optimal solution, there is an edge ϕ(e) ∈ S A such that the weight of this latter edge is at least the weight of e. Whereas (2) states that for each e ∈ S A , we may associate e to at most four edges of the optimal solution. In the following, we prove that it is possible to define a mapping ϕ satisfying these inequalities.
The algorithm may decide not to choose an edge e opt i for four main reasons: • e opt i is eliminated because it is in R, when an edge e A j is chosen. In this case, we have ω(e A j ) ≥ ω(e opt i ) because only edges appearing after e A j in the ordered list can be in R. When an edge e A j is chosen, it can eliminate at most two edges of optimal solution by (1) ∀e ∈ S opt , ω(e) ≤ ω(ϕ(e)) (2) ∀e ∈ S A , |ϕ −1 ({e})| ≤ 5 updating of the list of edges (see Fig. 10). We assign ϕ(e opt i ) = e A j in this case. (1) is satisfied by construction, and (2) holds when considering only the optimal edges which are eliminated by this way. • e opt i is eliminated because its addition disconnects the graph and the number of alternating cycles and alternating paths required to cover the graph becomes too big. This happens in one of the following two cases.
e opt i closes a cycle. In that case, there is at least one edge e A j in this cycle, and since it has been chosen before the algorithm considers e opt i , we necessarily have ω(e A j ) ≥ ω(e opt i ) . Thus, we assign ϕ(e opt i ) = e A j . Then, (1) is satisfied by construction. The edge e A j has been already chosen, may have eliminated at most two optimal edges, but (2) is still satisfied.
e opt i closes a door d and one bridge dx incident to d is necessary to construct a solution with the remaining edges. There is a door y which has been closed by an edge e A j in a previous step and this forces dx to be in S A . Since closing a door increases by at most one the minimum number of alternating paths required to cover the graph, the closing of y forces at most one bridge of G * to be in S A . Thus, the closing of y prevents d and x from closing, that is, at most two edges of S opt , incident to d and x respectively, can be associated to e A j Then, (1) is satisfied by construction. The edge e A j may have eliminated at most two optimal edges in R and may prevent the closing of a cycle, but (2) is still satisfied.
• e opt i is eliminated because its inclusion would merge two paths p 1 and p 2 . If e opt i is not a bridge and p 1 and p 2 are a single-edge paths, then the number of alternating cycles and paths are reached in S, that is σ c = c , σ p = p and S = S A . Then, we can find an edge e A j such that |ϕ − 1(e A j )| = 0 and we assign ϕ(e opt i ) = e A j . Then, (1) and (2) are satisfied by construction. Otherwise, the algorithm eliminates e opt i because one of the merged paths must be closed into a cycle to reach the correct number of alternating cycles. Otherwise, there is an edge e A j in S A considered before e opt i in the algorithm such that |ϕ − 1(e A j )| ≤ 3 (since otherwise the path would be already closed into a cycle) and then we assign ϕ(e opt i ) = e A j . Again, (1) and (2) are satisfied by construction.
From the previous discussion and by (1) and (2), clearly we have: The ratio is tight, as shown by the example depicted in Fig. 11.
Concerning the complexity, the edges can be sorted in O(|V (G * )| log |E(G * )|) time. The feasibility function  Fig. 11 The approximation ratio of five for the greedy algorithm is tight. Matching edges are bold, dashed edges are in the approximate solution and solid edges are in the optimal solution. G * is composed by the cliques C 1 = {a, b, c, d, e, f } , C 2 = {g, h} , C 3 = {i, j, k, l} and C 4 = {m, n, o, p} . All edges have weight zero except ac and the edges of S opt . We suppose that σ p = 3 and σ c = 0 , and the greedy algorithm chooses "the wrong edge" ac first. Consequently, the solution S A given by the greedy algorithm is of weight 1, whereas an optimal solution would be of weight 5 is called |E(G * )| times. Thus, the time complexity of the algorithm is O(|E(G * )| · |V (G * )| · σ 2 c ).

Experimental results
In this section, we compare the performance of Algorithm 1 with three different feasibility functions and an integer linear programming formulation [15] implemented with ILOG CPLEX [16].

Dataset
We reuse the dataset already used in [9], which was obtained with the following pipeline: 1 Choice of a reference genome, for instance on the nucleotide database from NCBI 2 . Table 6 presents selected genomes used for our experiments. We chose a panel of genomes of various origins and sizes. 2 Simulation of paired-end reads, using wgsim [17].
The chosen parameters are an insert size of 500bp and a read length L of 100bp. 3 Assembly using the de novo assembly tool, based on a De Bruijn graph efficient representation: minia [18] with k-mer size k = 30. 4 Mapping of reads on contigs, using bwa [19]. This mapping tool was chosen according to results obtained by Hunt [20], a survey on scaffolding tools. 5 Generation of scaffold graph from the mapping file.
Statistics on the numbers of vertices and edges in produced scaffold graphs can be viewed in Table 7.

Feasibility functions
There is no polynomial-time computable feasibility function in the general case. Thus, to use the greedy algorithm with a specific feasibility function on a real instance, we must transform it. For this, we construct a supergraph by adding edges of weight zero. We compare three feasibility functions, defined on complete graphs, connected cluster graphs and block graphs 3 , respectively. Note that the construction of a complete supergraph requires the largest amount of edge additions whereas the least amount of edge additions is required for the construction of a block supergraph. We already showed in [9] that the computed ratio is close to one on real instances, that is, relatively far from the theoretical ratio of 3. The aim of these experiments is to answer the two following questions: • Can greedy algorithms on connected cluster graphs and block graphs be used on large scaffold graphs, and what is its associated computation time? • Do we get a better practical ratio if the amount of additional edges is smaller (e.g. the completion rate, see Table 7, is smaller)? In other words, do we obtain better results on block graphs and connected cluster graphs than in complete graphs?

Results
Experiments were run on a personal computer with four i7 processors at 1.9GHz and 16GB RAM. Memory usage was very light, even on the biggest instance anopheles. Table 8 shows scores and computation times for every instance. We can see that greedy computation times are less than few seconds except for anopheles, where the connected cluster graph version and the block graph version need a few minutes. As expected, the greedy algorithms are much faster than the ILP formulation in every case. These results let us answer to our first question: connected cluster graph and block graph versions of the greedy algorithm are capable of treating big instances, however the computation time is significantly bigger than the complete version. Concerning the scores, we can see that the three greedy algorithms have the same score for most of the data. The connected cluster graph and block graph versions have a slightly better score in four instances: anopheles, anthrax, sacchr3 and sacchr12. Moreover, connected cluster graph and block graph versions have the same score in all instances except in anopheles, where the block graph version improves the score of the connected cluster graph version by three (which is not really significant compared to the absolute values). These results indicate that the answer to the second question is positive. However, the differences between scores are not significant enough to be completely affirmative. We can think that using the greedy algorithm with feasibility function defined on a sparser class of graphs may lead to better results.

Conclusion and future work
We presented in this paper the first polynomial-time algorithm approximating the scaffolding problem on non-complete graphs. Using a dynamic programming approach, we exploited the tree-like nature of connected cluster graphs to extend the feasibility function and the analysis of the approximation ratio. We also showed that this new algorithm provides slightly better results on real data than the greedy algorithm on complete graphs, although its theoretical ratio is worse. This leads us to the hypothesis that using a feasibility function defined on a graph class close to the original instance produces better results. This is surprising since, intuitively, algorithms on superclasses can choose from a larger set of edges to build solutions (any solution on the more restricted class is also a solution in the more general class). A natural extension of this work is to consider sparser graphs: for example, one could replace cliques in connected cluster graphs by co-bipartite graphs as the feasibility function is polynomial-time computable in this case [8]. One may also explore the possibility of exploiting randomized algorithms to improve the ratio [6].