Solving Two Stage Stochastic Programming Problems Using ADMM

In this paper, we have dealt with the solution of a two-stage stochastic programming problem using ADMM. We have formulated the problem into a deterministic three-block separable optimization problem, and then we applied ADMM to solve it. We have established the theoretical convergence of ADMM to the optimal solution based on the concept of lower semicontinuity and the Kurdyka-Lojasiewicz property. We have compared ADMM with Progressive Hedging in terms of performance criteria using ﬁve benchmark problems from the literature. The comparison shows that ADMM outperforms Progressive Hedging.


Introduction
In recent years, researchers have shown an increased interest in the Alternating Direction Method of Multipliers (ADMM) method [13] and its applications. The importance of ADMM arises from its ability to decompose the optimization problem into separable components, which can be solved in alternating manners. Various applied problems in machine learning form an application area of ADMM. The emergence of machine learning and big data applications necessitates distributed computation and storage due to large problem sizes. Boyd et al. [5] have applied ADMM on various distributed large-scale convex optimization problems, including the least absolute shrinkage and selection operator (LASSO) [22], support vector machine (SVM), and sparse logistic regression. Wang et al. [21] have also used ADMM to replace stochastic gradient descent (SGD) optimizer in deep learning. Their new algorithm based on ADMM converges to the optimal solution.
The convergence properties and application of ADMM have been reported in the literature [5,8,9]. There is a solid relationship between Douglas-Rachford splitting [23] and ADMM. A careful formulation shows that the two methods are equivalent in their dual form. Arpón et al. [2] successfully implemented ADMM to the twostage stochastic program (SP). They have not provided the theoretical convergence of ADMM but referred to the work of Sun et. al [19] for the convergence proof. We have taken a slightly different approach to establish the theoretical convergence of ADMM. Since the class of SPs we have considered are non-convex and non-smooth, our convergence proof is based on the concept of semicontinuity and the Kurdyka-Lojasiewicz property of the objective function [20]. Sun et al. [19] have introduced an intermediate step in the Gauss-Seidel cycle for updating the block coordinate descent for ADMM, which requires more computational time. They have argued that the directly extended ADMM for 3-blocks, a version of the classical ADMM developed by adding a third variable to the objective function and the constraints, is non-convergent. We have used the usual Gauss-Seidel approach where each variable is updated once; hence no extra computation step is needed. Also, we have proved that the directly extended ADMM is convergent under some suitable assumptions. Then we have applied ADMM to the two-stage SPs with a finite number of scenarios. The formulation is separable and dimension intensive, but the scenario-dependent variables and related dual residuals of ADMM are executable in parallel.
We have studied the convergence analysis of the three-block formulation of ADMM, and implemented it for solving the two-stage SP without any external solver. We have tested our algorithm on a number of problems from the literature [1,12]. Moreover, we have numerically compared our algorithm with state-of-art Progressive Hedging (PH) [17,4].
The rest of the paper is organized as follows. Section 2 introduces the threeblock minimization problem, and in Section 3 we give a brief introduction to the twostage SP. We present the theoretical convergence of the 3-block ADMM in Section 4. The implementation of ADMM to the two-stage SP is presented in Section 5. The numerical experiments are given in Section 6, followed by the concluding remarks in Section 7.

Three-block Minimization Problem
The classic 2-block ADMM, its numerical implementation and convergence properties have been well documented in the literature [5,8,9]. Sun et al. [19] have applied 3-block semi-proximal ADMM to a class of convex conic programming with several types of constraints. In this paper, we study the 3-block problem for a class of non-convex programming problems using the directly extended ADMM and provide its convergence under some suitable assumptions 4.
Consider the three-block minimization problem in the form min where f : R n 1 → R, g : R n 2 → R are proper lower semi-continuous functions, h : R n 3 → R is a linear function; t ∈ R m , and A 1 , A 2 and A 3 are in appropriate dimensions [20]. The augmented Lagrangian for Problem (1) is defined as In the solution procedure of Problem (1) we include regularized terms. The regularized terms enhance the convergence by reducing the oscillation between progressive iterates [2]. We have used the following iterative procedure with the regularized term: where γ is a penalty parameter linked with the regularized terms. The augmented regularized term added to each subproblem in (3) eliminates the need for a convex objective function [20]. Boyd et al. [5] state the necessary and sufficient optimality conditions for Problem (1) as follows where Equation (4) is the primal feasibility, and Equations (5)-(7) are dual feasibilities.
Here, ∂ f denotes the subdifferential set of f . Provided that x k+1 which means that x k+1 3 and µ k+1 satisfy the duality condition in Equation (7). In the case of x k+1 2 , we have is the residual for the dual feasibility. By repeating the same steps, we find the residual associated with x k+1 1 as s k+1 as dual residuals. We will see in Section 4 that all the residuals converge to zero as k → ∞.

Two-stage SP
The two-stage SP is defined as [12] where c ∈ R n 1 , A ∈ R m 1 ×n 1 , b ∈ R m 1 , are constants and ξ is a random variable of a known distribution. The term, Q(x, ξ ), outlines the second-stage optimum value and is defined as [4] Q(x, ξ ) = min y q(ξ ) T y, subject to where q(ξ ) ∈ R n 2 , T (ξ ) ∈ R m 2 ×n 1 , W (ξ ) ∈ R m 2 ×n 2 , h(ξ ) ∈ R m 2 encode the random variable data.
The value of the first-stage variable, x, needs to be determined before any future revealing of the random variable ξ ; y is known as the second-stage decision variable, which corresponds to the decisions after the realizations of the random variable revealed; y is also known as recourse variable since it compensates any bad decisions that may occur at the first-stage.
When the random variable ξ has finitely many realizations, or scenarios S = {ξ 1 , ξ 2 . . . . , ξ s }, with corresponding probabilities {p 1 , p 2 , . . . , p s }, Problem (8) can be reformulated as a large linear programming (LP) problem [10] in the form min x,y where ξ = ξ i is i-th the scenario. Cutting-plane methods, e.g., the L-shaped method, are used to solve (10) when the recourse matrix W is not dependent on the random variable ξ . When the recourse matrix depends on the random variable, a different method is needed, e.g., Progressive Hedging. We have taken a different approach that can deal with the recourse matrix of either type. Problem (10) can be equivalently restated as x −x = 0, where is a convex function since it is defined on a convex set [15]. The two indicator functions in (11) for the pairs (x,ŷ i ) compensate for the nonnegativity condition of the variables x and y i in the constrains (13) and (14), respectively. The new formulation in Equations (11)-(15) is considered as a three-block ADMM model. The first-stage decision variable x is presented the first block, while the second stage decision variable y i in the second block, and the third one for the pair (x,ŷ i ), i = 1, 2, · · · , s.
The section below states the convergence analysis of ADMM applied to two-stage SP.

Convergence Analysis
We start by introducing some definitions and lemmas that will be needed here.
Definition 1 The function f : R n → R has the Kurdyka-Lojasiewicz (K-L) [20] property atx, if there exists positive constants η > 0, δ > 0, and a concave function where the function ϕ has the following characteristics: Since all the functions we are considering in the two-stage SP are real analytic (linear, and indicator functions), they satisfy K-L inequality. All the lemmas and theorems presented here applies to real analytic functions.
In our context, we define a function V : where τ = 4γ 2 ρσ , γ and σ are constants, andx 3 is the immediate prior iteration of x 3 . Let V be a lower semi-continuous function with non-empty domain and V (x) > −∞ for every x ∈ dom V , and a 1 and a 2 are fixed positive constants. We suppose that the iterates of ADMM in (3) have a Lyapunov function that verifies the following subgradient decent conditions of [3]: In the sequel, we assume the properties (A1)-(A4) hold, where the given matrices, functions, and parameters are subject to Equations (2)- (3): (A3) The parameters are chosen such that ρ > 8γ σ ; (A4) All the matrices A 1 , A 2 , and A 3 are bounded under Frobenius norm.
We will demonstrate that the function V meets the conditions (C1)-(C3) once the sequence has been generated with the iterative procedure in Equation (3). The proofs of Lemmas 1 and 3 are based on our choice of V in (16).
The statements of lemmas 1 and 4 are given in [3] and [20], respectively, but we have taken a slight different approach in proving them for two-stage SP.

Lemma 1 Let {x k } be a sequence that meets the conditions (C1)-(C3). If V is a K-L function, then the sequence {x
, where x * is the optimal point. Moreover, the augmented function L ρ and Lyapunov function V has the same optimal value. Proof Condition (C1) states that the function V is a nonincreasing function, i.e., Now, if x k ∈ N (x * , δ ) for k ≥ 1, we claim that the following inequality holds where the constants a 1 and a 2 are given in the conditions (C1)-(C2). If x k+1 = x k , the inequality is trivial. Let us assume that Since any concave differentiable function f satisfies the inequality using this fact about the concave function combined with (C1) we have By combining the inequalities in (19) and (20) we have and by multiplying the both sides with x k − x k−1 and taking the square root we have Then, by applying Young's inequality 2 αβ ≤ α + β to Equation (21) and multiplying the both sides by 2 we get from which inequality (18) follows. Let ε, δ > 0 such that δ ∈ (0, ε). We assume that The intuition behind the assumption (22) is that if our initial starting solution is close to the optimal solution of V (x), the generated sequence will also remain bounded. The properties of the concave function, ϕ, also play a role in it.
We claim that x j ∈ N (x * , ε) hold for j = 1, 2, . . .. We prove this claim by induction. From (17), when k = 0, we have By combining inequality (24) with the assumption (22), and using triangle inequality we have By taking the sum of (18) from k = 1 to j we get the following which can be written visibly as Then by adding and subtracting each x 0 , x 1 , · · · , x j with the term x * − x j+1 and using triangle inequalities, and by (25) we have where we have added the positive term, x j − x j+1 , to the right hand side of in- since the expression in bracket is positive. Therefore, it follows that which infers that the sequence {x k } converge to somex. Then, from condition (C2) and the proven claim which contradicts the fact that v k → 0 as k → ∞. Therefore Then, the inequalities (28) and (29) conclude that V (x) = V (x * ). Due to the convergence of x k tox, the regularized term Proof Taking partial derivative with respect to x 3 at iteration k + 1 and setting Using the expressions for ∇ x 3 h(x k+1 ) and ∇ x 3 h(x k ) from (30) and using the Cauchy-Schwarz and Young inequalities after re-arranging of terms we get Also, we have where the inequalities (32a)-(32c) follows from (3), and the equality (32d) follows from (2) and (3). By adding up (32a)-(32d), we obtain where . Dividing both sides of inequality (31) by ρ and adding up with inequality (33) it follows that Therefore, by rearranging the terms of Equation (34) using τ = 4 γ 2 ρσ , we get where a 1 ∈ (0, γ 2 − τ], which is positive, sine γ 2 > τ, ρ > 8γ σ from Assumption (A3). ⊓ ⊔

Lemma 3 (Lemma 3, [20]) Let u k and w k respectively denote
In fact, w k is asymptotically regular, indeed, w k − w k+1 → 0 as k → ∞. Moreover, any accumulation point of {w k } is a stationary point of the augmented Lagrangian function L ρ .
Proof From Assumption (A2), and Equation (30) is a combination of u k and µ k . Thus, for being the iterates of lower semicontinuous function V , there exist a subsequence {ŵ k j } which converges toŵ * ; it follows lim inf j→∞ V (ŵ k j ) ≥ V (ŵ * ). Lemma 1 showed that the function V decreases in each iteration, and thus {ŵ k j } is convergent. Taking the summation for Equation (35) it follows that which implies that ∑ ∞ k=1 ŵ k+1 −ŵ k 2 < ∞; then it follows that ŵ k+1 −ŵ k 2 → 0 as k → ∞. This implies that primal and dual residuals converge to zero as k → ∞. ⊓ ⊔ The following lemma states that dist(0, ∂V (ŵ k+1 )) decreases with the iteration k + 1.
Proof By taking the partial derivative of V in (16) with respect x 1 and equating it with zero and substituting µ by adding and subtracting the term A T 1 µ k+1 − µ k to the right hand-side yields Replicating this step for the rest of variables respectively yields Adding up all the previous equations associated with each variable and applying the Frobenius norm to the both sides results in Since the matrices A 1 , A 2 , A 3 are all bounded in Frobenius norm, implies that ∃a 2 > 0 such that a 2 = max{b 1 , b 2 , b 3 } . Then, adding the appropriate norm for x k+1 1 and x k−1 3 the inequality holds.
⊓ ⊔ Lemma 5 Suppose the sequence {w k j } generated by the iterative procedure (3) converge toŵ * and V is lower Proof The proof of convergence follows directly from Lemma 1, in which the function V satisfies the conditions (C1)-(C3). We have shown in Lemmas 2-4 that V meets all conditions C1)-(C3). ⊓ ⊔

ADMM for Two-Stage SP
We consider deterministic version of the two-stage SP (11)-(15) and the notations used thereof. We redefine the variables x 1 , x 2 , and x 3 of Equation (1) as x 1 := (y 1 , y 2 , · · · , y s ) T , x 2 := (ŷ 1 ,ŷ 2 , · · · ,ŷ s ,x) T , and x 3 := x. Correspondingly, the functions will be defined as follows: The vector of multipliers is defined as In what follows, we derive the detailed steps to find the solution to Problem (11)- (15), and then we present a step by step algorithm for the solution. Consider the following augmented Lagrangian function where β ∈ R n 1 , α ∈ R m 1 , γ i ∈ R n 2 , and δ i ∈ R m 2 , i = 1, 2, · · · , s. Solving ∇ x L ρ = 0 yields: Similarly, ∇ y i L ρ = 0, i = 1, 2, · · · , s, yields: And by differentiating L ρ with respect tox,ŷ i respectively yieldŝ The updates for the dual variables will be expressed as Algorithm 1 presents the steps of ADMM for solving two-stage SP in Equations (11)- (15). An important parameter of Algorithm 1 is the penalization parameter ρ. The convergence of the ADMM algorithm is very sensitive to such a choice; poor selection may lead to slow or non-convergence in practical problems [18]. A variant of ADMM, residual balancing, where the penalty parameter ρ k changes at each iteration k is proposed by He et al [11]. The intuition behind the method is based on making the primal and dual residual norms to have similar magnitudes. By doing so, the primal and dual residual will have small values at the stage of convergence. This approach makes the performance less dependent on the initial choice of ρ. The superlinear convergence with ρ k → ∞ has been achieved by Rockefellar [16]. We have implemented iteration dependent adaptive ρ k as suggested by He et al. [11]. In Algorithm 1, the preconditioning step initializes penalty parameter ρ, the maximum number of iterations k max , and the convergence tolerance ε. The initialization in Line 1 provides an initial first-stage solution for the primary iterations k ≥ 1. The initial penalty parameter will be adaptively updated by He et al. [11] to maintain the gap difference between the primal and dual residuals, see Equation (40). Algorithm 1 runs on deterministic initializations, in which the first phase procedure of linear programming is applied to bring the starting point to a feasible region. Since Algorithm 1 decomposes the problem by scenarios, Lines 5,8,12,13,15, and 18 are implemented in parallel. Line 5 provides a scenario-wise solution to the second-stage variable in which all the components of the second-stage variable are solved concurrently. The second-stage solution needs to be assembled in one variable to find the first-stage solution, which is given in Line 9. Lines 10-13 give the updates for the dual variables using (39). The primal and dual residuals of Algorithm 1 are given in Lines 14-19. Algorithm 1 terminates in two ways; the convergence case, where all the residuals are less than the threshold ε, Line 20, or the non-convergence case, Line 21, where the algorithm hits the maximum number of iterations without meeting the convergence requirements. In the non-convergence case, the algorithm is enforced to stop by the maximum iteration.

Numerical Results
This section discusses the numerical experiments of Algorithm 1 on five benchmarks two-stage SP. The convergence and CPU time of the algorithm are also discussed.
The computer specification that runs the experiments has CPU Intel Core TM i7-7700 CPU @ 3.60GHz ×8 and 15.5 GB of RAM. All the experiments have been carried out in MATLAB 2020. In the initialization step, we have defined the penalty parameter ρ as suggested by He et al. [11]: where ν > 1, µ > 1. The value ε = 10 −3 and k max = 5 × 10 4 were used, respectively, in steps 20 and 21 of Algorithm 1. The primal and dual residuals that appear in all the relevant figures in this section are given in Line 19 of Algorithm 1, where dual1 and dual2 denotes s 1 and s 2 , respectively. In all the tested problems, we have found Algorithm 1 always converges to the optimal value. Each problem is solved with an initial starting point obtained using the phase 1 procedure of linear programming. Therefore, Algorithm 1 has been executed with on each problem with a feasible point. The CPU times are recorded for all problems. All the graphs of the residuals are provided in the log/log scale. Table 1 summarizes features of the tested problems used. All the tested problems are linear in their first and second stages. The problems LandS and Gbd are solved by Linderoth et al. [12] using Sample Average Approximation; the authors provided lower and upper bounds for optimal solutions. On the other hand, the problems Assets and Phone are taken from [1] for which the data can be found in the website stated in the paper. All the input data of the problems in Table 1 are provided in SMPS (Stochastic Mathematical Programming System) format for stochastic linear programs. In Table 1, the data in the last two columns are given in pair (m, n), where m is the number of rows, and n is the number of columns. The first stage size refers to the size of matrix A, and the second stage size refers to the size of matrix W, see Equation (10). Table 2 reports the statistical bounds for the optimal value for problems LandS and Gdb found by Linderoth et al. [12].
τ j q T y subject to Louveaux and Smeers [14] report the optimal solution to Problem (41) to be The result, as shown in Figure 1, presents the convergence of the residuals in 100 iterations, approximately.

Large Size LandS
This problem is a modified and extended version of the problem taken from [14].
The minimum total capacity and budget constraints are given in the first stage conditions, while the capacity restriction for each of the four technologies and the demands are in the second-stage. The demand constraint is given by where d j 's are given by d j = 0.04(k − 1), k = 1, 2, · · · , 100, j = 1, 2, 3, all with a probability of 0.01. Demands are assumed independent, and therefore there are (100) 3 = 10 6 scenarios with equal probability of 10 −6 . Figure 2 indicates that Algorithm 1 takes 3000 iterations to converge. The obtained optimal value for this problem is 225.85, which agrees with the result in Table  2.  In this problem, the assets are represented by nodes, while arcs represent the transactions. There are five nodes for which costs are fixed in every stage. These are checking, saving, certificate of deposit, cash, and loads [1]. This problem has a two-stage stochastic setting because the purchase or sale has deterministic costs, while the return on investment is random. We denote the set of nodes by N and the set of arcs by A . The objective value will is calculated over the terminal arcs, which is denoted by T ⊂ A .
The objective function is defined as f s (v, y(s)) ∈ C 1 , which is a convex function under scenario s ∈ S, where v := {v i j : (i, j) ∈ A 1 } ∈ R n 1 and y(s) := {y i j (s) : (i, j) ∈ A 2 } ∈ R n 2 , where n 1 = |A 1 | and n 2 = |A 2 |, respectively, n = n 1 + n 2 . Additionally, the deterministic and supply nodes are encoded in the vectors b ∈ R m 1 and h ∈ R m 2 ×|S| , where m 1 = |N 1 | and m 2 = |N 2 |, respectively. The two-stage SP is given in the form [1] min ∑ s∈S p s f s (v, y(s)) subject to (42) For the complete mathematical description, we refer to [1].
This problem has two different sizes; the first one has 100 scenarios, and the second has 37,500 scenarios. The numerical results depicted in Figure 3 corresponding to the 37, 500 scenarios; the 100 scenarios problem shows the similar results. Figure  3 indicates that the algorithm took less than 100 iterations to converge. The algorithm converges to the optimal value of −696.74, which coincides with the value reported in [1], [7].

Telecommunication Network Planning (Phone Problem)
The manager of a telecommunication network has to plan for future growth. Decisions about where and how much to expand the capacity must be chosen optimally. In the formulation of this problem, the "how much" is decided before the future realization.
These types of networks are very interconnected in which several routes can provide services to the demand of point-to-point requests. Besides, each route has one or more direct links. The objective is to minimize the unserved requests by adding new links while staying within the budget constraint.
The problem is expressed mathematically as where Q(x, d) is expressing the number of unserved requests. The full description of the problem can be found in [1]. The number of scenarios, in this case, is 1000. We have found that all the sizes of 50, 100, 500, 1000, and 5000 scenarios attain the same optimal value in our numerical experiment. Figure 4 represents that the algorithm converges in 100 iterations. The algorithm converges to the optimal value of 36.89. The optimal solution reported by the authors is 36.9 [1], [7].  In this problem, four types of aircraft have to be assigned to five routes to maximize the revenue under the stochastic demand. The first-stage variable encodes how many aircraft are assigned to each route, while the first-stage constraints bound each type's possible number of aircraft. On the other hand, the second-stage variables denote the number of transported passengers on every five routes, and the random demands represent the second-stage constraints. This model has 646,425 possible scenarios.
For the detailed mathematical model, see [6].
We have run the algorithm on the problem of different sizes. We have chosen to present results obtained for the moderate size of 1000 scenarios as the other sizes show similar results. Figure 5 indicates that the algorithm converges in less than 260 iterations. The optimal value for this problem is 1649.9, which agrees with the result in Table 2.

Numerical Comparisons and Discussion
This section provides a comparison of the Algorithm 1 with the Progressive Hedging (PH) algorithm with respect to CPU time needed for convergence, the number of iterations, percentage gap, and whether the method is converged or not for each problem. The results of this comparison are presented in Table 3 and Figure 6.
To assess the performance of ADMM and PH algorithms in terms of computational time, we have run both algorithms using the same programming language (MATLAB) in one device. A direct comparison of the running time of the two methods is unfair due to the different meaning of the stopping criteria used by the algorithms. For this reason, we used the following methodology: the two algorithms stop when the error (a combination of primal and dual residuals) is less than convergence tolerance ε = 10 −3 or when they reach the maximum number of iterations.
In Table 3, columns 2-5 present the results of two algorithms. Columns 2 and 3 contain the CPU times needed for convergence; columns 4 and 5 present the number of iterations required. The same number of maximum iterations was used in both algorithms. The symbols "N" and "C" in columns 6 and 7 denote non-convergence and convergence case, respectively. The percentage gaps, columns 8 and 9, are calculated using the formula φ −z * z * × 100, where φ is the computed optimal value and z * the known optimal. Table 3 shows that ADMM converged in all the problems while PH failed to converge in two problems; ADMM gives excellent approximations to the optimal values. The percentage gap infers that the convergence of ADMM is superior to that of PH in all the problems. The graphs in Figure 6 are provided on a logarithmic scale because of the variations of results of different problems. Figure 6a presents the time that algorithms require to stop. We observe that ADMM converges faster than PH in four problems, while they are equivalent in the CPU time regarding the first one. Figure 6b compares the number of iterations showings that ADMM outperforms PH on two problems. On the other hand, PH performs better on two problems. Both methods are comparable in the number of iteration on the Asset problem.

Conclusion
We have established the theoretical convergence of the iterative ADMM to the optimal solution of the two-stage stochastic programming problem. We have then implemented the method to three block formulation of the problem where its decomposable structure has been fully exploited in solving large-size problems. Numerical investigations of the algorithm using five benchmark problems have justified the theoretical convergence of the proposed algorithm.
The penalty parameter plays an essential role in the speed of the convergence of ADMM. We have shown that the adaptive penalty generates an accurate solution and always leads to convergence.  Table 1.
We have compared the performance of ADMM with the Progressive Hedging algorithm using the probability of success in obtaining the optimal solution, the accuracy of the solution obtained, and the CPU times. The comparison showed that ADMM outperforms PH in the probability of success and CPU times.
Our future research will involve the application of ADMM to multi-stage stochastic programming problems.