The emerging threat of articial intelligence on competition in liberalized electricity markets: A deep Q-network approach

Background: According to sustainable development goals (SDGs), societies should have access to aﬀordable, reliable, and sustainable energy. Deregulated electricity markets have been established to provide aﬀordable electricity for end-users through advertising competition. Although these liberalized markets are expected to serve this purpose, they are far from perfect and are prone to threats, such as collusion. Tacit collusion is a condition, in which power generating companies (GenCos) disrupt the competition by exploiting their market power. Methods: In this manuscript, a novel deep Q-network (DQN) model is developed, which GenCos can use to determine the bidding strategies to maximize average long-term payoﬀs using available information. In the presence of collusive equilibria, the results are compared with a conventional Q-learning model that solely relies on past outcomes. With that, this manuscript aims to investigate the impact of emerging DQN models on the establishment of collusive equilibrium in markets with repetitive interactions among players. Results and Conclusions: The outcomes show that GenCos may be able to collude unintentionally while trying to ameliorate long-term proﬁts. Collusive strategies can lead to exorbitant electric bills for end-users, which is one of the inﬂuential factors in energy poverty. Thus, policymakers and market designers should be vigilant regarding the combined eﬀect of information disclosure and autonomous pricing, as new models exploit information more eﬀectively.

The 7th united nations' sustainable development goal (SDG7) invites societies to 4 provide affordable, reliable, and sustainable energy for everyone. While access to 5 clean energy is the major concern in many developing countries [40], energy afford-6 ability is being emphasized in the developed world [12]. Traditionally, electricity 7 had to be consumed instantly after generation due to the unfavorable economics 8 of electricity storage technologies. Owing to this physical constraint, the electricity 9 industry expanded as vertically integrated monopolies around the globe. Unfor-10 tunately, these entities suffered from poor performance and high operation costs, 11 which forced governments to reform the electric power sector. To boost the efficiency 12 of these regulated entities, market designers and policymakers pursue liberalization (i.e., deregulation) that aims to maximize social welfare through promoting com-14 petition among self-interested participants. Although market designers expect to 15 witness full competition, it is demonstrated that some electricity markets act more 16 like oligopolies for the following reasons [11]: 17 • Limited number of generators as a result of high capital investment. 18 • Network congestion that prevents generators from dispatching power to inac-19 cessible consumers. 20 • Transmission losses that hinder producers in serving remote consumers. 21 Oligopolistic markets may incubate collusion that harms open competition among 22 participants. While explicit collusion in electricity markets is prohibited, tacit col-23 lusion may still exist in the absence of formal contracts [18]. To achieve a perfectly 24 competitive market, collusion (of any kind) should be eliminated, but it is not a 25 straightforward task for regulators to detect tacit collusion [1,8,39]. Heim and Götz 26 [19] study the rising price of reserve power in the German market. The authors con- 27 clude that the seemingly collusive behavior is due to the repetitive auctions with 28 the pay-as-bid pricing mechanism. 29 To make matters worse, antitrust agencies are worried that the autonomous pric-30 ing algorithms, often used by suppliers, may learn to collude unintentionally [6,7]. 31 Algorithmic pricing is common in many markets; for instance, according to Chen   On the other hand, simulation models are considered as alternatives to optimiza-50 tion (and equilibrium) models when underlying problems are intractable for analyt-51 ical methods to address [16]. Typically, researchers rely on agent-based simulation 52 models in decentralized electricity markets, since it provides sufficient flexibility to 53 investigate the impact of learning on GenCos' strategic behavior. At the forefront 54 of imitating human-like intelligence in agents are model-free reinforcement learning 55 (RL) algorithms [44]: agents learn the optimal set of actions (i.e., optimal policy) 56 with respect to each state, solely by interacting with the environment. In spite of 57 their success in various fields, including operations research, decision, and control theories, RL methods (e.g., Q-learning) suffer from two major drawbacks: the lack 59 of theoretical proof to assure solution optimality, and the curse of dimensionality 60 [5]. As the state space expands, the required memory to store transitions grows ex-61 ponentially with it. To circumvent the dimensionality curse, Roth-Erev learning [13] 62 is developed, which is a streamlined version of RL when a limited number of pure 63 strategies are played by agents. However, Roth-Erev-equipped agents are unable to 64 learn consistent behaviors in complex games, such as the sequential bargaining game 65 [20]. To address the dimensionality challenge, a more recent trend is to estimate the 66 optimal action-selection policy using deep neural networks (DNN).   In this study, we aim to create a state-of-the-art DQN model that assists generic 93 GenCos to raise and sustain their incomes without using confidential information 94 related to employed technologies (e.g., the unit generation cost), while also taking 95 network constraints into account. The outcomes are then investigated to assess  In this paper, the strategic bidding problem on a day-ahead market is considered, 102 taking network constraints into account. A typical electric grid is made of intercon-GenCos is consumed by demand centers, and the excess power flows to the con-105 nected nodes through transmission lines. Due to physical limitations, transmission 106 lines are unable to dispatch electricity above a certain threshold. A power network 107 is called "congested" when a thoroughly loaded transmission line reaches its max-108 imum capacity and cannot accommodate further dispatch. Network congestion is 109 managed by penalizing electricity consumption at congested nodes using the loca- GenCos' production level (P t i ). The DCOPF problem formulation is given as follows: Here, D t i is the demand at node i and hour t, BR is the set of available transmission to Eq.
(2) sets the unit electricity price at each node (i.e., λ t i ).

146
After clearing the market by the ISO, GenCos can calculate their payoffs at each 147 specific hour as where the electricity generation cost of GenCo-i 148 is captured by c i . It is quite realistic to assume GenCos conceal their payoffs from 149 rivals as it may reveal confidential information regarding their business [17].    payoffs in the real world [17].

201
• Exploration parameter (ǫ t i ) adjusts the exploration rate versus exploitation.

202
As GenCo-i becomes mature, it tends to rely more on collected information 203 than searching for undiscovered solutions. 210 Furthermore, at each iteration, t ∈ {1, . . . , max t }, GenCo-i updates the Q-value (Q t ij ) corresponding to each bid price (b ij ∈ B i ) based on modified α t i and the realized payoff (r ij ) as described in Eq. (6).
Deep Q-Networks approach 211 In this section, the detail of the proposed DQN model is described, by which Gen-

212
Cos enhance their understandings of the environment and optimize their actions 213 accordingly. The critical elements of the proposed model are as follows: • Environment: The platform whereby ISO clears the market and determines 215 agents' rewards.

216
• Agents: Myopic GenCos that desire to increase their long-term rewards 217 through learning.

218
• State: vector s t i encapsulates the state of the system for GenCo-i at time t.

219
In our setting, s t i consists of the submitted bid prices by all GenCos at time 220 t in addition to private information related to GenCo-i, such as c i and P t i .

221
• Action: The response of GenCo-i to improve its reward, based on observed 222 state (i.e., b t i ∈ B i ).

223
• Reward: the obtained payoff of GenCo-i, r t i , based on assigned power and 224 cleared price after submitting its bid price.

225
The overall workflow of the proposed DQN model is depicted in Figure 1. GenCo-i chooses a batch of experiences from memory using the last-in, first-out 229 (LIFO) scheme [1] . The LIFO scheme is used to prioritize and capture recent inter- In Eq.(7), the discount factor (γ ∈ (0, 1)) presents GenCos' perceived significance 234 of future rewards compared to immediate payoff. According to Eq.(7), the expected based on the collective actions of all GenCos (b t i , ∀i ∈ I), and not a GenCo solely.

237
As is evident, the Markov property does not hold, considering the action space of 238 each GenCo at the beginning of the simulation, i.e., p(s t+1 . . .

I2+n
To improve the stability, lines 3-8 in Algorithm 1 do not allow GenCo-i to exercise 261 its right for choosing a random bid if k previous bids are unchanged for some reason.

262
When GenCo-i decides to submit the best-known bid, it feeds the current state, The simulation is conducted ten times over 100,000 iterations in a computer with 295 16 GB memory and an Intel Core i7-10510U processor. The program dedicates a 296 thread with its exclusive memory space to each GenCo; hence, four logical cores 297 out of eight are utilized thoroughly in this case study.

298
The initial recency (α 0 i ) and initial exploration rates (ǫ 0 i ) of all GenCos are set to 299 0.1 and 0.9, respectively. The prediction network is trained using Adam algorithm 300 as mentioned earlier. Figure 3 shows the total loss ( i L i ( w i )) of the action-value   Table 2, the converged tuple of bids using the proposed DQN 309 outperforms the Nash equilibria and Q-learning with decay, in terms of payoffs.

310
The bold rows mean convergence to an SCE as defined in [1]  This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

367
Availability of data and materials 368 All generated or analyzed data in this study are included in this manuscript. 369