In this section, in order to solve the localisation and mapping problem in scenarios with WEI environments, a SLAM method for clustered robots based on point-line matching is presented. First, the architecture and communication protocol of the cluster robot system are presented. Then the point-line matching algorithm for visual feature matching and closed-loop detection is detailed. Finally, the process of global map optimisation and position estimation on the server side is described.

## 2.1 Server-based centralized swarm robot system

The Swarm Robot SLAM system uses a central server to communicate and compute in an efficient way. The system architecture can be represented by the Fig. 1.

The robot platform is a TurtleBot 3 [12]. Each robot is equipped with a monocular camera and a processor for local feature extraction and position estimation. The robots communicate with a central server via a wireless network to exchange map data and position information. The server maintains a global map of the environment and performs map fusion and optimisation based on the data sent by the robots.

The main advantage of the centralised architecture is that it allows the robot to offload computationally intensive tasks (e.g. global map optimisation and closed-loop detection) to the server [13]. This reduces the on-board processing load on the individual robots, allowing them to focus on localised mapping and navigation. In addition, the centralised server provides a unified coordinate framework for map fusion and robot pose estimation, which helps to mitigate inconsistency and drift problems in distributed SLAM approaches [14].

For achieving the desired algorithmic functionality, we have adopted a communication architecture anchored in the Robot Operating System (ROS)[15]. Within the ROS framework, both a node manager and a parameter manager are established to facilitate efficient data flow between the robots and the centralized server. Subsequent to setting up this communication framework, the mode of information transmission must be selected. Although both wired and wireless options are available, the mobile nature of the robots makes a wireless setup the more pragmatic choice.

In this work, we utilize ROS in conjunction with Wi-Fi to establish a global communication network across the swarm robot system.

## 2.3 Generating point-line pairs

The proposed point-line matching method generates point-line pairs to overcome perceptual overlap issues in weak reference information environments. By establishing correspondences between point features and line features, the method improves the accuracy and robustness of the relative pose estimation process.

ORB algorithms are utilized to identify points as landmarks within the environment, serving as discrete locations for robotic navigation. The ORB algorithm extracts keypoints \({p}_{i}\) in an image \(I\), where \(i\) ranges from 1 to \(N,N\) being the total number of keypoints. Each keypoint \({p}_{i}\) is described by its coordinates \(\left({x}_{i},{y}_{i}\right)\). Mathematically, this can be expressed as:

$${\text{p}}_{\text{i}}={\text{f}}_{\text{O}\text{R}\text{B}}\left(\text{I}\right), \text{i}=\text{1,2},\dots ,\text{N}$$

1

LSD techniques extrapolate lines from consecutive point observations, thereby providing a richer representation of the environment. Line segments \({l}_{j}\) are identified within the same visual frame, where \(j\) ranges from 1 to \(M\), \(M\) being the total number of line segments. Each line segment \({l}_{j}\) is defined by its endpoints \(\left({x}_{j1},{y}_{j1}\right)\) and \(\left({x}_{j2},{y}_{j2}\right)\). Mathematically, this can be formalized as:

$${\text{l}}_{\text{j}}={\text{f}}_{\text{L}\text{S}\text{D}}\left(\text{I}\right), \text{j}=\text{1,2},\dots ,\text{M}$$

2

Subsequently, a Distributed Bag-of-Words (DBoW)[18] method is employed to establish correspondences between these point and line features across consecutive frames, thereby creating a richer dataset. After feature extraction, DBoW is employed to establish correspondences between these features. The descriptors of keypoints \({p}_{i}\) and line segments \({l}_{j}\) are encapsulated as \(d\left({p}_{i}\right)\) and \(d\left({l}_{j}\right)\) respectively. Similarity \(S\) between descriptors from two consecutive frames \({I}_{t}\) and \({I}_{t-1}\) is computed using Euclidean distance:

$$\begin{array}{rr}\text{S}\left(\text{d}\left({\text{p}}_{\text{i}}^{\text{t}}\right),\text{d}\left({\text{p}}_{\text{i}}^{\text{t}-1}\right)\right)& ={∥\text{d}\left({\text{p}}_{\text{i}}^{\text{t}}\right)-\text{d}\left({\text{p}}_{\text{i}}^{\text{t}-1}\right)∥}_{2}\\ \text{S}\left(\text{d}\left({\text{l}}_{\text{j}}^{\text{t}}\right),\text{d}\left({\text{l}}_{\text{j}}^{\text{t}-1}\right)\right)& ={∥\text{d}\left({\text{l}}_{\text{j}}^{\text{t}}\right)-\text{d}\left({\text{l}}_{\text{j}}^{\text{t}-1}\right)∥}_{2}\end{array}$$

3

Finally, to enhance the overall reliability of these correspondences, the Random Sample Consensus (RANSAC) algorithm is applied. The algorithm selects a subset of \(n\) feature pairs to estimate a model \(M\), and based on this model, it classifies inliers \(I\) and outliers \(O\). Mathematically, this is given by:

$$\begin{array}{c}\text{M}=\text{R}\text{A}\text{N}\text{S}\text{A}\text{C}\left(\text{n}\right)\\ \text{I},\text{O}=\text{c}\text{l}\text{a}\text{s}\text{s}\text{i}\text{f}\text{y}(\text{M},\text{S})\end{array}$$

4

By considering points as landmarks and lines as extrapolations from consecutive point observations, the environment is represented in a more comprehensive manner. This ensures a feature correspondence that are foundational for subsequent stages of relative pose estimation.

In swarm robot, the algorithm is applied to construct dictionaries of point and line features. The point features encapsulate location and scale information, while line features encompass the endpoint locations. For point features, these are defined by their location and scale, creating a circle with '\(p\)' as the center and '\(r\)' as the radius. Line features, in contrast, are determined by the positions of their endpoints. A distinctive word pair is created when a line segment of a line feature intersects with the circle of a point feature. As demonstrated in Fig. 2, point and line 1 can be grouped to create a word pair. However, the combination of point and line 2 does not meet the criteria for forming a word pair.

These dictionaries serve as a rich source of information for the robots, enabling them to better understand their surroundings and coordinate their activities more effectively.

The generation of word pairs from the intersection of point feature circles and line feature segments offers a unique way to interpret and extract information from sparse data. For instance, in a formation where a robot corresponds to a 'point' and its movement trajectory represents a 'line', the intersection could symbolize an event or interaction, such as obstacle avoidance or task collaboration. This study is based on a collaborative vision SLAM algorithm for swarm robots in which, each robot calculates the pose coordinates based on the initial positioning point as the coordinate origin. Therefore, each robot is in a different coordinate system, the pose fusion cannot be performed directly, but the pose transformation matrix for each robot movement has to be calculated based on the pose information of the key frames.

In this system, the key frame positional information of the robot is as follows.

𝐼= timestamp ,𝑡𝑥,𝑡𝑦,𝑡𝑧,𝑞0,𝑞1,𝑞2,𝑞3 (5)

where, the \(timestamp\)is the timestamp of the keyframe pose, and \({t}_{x},{t}_{y},{t}_{z}\) are the coordinates describing the position of the keyframe in 3D space; \({q}_{0},{q}_{1},{q}_{2},{q}_{3}\) are the unit quaternions describing the rotation of the keyframe in 3D space.

The rotation matrix \(R\) of the keyframe can be calculated from the quaternions, and the transformation relations are as follows:

$$\varvec{R}=\left[\begin{array}{lll}1-2{q}_{2}^{2}-2{q}_{3}^{2}& 2{q}_{1}{q}_{2}+2{q}_{0}{q}_{3}& 2{q}_{1}{q}_{3}-2{q}_{0}{q}_{2}\\ 2{q}_{1}{q}_{2}-2{q}_{0}{q}_{3}& 1-2{q}_{1}^{2}-2{q}_{3}^{2}& 2{q}_{2}{q}_{3}+2{q}_{0}{q}_{1}\\ 2{q}_{1}{q}_{3}+2{q}_{0}{q}_{2}& 2{q}_{2}{q}_{3}-2{q}_{0}{q}_{1}& 1-2{q}_{1}^{2}-2{q}_{2}^{2}\end{array}\right]$$

6

The keyframe panning matrix is as follows:

$$\varvec{t}={\left[{t}_{x},{t}_{y},{t}_{z}\right]}^{\text{T}}$$

7

As a result, the bitwise transformation matrix of keyframe \(i\) is shown below:

$${\varvec{T}}_{i}=\left[\begin{array}{cc}\varvec{R}& \varvec{t}\\ {0}_{1\times 3}& 1\end{array}\right]$$

8

The robot \({R}_{A}\) has the pose \({\varvec{T}}_{A,i}\) when passing through scene \(M\) at one moment \({t}_{i}\), and the robot \({R}_{B}\) has the pose \({\varvec{T}}_{B,j}\) when passing through the same scene at another moment t\({t}_{j}\), which is recognized by the system, and the robot \({R}_{A}\) arrives at scene \(N\) with the pose \({\varvec{T}}_{A,j}\).

The cumulative pose transformation of robot \({R}_{A}\) from moment \({R}_{A}\) to moment \({t}_{j}\) is:

$${\varvec{T}}_{A,i\to j}={\varvec{T}}_{A,i+1}{\varvec{T}}_{A,i+2}\cdots {\varvec{T}}_{A,j-1}{\varvec{T}}_{A,j}$$

9

Thus, robot \({R}_{A}\) is transformed with respect to robot \({R}_{B}\) in local coordinates at moment \({t}_{i}\):

$${\varvec{T}}_{A,i}={\left({\varvec{T}}_{A,i\to j}\right)}^{-1}{\varvec{T}}_{A,j}$$

10