Various problems afflict the complex network structures that are implemented in pedestrian detection and tracking applications, including a large number of parameters, lengthy training times, and slow running speeds. This is especially true when current algorithms are used on mobile devices. In this study, we proposed an improved pedestrian detection and tracking model based on YOLOv5s detection and DeepSort tracking and used the video Mosaic algorithm based on the scale-invariant feature transform (SIFT) algorithm to expand the scope of pedestrian detection and tracking by addressing the aforementioned problems. We implemented MnasNet_P, cavity convolution, and SPP layers in our model. The Mish activation function replaced the LeakyReLU activation function to improve the generalizability of the model. A depth-separable convolution was introduced to replace the standard convolution of residual edges in the C3_1 structure, reducing the number of network parameters. Shufflenetv2_x1.5 was introduced as a pedestrian appearance feature extraction network, further reducing the number of network parameters while maintaining high tracking accuracy. Finally, we used the video splicing algorithm based on SIFT to expand the range of pedestrian tracking. Application of our model to public datasets demonstrated an improvement of 2.71% in average accuracy and an improvement of 30.7% in the FPS rate. Experimental results obtained with the MOT16 dataset demonstrated a substantially reduced model size while maintaining a high tracking accuracy, indicating that our algorithm is suitable for pedestrian tracking on mobile terminals or embedded devices. The real-time video stitching method based on SIFT expanded the tracking range of pedestrians.