## Experimental set-up

The optical setup for image transmission through the MMFs is described in Supplementary Fig. 1. The continuous-wave laser beam at wavelength 561nm is expanded by a pair of lenses (\({f}_{1}\) = 10 mm, \({f}_{2}\) = 100mm) and projected onto the SLM (V-7001) for binary amplitude modulation. The spatial modulated optical field is coupled by an objective lens (Nikon, 20X, 0.25NA) into the input facet of the MMF. At the output of the MMF, the speckle field at the distal end is magnified by another objective (Nikon, 20X, 0.25NA) and imaged by the CMOS (MER2-230-167U3M).

## Stability analysis of MMF channel

We adopted three metrics to evaluate the transmission channel stability over a large variety of MMFs, as summarized in Supplementary Fig. 2. The first metric measures the SSIM of a sequence of output speckle patterns to the reference pattern captured at \(T=0\) while transmission a fixed image. The SSIM calculation can be expressed as

$$\begin{array}{c}SSIM=\frac{\left(2*{\mu }_{x}*{\mu }_{y}+{C}_{1}\right)*\left(2*{\sigma }_{xy}+{C}_{2}\right)}{\left({\mu }_{x}^{2}+{\mu }_{y}^{2}+{C}_{1}\right)*\left({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{C}_{2}\right)}. \#\left(1.\right)\end{array}$$

Here, \(x\) and \(y\) are the two images being compared, \({\mu }_{x}\), \({\mu }_{y}\) and \({\sigma }_{x}^{2}\), \({\sigma }_{y}^{2}\) are the mean values and variations of images \(x\) and \(y\), and \({\sigma }_{xy}\) is the covariance between two images. The \({C}_{1}\) and \({C}_{2}\) are two constants used to prevent the denominator from becoming zero or the result becoming infinite.

The second metric measures the Helinger distance of the speckle energy distribution of two batches. The speckle energy distribution of each batch is obtained by summing up all speckle images in that batch, and one batch consists of 500 output speckle images captured within 5 seconds. For two discrete energy distributions \(P=({p}_{1},{p}_{2}\dots ,{p}_{n})\) and \(Q=({q}_{1},{q}_{2}\dots ,{q}_{n})\), \(n\) is the total number of pixels in a single speckle image, the Hellinger distance can be calculated as

$$\begin{array}{c}{H}^{2}\left(P,Q\right)=\frac{1}{2}\sum _{i=0}^{n}{\left(\sqrt{{q}_{i}}-\sqrt{{p}_{i}}\right)}^{2}.\#\left(2.\right)\end{array}$$

The third metric computes the pixel-wise accuracy of recovered image at each time point, by using an approximate reconstruction algorithm termed real-value inverse transmission matrix (RVITM)[3]. The test image is transmitted right after the approximal transmission matrix is calibrated. The accuracy for each recovered image is the percentage of mis-distinguished pixels to the full number of pixels.

## Data preparation

Our research employed arbitrary random binary images, as well as natural scene images including the Fashion-MNIST dataset, Latin letters from the E-MNIST dataset, and handwritten digitals from the MNIST dataset. The input images tested in the experiments are of 16×16-pixel and 32×32-pixel resolution. According to the maximum number of spatial modes that is allowed to transmit through the MMF, it is theoretically possible to input images of larger pixel resolution. We used a binary SLM as the amplitude modulator to encode the spatial information, and the output speckles are captured by the monochromatic CMOS. To reduce computational complexity, we down-sampled the speckle images to 100×100-pixel and 150×150-pixel for reconstructing input patterns of 16×16-pixel and 32×32-pixel, respectively.

## Neural network architecture

We designed an ensemble network consisting of three sub-networks with the same structure but updated differently over time. Each subnetwork is a convolutional neural network consisting of three convolutional layers and one fully-connected layer. Each convolution layer is followed by Dropout, Batch normalization, and ReLu activation. A sigmoid activation function is applied to the output layer. We employed cross-entropy as the loss function, and trained the networks with the AdaDelta optimizer at a learning rate of 0.1. The network training was executed on a workstation equipped with NVidia RTX3090 GPU.

## Confidence-based ensemble algorithm

Each subnetwork of the ensemble network calculates the average confidence level of one predicted instance as:

$$\begin{array}{c}c=\frac{1}{N}\sum _{i=1}^{N}\left(\left|{P}_{i} -0.5\right|+0.5\right) ,\#\left(3.\right)\end{array}$$

where \({P}_{i}\) represents the predicted value of pixel \(i\) from the subnetwork, which is binarized to be either 0 or 1, and \(N\) is the spatial resolution of the input patterns. The confidence-based weight of each subnetwork is defined as:

$$\begin{array}{c}{w}_{k}=\text{exp}\left(\frac{1-{c}_{k}}{\left(1-{c}_{1}\right)+\left(1-{c}_{2}\right)+\left(1-{c}_{3}\right)}\right).\#\left(4.\right)\end{array}$$

The final ensemble prediction is the weighted summation of predictions from three subnetworks:

$$\begin{array}{c}P=\frac{{w}_{1}*{p}_{1}+{w}_{2}*{p}_{2}+{w}_{3}*{p}_{3}}{{w}_{1}+{w}_{2}+{w}_{3}}. \#\left(5.\right)\end{array}$$