To assist sixth-generation wireless systems in the management of a wide variety of services, ranging from mission-critical services to safety critical tasks, key physical layer technologies such as reconfigurable intelligent surfaces (RISs) are proposed. Even though RISs are already used in various scenarios to enable the implementation of smart radio environments, they still face challenges with regards to real time operation. Specifically, RISs typically need costly, high dimensional channel estimation with offline exhaustive search, requiring prohibitive hardware complexity or online exhaustive beam-training that incurs high training overhead. While in its infant stage, the application of deep learning (DL) tools shows promise in enabling feasible solutions. In this paper, we propose two low-training overhead and energy-efficient adversarial bandit-based schemes with outstanding performance gains when compared to DL-based reflection beamforming reference methods. The resulting deep learning models are discussed using state-of-the art model quality prediction trends.