Automatic recognition of actions in videos taken in uncontrolled environments is one of the most challenging tasks in computer vision. The most successful recent methods are based on Convolutional Neural Networks (CNNs) and use a variant of optical flow alongside the video frames to extract the information hidden in the variation of images. We propose a method that provides a neural network with the means to learn to extract features from local frame-to-frame variations, in addition to the global variation from a set of frames. The proposed method does not need to use any variant of pre-computed hand-crafted optical flow. We propose also a mechanism to raise the two limitations of CNNs when dealing with videos instead of single images: their inability to capture the difference between the video frames, and being optimized to find patterns regardless of their positions in space, i. e. insensitive to the occurrence order of features. The architecture we propose provides the possibility of reusing pre-trained image classification CNNs as feature extractors, allowing to take advantage of them with minimal tuning. The computational cost is mostly dependent on the choice of feature extractors. By taking the most efficient state-of-the art image classification CNNs, the proposed approach reaches impressive computational efficiency and at the same time outperforms all the state-of-art real-time methods in term of accuracy.