Understanding and predicting viewers’ emotional responses to videos has emerged as a pivotal challenge due to its multifaceted applications in video indexing, summarization, personalized content recommendation, and effective advertisement design. One major roadblock in this domain has been the lack of expansive datasets with videos paired with viewer-reported emotional annotations. We bridge this gap by introducing a deep learning methodology trained on a unique proprietary dataset of over 30,000 real video advertisements, each annotated by an average of 75 viewers. This equates to over 2.3 million emotional annotations across eight distinct categories: anger, contempt, disgust, fear, happiness, sadness, surprise, and neutral, coupled with the temporal onset of these emotions. Leveraging 5-second video clips, our approach aims to capture pronounced emotional responses. Our convolutional neural network uses both video and audio to predict salient 5-second emotional clips, achieving an average balanced accuracy of 43.6%, with especially high performance for detecting happiness (55.8%) and sadness (60.2%). When applied to full ads, our model obtains a strong average AUC of 75% in determining emotional undertones. To spur progress, we publicly release our trained network. This work helps overcome previous data limitations and provides an accurate deep learning solution for video emotion understanding.