CompoNet: Toward Incorporating Human Perspective in Automatic Music Generation Using Deep Learning

7 The art nature of music makes it difﬁcult, if not impossible, to extract solid rules from composed pieces and express them mathematically. This has led to the lack of utilization of music expert knowledge in the AI literature for automation of music composition. In this study, we employ intervals , which are the building blocks of music, to represent musical data closer to human composers’ perspectives. Based on intervals, we developed and trained OrchNet which translates musical data into and from numerical vector representation. Another model called CompoNet is developed and trained to generate music. Using intervals and a novel monitor-and-inject mechanism, we address the two main limitations of the literature: lack of orchestration and lack of long-term memory. The music generated by CompoNet is evaluated by Turing Test: whether human judges can tell the difference between the music pieces composed by humans versus. generated by our system. The Turing test results were compared using Mann-Whitney U Test, and there was no statistically signiﬁcant difference between human-composed music versus what our system has generated.

folk-rnn 24 LSTM Jaques et al. 9 Note RNN Not given RNN Yu et al. 17 SeqGAN Nottingham Dataset 25 Convolution Yang et al. 15 MidiNet TheoryTab 26 Convolution Mogren 16 C-RNN-GAN Not Given RNN Dong et al. 19,20 MuseGAN Lakh MIDI Dataset 22 Convolution Lack of Orchestration 35 Numbers in music theory should be interpreted differently than in the algebra. This leads to extra challenges when it comes 36 to musical data representation. In other words, since numerical patterns in music follow different rules than patterns in other 37 domains, existing pattern recognition techniques fail to correctly and efficiently extract the patterns in a given piece of music.

38
Since common practices in pattern recognition need to put extra effort to compensate for music-specific differences, the number 39 of instruments and ensembles needs to be lowered.

41
Existing methodologies can be categorized into generative and memory-based models. While memory-based models, unlike 42 generative models, do have memory capabilities, they inherit LSTM's most famous limitation: lack of long term memory. It has 43 been observed that in sequence generation based on LSTM, an object, name, or idea gets introduced in the beginning, but it 44 vanishes as the sequence gets longer.

46
In this paper, we aim to address the aforementioned challenges/limitations by employing the embedding mechanism based on 47 music theory 27  3. if the file contains less than 10 bars (the number of bars was calculated as: The Music Embedding package 27 was used for data embedding. This package accepts inputs in the pianoroll format; therefore, 74 Pypianoroll package is used to convert MIDI files into pianorolls and then input them to the embedder of the Music Embedding 75 package. The embedder converts pianoroll into sequences of intervals. In music theory, an interval is defined as the difference The proposed OrchNet has music theory built into it and translates music data to and from vector representation. OrchNet, 85 which is a Stacked AutoEncoder, receives data embedded with Music Embedding package. Order, type, and direction are 86 categorical while octave and RLE are numerical; therefore, it is reasonable to have one-hot encoding for Order, type, and 87 direction and integer encoding for octave and RLE. However, since OrchNet is an AutoEncoder, its input and output must be  The above-mentioned 90-bit encoding represents a single interval. Yet, music is the sequence of intervals. Per music theory,  Since the data has temporal dependencies, OrchNet is built based on LSTM blocks. As explained earlier, the input is a 64 by 90 110 tensor in which the sequence is padded with zeros to make its length consistent with other sequences. Therefore, OrchNet starts  This model works based on a sliding window mechanism which is illustrated in Fig. 3. In this mechanism, the model  The input to the discriminator comes from either the generator or from OrchNet Encoder encoding 16 consecutive bars.

154
The discriminator performs a binary classification task to determine the source of a given input. The training routine starts 155 with training the discriminator. At this stage, since generator is producing random output, discriminator easily learns how to 156 differentiate between inputs from OrchNet Encoder and the generator. The value of loss function is then back-propagated 157 through discriminator to calculate the loss of the generator. This leads to training the generator to produce more realistic outputs.

158
The cycle repeats and discriminator becomes better at differentiating the source and generator becomes better at producing 159 outputs that it becomes hard for the discriminator to distinguish. At the end, the generator is trained enough to generate the 160 required 16 samples similar to the existing samples in the dataset from a given random vector.

161
The Monitor-and-Inject Mechanism 162 The proposed architecture of CompoNet has two major issues:

163
• being based on LSTM, it suffers from lack of long-term memory, 164 • it follows a fixed sequential approach which makes it unsuitable to generate different styles and forms.    Python is the main language used to implement this work. Music Embedding, Pypianoroll, and pretty_midi packages were used 185 for data-handling tasks while Keras and TensorFlow were used to define and train the Deep Learning models.

186
A considerable number of models had to be trained and evaluated for this work. Therefore, it was crucial to exploit hardware  To perform the Turing test, 10 pieces were generated by CompoNet to be compared against 10 randomly-selected pieces from 243 the dataset. To keep the test fair and reasonable, the following steps were taken:

244
• To ensure respondents are judging the pieces rather than answering the question from their memories, famous pieces 245 were removed from consideration.

246
• To ensure the sequence of pieces is not impacting the results, they were put in a random sequence.

247
• To ensure the quality of audio is not impacting the results, all 20 pieces were audiolized with the same engine.

248
• To increase the likelihood of respondent completely listing to the piece before making a decision, all pieces were kept 249 under one minute.

250
For each piece, respondents were asked "How do you evaluate this piece of music?" and they were asked to choose one the 251 following options:

252
• A. I am confident it is composed by a human.

253
• B. I think it might be composed by a human.

264
Before beginning the test, respondents were asked to self-identify their age and gender groups voluntarily. Fig. 7 and Fig. 8 265 illustrate the age and gender distributions of respondents respectively. Additionally, Questionpro automatically provides 266 geographical distribution of respondents which is provided in Table 4 and Fig. 9. Based on these information, the respondents 267 of this test include a diverse range of age, gender, and geographical location, which reduces potential biases in the results.   -0.5 I am confident it is composed by AI Software. -1

269
A scoring mechanism was used to compare the pieces of music in the test. The score of each piece was calculated as: where s is the score of the piece, i is the number of respondent, n is the total number of respondents, and v i is the value of the 270 i th respondent's response for the piece based on  composers whose works are present in the dataset.