Introduction to Neural Networks final project, where we used multiple recurrent neural networksto learn and compose string quartets in the style of Joseph Haydn.
For my final project for Introduction to Neural Networks, my project partner
Xiaohan Ding and I put together an
ensemble of recurrent neural networks to learn and compose string quartets in the style of Joseph Haydn.
Our initial ambition was set on building long-term structure into music, which at the
time seemed to be an elusive problem that a lot of papers pointed out, but none provided any solutions nor insight
to tackle it.
We presented our plan and thoughts to our professor, whose feedback steered us toward something "simpler". Looking
back though, I wish we had presented a more concrete problem - I can spot a ternary form or sonata
form when I see one, but what does long-term structure mean? Does it mean that we find ways to inject a prior that
makes the computer generate compositions in sonata form? Or does it mean that the computer will generate something
completely new that contains some (possibly un-classified) underlying structure?
Being a completely newbie at this, the premise of my goal was not very well-defined. Not know how to get guidance,
decided to test out something simpler instead - simply modeling music with neural networks and see what kind of
results we get.
For our project, we wanted to model something a bit more complex than Bach chorales and see what kinds of results
we would get. The most obvious choice to us at the time were string quartets because they were four-parts similar
to the chorales, but instead has a lot more rhythmic and harmonic complexity.
To model such rhythmic and harmonic complexity, we employ an ensemble of 16 neural networks, each one modeling one
aspect of music for one of the four instruments - we had a network that modeled forward rhythm and pitch patterns,
a network that modeled backward pitch patterns, a network that modeled harmony across the four parts, and a final
model (a judge model) that took in prediction from the aforementioned three models, and come up with a final answer.
To train the models, we create four-measure frames around each of the notes in the music and supply all but that
particular note. The judge model produces a probability distribution over the 88 notes on the piano,
and we back-propagate the cross-entropy loss between the predicted distribution and the actual one-hot distribution.
To generate notes, we use a Markov-Chain Monte Carlo (MCMC) sampling approach, where we start by generate a random
set of notes and rhythm, then iterate through it by randomly selecting a note, send all of the notes in the
four-frame neighborhood through the ensemble of networks we trained. Once we get a prediction, we replace the
current note with the generated one, and we rinse and repeat times.
The image below contains a comparison between the randomly generated music that we started with on the left, and
something that the networks produced after 1000 iterations of sampling. We can see that while the harmony looks a
bit funky, there is a resemblance of key when looking at the accidentals, as well as rhythmic patterns that are
somewhat familiar and reasonable-looking.
To complete the project, we created a survey consisted of three short samples generated from our algorithm, as well
as three samples from actual composers - somewhat of a musical Turing test. Take a listen below and see if you can
tell the difference!
Note that we normalized the music by generating the real ones with computer (as opposed to using live performances).
To be honest, I was a bit disappointed at how atonal the music sounded. At any given 16th note moment, which is the
small unit of time we quantized everything into, the harmony seems to make sense. But when put into the over all
context, the music does sound anywhere near what Haydn wrote - it almost sounds like Schoenberg's
goose-bump-inducing twelve-tone pieces!
Because the harmony at each slice of time across four voices sounds reasonable, one thing we thought that might
improve this was to have another network model the chord progression in the four-measure window as well. This would
require a significant amount of work, because we would need to learn the chords at each of the time slices, then
use another recurrent network to capture that, similar to the ones we use to capture the melodic lines - something
I will maybe look into in the future.
On the bright side though, it did produce some interesting melodic lines, such as around 8th second of Sample #1.
If you are interested in the details, the code repository for this project is
here, and we also did a write up of
this, which can be found here.