Image of a computer. Image of a robot. Image of a blue piece. Image of a green piece.

Generating String Quartets with Neural Networks


Introduction to Neural Networks final project, where we used multiple recurrent neural networksto learn and compose string quartets in the style of Joseph Haydn.


For my final project for Introduction to Neural Networks, my project partner Xiaohan Ding and I put together an ensemble of recurrent neural networks to learn and compose string quartets in the style of Joseph Haydn.

Our initial ambition was set on building long-term structure into music, which at the time seemed to be an elusive problem that a lot of papers pointed out, but none provided any solutions nor insight to tackle it.

We presented our plan and thoughts to our professor, whose feedback steered us toward something "simpler". Looking back though, I wish we had presented a more concrete problem - I can spot a ternary form or sonata form when I see one, but what does long-term structure mean? Does it mean that we find ways to inject a prior that makes the computer generate compositions in sonata form? Or does it mean that the computer will generate something completely new that contains some (possibly un-classified) underlying structure?

Being a completely newbie at this, the premise of my goal was not very well-defined. Not know how to get guidance, decided to test out something simpler instead - simply modeling music with neural networks and see what kind of results we get.

Deep Dive

There were many works in the domain of music generation with machine learning models, each employing ever more complex architectures, such as tied parallel networks by D. Johnson, convolutional generative adversarial networks by Yang et al, and HARMONET by Hild et al.

For our project, we wanted to model something a bit more complex than Bach chorales and see what kinds of results we would get. The most obvious choice to us at the time were string quartets because they were four-parts similar to the chorales, but instead has a lot more rhythmic and harmonic complexity.

To model such rhythmic and harmonic complexity, we employ an ensemble of 16 neural networks, each one modeling one aspect of music for one of the four instruments - we had a network that modeled forward rhythm and pitch patterns, a network that modeled backward pitch patterns, a network that modeled harmony across the four parts, and a final model (a judge model) that took in prediction from the aforementioned three models, and come up with a final answer.

Network architecture

To train the models, we create four-measure frames around each of the notes in the music and supply all but that particular note. The judge model produces a probability distribution over the 88 notes on the piano, and we back-propagate the cross-entropy loss between the predicted distribution and the actual one-hot distribution.

To generate notes, we use a Markov-Chain Monte Carlo (MCMC) sampling approach, where we start by generate a random set of notes and rhythm, then iterate through it by randomly selecting a note, send all of the notes in the four-frame neighborhood through the ensemble of networks we trained. Once we get a prediction, we replace the current note with the generated one, and we rinse and repeat times.

The image below contains a comparison between the randomly generated music that we started with on the left, and something that the networks produced after 1000 iterations of sampling. We can see that while the harmony looks a bit funky, there is a resemblance of key when looking at the accidentals, as well as rhythmic patterns that are somewhat familiar and reasonable-looking.

Randomly generated stuff vs. 1000-iterations of sampling


To complete the project, we created a survey consisted of three short samples generated from our algorithm, as well as three samples from actual composers - somewhat of a musical Turing test. Take a listen below and see if you can tell the difference!

Note that we normalized the music by generating the real ones with computer (as opposed to using live performances).

Sample #1. Answer: Our Model

Sample #2. Answer: Arnold Schoenberg

Sample #3. Answer: Our Model

Sample #4. Answer: Joseph Haydn

Sample #5. Answer: Joseph Haydn

Sample #6. Answer: Our Model


To be honest, I was a bit disappointed at how atonal the music sounded. At any given 16th note moment, which is the small unit of time we quantized everything into, the harmony seems to make sense. But when put into the over all context, the music does sound anywhere near what Haydn wrote - it almost sounds like Schoenberg's goose-bump-inducing twelve-tone pieces!

Because the harmony at each slice of time across four voices sounds reasonable, one thing we thought that might improve this was to have another network model the chord progression in the four-measure window as well. This would require a significant amount of work, because we would need to learn the chords at each of the time slices, then use another recurrent network to capture that, similar to the ones we use to capture the melodic lines - something I will maybe look into in the future.

On the bright side though, it did produce some interesting melodic lines, such as around 8th second of Sample #1.

If you are interested in the details, the code repository for this project is here, and we also did a write up of this, which can be found here.

What do you think? Leave a comment below!