So lately I’ve been spending a relative amount of time toying with GPT2, who made the headlines about producing text so believable that it was considered dangerous (GPT2 is the toned down version).
ML and Reddit
I started by getting hooked on this GPT2 generated subreddit:
Which I highly recommend to everyone to read daily as an exercise in critical thinking and challenging the natural human bias to trust everything you see. I especially enjoy the tag trained on r/totallynotrobots which is basically robots pretending to be humans pretending to be robots pretending to be humans.
It wasn’t long before I tried it for myself. I’ve long wanted to download all my social media posts and train some kind of ML on it, and GPT2 seemed like the state of the art.
Somehow I started to mess around with Torch RNN which was the previous state of the art, I guess, made accessible through this tutorial which gave us such gems as a PBS idea channel episode, a genius buzzfeed skit, or the relatively famous short film Sunspring.
Both Torch RNN and GPT2 are pretty similar in the way they are used (I believe it’s all tensorflow under the hood). They both deliver you a pre-trained model that kinda knows english, I think, and expect as input a txt file of example lines.
But training took ages on my computer (like a whole night for a couple of iterations) because despite being fairly powerful its GPU isn’t supported for the ML training optimizations (sad). I had little hope that anything more sophisticated would be possible on my machine.
Fortunately, people are sometimes really great, and not only did Max Woolf make a wrapper to make GPT2 easy to use, he also made a colaboratory notebook that makes it dead simple to use and most importantly computationally sustainable, since it runs on the Google Compute Engine VM with some sort of free quota. It has a very nice Google Drive integration that makes it easy to save trained model or upload new training data. With this, you can train a model in less than 1h, making it really easy to play with.
First of all, it’s been extremely easy to download all my data from social networks (here I’m talking about Google, Facebook, Tumblr, Discord and WordPress). Everything has a dump archive function now (courtesy of EU law I believe?), so that definitely made my life easier. A bit of python scripting to transform the json or xml into txt and we were good to go.
I first started the training on the posts of this blog. The outcome was pretty convincing. It felt pretty weird and special to see these lines that felt like I could have written but I actually didn’t. It really seemed like another version of me, which of course tickled my philosophy bone.
Obviously the result wasn’t perfect. It often spouts out nonsensical stuff, but I enjoyed very much weeding out the absurd or malformed proposition to keep something sensical by human conventional standards (let’s say I had around 1 satisfying proposal for 5 results on average).
This way, I had the program write a short story for this blog. I gave it the prompt you see in bold, and it chose among the completions it proposed. I did not add any text myself. As you can see, it’s a bit weird. In particular it doesn’t really lead anywhere, I think GPT2 isn’t very teleological. That definitely was a challenge for a short story ^^ But I like to think that the style is pretty convincing.
And the overall exercise is far from absurd. It reminded me of the Ecriture automatique productions by the surrealists. It’s still an easier read than Naked Lunch. Really gets you thinking about the self, art and authorship, doesn’t it? Who wrote this story in the end? What if I hadn’t done any editing? What does it mean for copyright?
Prompted by these questions, I trained several models on works of art that I thought would produce interesting outputs. I put all my favorite results on
In particular, I trained a model on the Hitchhiker’s Guide to the Galaxy (which produced a lot of “bits of story” and dialogs that were not really usable as standalone excerpts), Welcome to Night Vale scripts (which were pretty convincing especially when you prompt it with a phrase of the show like “And now, a look at the community calendar!”), or all of homestuck (which was pretty challenging to get anything good out of).
Once I had all these pretty ok results, I immediately processed to try merging my brain (at least this model copy) to the brains of these authors I admire (at least this model copy). The result was a mess until I had the great idea to feed the input corpora not in parallel all at once but in sequence (i.e. do 1000 rounds of training on the authors’ corpus, and then 1000 rounds on mine). The results were pretty nice.
This taught me the single most important fact about playing with GPT2: it’s all about your training data. The parameters (# training rounds, “temperature”) can’t really save you if your input data isn’t the best it can be. You want it as clean and uniform as possible. Which is really the core point of the next section.
Social media corpora
I trained GPT2 models on my conversations and emails, but it was all utter failures. The fact that I’m often using several languages certainly doesn’t help, but the trouble I’ve had with the homestuck corpus makes me believe that GPT2 is simply not very great with dialogs and conversations.
I even tried to sanitize my input further, prefixing my lines of dialogs by “-” and whoever I was talking to by “>”, with the hope of starting a conversation with the GPT2 model, but I couldn’t get anything out of it. Maybe if I went over the corpus manually and kept only the meaningful messages, I’d get something different, but this sounds daunting.
Needless to say that merging this with my blog post corpora was also pretty bad, so in the end I stuck to my blog corpus.
By the way, I also tried to train a model on a list of J.K. Rowkling’s retcon tweets to get crispy new intel about the Harry Potter canon variations, but I couldn’t get it to produce anything new.
- GPT2 on colab is extremely easy.
- Your training corpus is everything, really.
- GPT2 does great with literary types of text but sucks a bit at conversations/informal speech.
As intoxicating as it is to watch a ghost of myself produce believable texts, I’m not sure where it leads ^^ My ultimate goal would be to be able to produce some sort of system I can interact with and teach dynamically to get better (i.e. conversational and dynamic retraining) but that seems pretty rare in the world of generational ML models. I might have to dig deeper into Tensorflow, but I can’t really do that with my current machine, so I’m kinda stuck.
I have a couple of pointers for conversational ML (still no dynamic/online/interactive/reinforcement learning though so that limits the interest), but I expect them to be less good than GPT2. Haven’t had time to try them yet (probably they require more power than I have). The dream would be to combine that with GPT2 I guess and figure out a way to dynamically retrain the model on itself.
In any case, it feels really nice to see some progress in my Caprica dream.