The story so far
It was more than a year ago that I had my playing with gpt2 phase, resulting in a short story co-written with the AI and this little blog http://yo252yo-a.tumblr.com/ which I kinda stopped updating after a while.
But I was bound to come back to it some day! It all started when I decided to open a twitter account for my podcast. I very naturally made a little script to schedule all my tweets (from Google Spreadsheet ^^) so that I could enqueue tweets, obviously. I also went back in time to the archive of my facebook/tumblr/whatever posts to see what could fit this new account since I posted so much enlightening things over the years xD
Once this was in place, it was like my twitter account was managed by a nice little bot (who was simply posting things from a queue, but still). As its parent, I obviously wanted to see it grow: how cool would it be if it could learn and evolve by itself? Could it ever be self-aware (lol)? After all, it already had access to twitter, and it had a collection of my tweets to learn from.
So I dusted off my colab repository of GPT2, since GPT3, despite all the hype, remains pretty inaccessible. Most notably, I had to make it work with an old version of tensorflow (the recent versions broke it), and I also made it read and write directly to Google Spreadsheet /o/ In the end, I only had to run the code in the colab to fetch the data, train on it, and post it directly in the queue to be twitted. Pretty sweet setup.
The problem is that GPT2 produces mostly crap. And I didn’t know what temperature or training set would be ideal for my purposes. It was time to experiment!
I ran several training sets on several temperatures. For each, I personally annotated 200 results. I dont think the result will be super significant, but it’s better than nothing.
The success criteria was: is this tweetable (i.e. relatively grammatically correct, at least a bit interesting/surprising, and of course different from the training set). The good samples will be posted on our twitter with the hashtag #shitmygpt2says.
The basic training set was the queue of all our tweets for the podcast twitter account, including the archive of all my past tumblr/facebook posts that I sanitized for the occasion (a lot of work xD).
But like my previous attempts, I thought it was a bit sad to limit myself to things produced by me when I had the perfect chance to merge my brain with the people I admire. Furthermore, I kinda wanted to make my twitter AI standalone and able to “learn” as time passes, even though GPT really isn’t the best framework for that ^^
I ended up making a twitter list of people I admire, and used their recent tweets in my dataset. The idea was to make my model aware of “recent events”, recent words, etc…
Yet, I wanted to keep a feeling that the writing style was distinctly mine. It is accounted for in the success criteria, and the core of this experimentation was “how should I mix the training set to keep awareness of the recent world but still control the style of the output?”.
Sequential vs merging
In my previous attempts, I mostly used a “merging” approach feeding everything to the learning phase. The alternative is to feed two corpora in succession during the learning phase.
From what I observed, it seems that GPT2 absorbs the style of whatever it was fed last, even if it is for very few training epochs. For instance, when I fed it corpus A for 1.5k epochs and then corpus B for 100 epochs, it produced results that looked like corpus B, even though it exhibited some signs of having learned A every now and then (pretty rarely though, that’s why I kept so many epochs in the first phase of training).
I kinda think of it with a cooking metaphor, when I first marinate the model in corpus A and then lightly sear it with corpus B.
Here are the experimental results that loosely validate this:
We notice here btw that the merging strategy is pretty poor because consistency of the training set is pretty important with GPT2. The first three lines did not exhibit a strong difference, making me believe that 1k epochs is enough for GPT2 to “forget” about the initial corpus, which is how I ended up with the 1.5k/100 mix which gave me the best outcomes.
Here is the total result of my experiments. GPT2 produces around 93% of crap, which makes sanitizing a pretty tough job ^^ It appears that this could drop to 80% or below by using correctly the “marinade/searing” technique and keeping the training set uniform.
As it is widely known, temperature below 0.8 is pretty bad, but I find myself often pushing above 1, though it seemed to do pretty poorly with my best data sets. I’ll keep using different temperatures as they produce different types of results that I enjoy in their own way. But I’ll probably stop using text corpora as a base (past writing, night vale scripts, etc…) because they don’t seem to bring anything to the table (and could even be detrimental, better stick to tweets).
So we’re pretty far from a self-aware AI that learns from its mistakes, but seeing that I’ll always retrain it on recent tweets, and that it will be trained on my own tweets that include the proposals it made and I kept, I hope that as time passes it’ll still learn to be a bit better (it already started annotating posts with the #shitmygpt2says hashtag itself).
In the future, I’ll run this every now and then in its best configuration, and keep posting on twitter with the hashtag #shitmygpt2says. Stay tuned if you’re interested!