Hi! I have a lot of other things to do so as procrastination I suppose I did a little analysis. I’m learning Japanese by watching raw anime and writing down what I don’t know, so I have a pretty great dataset of what anime did I find actually hard vs easy. For best experience, I’d like to watch them by increasing level of difficulty, but it’s hard to figure out the difficulty before actually watching
I had looked at a glance at using subtitles files to do that: the idea was that the bigger the subtitles files, the more text and therefore difficult text there would be, and therefore the hardest it would be to understand. But a quick glance seemed to show no correlation between subs size and difficulty.
So today I thought it wouldnt be too hard to write a little script that goes over subtitles in a folder, remove all junk boilerplate and count the kanjis. The script is here. The upside of this is that I can remove the .srt or .ass boilerplate, but also look at the text to see if it has difficult kanjis.
Without further ado, this is the average kanji distribution per subtitle file for 10 anime annotated with my expert ground truth:
And on a log scale:
So… this doesnt look very helpful. The result of this analysis is a complete failure. JLPT kanji level doesn’t correlate with anime difficulty. But maybe I can pick myself a list of kanjis that are correlated with difficult anime (like military vocabulary, this always gets me). I guess ideally I’d assign a weight to each kanji by machine learning but this requires more effort that im willing to do. If you know a dead easy way to do that, though, I’m interested