Using Python to clean up corpus files for OpenNMT Training

So I’m working on a little epub project tentatively called epub-ocr-and-translate (EOAT) that started out as me sharing a bunch of little scripts I was using to OCR, translate, and single-source the creation of PDFs and epubs from old public domain works in other languages. It’s kind of ballooned into a much bigger project than I originally envisioned, somehow leading me down the path of (don’t laugh…okay, fine, you can laugh, but make it quick) DIY machine translation…

While working with different online translation engines (translate-shell and Google’s paid Google Cloud Translation API, mostly), I discovered a lot of really, really strange output. Almost as if the neural network itself was calling for help, repeating things like DANGER DANGER DANGER for ten straight lines when I’d fed it three words. And this was creepily repeatable – a real ghost in the machine. Interestingly, I’d see this via command line, but never if I plugged the triggering content into translate.google.com using a browser (although I did encounter similar behavior recently checking out a non-Google public web-based translator).

This made me really curious about machine translation, what kind of technology different companies are using and…I know it sounds crazy, but can we do it ourselves? So I stumbled onto OpenNMT and the answer is yes! That is, if you have the money or the time. Sure, you don’t have the resources of Google or Yandex, but maybe the things you want to translate are pretty specialized and closed-set…in that case, there’s even a chance you could do better.

But back to the money part: even if you don’t have a lot of money, you can set everything up on a cheap cloud system, learn the steps, figure out what you’re doing, and then, when you’re ready, take the plunge into machine learning by running your training sessions on a more expensive rented system. You definitely don’t wanna be learning the basics of setup and training on a four-buck-an-hour EC2 instance. Also, one thing I’ve done to save a little time and money is to set lots of checkpoints with --save_checkpoint and use spot instances when using GPU-enabled servers: I’ve currently got a training session running on a .75/hour instance running for .29/hour. If my spot instance goes away, I just restart it somewhere else with --train_from set to the last good model.

So here’s where this blog post comes in…I started this project wanting to translate old public domain books written in Russian. The problem is, I don’t know Russian. And a lot of the experiments I’ve been doing with Google’s APIs, even the paid one, are better than anything else…but they’re still a long way off and you really do lose days tracking down some of this strange translation output. I’m still really wondering where all of those strange ‘DANGER DANGER CORPORATE COMMUNICATIONS MISSILE MISSILE SHOWER SHOWER SHOWER SHOWER’ things are coming from, too (and absolutely yes, “SHOWER” might have been the most terrifying glitch!).

Doing a little digging through forums and digging around on OPUS, I noticed what other people are using for their corpii: Wikipedia files, news headlines, and even some old books (what I’m most interested in). A lot of the older translations are pretty great for this because some translators were pretty good about translating line-by-line…because the training files need line-by-line agreement, a lot of fiddling usually needs to be done to the output.

Case in point, I popped open the Wikipedia files commonly used for OpenNMT training and noticed there’s a lot of mixed in English and transliterated Russian in the Russian files and a lot of Cyrillic in the English files. I’m thinking maybe cleaning that stuff up a little bit might help the training process.

The tricky thing is that for each line you remove from one file, you’ve gotta remove the corresponding line from the other file, because training depends on each line of each file including the translation of the same line within the other file.

So Python to the rescue! By default, my script looks for any character from a to Z — [a-zA-Z] — in a line to remove it, because my initial intention was just to strip all of the Latin characters out of the Russian training files. I added the ability to pass a regex on the command line to match whatever you want, so this tool could conceivably be used to do all sorts of parallel removals.

The script I pulled together is called eoat-corpusclean.py and it takes three arguments:

-s	source file, or the file you want to search within and yank lines out of
-t	target file, or the file you want to remove corresponding lines from
-r	optional, this is the regex you're searching for, in single quotes. For example, -r 'А-ЯЁ'

It writes output to $source_filename_clean and $target_filename_clean — not elegant, sure, but this is research-grade stuff. ;) Before you play with it on your own language training files — which are likely pretty big and slow to process — you may want to try on a smaller subset to test your regexes before a full run. It can take a pretty long time on large files (running a pair of 500k+ line files as I write this, and it’s close to done at 48 minutes. Note that I’m also suicidally running this on the same wimpy EC2 instance that I’m running training on…nothing’s crashed yet, but nothing’s fast yet, either).

To test it out on your training files quickly, you can do something like:

head -n 1000 training_file.en-ru.en >> training_file.en-ru.ru.short
head -n 1000 training_file.en-ru.en >> training_file.en-ru.en.short

So, for example, if I wanted to run this across the Wikipedia English-Russian training corpus and remove basic Latin characters from the Russian file, I’d run this:

    python eoat-corpusclean.py -s Wikipedia.en-ru.ru -t Wikipedia.en-ru.en

and that generates two new files with a “_clean” extension and significantly fewer lines, which is now missing a lot of Rusglish.

So after that, I want to grab the Russian out of the English, so I run:

    python eoat-corpusclean.py -s Wikipedia.en-ru.en_clean -t Wikipedia.en-ru.ru_clean -r 'А-ЯЁ'

which removes lines with Cyrillic characters from the English corpus.

So now, when I look at the files side by side, they’re both a lot smaller, but…they look cleaner and they’re still aligned! You can run the following on Linux to print the files line by line; add | more to the end of the command to space through from beginning to end.

  pr -m -t  wiki-ru-1000_clean_clean wiki-en-1000_clean_clean

And we get this, and can check to see if our lines match up:

   Через неделю, 1 сентября, Германия  Average temperatures on the coast a
   29 марта 2004 года Литва вступила в In March 2004, Lithuania became a f
   Россия также подразделяется на 9 фе ;Federal districtsFederal subjects
   Обладает правом издания указов, обя Unlike the federal subjects, the fe
   Число занятых в промышленности — 27 73% of the population lives in urba
   Традиционно в России популярны наст Association football is one of the
   * Сайт Комиссии при Президенте Росс St. Mary, St. Nicholas, St. Andrew,
   О статусе соционики существуют прот A. Augustinavichiute and sources of
   * Лингвистика языка изучает язык ка Linguistics is the scientific study
   Д.				       J.
   Дюркгейм поддерживает утверждение о Durkheim maintained that the social
   Общность является подлинной совмест Society is nothing more than the sh

If you’re sanity-checking line-alignment on a large corpus, you’re probably going to want to just sample top and bottom:

   pr -m -t file1 file2 |head -n 30
   pr -m -t file1 file2 |tail -n 30

where 30 is whatever number of lines you want to grab.

If you’ve got a sharp eye, and see my example, or look at the Wikipedia corpus files directly, you’ll quickly understand why we want to create our own clean corpus to train on, though. It definitely doesn’t appear to be one-to-one line-by-line matching content to translation!

So that’s what I’m doing, learning the ins and outs of neural machine translation for kicks, while I work on this little epub project that keeps ballooning. And maybe, one day, after many eons of training, my little book builder might work reasonably well for what I need it to work for (at the very least, I’ll have one heck of a custom EC2 AMI to share with the world).

And hey, man, I’ve got 22.56% accuracy after 8.5 hours and 2150 steps running on a cheapo CPU with a pared-down (100k lines), still-verifiably messy training data (I started training before I started really looking at the training data itself and heck if I’m going to stop and restart it now…let her run for the next week while we outrun Hurricane Dorian and we’ll see what happens where we land!).

I figure this is a good way to learn the basics and get comfy…give me a year and some GPUs and a better corpus to train on (okay, don’t give me that corpus, I’m working on that), maybe my little toy monster will be useful.

You can find this tool at https://github.com/jenh/epub-ocr-and-translate/blob/master/onmt-helpers/eoat-corpusclean.py and it’s part of a suite of other publishing tools in the EOAT Project.

I hope to add more OpenNMT-py scripts and components in the near future — and release an EC2 AMI — because no one else should ever have to build so many different grumpily interdependent things from source – as I continue working on EOAT.