EOAT now has a Docker image! It’s huge, but for good reason: I’ve installed Tesseract and its dependencies, pandoc, TexLive, required components for third party translation (gcloud, trans-shell, boto3), all the EOAT tools, and more. Because it can take hours to get everything configured and installed with dependencies in the correct order, this should speed everything up.
If you give this a whirl and have comments/bugs/issues/requests for an even more ridiculously giant docker image, please drop me a line @ jen@sevenminuteserver.com.
So here’s how you use it:
-
Install docker (
sudo [yum|apt-get] install dockeron Linux, and however y’all do it on Windows). If you get permissions problems running docker on Linux, make sure you add yourself to the docker group:sudo usermod -aG docker $(whoami)and log out and back in again, thensudo service docker start. -
Pull my docker image:
docker pull jenh/eoat:v2 -
List the images to make sure you’ve got it:
you@host:~$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE jenh/eoat v2 af26663f6b4f About an hour ago 6.04GB -
Run an instance, but sure to use
-d, as the instance will shut down immediately if nothing’s running on it; and nothing will be running on it at first…docker run -t -d --name EOAT af26663f6b4f -
Copy the PDF you want to OCR and translate (note, you can also connect to the image and wget it:
wget https://wherever.wherever/myfile.pdf):docker cp myfile.pdf EOAT:/root/ -
Log into the image:
docker exec -it EOAT bash -
Copy the Tesseract language files you need. If you are using English, Russian, French, and/or Spanish, you can skip this step. For a full listing of the files available see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files. For example, the following downloads Portuguese model files:
cd /usr/local/share/tessdata wget https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -
Navigate to where you saved the PDF and get started, where
engis the three-letter language code:cd ~/ && eoat-ocr myfile.pdf eng -
Wait awhile and let Tesseract work. When finished, eoat-ocr will write the data to a text file. You may want to open it and clean it up, you’ll see things like page numbers and the like that may need to be removed, images that produce gobbledygook, etc.
-
Translate. The following will use free Google Translate to translate English into Spanish using
translate-shellwith a wait time between lines of 4 seconds (the default is 2): but you can change this engine to a paid engine (which doesn’t typically cut you off!):-e gcloudto use Google Cloud,-e amazonfor AWS:eoat-trans -i myfile.txt -s en -t es -w 4You can change this to a paid engine, which will still cut you off. It’s worth setting -w to 1 and not 0. For Google Cloud, copy your credentials file (JSON format) to the docker image, then
export GOOGLE_APPLICATION_CREDENTIALS=my-translate-creds.jsonand use-e gcloud. For AWS, edit/root/.aws/credentialsand customize to your region and access information, then use-e amazon:[default] region = us-east-1 aws_access_key_id = your_aws_key aws_secret_access_key = your_aws_secret_key -
Split the output into separate files per chapter, if applicable. Use
-dto specify the delimiter to use to break the document into sections.eoat-split -i myfile.txt-en-es.txt -d "Chapter" -
Run
eoat-maketo create a makefile -
Run
eoat-build ento build English deliverables,eoat-build esto build Spanish deliverables, etc. If you skipped the translation step, you can just runeoat-build.Output will be saved in
book_en.epub(for an English epub) andbook_en.pdf(for an English PDF).