EOAT now has a Docker image! It’s huge, but for good reason: I’ve installed Tesseract and its dependencies, pandoc, TexLive, required components for third party translation (gcloud, trans-shell, boto3), all the EOAT tools, and more. Because it can take hours to get everything configured and installed with dependencies in the correct order, this should speed everything up.
If you give this a whirl and have comments/bugs/issues/requests for an even more ridiculously giant docker image, please drop me a line @ jen@sevenminuteserver.com.
So here’s how you use it:
-
Install docker (
sudo [yum|apt-get] install docker
on Linux, and however y’all do it on Windows). If you get permissions problems running docker on Linux, make sure you add yourself to the docker group:sudo usermod -aG docker $(whoami)
and log out and back in again, thensudo service docker start
. -
Pull my docker image:
docker pull jenh/eoat:v2
-
List the images to make sure you’ve got it:
you@host:~$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE jenh/eoat v2 af26663f6b4f About an hour ago 6.04GB
-
Run an instance, but sure to use
-d
, as the instance will shut down immediately if nothing’s running on it; and nothing will be running on it at first…docker run -t -d --name EOAT af26663f6b4f
-
Copy the PDF you want to OCR and translate (note, you can also connect to the image and wget it:
wget https://wherever.wherever/myfile.pdf
):docker cp myfile.pdf EOAT:/root/
-
Log into the image:
docker exec -it EOAT bash
-
Copy the Tesseract language files you need. If you are using English, Russian, French, and/or Spanish, you can skip this step. For a full listing of the files available see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files. For example, the following downloads Portuguese model files:
cd /usr/local/share/tessdata wget https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata
-
Navigate to where you saved the PDF and get started, where
eng
is the three-letter language code:cd ~/ && eoat-ocr myfile.pdf eng
-
Wait awhile and let Tesseract work. When finished, eoat-ocr will write the data to a text file. You may want to open it and clean it up, you’ll see things like page numbers and the like that may need to be removed, images that produce gobbledygook, etc.
-
Translate. The following will use free Google Translate to translate English into Spanish using
translate-shell
with a wait time between lines of 4 seconds (the default is 2): but you can change this engine to a paid engine (which doesn’t typically cut you off!):-e gcloud
to use Google Cloud,-e amazon
for AWS:eoat-trans -i myfile.txt -s en -t es -w 4
You can change this to a paid engine, which will still cut you off. It’s worth setting -w to 1 and not 0. For Google Cloud, copy your credentials file (JSON format) to the docker image, then
export GOOGLE_APPLICATION_CREDENTIALS=my-translate-creds.json
and use-e gcloud
. For AWS, edit/root/.aws/credentials
and customize to your region and access information, then use-e amazon
:[default] region = us-east-1 aws_access_key_id = your_aws_key aws_secret_access_key = your_aws_secret_key
-
Split the output into separate files per chapter, if applicable. Use
-d
to specify the delimiter to use to break the document into sections.eoat-split -i myfile.txt-en-es.txt -d "Chapter"
-
Run
eoat-make
to create a makefile -
Run
eoat-build en
to build English deliverables,eoat-build es
to build Spanish deliverables, etc. If you skipped the translation step, you can just runeoat-build
.Output will be saved in
book_en.epub
(for an English epub) andbook_en.pdf
(for an English PDF).