Settle on Tesseract model type #545

apyrgio · 2023-09-11T12:52:04Z

Background

There are three different Tesseract model types that we can choose from:

tessdata_fast: Fast integer versions of trained LSTM models. Best "value for money" in speed vs accuracy, Integer models.
testdata_best: Best (most accurate) trained LSTM models. Best results on Google's eval data, slower, Float models.
tessdata: Trained models with fast variant of the "best" LSTM models + legacy models. The LSTM models have been updated with Integer version of tessdata_best LSTM models.

Their differences are outlined in the following sources:

https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files.md
- Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions.
- tessdata_best is for people willing to trade a lot of speed for slightly better accuracy. It is also the only set of files which can be used for certain retraining scenarios for advanced users.
- The third set in tessdata is the only one that supports the legacy recognizer. [...]
https://towardsdatascience.com/googles-tesseract-ocr-how-good-is-it-on-documents-d71d4bf7640?gi=82d346e1e9e8

tessdata and tessdata_best appears to exhibit comparable performance in terms of recognition accuracy. tessdata_fast, on the other hand, is marginally better than the former two models. And as expected, this model is also the fastest.
fast vs. best tesseract-ocr/tesseract#1404
- Best is what is says it is. For languages where we have eval data, it is the network configuration that yielded best results on the eval data.
- Fast is a speed/accuracy compromise, based on my own judgement, as to what offered the best "value for money" in speed vs accuracy. For some languages, this is still best, but for most not.
  [...] If you want best to run faster, it is easy to integerize "best" at the cost of a small loss in accuracy.

Size Comparison

Model type	Compressed (MiB)	Uncompressed (MiB)
`tessdata_fast`	336	668
`tessdata_best`	638	1357
`tessdata`	638	1357

Using tessdata_fast shaves ~300MiB from the container image, and ~650MiB disk space.

Distro support

Alpine Linux offers the tessdata model type. See https://git.alpinelinux.org/aports/tree/community/tesseract-ocr/APKBUILD
Debian offers the tessdata-fast model type. See https://tracker.debian.org/pkg/tesseract-lang and https://github.com/AlexanderP/tesseract-lang-debian/blob/master/debian/upstream/metadata
Fedora offers the tessdata-fast model type. See https://src.fedoraproject.org/rpms/tesseract-tessdata/blob/rawhide/f/tesseract-tessdata.spec

Dangerzone originally installed Tesseract language models from Alpine Linux, but due to some issues (#417), we resorted to downloading the tessdata language models directly from GitHub (#422). See:

dangerzone/Dockerfile

Line 31 in 214ce97

    
           && wget https://github.com/tesseract-ocr/tessdata/archive/$TESSDATA_VERSION/tessdata-$TESSDATA_VERSION.tar.gz \

Switching to a different model type is as simple as switching the repo we download tarballs from.

Problem

In Qubes we plan to install the language packs via their RPM counterparts (#431 (comment)). This means that a regular Dangerzone installation will use the tessdata model type, whereas Dangerzone on Qubes will use the tessdata-fast model type.

Suggestion

We have an opportunity to bring these platforms in sync by using tessdata-fast in the Dangerzone container.

Arguments for switching to tesseract-fast:

Available in major Linux distros (although Alpine/Arch Linux deviate from this)
Slightly faster
Smaller size (-300MiB from package, -650MiB from disk)
Almost as accurate as the best models.

Arguments for staying with tesseract:

Backwards compatibility with previous versions of Dangerzone.

Do people see any reason not to use the tessdata-fast models on the Dangerzone container?

Related Issues

The text was updated successfully, but these errors were encountered:

eloquence · 2023-09-11T16:48:09Z

I think the quoted line is pretty persuasive:

Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions.

It seems worth considering offering the "slightly more accurate" models once we switch to an architecture with flexible downloads, but I agree for now this seems like an easy win to shave off download size and get cross-platform consistency.

Switch to the tessdata-fast Tesseract model, instead of the tessdata one. The tessdata-fast Tesseract model is much smaller, and a bit faster than the other one. Also, it's the model that Debian/Fedora ship by default. Closes #545

apyrgio added container OCR labels Sep 11, 2023

apyrgio added this to the 0.5.0 milestone Sep 11, 2023

apyrgio mentioned this issue Sep 18, 2023

Switch to tessdata-fast Tesseract model #548

Merged

apyrgio closed this as completed in cbca911 Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Settle on Tesseract model type #545

Settle on Tesseract model type #545

apyrgio commented Sep 11, 2023

eloquence commented Sep 11, 2023

Settle on Tesseract model type #545

Settle on Tesseract model type #545

Comments

apyrgio commented Sep 11, 2023

Background

Size Comparison

Distro support

Problem

Suggestion

Related Issues

eloquence commented Sep 11, 2023