You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are three different Tesseract model types that we can choose from:
tessdata_fast: Fast integer versions of trained LSTM models. Best "value for money" in speed vs accuracy, Integer models.
testdata_best: Best (most accurate) trained LSTM models. Best results on Google's eval data, slower, Float models.
tessdata: Trained models with fast variant of the "best" LSTM models + legacy models. The LSTM models have been updated with Integer version of tessdata_best LSTM models.
Their differences are outlined in the following sources:
Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions.
tessdata_best is for people willing to trade a lot of speed for slightly better accuracy. It is also the only set of files which can be used for certain retraining scenarios for advanced users.
The third set in tessdata is the only one that supports the legacy recognizer. [...]
tessdata and tessdata_best appears to exhibit comparable performance in terms of recognition accuracy. tessdata_fast, on the other hand, is marginally better than the former two models. And as expected, this model is also the fastest.
Best is what is says it is. For languages where we have eval data, it is the network configuration that yielded best results on the eval data.
Fast is a speed/accuracy compromise, based on my own judgement, as to what offered the best "value for money" in speed vs accuracy. For some languages, this is still best, but for most not.
[...] If you want best to run faster, it is easy to integerize "best" at the cost of a small loss in accuracy.
Size Comparison
Model type
Compressed (MiB)
Uncompressed (MiB)
tessdata_fast
336
668
tessdata_best
638
1357
tessdata
638
1357
Using tessdata_fast shaves ~300MiB from the container image, and ~650MiB disk space.
Dangerzone originally installed Tesseract language models from Alpine Linux, but due to some issues (#417), we resorted to downloading the tessdata language models directly from GitHub (#422). See:
Switching to a different model type is as simple as switching the repo we download tarballs from.
Problem
In Qubes we plan to install the language packs via their RPM counterparts (#431 (comment)). This means that a regular Dangerzone installation will use the tessdata model type, whereas Dangerzone on Qubes will use the tessdata-fast model type.
Suggestion
We have an opportunity to bring these platforms in sync by using tessdata-fast in the Dangerzone container.
Arguments for switching to tesseract-fast:
Available in major Linux distros (although Alpine/Arch Linux deviate from this)
Slightly faster
Smaller size (-300MiB from package, -650MiB from disk)
Almost as accurate as the best models.
Arguments for staying with tesseract:
Backwards compatibility with previous versions of Dangerzone.
Do people see any reason not to use the tessdata-fast models on the Dangerzone container?
Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions.
It seems worth considering offering the "slightly more accurate" models once we switch to an architecture with flexible downloads, but I agree for now this seems like an easy win to shave off download size and get cross-platform consistency.
Switch to the tessdata-fast Tesseract model, instead of the tessdata
one. The tessdata-fast Tesseract model is much smaller, and a bit faster
than the other one. Also, it's the model that Debian/Fedora ship by
default.
Closes#545
Background
There are three different Tesseract model types that we can choose from:
tessdata_fast
: Fast integer versions of trained LSTM models. Best "value for money" in speed vs accuracy,Integer
models.testdata_best
: Best (most accurate) trained LSTM models. Best results on Google's eval data, slower,Float
models.tessdata
: Trained models with fast variant of the "best" LSTM models + legacy models. The LSTM models have been updated withInteger
version oftessdata_best
LSTM models.Their differences are outlined in the following sources:
Size Comparison
tessdata_fast
tessdata_best
tessdata
Using
tessdata_fast
shaves ~300MiB from the container image, and ~650MiB disk space.Distro support
tessdata
model type. See https://git.alpinelinux.org/aports/tree/community/tesseract-ocr/APKBUILDtessdata-fast
model type. See https://tracker.debian.org/pkg/tesseract-lang and https://github.com/AlexanderP/tesseract-lang-debian/blob/master/debian/upstream/metadatatessdata-fast
model type. See https://src.fedoraproject.org/rpms/tesseract-tessdata/blob/rawhide/f/tesseract-tessdata.specDangerzone originally installed Tesseract language models from Alpine Linux, but due to some issues (#417), we resorted to downloading the
tessdata
language models directly from GitHub (#422). See:dangerzone/Dockerfile
Line 31 in 214ce97
Switching to a different model type is as simple as switching the repo we download tarballs from.
Problem
In Qubes we plan to install the language packs via their RPM counterparts (#431 (comment)). This means that a regular Dangerzone installation will use the
tessdata
model type, whereas Dangerzone on Qubes will use thetessdata-fast
model type.Suggestion
We have an opportunity to bring these platforms in sync by using
tessdata-fast
in the Dangerzone container.Arguments for switching to
tesseract-fast
:Arguments for staying with
tesseract
:Do people see any reason not to use the
tessdata-fast
models on the Dangerzone container?Related Issues
The text was updated successfully, but these errors were encountered: