-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support more OCR languages #422
Conversation
bf43cb7
to
795b264
Compare
I've tried downloading the whole data and it's 638MB in Without any training data the tar.gz container is 380mb and the built rpm is 365MB. But it's not too bad from the 600mb download we had. But in the future we can probably look into downloading the models dynamically. Although with docker images that'll introduce issues of its own. |
1 similar comment
I've tried downloading the whole data and it's 638MB in Without any training data the tar.gz container is 380mb and the built rpm is 365MB. But it's not too bad from the 600mb download we had. But in the future we can probably look into downloading the models dynamically. Although with docker images that'll introduce issues of its own. |
This one also fixes #357 |
Yes, I've noticed the inflation in the size as well. That's the price we pay for extra language coverage. As for downloading extra languages dynamically, that would be very interesting. The Dangerzone client could:
This will break air-gapped systems, so I'm not proposing it for now. If at some point we make this distinction though, we could achieve pretty slim images. |
Grab Tesseract's trained models from GitHub, instead of from the Alpine Linux repos. Over the past few months, the models in the Alpine Linux repos did not remain stable, leading to CI issues. Since the models are already pre-trained and available through Tesseract's repo on GitHub, we can use the release tarball that they offer to install them in the container image, which is basically what the upstream packages are doing as well. In order to make sure that we have no regressions, at the time of this commit we ensured that the hashes of the models offered through the Alpine Linux repos and the models offered from the GitHub release are the same. Also, in order to detect future regressions or foul play, we check the downloaded models against a known checksum. Given that these models change every few years, updating the checksum should not be an issue. Fix #357
Restore the OCR languages to the state they were in 66d3c40, with some minor changes. We can now do so because we download all the trained models, not just the ones that Alpine Linux offers.
Remove the Kurdish (Arabic) language ("kur_ara") from the list of languages that we offer for OCR, since it's not included in the installed languages. Interestingly, it is not present in the Apline Linux repos as well, so this was probably an omission in the first place.
Test that the languages that we provide to users for OCR match the languages that are installed in the container image Fixes #417
Ditch the Alpine Linux packages for Tesseract OCR languages, in favor of directly downloading them from the source.
Fixes #417