Paperminer is a system for amending documents stored in Paperless-ngx with additional information ("facts") extracted from the documents themselves or other sources.
The hansmi/dossier
package is called to parse PDF documents (other
formats could be implemented).
The Go programming language's plugin
package comes with
a number of caveats which make it unsuitable. Compile-time plugins via the
hansmi/staticplug
package are used instead. It's therefore
necessary to set up your own build. An example for a program with a plugin can
be found in the example/myminer
directory.
Plugins may use dossier sketches to look for specific regular
expressions at absolute or relative positions on pages. The sketchfacts
package is often sufficient even though it ignores pages
beyond the first. Custom logic can produce document facts from the findings.
Plugins may also extract arbitrary document pages and implement their own data extraction. External APIs may also be involved.
Normalizing extracted text before parsing it further is generally recommended, not just for date and time: remove extraneous whitespace and separators, etc. Regular expressions should also be written to be flexible where possible. OCR-derived text is often not exactly the same as the original.
Useful packages for writing document facters:
hansmi/zyt
: Parse language/locale-specific date and time formats.hansmi/aurum
: Golden tests. Used for generic document facter tests by thefactertest
package.