top of page
DCR
Document Content Recognition
Preprocessor
-
Identifying scanned image pdf documents using PyMuPDF.
-
Converting scanned image pdf documents to a series of jpeg or png files using pdf2image and Poppler.
-
Converting bmp, gif, jp2, jpeg, png, pnm, tif, tiff or webp type documents to pdf format using Tesseract OCR.
-
Converting csv, docx, epub, html, odt, rst or rtf type documents to pdf format using Pandoc and TeX Live.
Natural Language Processing (NLP)
-
Extracting text and metadata from pdf documents using PDFlib TET.
-
Categorisation of the lines in the document, e.g. body, footer, header lines etc.
-
Determination of the token structure sentence by sentence with the help of spaCy.
-
Storage of the analysis result optional in a PostgreSQL database or in a JSON flat file.
index_rahman_finin
architecture_preprocessor
developing_data_model_dbt_document_erd
index_rahman_finin
1/4
bottom of page