top of page

DCR

Document Content Recognition

Preprocessor

Identifying scanned image pdf documents using PyMuPDF.
Converting scanned image pdf documents to a series of jpeg or png files using pdf2image and Poppler.
Converting bmp, gif, jp2, jpeg, png, pnm, tif, tiff or webp type documents to pdf format using Tesseract OCR.
Converting csv, docx, epub, html, odt, rst or rtf type documents to pdf format using Pandoc and TeX Live.

Natural Language Processing (NLP)

Extracting text and metadata from pdf documents using PDFlib TET.
Categorisation of the lines in the document, e.g. body, footer, header lines etc.
Determination of the token structure sentence by sentence with the help of spaCy.
Storage of the analysis result optional in a PostgreSQL database or in a JSON flat file.

index_rahman_finin

index_rahman_finin

architecture_preprocessor

architecture_preprocessor

developing_data_model_dbt_document_erd

developing_data_model_dbt_document_erd

index_rahman_finin

index_rahman_finin

1/4

bottom of page