top of page

DCR

Document Content Recognition

Preprocessor

  • Identifying scanned image pdf documents using PyMuPDF.

  • Converting scanned image pdf documents to a series of jpeg or png files using pdf2image and Poppler.

  • Converting bmpgifjp2jpegpngpnmtiftiff or webp type documents to pdf format using Tesseract OCR.

  • Converting csvdocxepubhtmlodtrst or rtf type documents to pdf format using Pandoc and TeX Live.

Natural Language Processing (NLP)

bottom of page