jmsquare / optical-character-recognition Goto Github PK
View Code? Open in Web Editor NEWThis repository provides 2 functions to read contents and metadata from image pdf files (read.ocr) and from Word document (read.docx). Read.ocr function uses tesseract method to make optical character recognition (OCR) on image pdf file. Read.docx function unzips .docx file to convert to xml file and extract data.