Full Text Importer

MINUMUM IMPORTER VERSION: 3.20.6774.30656

EVENT: DocumentAfterImport

SYNOPSIS: Imports externally generated OCR data

DESCRIPTION: Full Text Importer will look for the presence of an OCR file after any document is imported. The OCR file must be named the same as the document but with .OCR or .OCRCSV on the end depending on the format. For example, after a document named SCANNED.PDF is imported, this Add-In will look for a file named SCANNED.PDF.OCR (or SCANNED.PDF.OCRCSV). If that file exists then the Add-In will use that file's data as the OCR data for the given document.

CONFIGURATION: There is no configuration needed for this Add-In.

File Formats: There are 2 file formats supported - OCR and OCRCSV.

OCR

This format is a JSON encoded list of docMgt.dmREST.Word items. Use the docMgt.RESTHelper DLL as a reference to get that object type. If you use this method then you will need to standardize your word coordinates to the docMgt standard. You can use the docMgt.RESTHelper.StandardizeFullTextCoordinates method to translate your coordinates to the standard.

OCRCSV

This is a CSV file that has no headers. The columns need to be “Word, Page, Left, Top, Width, Height, Image Width, Image Height” in that order. Here is a sample:

Word One,1,10,20,300,50,880,1100

Word Two,1,330,20,300,50,880,1100

Word Three,1,660,20,300,50,880,1100

That example would be 3 words total - all for page 1 of the document. “Word One” would have come from the zone of 10, 20, 300, 50 from an image that was 850 pixels wide and 1100 pixels tall and so on. We need the last 2 items (image size) to standardize the coordinates to our normal coordinate system. Otherwise the highlighting would be off.