bitVector: PDFBox and ocr content

Wednesday, February 8, 2012

PDFBox and ocr content

A couple of months ago I needed to retrieve ocr text from pdf files.
After some tests I choose Apache PDFBox™ library.
It’s easy to use, I show you what you need to do to get the text from the first page of your pdf file:

PDDocument pdDocument = null;
try
{
    pdDocument = PDDocument.load(_currentPdfFileDto.FullName);
    var stripper = new PDFTextStripper();
    stripper.setSortByPosition(true);
    stripper.setStartPage(1);
    stripper.setEndPage(1);
    stripper.getText(pdDocument);

Pages

Wednesday, February 8, 2012

PDFBox and ocr content

No comments:

Post a Comment