hOcr2Pdf.NET is a library that programmers can use to create highly compressed, searchable pdf's for applications.
Requirements: .NET 4.0 or higher
Tesseract 3.0 w/ the ability to produce hOcr files or Cuneiform For Linux
JBig2.exe (included) in the same path as the dll
Major Classes: PDFDoc (PDFDoc.Open() OR PDFDoc.Create())
Example Usage
Compress PDF to Jbig2.
PDFDoc doc = PDFDoc.Open(file);
doc.CompressJBig2()
Get page image (Jbig2 and jpeg2000 pages require Ghostscript to be installed)
PDFDoc doc = PDFDoc.Open(file);
doc.GetPageImage(1);
Ocr PDF
PDFDoc doc = PDFDoc.Open(file);
doc.Ocr(Utils.OcrMode.Tesseract, "eng", WriteTextMode.Word, null);
Create a new PDF
PDFDoc doc = PDFDoc.Create(file);
doc.AddPage(img, PageSize.Letter);
doc.Rotate(...)
doc.Save()
doc.Ocr(...)
doc.Compress(...)
doc.Save()
Get Object graph of HOCR document
hDocument d = OcrController.CreateHOCR(OcrMode.Tesseract, "eng", img);
foreach(var p in d.Pages)
foreach(var para in p.Paragraphs)
foreach(var l in para.Lines)
foreach(var w in l.Words)
Console.WriteLine(w.Text);
Tips
Be sure and Save() the pdf when using an image format that requires Ghostscript to extract. For example,
if you compress a pdf to jbig2 and then try to ocr it before calling Save() then all bets are off. Save() writes any change to disk so that Ghostscript can access the changed pages for image extraction.