pdfcomp.icon.png

Project Description
hOcr2Pdf.NET is a .NET library to create or convert .hocr html produced by Tesseract or Cuneiform into highly compressed searchable pdfs using HtmlAgilityPack, Jbig2 and iTextSharp. It is written in C#.


Features
Special thanks to the developers of:
http://soft.rubypdf.com/software/windows-version-jbig2-encoder-jbig2-exe --to encode images to jbig2
http://htmlagilitypack.codeplex.com/ --to parse the hocr files
http://code.google.com/p/tesseract-ocr/ --used for OCR and hocr output
https://launchpad.net/cuneiform-linux/ --used for OCR and hocr output
http://itextpdf.com/ --used to create/edit pdfs
http://www.ghostscript.com/download/gsdnld.html -- used for pdf page extraction