Extract Arabic Text From Pdf document

Question

I am trying to make a program which extracts the text from a PDF document PDF documents contain ARABIC text written by different types of FONT when I extract the text it works with some files and others not it gives me ambiguous Text I am using c # and Itext7 to make this program please show me the methodology to do this with some examples thank you My code : StringBuilder processed = new StringBuilder(); var src = "d:	ext06.pdf"; var pdfDocument = new PdfDocument(new PdfReader(src)); var strategy = new LocationTextExtractionStrategy(); for (int i = 1; i Converted_Lines = new List (); foreach (string s in lines) { string converted_string = Inverse(s); Converted_Lines.Add(converted_string); } textBox1.Text = String.Join(Environment.NewLine, Converted_Lines);

Marvin Reid · Answer

You can try using the LEADDocument class which will parse the text from PDFs. It has an OCR Engine embedded into it that can OCR Arabic text if the PDF is scanned. This supports various font types and should resolve the font issue here. You can use the following code to extract the text. string pdfFileName = @ "Input.pdf" ); using (LEADDocument _document = DocumentFactory.LoadFromFile(pdfFileName, new LoadDocumentOptions())) { IOcrEngine _ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.OmniPageArabic); _ocrEngine.Startup( null , null , null , null ); _document.Text.OcrEngine = _ocrEngine; foreach (DocumentPage _page in _document.Pages) { DocumentPageText _pageText = _page.GetText(); _pageText.BuildText(); string _text = _pageText.Text; Console.WriteLine($ "Page Number: {_page.PageNumber}
" ); Console.WriteLine($ "{_text}
" ); } } https://www.leadtools.com/help/sdk/v21/tutorials/dotnet-console-parse-the-text-of-a-document.html

Uehara Ayaka · Answer

The following link might be helpful. https://www.e-iceblue.com/Tutorials/Spire.PDF/Spire.PDF-Program-Guide/Extract/Read/Extract-Text-from-PDF-Document-using-SimpleTextExtractionStrategy.html