You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to see how can use DuplicateOverlappingTextProcessor as part of the HOcrTextExporter process.
hocrTextExporter.Get() only accepts a Page as input, along with wordExtractor and pagesegmentor in the constructor.
Where as DuplicateOverlappingTextProcessor only returns a list of letters. Doesn't seem to be a defined way to get from 1 to the other.
I think solution is to add an option to the Word Extractor Options. And use like so.
var ops = new NearestNeighbourWordExtractorOptions();
ops.DeduplicateOverlappingText = true;
var wordExtractor = new NearestNeighbourWordExtractor(ops);
HOcrTextExporter hocrTextExporter = new HOcrTextExporter(wordExtractor, DocstrumBoundingBoxes.Instance);
string hocrtext = hocrTextExporter.Get(page, useHocrjs: true);
Having a look I think only need the below 2 changes to the 1 class. I'm not able to test code at the moment though.
@BobLd do you know if this would be possible, aiui the change would be to feed the deduplicated letters into the word detection algorithm. Does a pre-processing step/pipeline exist for such a thing today?
I'm trying to see how can use DuplicateOverlappingTextProcessor as part of the HOcrTextExporter process.
hocrTextExporter.Get() only accepts a Page as input, along with wordExtractor and pagesegmentor in the constructor.
Where as DuplicateOverlappingTextProcessor only returns a list of letters. Doesn't seem to be a defined way to get from 1 to the other.
I think solution is to add an option to the Word Extractor Options. And use like so.
Having a look I think only need the below 2 changes to the 1 class. I'm not able to test code at the moment though.
UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor
NearestNeighbourWordExtractor.cs
....
Happy if there's an alternative existing way of doing it?
The text was updated successfully, but these errors were encountered: