Using DuplicateOverlappingTextProcessor in HOcrTextExporter #867

KayTannee · 2024-07-09T13:36:10Z

I'm trying to see how can use DuplicateOverlappingTextProcessor as part of the HOcrTextExporter process.

hocrTextExporter.Get() only accepts a Page as input, along with wordExtractor and pagesegmentor in the constructor.

Where as DuplicateOverlappingTextProcessor only returns a list of letters. Doesn't seem to be a defined way to get from 1 to the other.

I think solution is to add an option to the Word Extractor Options. And use like so.

        var ops = new NearestNeighbourWordExtractorOptions();
        ops.DeduplicateOverlappingText = true;
        var wordExtractor = new NearestNeighbourWordExtractor(ops);
        HOcrTextExporter hocrTextExporter = new HOcrTextExporter(wordExtractor, DocstrumBoundingBoxes.Instance);
        string hocrtext = hocrTextExporter.Get(page, useHocrjs: true);

Having a look I think only need the below 2 changes to the 1 class. I'm not able to test code at the moment though.

UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor
NearestNeighbourWordExtractor.cs

    /// <summary>
    /// Get the words.
    /// </summary>
    /// <param name="letters">The page's letters to group into <see cref="Word"/>s.</param>
    /// <returns>The <see cref="Word"/>s generated by the nearest neighbour method.</returns>
    public IEnumerable<Word> GetWords(IReadOnlyList<Letter> letters)
    {
        if (letters == null || letters.Count == 0)
        {
            return Array.Empty<Word>();
        }

        // #### Change 1
        // Remove overlapping duplicates
        if (options.DeduplicateOverlappingText) {
            letters = DuplicateOverlappingTextProcessor.Get(letters);
        }

....

    /// <summary>
    /// Nearest neighbour word extractor options.
    /// </summary>
    public class NearestNeighbourWordExtractorOptions : IWordExtractorOptions
    {
        /// <summary>
        /// <inheritdoc/>
        /// Default value is -1.
        /// </summary>
        public int MaxDegreeOfParallelism { get; set; } = -1;

        // #### Change 2
        /// <summary>
        /// Uses DuplicateOverlappingTextProcessor to remove overlapping letters before GetWords. 
        /// </summary>
        public bool DeduplicateOverlappingText = false;

Happy if there's an alternative existing way of doing it?

The text was updated successfully, but these errors were encountered:

EliotJones · 2024-09-29T15:21:21Z

@BobLd do you know if this would be possible, aiui the change would be to feed the deduplicated letters into the word detection algorithm. Does a pre-processing step/pipeline exist for such a thing today?

BobLd · 2024-09-29T17:56:46Z

I believe this is possible but I'd need to look into it, not sure how easy it is

EliotJones added enhancement document-reading Related to reading documents labels Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using DuplicateOverlappingTextProcessor in HOcrTextExporter #867

Using DuplicateOverlappingTextProcessor in HOcrTextExporter #867

KayTannee commented Jul 9, 2024

EliotJones commented Sep 29, 2024

BobLd commented Sep 29, 2024

Using DuplicateOverlappingTextProcessor in HOcrTextExporter #867

Using DuplicateOverlappingTextProcessor in HOcrTextExporter #867

Comments

KayTannee commented Jul 9, 2024

EliotJones commented Sep 29, 2024

BobLd commented Sep 29, 2024