Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using DuplicateOverlappingTextProcessor in HOcrTextExporter #867

Open
KayTannee opened this issue Jul 9, 2024 · 2 comments
Open

Using DuplicateOverlappingTextProcessor in HOcrTextExporter #867

KayTannee opened this issue Jul 9, 2024 · 2 comments
Labels
document-reading Related to reading documents enhancement

Comments

@KayTannee
Copy link

I'm trying to see how can use DuplicateOverlappingTextProcessor as part of the HOcrTextExporter process.

hocrTextExporter.Get() only accepts a Page as input, along with wordExtractor and pagesegmentor in the constructor.

Where as DuplicateOverlappingTextProcessor only returns a list of letters. Doesn't seem to be a defined way to get from 1 to the other.

I think solution is to add an option to the Word Extractor Options. And use like so.

        var ops = new NearestNeighbourWordExtractorOptions();
        ops.DeduplicateOverlappingText = true;
        var wordExtractor = new NearestNeighbourWordExtractor(ops);
        HOcrTextExporter hocrTextExporter = new HOcrTextExporter(wordExtractor, DocstrumBoundingBoxes.Instance);
        string hocrtext = hocrTextExporter.Get(page, useHocrjs: true);

Having a look I think only need the below 2 changes to the 1 class. I'm not able to test code at the moment though.

UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor
NearestNeighbourWordExtractor.cs

    /// <summary>
    /// Get the words.
    /// </summary>
    /// <param name="letters">The page's letters to group into <see cref="Word"/>s.</param>
    /// <returns>The <see cref="Word"/>s generated by the nearest neighbour method.</returns>
    public IEnumerable<Word> GetWords(IReadOnlyList<Letter> letters)
    {
        if (letters == null || letters.Count == 0)
        {
            return Array.Empty<Word>();
        }

        // #### Change 1
        // Remove overlapping duplicates
        if (options.DeduplicateOverlappingText) {
            letters = DuplicateOverlappingTextProcessor.Get(letters);
        }

....

    /// <summary>
    /// Nearest neighbour word extractor options.
    /// </summary>
    public class NearestNeighbourWordExtractorOptions : IWordExtractorOptions
    {
        /// <summary>
        /// <inheritdoc/>
        /// Default value is -1.
        /// </summary>
        public int MaxDegreeOfParallelism { get; set; } = -1;

        // #### Change 2
        /// <summary>
        /// Uses DuplicateOverlappingTextProcessor to remove overlapping letters before GetWords. 
        /// </summary>
        public bool DeduplicateOverlappingText = false;

Happy if there's an alternative existing way of doing it?

@EliotJones
Copy link
Member

@BobLd do you know if this would be possible, aiui the change would be to feed the deduplicated letters into the word detection algorithm. Does a pre-processing step/pipeline exist for such a thing today?

@EliotJones EliotJones added enhancement document-reading Related to reading documents labels Sep 29, 2024
@BobLd
Copy link
Collaborator

BobLd commented Sep 29, 2024

I believe this is possible but I'd need to look into it, not sure how easy it is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
document-reading Related to reading documents enhancement
Projects
None yet
Development

No branches or pull requests

3 participants