Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fallback-encoding per-repo option for non-utf8 text files #388

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tgulacsi
Copy link
Contributor

This allows to specify an alternate charset (fallback-encoding) per repository.

@tgulacsi
Copy link
Contributor Author

May help #78, #243 and #368, and my ISO8859-2 case :)

Copy link
Contributor

@salemhilal salemhilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thank you for contributing! Encoding stuff can be pretty scary — can you add some tests to your changes? We're putting some eyes on this PR now otherwise, but we'd want some test coverage before we merge.

index/index.go Outdated
)

const (
matchLimit = 5000
manifestFilename = "metadata.gob"
excludedFileJsonFilename = "excluded_files.json"
filePeekSize = 2048
filePeekSize = 1 << 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Why this big jump in filePeekSize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That pesky comment that contains á that breaks UTF-8 detection may be at the end of the file.
This is not a full solution, just a bet that source files should be less than a megabyte long.

A full solution could be retrying the whole indexing on decoding error.
But that felt more involved - though I could try it if you confirm that'd be better.

@salemhilal
Copy link
Contributor

Hey! We're meeting now and generally like this PR. We have two asks before we merge it:

  1. Can you add some tests?
  2. Increasing filePeekSize might impact startup time. Can you give us any details on how much of an impact you are seeing, and if possible, a rough estimate of the size of your codebase?

Thank you again!

The big peek buf was needed to ensure that a non-utf8 rune at the end of
the file does get the attention and be tried with the fallback encoding.

A more robust approach is to just try the reading as-is, validating the
encoding, and try the fallback encoding if this fails.

Also add a test case.
@tgulacsi
Copy link
Contributor Author

Sorry for my slow reply, but now I've added a test case, and implemented the more robust retrying logic.
PTAL.

@tgulacsi
Copy link
Contributor Author

May I assist reviewing this PR?

pru-mike pushed a commit to pru-mike/hound that referenced this pull request Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants