-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable confidence calculations with unit tests #234
base: dev
Are you sure you want to change the base?
Conversation
…ation Refactor confidence calculation to simplify logic, document process, and add logging Outline unit tests for confidence calculations
WalkthroughThe changes enhance the confidence calculation mechanism in the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #234 +/- ##
==========================================
- Coverage 53.41% 49.85% -3.57%
==========================================
Files 37 37
Lines 4362 4379 +17
==========================================
- Hits 2330 2183 -147
- Misses 2032 2196 +164 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
ovos_workshop/skills/common_query_skill.py (1)
73-75
: Address the TODO comment.Please review the TODO comment and consider adjusting the default values for
response_sentences_multiplier
andresponse_words_multiplier
. The current values may inflate the confidence score unnecessarily.Do you want me to suggest alternative default values or open a GitHub issue to track this task?
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- ovos_workshop/skills/common_query_skill.py (5 hunks)
- test/unittests/skills/test_common_query_skill.py (1 hunks)
Additional comments not posted (6)
test/unittests/skills/test_common_query_skill.py (1)
47-63
: Excellent test coverage for the confidence calculation logic!The
test_calc_confidence
method effectively tests thecalc_confidence
function by evaluating the confidence levels for various query types. It covers different scenarios, including generic queries, specific queries, and low-confidence queries.The assertions are well-defined and ensure that the confidence levels are calculated correctly based on the query type and match level. This test method enhances the overall test coverage and helps maintain the correctness and reliability of the confidence calculation mechanism.
Great job on implementing this comprehensive test method!
ovos_workshop/skills/common_query_skill.py (5)
29-29
: LGTM!The comment correction for the
GENERAL
level provides better clarity on its purpose.
65-70
: Configurable parameters enhance flexibility.The addition of instance variables for configurable confidence calculation parameters is a great improvement. It allows for fine-tuning the confidence scores based on the specific requirements of each skill.
153-156
: Improved confidence calculation and error handling.The following changes enhance the confidence calculation process:
- Renaming the method to
calc_confidence
allows for potential overrides by subclasses, promoting extensibility.- The added check and capping of the confidence value at 1.0 ensure the confidence remains within a valid range.
- Logging a warning when the calculated confidence exceeds 1.0 aids in identifying and debugging potential issues.
Line range hint
201-261
: Enhanced confidence calculation with configurable parameters.The refactoring of the
calc_confidence
method significantly improves the confidence calculation process:
- The method now considers multiple factors, such as the percentage of input words consumed, the number of sentences in the response, the relevance of the answer to the input query, and the word count of the response. This provides a more comprehensive assessment of the response's confidence.
- The use of configurable weights for each factor allows for fine-tuning the impact of each factor on the final confidence score, enabling customization based on the specific requirements of each skill.
- The modified relevance calculation, which focuses on the percentage of relevant words in the answer relative to the total word count, improves the accuracy of the confidence score by dynamically adjusting based on the actual content of the response.
- The added logging statements provide valuable insights into the confidence calculation process, aiding in debugging and understanding how the final confidence score is derived.
258-258
: LGTM!The inclusion of the calculated bonuses in the final confidence score enhances the accuracy of the confidence assessment by considering multiple factors and their configured weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
test/unittests/skills/test_common_query_skill.py (1)
Line range hint
75-77
: Reminder: Implement the missing test methods.The TODO comments indicate that the test methods for
handle_question_query
andhandle_query_action
are yet to be implemented. Please ensure that these methods are tested to improve the overall test coverage of theCommonQuerySkill
class.Do you want me to generate the missing test methods or open a GitHub issue to track this task?
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- test/unittests/skills/test_common_query_skill.py (2 hunks)
Additional comments not posted (4)
test/unittests/skills/test_common_query_skill.py (4)
40-49
: LGTM!The
test_get_cq
method thoroughly tests the__get_cq
method by mocking theCQS_match_query_phrase
method and verifying its behavior. It also tests the exception case to ensure graceful handling.
52-54
: LGTM!The
test_remove_noise
method verifies that theremove_noise
method correctly normalizes a noisy input string, which is a common use case.
57-73
: LGTM!The
test_calc_confidence
method thoroughly tests thecalc_confidence
method by evaluating the confidence levels for different query types. It verifies that the confidence levels are calculated correctly and follow the expected order. The test method is well-structured and uses meaningful variable names and comments to improve readability.
2-2
: LGTM!The import statement for the
Mock
class is necessary for using it in the test methods and follows the Python convention of placing imports at the beginning of the file.
Support returned float confidence level in addition to enum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (4)
ovos_workshop/skills/common_query_skill.py (4)
Line range hint
37-75
: Configurable parameters for confidence calculation look good!The introduction of configurable parameters allows for greater flexibility in tuning the confidence calculation. The default values seem reasonable.
However, please address the TODO comment regarding the default values for
response_sentences_multiplier
andresponse_words_multiplier
. Consider adjusting these values based on empirical testing to ensure optimal confidence calculation.
153-162
: Handling of directly reported confidence looks good!Allowing skills to directly report confidence as a float value provides flexibility and allows skills to override the default confidence calculation if needed. Capping the confidence at 1.0 ensures that the confidence value remains within a valid range.
Please remove the extraneous
f
prefix from the log message at line 154 as suggested by the static analysis tool Ruff.- LOG.debug(f"Confidence directly reported by skill") + LOG.debug("Confidence directly reported by skill")Tools
Ruff
154-154: f-string without any placeholders
Remove extraneous
f
prefix(F541)
222-228
: Sentence count approximation and noise word removal look good!Approximating the number of sentences based on periods is a reasonable approach, and ensuring a minimum sentence count of 1 is a good fallback. Removing noise words from the matched portion helps focus on the meaningful part of the user's input.
In the future, consider exploring more robust sentence boundary detection techniques to handle edge cases like abbreviations and decimal numbers more accurately.
231-251
: Relevance calculation looks good!Calculating relevance based on the percentage of relevant words in the answer is a reasonable approach to determine how well the answer addresses the user's input. Capping the answer size at a maximum value helps prevent overly long answers from receiving an unfair confidence boost.
Please address the TODO comment regarding removing SSML from the answer before calculating relevance. This will ensure that the relevance calculation is based on the actual text content of the answer.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- ovos_workshop/skills/common_query_skill.py (7 hunks)
- test/unittests/skills/test_common_query_skill.py (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- test/unittests/skills/test_common_query_skill.py
Additional context used
Ruff
ovos_workshop/skills/common_query_skill.py
154-154: f-string without any placeholders
Remove extraneous
f
prefix(F541)
Additional comments not posted (5)
ovos_workshop/skills/common_query_skill.py (5)
16-16
: LGTM!The import statement is necessary to support the type hint changes in the
__get_cq
andCQS_match_query_phrase
methods.
29-29
: LGTM!The updated comment provides a clearer description of the
GENERAL
level in theCQSMatchLevel
enum.
176-177
: LGTM!The updated return type hint accurately reflects the changes made to allow skills to directly report confidence as a float value.
253-266
: Bonus calculation and confidence capping look good!Calculating bonuses based on multiple factors allows for a more nuanced confidence calculation that takes into account various aspects of the answer. Applying configurable weights to each bonus factor provides flexibility in tuning the confidence calculation.
Logging a warning and capping the confidence at 1.0 helps identify and handle cases where the calculated confidence exceeds the valid range.
306-306
: LGTM!The updated return type hint accurately reflects the changes made to allow skills to directly report confidence as a float value.
# Approximate the number of sentences in the answer. A trailing `.` will | ||
# split, so reduce length by 1. If no `.` is present, ensure we count | ||
# any response as at least 1 sentence | ||
num_sentences = min(len(answer.split(".")) - 1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not very good, should use quebra_frases instead (already a dependency)
print(quebra_frases.sentence_tokenize(
"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."))
#['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.',
#'Did he mind?',
#"Adam Jones Jr. thinks he didn't.",
#"In any case, this isn't true...",
#"Well, with a probability of .9 it isn't."]
wc_mod = float(float(answer_size) / float(WORD_COUNT_DIVISOR)) * 2 | ||
# Calculate bonuses based on calculated values and configured weights | ||
consumed_pct_bonus = consumed_pct * self.input_consumed_multiplier | ||
num_sentences_bonus = num_sentences * self.response_sentences_multiplier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be part of the score at all? its a voice assistant, do we prefer a skill reading a full wikipedia page vs giving a straight answer?
IMHO we should drop this method and just map the enum to fixed float values, let ovos-core do the disambiguation |
I see this was added in #61. I'm not sure what the implementation looked like prior to that, but The original Mycroft logic (still used in Neon) still did some adjustment. I think there needs to be some amount of logic to separate responses that return the same confidence enum for predictable behavior, otherwise responses will be random for questions where multiple skills return a |
Not random, there's a reranker plugin that will do much better job than this, there is even a config option to allow ignoring skill confidence completely. Either BM25 or flashrank plugins are available and work well. This logic came from the MK2 btw, it's also Mycroft just more recent than what neon forked |
Refactors
__calc_confidence
tocalc_confidence
for skills to optionally override calculationAdd scalar confidence values as instance variables for skill customization
Refactor confidence calculation to simplify logic, document process, and add logging
Add unit tests for confidence calculations
Summary by CodeRabbit
New Features
Bug Fixes
GENERAL
match level in the confidence calculation.Tests