Releases: Future-House/paper-qa
v5.6.1
Full Changelog: v5.6.0...v5.6.1
v5.8.0
What's Changed
- Update all non-major dependencies by @renovate in #745
- Created
dev
extra for convenience by @jamesbraza in #750 - Update all non-major dependencies by @renovate in #754
- Populated
LICENSE
by @jamesbraza in #756 - Add partitioning func capabilities to allow doc-types-based embedding ranking by @mskarlin in #752
- Exposed seeding of LitQA2 read and shuffling by @jamesbraza in #758
Full Changelog: v5.7.0...v5.8.0
v5.7.0
What's Changed
- Moved
README
to usesession
overanswer
by @jamesbraza in #741 - Moved
Docs.aadd
to supportstr | os.PathLike
by @jamesbraza in #742 - Cleared up 'Adding Documents Manually' docs by @jamesbraza in #740
- Support env states with custom status functions by @mskarlin in #743
- Update astral-sh/setup-uv action to v4 by @renovate in #746
- Moved JSON summary prompt to mention score is an integer by @jamesbraza in #748
Full Changelog: v5.6.0...v5.7.0
v5.6.0
Highlights
This release is mainly a bunch of bug fixes:
- Pulling in breaks in upstream dependencies (e.g. Pydantic 2.10, aviary 0.10.1)
- Makes
GradablePaperQAEnvironment
's evaluations robust to an empty answer or multiple answers
Due to the introduction of Complete.NO_ANSWER_PHRASE
in #726 it was requested we consider this a minor version bump, as it will impact system performance.
What's Changed
- Fixed settings
session
intoEnvironmentState
, and suppressing PyMuPDF derivedDeprecationWarning
by @jamesbraza in #713 - Adding assertion
gather_evidence
doesn't populatesession.answer
by @jamesbraza in #716 - Lock file maintenance by @renovate in #715
- Fixes
gather_with_concurrency
typing by @maykcaldas in #714 - Latest tooling dependencies by @jamesbraza in #719
- Lock file maintenance by @renovate in #718
- Fixed
EVAL_PROMPT_TEMPLATE
to handle empty string or multiple match answers by @jamesbraza in #724 - Address missing
GenerateAnswer
in trajectories, no answers afterComplete
tools, and better history by @mskarlin in #726 - Pulling in latest
aviary
forconcurrency
rename by @jamesbraza in #728 - Pulling in latest
aviary
for dependencies fix, and retrying flakytest_propagate_options
more by @jamesbraza in #729 - Pulling in latest
ldp
forCallback.before_rollout
by @jamesbraza in #734 - Documenting why we don't handle evaluation failures in
GradablePaperQAEnvironment.step
by @jamesbraza in #738 - Created
LitQAEvaluation.calculate_accuracy_precision
utility by @jamesbraza in #733 - Refreshed test cassettes, fixed flaky test
test_search
, and fixed test type ignores by @jamesbraza in #739 - Unpins pydantic >2.10.2 requirement, removes TYPE_CHECKING by @nadolskit in #725
- Lock file maintenance by @renovate in #737
- Alternative maybe is text by @loesinghaus in #717
New Contributors
- @maykcaldas made their first contribution in #714
- @loesinghaus made their first contribution in #717
Full Changelog: v5.5.0...v5.6.0
v5.5.1
Full Changelog: v5.5.0...v5.5.1
v5.5.0
Highlights
In all of v5 before this release, we defined the presence of 1+ answer generations not containing the substring "cannot answer"
as the agent loop's end. However, this (suboptimally) leads to the agent loop terminating early on partial answers like "Based on the sources provided, it appears no one has done x." We realized this, and have resolved this issue by:
- No longer coupling our done condition with the substring
"cannot answer"
being not present in 1+ generated answers - No longer implicitly depending on clients mentioning this
"cannot answer"
sentinel in the inputqa
prompt
We also fixed several (bad) bugs:
- We support parallel tool calling (2+
ToolCall
s in oneaction: ToolRequestMessage
). However, our tools (notablygather_evidence
) are not actually concurrent-safe. Our tool schemae instructed not to call certain tools in parallel, nonetheless we observed agents specifyinggather_evidence
to be called in parallel. So now we force our tools to be non-concurrently executed to work around this race condition - When using
LitQAEvaluation
and the sameGradablePaperQAEnvironment
2+ times, we repeatedly added the "unsure" option to the target multiple choice question, degrading performance over time - When using
PaperQAEnvironment
2+ times, eachreset
was not properly wiping theDocs
object - The reward distribution of
LitQAEvaluation
was mixing up "unsure" reward of0.1
with the "incorrect" reward of-1.0
, not properly incentivizing learning
There are a bunch of other minor features, cleanups, and bugfixes here too, see the full list below.
What's Changed
- Deprecation cycle for
AgentSettings.should_pre_search
by @jamesbraza in #679 - Moved agent prompts to
prompts.py
by @jamesbraza in #681 - Refactor to remove
skip_system
fromLLMModel.run_prompt
by @jamesbraza in #680 - Resolving
evidence_detailed_citations
andAnswer
deprecations by @jamesbraza in #682 - Fixed agent prompt names and contents after #681 mess up by @jamesbraza in #683
- Removed
tool_names
validation forgen_answer
being present by @jamesbraza in #685 - Fixing
test_evaluation
logic bugs by @jamesbraza in #686 - Removed
GenerateAnswer.FAILED_TO_ANSWER
as its unnecessary by @jamesbraza in #691 - Allowing serialized
Settings
inget_settings
by @jamesbraza in #688 - Fixed LDP runner's
TRUNCATED
not callinggen_answer
, and documentedAgentStatus
by @jamesbraza in #690 - Removed
gen_answer
's dead argumentquestion
by @jamesbraza in #689 - Making sure we copy distractors by @sidnarayanan in #694
- Created
complete
tool to allow unsure answers by @jamesbraza in #684 - Added missing
test_from_question
cassette by @jamesbraza in #696 - Moved
fake
agent to LLM proposecomplete
tool by @jamesbraza in #695 - Default to ordered tool calls, w env variable control by @mskarlin in #697
- Lock file maintenance by @renovate in #699
- Refactored
TestGradablePaperQAEnvironment
for DRY code by @jamesbraza in #702 - Fixing
PaperQAEnvironment.reset
respectingmmr_lambda
andtext_hashes
by @jamesbraza in #703 - Removed
"cannot answer"
literals and addedreset
tool by @jamesbraza in #698 - Update all non-major dependencies by @renovate in #705
- Fixing
LitQAEvaluation
bugs: incorrect reward indices, not using LLM's native knowledge by @jamesbraza in #708 - Adding filters to paper-qa Docs by @whitead in #707
- Fixed mutably defaulted
NumpyVectorStore.texts
by @jamesbraza in #711
Full Changelog: v5.4.0...v5.5.0
Hotfix to included `ordered=True` in tool exec calls
Prevents parallel tool calls from clobbering the env. state.
v5.3.3
Full Changelog: v5.3.2...v5.3.3
v5.4.0
What's Changed
- Renamed to PQASession type by @whitead in #653
- Lock file maintenance by @renovate in #657
- Ability to zero-shot
gen_answer
by @jamesbraza in #658 - Lock file maintenance by @renovate in #659
- Moving to
uv
dependency groups by @jamesbraza in #660 - Lock file maintenance by @renovate in #664
- Convert citation to formatted_citation usage where necessary by @mskarlin in #666
- Catch edge case where externalIds field is None by @mskarlin in #668
- Made o1 temperature issue a warning, instead of valueerror by @whitead in #669
- Added train and eval splits' questions and DOIs by @jamesbraza in #662
fake
agent allowing timeouts or exceptions, by @jamesbraza in #672- Optional
AnswerSetting.max_answer_attempts
to allow a new unsure branch by @jamesbraza in #673 - Made it so you do not die on invalid tool by @whitead in #670
- Allowing latest
pydantic-settings
and regenerated cassettes by @jamesbraza in #674 - Empty tool calls leading to
done
condition by @jamesbraza in #671 - Changed it to be debug for source quality by @whitead in #675
Full Changelog: v5.3.2...v5.4.0
v5.3.2
What's Changed
- Printing the
text
in a failedllm_parse_json
by @jamesbraza in #629 - Change S2 client logic to use arxiv doi if it's defined by @mskarlin in #632
- Increased retry count for
ClientConnectorDNSError
errors by @jamesbraza in #639 - Make string similarity case insensitive by default by @mskarlin in #640
- Pulling in latest
fhaviary
,mypy
,ruff
by @jamesbraza in #647 - Add an after model validator ensuring temp=1 for o1 models by @dakoner in #649
- Fixing crash due to
None
author by @jamesbraza in #650 - Fixing flaky test
test_minimal_fields_filtering
by @jamesbraza in #651 - Fixing flaky tests
test_code
andtest_minimal_fields_filtering
by @jamesbraza in #652 - Lock file maintenance by @renovate in #648
New Contributors
Full Changelog: v5.3.1...v5.3.2