Skip to content

Commit

Permalink
update arc prize in evals
Browse files Browse the repository at this point in the history
  • Loading branch information
souzatharsis committed Dec 8, 2024
1 parent 390c810 commit 4aa0d97
Show file tree
Hide file tree
Showing 14 changed files with 200 additions and 180 deletions.
Binary file modified tamingllms/_build/.doctrees/environment.pickle
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/evals.doctree
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/structured_output.doctree
Binary file not shown.
2 changes: 2 additions & 0 deletions tamingllms/_build/html/_sources/notebooks/evals.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1242,6 +1242,8 @@
"\n",
"These features make the ARC benchmark a unique test of machine intelligence, focusing on the ability to adapt to novelty and solve problems without relying heavily on memorization. This is more aligned with the concept of general intelligence, which emphasizes the ability to learn efficiently and tackle new challenges.\n",
"\n",
"The ARC-AGI benchmark remained unbeaten for five years as of December 2024 (a minimum score of 85% is required to win) {cite}`arcprizeresults2024`. While deep learning has significantly advanced in recent years, pure deep learning approaches perform poorly on the ARC-AGI benchmark. This is because traditional deep learning relies on relating new situations to those encountered during training and lacks the ability to adapt or recombine knowledge for entirely new tasks. ARC Prize 2024 spurred the development of novel AGI reasoning techniques, leading to a significant increase in the state-of-the-art score on the ARC-AGI private evaluation set from 33% in 2023 to 55.5% in 2024. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.\n",
"\n",
"As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren't previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks."
]
},
Expand Down
221 changes: 113 additions & 108 deletions tamingllms/_build/html/notebooks/evals.html

Large diffs are not rendered by default.

60 changes: 30 additions & 30 deletions tamingllms/_build/html/notebooks/output_size_limit.html

Large diffs are not rendered by default.

80 changes: 40 additions & 40 deletions tamingllms/_build/html/notebooks/structured_output.html

Large diffs are not rendered by default.

Binary file modified tamingllms/_build/html/objects.inv
Binary file not shown.
2 changes: 1 addition & 1 deletion tamingllms/_build/html/searchindex.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion tamingllms/_build/jupyter_execute/markdown/intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "bea94820",
"id": "05f22589",
"metadata": {},
"source": [
"(intro)=\n",
Expand Down
2 changes: 2 additions & 0 deletions tamingllms/_build/jupyter_execute/notebooks/evals.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1242,6 +1242,8 @@
"\n",
"These features make the ARC benchmark a unique test of machine intelligence, focusing on the ability to adapt to novelty and solve problems without relying heavily on memorization. This is more aligned with the concept of general intelligence, which emphasizes the ability to learn efficiently and tackle new challenges.\n",
"\n",
"The ARC-AGI benchmark remained unbeaten for five years as of December 2024 (a minimum score of 85% is required to win) {cite}`arcprizeresults2024`. While deep learning has significantly advanced in recent years, pure deep learning approaches perform poorly on the ARC-AGI benchmark. This is because traditional deep learning relies on relating new situations to those encountered during training and lacks the ability to adapt or recombine knowledge for entirely new tasks. ARC Prize 2024 spurred the development of novel AGI reasoning techniques, leading to a significant increase in the state-of-the-art score on the ARC-AGI private evaluation set from 33% in 2023 to 55.5% in 2024. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.\n",
"\n",
"As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren't previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks."
]
},
Expand Down
2 changes: 2 additions & 0 deletions tamingllms/notebooks/evals.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1242,6 +1242,8 @@
"\n",
"These features make the ARC benchmark a unique test of machine intelligence, focusing on the ability to adapt to novelty and solve problems without relying heavily on memorization. This is more aligned with the concept of general intelligence, which emphasizes the ability to learn efficiently and tackle new challenges.\n",
"\n",
"The ARC-AGI benchmark remained unbeaten for five years as of December 2024 (a minimum score of 85% is required to win) {cite}`arcprizeresults2024`. While deep learning has significantly advanced in recent years, pure deep learning approaches perform poorly on the ARC-AGI benchmark. This is because traditional deep learning relies on relating new situations to those encountered during training and lacks the ability to adapt or recombine knowledge for entirely new tasks. ARC Prize 2024 spurred the development of novel AGI reasoning techniques, leading to a significant increase in the state-of-the-art score on the ARC-AGI private evaluation set from 33% in 2023 to 55.5% in 2024. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.\n",
"\n",
"As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren't previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks."
]
},
Expand Down
9 changes: 9 additions & 0 deletions tamingllms/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -374,3 +374,12 @@ @misc{arcprize2024
howpublished={ARC Prize Website},
url={https://arcprize.org/},
}

@misc{arcprizeresults2024,
title={ARC Prize 2024 Results},
author={Francois Chollet},
year={12/08/2024},
howpublished={ARC Prize Website},
url={https://arcprize.org/2024-results},
}

0 comments on commit 4aa0d97

Please sign in to comment.