update arc prize in evals

souzatharsis · Dec 8, 2024 · 4aa0d97 · 4aa0d97
1 parent 390c810
commit 4aa0d97
Show file tree

Hide file tree

Showing 14 changed files with 200 additions and 180 deletions.
diff --git a/tamingllms/_build/.doctrees/environment.pickle b/tamingllms/_build/.doctrees/environment.pickle
diff --git a/tamingllms/_build/.doctrees/notebooks/evals.doctree b/tamingllms/_build/.doctrees/notebooks/evals.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree b/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/structured_output.doctree b/tamingllms/_build/.doctrees/notebooks/structured_output.doctree
diff --git a/tamingllms/_build/html/_sources/notebooks/evals.ipynb b/tamingllms/_build/html/_sources/notebooks/evals.ipynb
@@ -1242,6 +1242,8 @@
     "\n",
     "These features make the ARC benchmark a unique test of machine intelligence, focusing on the ability to adapt to novelty and solve problems without relying heavily on memorization. This is more aligned with the concept of general intelligence, which emphasizes the ability to learn efficiently and tackle new challenges.\n",
     "\n",
+    "The ARC-AGI benchmark remained unbeaten for five years as of December 2024 (a minimum score of 85% is required to win) {cite}`arcprizeresults2024`. While deep learning has significantly advanced in recent years, pure deep learning approaches perform poorly on the ARC-AGI benchmark. This is because traditional deep learning relies on relating new situations to those encountered during training and lacks the ability to adapt or recombine knowledge for entirely new tasks. ARC Prize 2024 spurred the development of novel AGI reasoning techniques, leading to a significant increase in the state-of-the-art score on the ARC-AGI private evaluation set from 33% in 2023 to 55.5% in 2024. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.\n",
+    "\n",
     "As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren't previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks."
    ]
   },

diff --git a/tamingllms/_build/html/notebooks/evals.html b/tamingllms/_build/html/notebooks/evals.html
diff --git a/tamingllms/_build/html/notebooks/output_size_limit.html b/tamingllms/_build/html/notebooks/output_size_limit.html
diff --git a/tamingllms/_build/html/notebooks/structured_output.html b/tamingllms/_build/html/notebooks/structured_output.html
diff --git a/tamingllms/_build/html/objects.inv b/tamingllms/_build/html/objects.inv
diff --git a/tamingllms/_build/html/searchindex.js b/tamingllms/_build/html/searchindex.js
diff --git a/tamingllms/_build/jupyter_execute/markdown/intro.ipynb b/tamingllms/_build/jupyter_execute/markdown/intro.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "bea94820",
+   "id": "05f22589",
    "metadata": {},
    "source": [
     "(intro)=\n",

diff --git a/tamingllms/_build/jupyter_execute/notebooks/evals.ipynb b/tamingllms/_build/jupyter_execute/notebooks/evals.ipynb
@@ -1242,6 +1242,8 @@
     "\n",
     "These features make the ARC benchmark a unique test of machine intelligence, focusing on the ability to adapt to novelty and solve problems without relying heavily on memorization. This is more aligned with the concept of general intelligence, which emphasizes the ability to learn efficiently and tackle new challenges.\n",
     "\n",
+    "The ARC-AGI benchmark remained unbeaten for five years as of December 2024 (a minimum score of 85% is required to win) {cite}`arcprizeresults2024`. While deep learning has significantly advanced in recent years, pure deep learning approaches perform poorly on the ARC-AGI benchmark. This is because traditional deep learning relies on relating new situations to those encountered during training and lacks the ability to adapt or recombine knowledge for entirely new tasks. ARC Prize 2024 spurred the development of novel AGI reasoning techniques, leading to a significant increase in the state-of-the-art score on the ARC-AGI private evaluation set from 33% in 2023 to 55.5% in 2024. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.\n",
+    "\n",
     "As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren't previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks."
    ]
   },

diff --git a/tamingllms/notebooks/evals.ipynb b/tamingllms/notebooks/evals.ipynb
@@ -1242,6 +1242,8 @@
     "\n",
     "These features make the ARC benchmark a unique test of machine intelligence, focusing on the ability to adapt to novelty and solve problems without relying heavily on memorization. This is more aligned with the concept of general intelligence, which emphasizes the ability to learn efficiently and tackle new challenges.\n",
     "\n",
+    "The ARC-AGI benchmark remained unbeaten for five years as of December 2024 (a minimum score of 85% is required to win) {cite}`arcprizeresults2024`. While deep learning has significantly advanced in recent years, pure deep learning approaches perform poorly on the ARC-AGI benchmark. This is because traditional deep learning relies on relating new situations to those encountered during training and lacks the ability to adapt or recombine knowledge for entirely new tasks. ARC Prize 2024 spurred the development of novel AGI reasoning techniques, leading to a significant increase in the state-of-the-art score on the ARC-AGI private evaluation set from 33% in 2023 to 55.5% in 2024. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.\n",
+    "\n",
     "As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren't previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks."
    ]
   },

diff --git a/tamingllms/references.bib b/tamingllms/references.bib
@@ -374,3 +374,12 @@ @misc{arcprize2024
       howpublished={ARC Prize Website},
       url={https://arcprize.org/},
 }
+
+@misc{arcprizeresults2024,
+      title={ARC Prize 2024 Results},
+      author={Francois Chollet},
+      year={12/08/2024},
+      howpublished={ARC Prize Website},
+      url={https://arcprize.org/2024-results},
+}
+
-Original file line number
+Diff line change
@@ Expand Up / @@ -1242,6 +1242,8 @@ @@
         "\n",
         "These features make the ARC benchmark a unique test of machine intelligence, focusing on the ability to adapt to novelty and solve problems without relying heavily on memorization. This is more aligned with the concept of general intelligence, which emphasizes the ability to learn efficiently and tackle new challenges.\n",
         "\n",
+        "The ARC-AGI benchmark remained unbeaten for five years as of December 2024 (a minimum score of 85% is required to win) {cite}`arcprizeresults2024`. While deep learning has significantly advanced in recent years, pure deep learning approaches perform poorly on the ARC-AGI benchmark. This is because traditional deep learning relies on relating new situations to those encountered during training and lacks the ability to adapt or recombine knowledge for entirely new tasks. ARC Prize 2024 spurred the development of novel AGI reasoning techniques, leading to a significant increase in the state-of-the-art score on the ARC-AGI private evaluation set from 33% in 2023 to 55.5% in 2024. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.\n",
+        "\n",
         "As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren't previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks."
        ]
       },
@@ Expand Down @@