-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Public eval set: errors and ambiguities detected during manual solve #123
Comments
FYI I think some of these have been fixed already, e.g. see b7fd42c |
Thank you for doing this. Update: oh I think they may have been fixed actually! |
I think you may have been using an editor that is out of sync with the latest ARC-AGI dataset. There has been lots of fixes to this ARC-AGI repo around 2024-may/june, leading up to the launch of ARC-Prize. I checked some of the issues you reported, and they seemed fixed. This is a huge list of issues that is hard to work with. I recommend that @malcolmsharpe splits this into smaller separate issues, and close this huge github issue. Examples of well formatted github issues regarding specific tasks. |
We have an active discord chat with a #tasks channel where we discuss individual mistakes, if you'd like to talk about these. For example 50f325b5 is discussed at |
I manually solved the entire public eval set in the ARCreate Playground. While doing so, I noticed errors and ambiguities in some tasks.
Here is the list of all the errors and ambiguities that I recorded. Please note that the versions of the tasks are the ones that were live in the ARCreate Playground during the dates I was solving them (2024-06-17 - 2024-07-03), so they may not reflect the latest commits to the repo. For each task, I give its number in the ARCreate Playground, as well as its filename.
(My original motivation was to determine human baseline performance, which I discuss in the last section.)
Errors (recoverable)
These tasks had errors, but I was able to produce an accepted test output anyway.
Ambiguities
These tasks have more than one test output that is consistent with the demonstrations, in my opinion. In some cases, I needed multiple tries to produce an accepted test output.
Errors (unrecoverable)
For these tasks, I was not able to produce an accepted test output, so I suspect errors exist. I have not checked the files.
Human baseline performance
I was able to find a working idea for every task, including the ones with ambiguities and unrecoverable errors, so, if these are fixed, I would consider human performance on the public eval set to be 100%. Without the fixes:
Thus, human performance is somewhere between 93.0% and 98.8%, depending on the impact of recoverable errors and ambiguities.
However, practical human performance will be negatively impacted by the difficulty of producing the test output. Some tasks are unergonomic (particularly the numerous fill-in-the-pattern tasks), where it is challenging and time-consuming to produce the test output even if you have the right idea. Ergonomics could be improved with a more powerful solver tool: in particular, copy-paste with mirroring and rotation would simplify fill-in-the-pattern tasks, among others.
The text was updated successfully, but these errors were encountered: