-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot crawl the data from the OpenReview website #3
Comments
Hi, could you give more context for your error? I saw you are using a |
@fedebotu Thank you very much for your reply. I converted the .ipynv file to .py. I have tried a few times, all failed due to the same problem. |
@DongXingshuai I found out why. Apparently, a Here is the updated def filter_data(item,
review_keys=['summary_of_the_paper', 'strength_and_weaknesses', 'clarity,_quality,_novelty_and_reproducibility', 'summary_of_the_review'],
decision=True):
"""Filter only ratings, confidence, withdraw status and decisions"""
# parse each note
withdraw = 0
try:
# filter meta note
meta_note = [d for d in item if 'Paper' not in d['invitation']]
# check withdrawn
withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
except:
# note: simple pass for no meta notes
pass
# decision
if decision:
try:
if withdraw == 0:
decision_note = [d for d in item if 'Decision' in d['invitation']]
decision = decision_note[0]['content']['decision']
else:
decision = ''
except:
decision = ''
# filter reviewer comments
comment_notes = [d for d in item \
if 'Official_Review' in d['invitation'] and 'recommendation' in d['content'].keys()]
comment_notes = sorted(comment_notes, key=lambda d: d['number'])[::-1]
ratings = [int(note['content']['recommendation'].split(':')[0]) for note in comment_notes]
confidences = [int(note['content']['confidence'].split(':')[0]) for note in comment_notes]
review_lengths = [sum(len(note['content'][key].split()) for key in review_keys) for note in comment_notes] # review lengths
data = {'ratings': ratings, 'confidences': confidences, 'withdraw': withdraw, 'review_lengths': review_lengths}
if decision: data['decision'] = decision
return data
```
|
@fedebotu thank you very much. |
Hi there, I tried to run the parse_data.py to crawl data from openreview. Unfortunately, it did not work. The following are the error messages. Is anybody can give me a hand? Thank you!
ipython parse_data.py
Offset: 0 Data: 0
Offset: 1000 Data: 1000
Offset: 2000 Data: 2000
Offset: 3000 Data: 3000
Offset: 4000 Data: 3809
Number of submissions: 3809
Number of papers (including old): 4874
0%| | 0/4874 [00:00<?, ?it/s]
0%| | 0/4874 [00:00<?, ?it/s]
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/dongxingshuai/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/dongxingshuai/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py", line 166, in filter_data
withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
IndexError Traceback (most recent call last)
File ~/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py:195
190 # In[59]:
191
192
193 # filter data in a pool of processes
194 with Pool(8) as p:
--> 195 filtered_notes = list(tqdm(p.imap(filter_data, notes), total=len(notes)))
198 # In[60]:
199
200
201 # create dataframe
202 ratings = pd.DataFrame(filtered_notes)
File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/notebook.py:249, in tqdm_notebook.iter(self)
247 try:
248 it = super(tqdm_notebook, self).iter()
--> 249 for obj in it:
250 # return super(tqdm...) will not catch exception
251 yield obj
252 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt
File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/std.py:1182, in tqdm.iter(self)
1179 time = self._time
1181 try:
-> 1182 for obj in iterable:
1183 yield obj
1184 # Update and possibly print the progressbar.
1185 # Note: does not call self.update(1) for speed optimisation.
File ~/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py:868, in IMapIterator.next(self, timeout)
866 if success:
867 return value
--> 868 raise value
IndexError: list index out of range
The text was updated successfully, but these errors were encountered: