Cannot crawl the data from the OpenReview website #3

DongXingshuai · 2023-09-07T03:54:06Z

Hi there, I tried to run the parse_data.py to crawl data from openreview. Unfortunately, it did not work. The following are the error messages. Is anybody can give me a hand? Thank you!

ipython parse_data.py
Offset: 0 Data: 0
Offset: 1000 Data: 1000
Offset: 2000 Data: 2000
Offset: 3000 Data: 3000
Offset: 4000 Data: 3809
Number of submissions: 3809
Number of papers (including old): 4874
0%| | 0/4874 [00:00<?, ?it/s]
0%| | 0/4874 [00:00<?, ?it/s]

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/dongxingshuai/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/dongxingshuai/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py", line 166, in filter_data
withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

IndexError Traceback (most recent call last)
File ~/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py:195
190 # In[59]:
191
192
193 # filter data in a pool of processes
194 with Pool(8) as p:
--> 195 filtered_notes = list(tqdm(p.imap(filter_data, notes), total=len(notes)))
198 # In[60]:
199
200
201 # create dataframe
202 ratings = pd.DataFrame(filtered_notes)

File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/notebook.py:249, in tqdm_notebook.iter(self)
247 try:
248 it = super(tqdm_notebook, self).iter()
--> 249 for obj in it:
250 # return super(tqdm...) will not catch exception
251 yield obj
252 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/std.py:1182, in tqdm.iter(self)
1179 time = self._time
1181 try:
-> 1182 for obj in iterable:
1183 yield obj
1184 # Update and possibly print the progressbar.
1185 # Note: does not call self.update(1) for speed optimisation.

File ~/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py:868, in IMapIterator.next(self, timeout)
866 if success:
867 return value
--> 868 raise value

IndexError: list index out of range

fedebotu · 2023-09-07T07:50:42Z

Hi, could you give more context for your error? I saw you are using a .py file. Is it the same as the .ipynb notebook?
As a quick bugfix, you may also try to re-run your program since as I remember it could be a network error

DongXingshuai · 2023-09-07T08:26:42Z

@fedebotu Thank you very much for your reply.

I converted the .ipynv file to .py. I have tried a few times, all failed due to the same problem.

fedebotu · 2023-09-07T12:29:25Z

@DongXingshuai I found out why. Apparently, a meta_note was not available in a paper (hence the error). Putting a try... except fixed it!

Here is the updated filted_data function:

def filter_data(item, 
                review_keys=['summary_of_the_paper', 'strength_and_weaknesses', 'clarity,_quality,_novelty_and_reproducibility', 'summary_of_the_review'],
                decision=True):
    """Filter only ratings, confidence, withdraw status and decisions"""
    # parse each note
    withdraw = 0
    try:
        # filter meta note
        meta_note = [d for d in item if 'Paper' not in d['invitation']]
        # check withdrawn
        withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
    except:
        # note: simple pass for no meta notes
        pass
    # decision
    if decision:
        try:
            if withdraw == 0:
                decision_note = [d for d in item if 'Decision' in d['invitation']]
                decision = decision_note[0]['content']['decision']
            else:
                decision = ''
        except:
            decision = ''
    # filter reviewer comments
    comment_notes = [d for d in item \
                     if 'Official_Review' in d['invitation'] and 'recommendation' in d['content'].keys()]
    comment_notes = sorted(comment_notes, key=lambda d: d['number'])[::-1]
    ratings = [int(note['content']['recommendation'].split(':')[0]) for note in comment_notes]
    confidences = [int(note['content']['confidence'].split(':')[0]) for note in comment_notes]
    review_lengths = [sum(len(note['content'][key].split()) for key in review_keys) for note in comment_notes] # review lengths

    data = {'ratings': ratings, 'confidences': confidences, 'withdraw': withdraw, 'review_lengths': review_lengths}
    if decision: data['decision'] = decision
    return data
    ```

DongXingshuai · 2023-09-11T00:37:02Z

@fedebotu thank you very much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot crawl the data from the OpenReview website #3

Cannot crawl the data from the OpenReview website #3

DongXingshuai commented Sep 7, 2023

fedebotu commented Sep 7, 2023

DongXingshuai commented Sep 7, 2023

fedebotu commented Sep 7, 2023

DongXingshuai commented Sep 11, 2023

Cannot crawl the data from the OpenReview website #3

Cannot crawl the data from the OpenReview website #3

Comments

DongXingshuai commented Sep 7, 2023

ipython parse_data.py Offset: 0 Data: 0 Offset: 1000 Data: 1000 Offset: 2000 Data: 2000 Offset: 3000 Data: 3000 Offset: 4000 Data: 3809 Number of submissions: 3809 Number of papers (including old): 4874 0%| | 0/4874 [00:00<?, ?it/s] 0%| | 0/4874 [00:00<?, ?it/s]

fedebotu commented Sep 7, 2023

DongXingshuai commented Sep 7, 2023

fedebotu commented Sep 7, 2023

DongXingshuai commented Sep 11, 2023

ipython parse_data.py
Offset: 0 Data: 0
Offset: 1000 Data: 1000
Offset: 2000 Data: 2000
Offset: 3000 Data: 3000
Offset: 4000 Data: 3809
Number of submissions: 3809
Number of papers (including old): 4874
0%| | 0/4874 [00:00<?, ?it/s]
0%| | 0/4874 [00:00<?, ?it/s]