Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot crawl the data from the OpenReview website #3

Open
DongXingshuai opened this issue Sep 7, 2023 · 4 comments
Open

Cannot crawl the data from the OpenReview website #3

DongXingshuai opened this issue Sep 7, 2023 · 4 comments

Comments

@DongXingshuai
Copy link

Hi there, I tried to run the parse_data.py to crawl data from openreview. Unfortunately, it did not work. The following are the error messages. Is anybody can give me a hand? Thank you!

ipython parse_data.py
Offset: 0 Data: 0
Offset: 1000 Data: 1000
Offset: 2000 Data: 2000
Offset: 3000 Data: 3000
Offset: 4000 Data: 3809
Number of submissions: 3809
Number of papers (including old): 4874
0%| | 0/4874 [00:00<?, ?it/s]
0%| | 0/4874 [00:00<?, ?it/s]

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/dongxingshuai/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/dongxingshuai/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py", line 166, in filter_data
withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

IndexError Traceback (most recent call last)
File ~/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py:195
190 # In[59]:
191
192
193 # filter data in a pool of processes
194 with Pool(8) as p:
--> 195 filtered_notes = list(tqdm(p.imap(filter_data, notes), total=len(notes)))
198 # In[60]:
199
200
201 # create dataframe
202 ratings = pd.DataFrame(filtered_notes)

File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/notebook.py:249, in tqdm_notebook.iter(self)
247 try:
248 it = super(tqdm_notebook, self).iter()
--> 249 for obj in it:
250 # return super(tqdm...) will not catch exception
251 yield obj
252 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/std.py:1182, in tqdm.iter(self)
1179 time = self._time
1181 try:
-> 1182 for obj in iterable:
1183 yield obj
1184 # Update and possibly print the progressbar.
1185 # Note: does not call self.update(1) for speed optimisation.

File ~/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py:868, in IMapIterator.next(self, timeout)
866 if success:
867 return value
--> 868 raise value

IndexError: list index out of range

@fedebotu
Copy link
Owner

fedebotu commented Sep 7, 2023

Hi, could you give more context for your error? I saw you are using a .py file. Is it the same as the .ipynb notebook?
As a quick bugfix, you may also try to re-run your program since as I remember it could be a network error

@DongXingshuai
Copy link
Author

@fedebotu Thank you very much for your reply.

I converted the .ipynv file to .py. I have tried a few times, all failed due to the same problem.

@fedebotu
Copy link
Owner

fedebotu commented Sep 7, 2023

@DongXingshuai I found out why. Apparently, a meta_note was not available in a paper (hence the error). Putting a try... except fixed it!

Here is the updated filted_data function:

def filter_data(item, 
                review_keys=['summary_of_the_paper', 'strength_and_weaknesses', 'clarity,_quality,_novelty_and_reproducibility', 'summary_of_the_review'],
                decision=True):
    """Filter only ratings, confidence, withdraw status and decisions"""
    # parse each note
    withdraw = 0
    try:
        # filter meta note
        meta_note = [d for d in item if 'Paper' not in d['invitation']]
        # check withdrawn
        withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
    except:
        # note: simple pass for no meta notes
        pass
    # decision
    if decision:
        try:
            if withdraw == 0:
                decision_note = [d for d in item if 'Decision' in d['invitation']]
                decision = decision_note[0]['content']['decision']
            else:
                decision = ''
        except:
            decision = ''
    # filter reviewer comments
    comment_notes = [d for d in item \
                     if 'Official_Review' in d['invitation'] and 'recommendation' in d['content'].keys()]
    comment_notes = sorted(comment_notes, key=lambda d: d['number'])[::-1]
    ratings = [int(note['content']['recommendation'].split(':')[0]) for note in comment_notes]
    confidences = [int(note['content']['confidence'].split(':')[0]) for note in comment_notes]
    review_lengths = [sum(len(note['content'][key].split()) for key in review_keys) for note in comment_notes] # review lengths

    data = {'ratings': ratings, 'confidences': confidences, 'withdraw': withdraw, 'review_lengths': review_lengths}
    if decision: data['decision'] = decision
    return data
    ```
    

@DongXingshuai
Copy link
Author

@fedebotu thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants