-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError:'data' when using BaiduImageCrawler #103
Comments
Hi, |
@chinasilva This will yield Exception in thread parser-001:
Traceback (most recent call last):
File "*/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "*/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "*/lib/python3.11/site-packages/icrawler/parser.py", line 104, in worker_exec
for task in self.parse(response, **kwargs):
File "*/lib/python3.11/site-packages/icrawler/builtin/baidu.py", line 116, in parse
content = json.loads(content, strict=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "*/lib/python3.11/json/__init__.py", line 359, in loads
return cls(**kw).decode(s)
^^^^^^^^^^^^^^^^^^^
File "*/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "*/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) After trying, I find that the following headers work: headers = {
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'User-Agent':
('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/88.0.4324.104 Safari/537.36'),
} |
is an example of how to do this as follows ?
|
I do not think the answer is adding Looks like this uses urllib3. urllib3 can import brotli if you have it installed. I assume brotli would add the "br". Otherwise
And this is what my Firefox 121 sends:
I would recommend testing to find the minimal requirement for bypassing Baidu problems. |
The response text I got when I see this error is this JSON. In this and another project:
The correct answer is probably what @liyufan posted, to send something that Baidu would expect from a real person. This should be an option somewhere, but I think Chinese and English is what Baidu expects. @simonmcnair looks correct to me. |
@liyufan posted a fix for "KeyError:'data' when using BaiduImageCrawler" when the response is: `{"antiFlag":1,"message":"Forbid spider access","bfe_log_id":"xxxxxx random numbers xxxxxx"}` Other errors should log the JSON received but I did not include that. `if content["data"] ... else self.logger.debug(content)` Log default headers in pownloader.py 116: ' while retry > 0 and not self.signal.get("reach_max_num"): try: response = self.session.get(file_url, timeout=timeout) #POF TEMP: self.logger.error(response.request.headers)'
Hi there!
I met this error when using Baidu. Google and Bing are fine.
Is there anything that can fix this?
The text was updated successfully, but these errors were encountered: