Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError:'data' when using BaiduImageCrawler #103

Open
tpnam0901 opened this issue Nov 10, 2021 · 6 comments
Open

KeyError:'data' when using BaiduImageCrawler #103

tpnam0901 opened this issue Nov 10, 2021 · 6 comments

Comments

@tpnam0901
Copy link

tpnam0901 commented Nov 10, 2021

Traceback (most recent call last):
  File "/home/minami/anaconda3/envs/python_function/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "anaconda3/envs/python_function/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/builtin/baidu.py", line 120, in parse
    for item in content['data']:
KeyError: 'data'

Hi there!
I met this error when using Baidu. Google and Bing are fine.
Is there anything that can fix this?

@ZhiyuanChen ZhiyuanChen added bug needs reproduce This needs reproduce labels Nov 10, 2021
@waduhekx
Copy link

Hi,
I met this error too.
Did you solved this?
Look forward to your reply.
Thanks.

@ZhiyuanChen ZhiyuanChen added help wanted and removed needs reproduce This needs reproduce labels Apr 25, 2022
@chinasilva
Copy link

Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.

just do it

image

@liyufan
Copy link

liyufan commented Mar 20, 2023

@chinasilva This will yield JSONDecodeError:

Exception in thread parser-001:
Traceback (most recent call last):
  File "*/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "*/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "*/lib/python3.11/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "*/lib/python3.11/site-packages/icrawler/builtin/baidu.py", line 116, in parse
    content = json.loads(content, strict=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "*/lib/python3.11/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
           ^^^^^^^^^^^^^^^^^^^
  File "*/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "*/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

After trying, I find that the following headers work:

headers = {
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'User-Agent':
    ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
     'AppleWebKit/537.36 (KHTML, like Gecko)'
     'Chrome/88.0.4324.104 Safari/537.36'),
}

@simonmcnair
Copy link

is an example of how to do this as follows ?

baidu_crawler = BaiduImageCrawler(storage={'root_dir': folder2})
baidu_crawler.session.headers= {
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'User-Agent':
    ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
     'AppleWebKit/537.36 (KHTML, like Gecko)'
     'Chrome/88.0.4324.104 Safari/537.36'),
}
baidu_crawler.crawl(keyword=lookfor, offset=0, max_num=1000,
                    min_size=(512,512), max_size=None)

@Patty-OFurniture
Copy link

Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.

just do it

I do not think the answer is adding Accept-Encoding: gzip, deflate, br

Looks like this uses urllib3. urllib3 can import brotli if you have it installed. I assume brotli would add the "br". Otherwise Accept-Encoding: gzip, deflate, br says I can handle GZIP, ZIP (deflate) and brotli responses. If you do not have brotli, you may get a garbage response.

urllib3 Response

Accept-Language may work, since most users prefer a specific language. Default headers, other than User-Agent:

'Accept-Encoding': 'gzip, deflate'
'Accept': '*/*'
`'Connection': 'keep-alive'``

And this is what my Firefox 121 sends:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br

I would recommend testing to find the minimal requirement for bypassing Baidu problems.

@Patty-OFurniture
Copy link

Traceback (most recent call last):
  File "/home/minami/anaconda3/envs/python_function/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "anaconda3/envs/python_function/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/builtin/baidu.py", line 120, in parse
    for item in content['data']:
KeyError: 'data'

Hi there! I met this error when using Baidu. Google and Bing are fine. Is there anything that can fix this?

The response text I got when I see this error is this JSON. In this and another project:

{"antiFlag":1,"message":"Forbid spider access","bfe_log_id":"xxxxxx random numbers xxxxxx"}

The correct answer is probably what @liyufan posted, to send something that Baidu would expect from a real person. This should be an option somewhere, but I think Chinese and English is what Baidu expects. @simonmcnair looks correct to me.

Patty-OFurniture added a commit to Patty-OFurniture/icrawler that referenced this issue Jan 4, 2024
@liyufan posted a fix for "KeyError:'data' when using BaiduImageCrawler" when the response is:

`{"antiFlag":1,"message":"Forbid spider access","bfe_log_id":"xxxxxx random numbers xxxxxx"}`

Other errors should log the JSON received but I did not include that.

`if content["data"]
	...
else
	self.logger.debug(content)`

Log default headers in pownloader.py 116:

'        while retry > 0 and not self.signal.get("reach_max_num"):
            try:
                response = self.session.get(file_url, timeout=timeout)
                #POF TEMP:
                self.logger.error(response.request.headers)'
Patty-OFurniture added a commit to Patty-OFurniture/icrawler that referenced this issue Jan 4, 2024
serser added a commit to serser/icrawler that referenced this issue Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants