KeyError:'data' when using BaiduImageCrawler #103

tpnam0901 · 2021-11-10T13:05:58Z

Traceback (most recent call last):
  File "/home/minami/anaconda3/envs/python_function/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "anaconda3/envs/python_function/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/builtin/baidu.py", line 120, in parse
    for item in content['data']:
KeyError: 'data'

Hi there!
I met this error when using Baidu. Google and Bing are fine.
Is there anything that can fix this?

The text was updated successfully, but these errors were encountered:

waduhekx · 2022-04-25T07:53:23Z

Hi,
I met this error too.
Did you solved this?
Look forward to your reply.
Thanks.

chinasilva · 2022-09-22T02:56:20Z

Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.

just do it

liyufan · 2023-03-20T11:41:53Z

@chinasilva This will yield JSONDecodeError:

Exception in thread parser-001:
Traceback (most recent call last):
  File "*/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "*/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "*/lib/python3.11/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "*/lib/python3.11/site-packages/icrawler/builtin/baidu.py", line 116, in parse
    content = json.loads(content, strict=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "*/lib/python3.11/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
           ^^^^^^^^^^^^^^^^^^^
  File "*/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "*/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

After trying, I find that the following headers work:

headers = {
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'User-Agent':
    ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
     'AppleWebKit/537.36 (KHTML, like Gecko)'
     'Chrome/88.0.4324.104 Safari/537.36'),
}

simonmcnair · 2023-07-18T11:17:56Z

is an example of how to do this as follows ?

baidu_crawler = BaiduImageCrawler(storage={'root_dir': folder2})
baidu_crawler.session.headers= {
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'User-Agent':
    ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
     'AppleWebKit/537.36 (KHTML, like Gecko)'
     'Chrome/88.0.4324.104 Safari/537.36'),
}
baidu_crawler.crawl(keyword=lookfor, offset=0, max_num=1000,
                    min_size=(512,512), max_size=None)

Patty-OFurniture · 2024-01-04T01:52:52Z

Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.

just do it

I do not think the answer is adding Accept-Encoding: gzip, deflate, br

Looks like this uses urllib3. urllib3 can import brotli if you have it installed. I assume brotli would add the "br". Otherwise Accept-Encoding: gzip, deflate, br says I can handle GZIP, ZIP (deflate) and brotli responses. If you do not have brotli, you may get a garbage response.

urllib3 Response

Accept-Language may work, since most users prefer a specific language. Default headers, other than User-Agent:

'Accept-Encoding': 'gzip, deflate'
'Accept': '*/*'
`'Connection': 'keep-alive'``

And this is what my Firefox 121 sends:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br

I would recommend testing to find the minimal requirement for bypassing Baidu problems.

Patty-OFurniture · 2024-01-04T02:01:43Z

Traceback (most recent call last):
  File "/home/minami/anaconda3/envs/python_function/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "anaconda3/envs/python_function/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/builtin/baidu.py", line 120, in parse
    for item in content['data']:
KeyError: 'data'

Hi there! I met this error when using Baidu. Google and Bing are fine. Is there anything that can fix this?

The response text I got when I see this error is this JSON. In this and another project:

{"antiFlag":1,"message":"Forbid spider access","bfe_log_id":"xxxxxx random numbers xxxxxx"}

The correct answer is probably what @liyufan posted, to send something that Baidu would expect from a real person. This should be an option somewhere, but I think Chinese and English is what Baidu expects. @simonmcnair looks correct to me.

@liyufan

@liyufan posted a fix for "KeyError:'data' when using BaiduImageCrawler" when the response is: `{"antiFlag":1,"message":"Forbid spider access","bfe_log_id":"xxxxxx random numbers xxxxxx"}` Other errors should log the JSON received but I did not include that. `if content["data"] ... else self.logger.debug(content)` Log default headers in pownloader.py 116: ' while retry > 0 and not self.signal.get("reach_max_num"): try: response = self.session.get(file_url, timeout=timeout) #POF TEMP: self.logger.error(response.request.headers)'

@liyufan

…-Data Fix hellock#103 via @liyufan

see hellock#103 (comment)

see #103 (comment)

ZhiyuanChen added bug needs reproduce This needs reproduce labels Nov 10, 2021

ZhiyuanChen added help wanted and removed needs reproduce This needs reproduce labels Apr 25, 2022

tpnam0901 closed this as completed Jan 10, 2023

ZhiyuanChen reopened this Jan 13, 2023

Patty-OFurniture added a commit to Patty-OFurniture/icrawler that referenced this issue Jan 4, 2024

Merge pull request #1 from Patty-OFurniture/Patty-OFurniture-KeyError…

d8e4833

…-Data Fix hellock#103 via @liyufan

Patty-OFurniture mentioned this issue Feb 3, 2024

Tkinter UI for icrawler #123

Open

serser added a commit to serser/icrawler that referenced this issue Aug 23, 2024

bugfix: KeyError:'data' when using BaiduImageCrawler

9a96042

see hellock#103 (comment)

serser mentioned this issue Aug 23, 2024

bugfix: KeyError:'data' when using BaiduImageCrawler #131

Merged

ZhiyuanChen pushed a commit that referenced this issue Oct 16, 2024

bugfix: KeyError:'data' when using BaiduImageCrawler

2c3f6fd

see #103 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError:'data' when using BaiduImageCrawler #103

KeyError:'data' when using BaiduImageCrawler #103

tpnam0901 commented Nov 10, 2021 •

edited

Loading

waduhekx commented Apr 25, 2022

chinasilva commented Sep 22, 2022

liyufan commented Mar 20, 2023

simonmcnair commented Jul 18, 2023

Patty-OFurniture commented Jan 4, 2024

Patty-OFurniture commented Jan 4, 2024

KeyError:'data' when using BaiduImageCrawler #103

KeyError:'data' when using BaiduImageCrawler #103

Comments

tpnam0901 commented Nov 10, 2021 • edited Loading

waduhekx commented Apr 25, 2022

chinasilva commented Sep 22, 2022

liyufan commented Mar 20, 2023

simonmcnair commented Jul 18, 2023

Patty-OFurniture commented Jan 4, 2024

Patty-OFurniture commented Jan 4, 2024

tpnam0901 commented Nov 10, 2021 •

edited

Loading