Skip to content

Commit

Permalink
添加浏览器渲染文档
Browse files Browse the repository at this point in the history
  • Loading branch information
Boris-code committed Apr 19, 2021
1 parent b198c2a commit dded96d
Show file tree
Hide file tree
Showing 6 changed files with 163 additions and 6 deletions.
1 change: 1 addition & 0 deletions docs/_sidebar.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
* 使用进阶
* [请求-Request](source_code/Request.md)
* [响应-Response](source_code/Response.md)
* [浏览器渲染](source_code/浏览器渲染.md)
* [解析器-BaseParser](source_code/BaseParser.md)
* [批次解析器-BatchParser](source_code/BatchParser.md)
* [Spider进阶](source_code/Spider进阶.md)
Expand Down
4 changes: 2 additions & 2 deletions docs/source_code/BaseParser.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ class BaseParser(object):

def download_midware(self, request):
"""
@summary: 下载中间件 可修改请求的一些参数
@summary: 下载中间件 可修改请求的一些参数, 或可自定义下载,然后返回 request, response
---------
@param request:
---------
@result: return request / None (不会修改原来的request)
@result: return request / request, response
"""

pass
Expand Down
1 change: 1 addition & 0 deletions docs/source_code/Request.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Request除了支持requests的所有参数外,更需要关心的是框架中
@param download_midware: 下载中间件。默认为parser中的download_midware
@param is_abandoned: 当发生异常时是否放弃重试 True/False. 默认False
@param render: 是否用浏览器渲染
@param render_time: 渲染时长,即打开网页等待指定时间后再获取源码
--
以下参数于requests参数使用方式一致
@param method: 请求方式,如POST或GET,默认根据data值是否为空来判断
Expand Down
136 changes: 136 additions & 0 deletions docs/source_code/浏览器渲染.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# 浏览器渲染

采集动态页面时(Ajax渲染的页面),常用的有两种方案。一种是找接口拼参数,这种方式比较复杂但效率高,需要一定的爬虫功底;另外一种是采用浏览器渲染的方式,直接获取源码,简单方便

框架内置一个浏览器渲染池,默认的池子大小为1,请求时重复利用浏览器实例,只有当代理失效请求异常时,才会销毁、创建一个新的浏览器实例

## 使用方式:

```python
def start_requests(self):
yield feapder.Request("https://news.qq.com/", render=True)
```
在返回的Request中传递`render=True`即可

框架支持`CHROME``PHANTOMJS`两种浏览器渲染,可通过[配置文件](source_code/配置文件)进行配置。相关配置如下:

```python
# 浏览器渲染
WEBDRIVER = dict(
pool_size=1, # 浏览器的数量
load_images=True, # 是否加载图片
user_agent=None, # 字符串 或 无参函数,返回值为user_agent
proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址
headless=False, # 是否为无头浏览器
driver_type="CHROME", # CHROME 或 PHANTOMJS,
timeout=30, # 请求超时时间
window_size=(1024, 800), # 窗口大小
executable_path=None, # 浏览器路径,默认为默认路径
render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码
)
```

`feapder.Request` 也支持`render_time`参数, 优先级大于配置文件中的`render_time`

## 设置User-Agent

> 每次生成一个新的浏览器实例时生效
### 方式1:

通过配置文件的 `user_agent` 参数设置

### 方式2:

通过 `feapder.Request`携带,优先级大于配置文件, 如:

```python
def download_midware(self, request):
request.headers = {
"User-Agent": "xxxxxxxx"
}
return request
```

## 设置代理

> 每次生成一个新的浏览器实例时生效
### 方式1:

通过配置文件的 `proxy` 参数设置

### 方式2:

通过 `feapder.Request`携带,优先级大于配置文件, 如:

```python
def download_midware(self, request):
request.proxies = {
"http": "http://xxx.xxx.xxx.xxx:xxxx"
}
return request
```

或者

```python
def download_midware(self, request):
request.proxies = {
"https": "https://xxx.xxx.xxx.xxx:xxxx"
}
return request
```

## 设置Cookie

通过 `feapder.Request`携带,如:

```python
def download_midware(self, request):
request.headers = {
"Cookie": "key=value; key2=value2"
}
return request
```

或着

```python
def download_midware(self, request):
request.cookies = {
"key": "value",
"key2": "value2",
}
return request
```

## 操作浏览器对象

通过 `response.browser` 获取浏览器对象

代码示例:请求百度,搜索feapder

```python
import time

import feapder
from feapder.utils.webdriver import WebDriver


class TestRender(feapder.AirSpider):
def start_requests(self):
yield feapder.Request("http://www.baidu.com", render=True)

def parse(self, request, response):
browser: WebDriver = response.browser
browser.find_element_by_id("kw").send_keys("feapder")
browser.find_element_by_id("su").click()
time.sleep(5)
print(browser.page_source)


if __name__ == "__main__":
TestRender().start()

```
7 changes: 3 additions & 4 deletions docs/usage/AirSpider.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,21 +239,20 @@ def start_requests(self):

```python
# 浏览器渲染
# 浏览器渲染
WEBDRIVER = dict(
pool_size=2, # 浏览器的数量
load_images=False, # 是否加载图片
pool_size=1, # 浏览器的数量
load_images=True, # 是否加载图片
user_agent=None, # 字符串 或 无参函数,返回值为user_agent
proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址
headless=False, # 是否为无头浏览器
driver_type="CHROME", # CHROME 或 PHANTOMJS,
timeout=30, # 请求超时时间
window_size=(1024, 800), # 窗口大小
executable_path=None, # 浏览器路径,默认为默认路径
render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码
)
```

默认不加载图片,提高渲染速度

## 14. 完整的代码示例

Expand Down
20 changes: 20 additions & 0 deletions tests/test_rander3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import time

import feapder
from feapder.utils.webdriver import WebDriver


class TestRender(feapder.AirSpider):
def start_requests(self):
yield feapder.Request("http://www.baidu.com", render=True)

def parse(self, request, response):
browser: WebDriver = response.browser
browser.find_element_by_id("kw").send_keys("feapder")
browser.find_element_by_id("su").click()
time.sleep(5)
print(browser.page_source)


if __name__ == "__main__":
TestRender().start()

0 comments on commit dded96d

Please sign in to comment.