-
Notifications
You must be signed in to change notification settings - Fork 494
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b198c2a
commit dded96d
Showing
6 changed files
with
163 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
# 浏览器渲染 | ||
|
||
采集动态页面时(Ajax渲染的页面),常用的有两种方案。一种是找接口拼参数,这种方式比较复杂但效率高,需要一定的爬虫功底;另外一种是采用浏览器渲染的方式,直接获取源码,简单方便 | ||
|
||
框架内置一个浏览器渲染池,默认的池子大小为1,请求时重复利用浏览器实例,只有当代理失效请求异常时,才会销毁、创建一个新的浏览器实例 | ||
|
||
## 使用方式: | ||
|
||
```python | ||
def start_requests(self): | ||
yield feapder.Request("https://news.qq.com/", render=True) | ||
``` | ||
在返回的Request中传递`render=True`即可 | ||
|
||
框架支持`CHROME`和`PHANTOMJS`两种浏览器渲染,可通过[配置文件](source_code/配置文件)进行配置。相关配置如下: | ||
|
||
```python | ||
# 浏览器渲染 | ||
WEBDRIVER = dict( | ||
pool_size=1, # 浏览器的数量 | ||
load_images=True, # 是否加载图片 | ||
user_agent=None, # 字符串 或 无参函数,返回值为user_agent | ||
proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址 | ||
headless=False, # 是否为无头浏览器 | ||
driver_type="CHROME", # CHROME 或 PHANTOMJS, | ||
timeout=30, # 请求超时时间 | ||
window_size=(1024, 800), # 窗口大小 | ||
executable_path=None, # 浏览器路径,默认为默认路径 | ||
render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码 | ||
) | ||
``` | ||
|
||
`feapder.Request` 也支持`render_time`参数, 优先级大于配置文件中的`render_time` | ||
|
||
## 设置User-Agent | ||
|
||
> 每次生成一个新的浏览器实例时生效 | ||
### 方式1: | ||
|
||
通过配置文件的 `user_agent` 参数设置 | ||
|
||
### 方式2: | ||
|
||
通过 `feapder.Request`携带,优先级大于配置文件, 如: | ||
|
||
```python | ||
def download_midware(self, request): | ||
request.headers = { | ||
"User-Agent": "xxxxxxxx" | ||
} | ||
return request | ||
``` | ||
|
||
## 设置代理 | ||
|
||
> 每次生成一个新的浏览器实例时生效 | ||
### 方式1: | ||
|
||
通过配置文件的 `proxy` 参数设置 | ||
|
||
### 方式2: | ||
|
||
通过 `feapder.Request`携带,优先级大于配置文件, 如: | ||
|
||
```python | ||
def download_midware(self, request): | ||
request.proxies = { | ||
"http": "http://xxx.xxx.xxx.xxx:xxxx" | ||
} | ||
return request | ||
``` | ||
|
||
或者 | ||
|
||
```python | ||
def download_midware(self, request): | ||
request.proxies = { | ||
"https": "https://xxx.xxx.xxx.xxx:xxxx" | ||
} | ||
return request | ||
``` | ||
|
||
## 设置Cookie | ||
|
||
通过 `feapder.Request`携带,如: | ||
|
||
```python | ||
def download_midware(self, request): | ||
request.headers = { | ||
"Cookie": "key=value; key2=value2" | ||
} | ||
return request | ||
``` | ||
|
||
或着 | ||
|
||
```python | ||
def download_midware(self, request): | ||
request.cookies = { | ||
"key": "value", | ||
"key2": "value2", | ||
} | ||
return request | ||
``` | ||
|
||
## 操作浏览器对象 | ||
|
||
通过 `response.browser` 获取浏览器对象 | ||
|
||
代码示例:请求百度,搜索feapder | ||
|
||
```python | ||
import time | ||
|
||
import feapder | ||
from feapder.utils.webdriver import WebDriver | ||
|
||
|
||
class TestRender(feapder.AirSpider): | ||
def start_requests(self): | ||
yield feapder.Request("http://www.baidu.com", render=True) | ||
|
||
def parse(self, request, response): | ||
browser: WebDriver = response.browser | ||
browser.find_element_by_id("kw").send_keys("feapder") | ||
browser.find_element_by_id("su").click() | ||
time.sleep(5) | ||
print(browser.page_source) | ||
|
||
|
||
if __name__ == "__main__": | ||
TestRender().start() | ||
|
||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
import time | ||
|
||
import feapder | ||
from feapder.utils.webdriver import WebDriver | ||
|
||
|
||
class TestRender(feapder.AirSpider): | ||
def start_requests(self): | ||
yield feapder.Request("http://www.baidu.com", render=True) | ||
|
||
def parse(self, request, response): | ||
browser: WebDriver = response.browser | ||
browser.find_element_by_id("kw").send_keys("feapder") | ||
browser.find_element_by_id("su").click() | ||
time.sleep(5) | ||
print(browser.page_source) | ||
|
||
|
||
if __name__ == "__main__": | ||
TestRender().start() |