diff --git a/README.md b/README.md index 80dffe49..88caf34b 100644 --- a/README.md +++ b/README.md @@ -20,22 +20,24 @@ ### 1.拥有强大的监控,保障数据质量 -![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2021/09/14/16316112326191.jpg) +![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2022/10/12/16655595870715.jpg) 监控面板:[点击查看详情](http://feapder.com/#/feapder_platform/feaplat) -### 2. 内置多维度的报警(支持 钉钉、企业微信、邮箱) +### 2. 内置多维度的报警(支持 钉钉、企业微信、飞书、邮箱) ![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718974597.jpg) ![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/29/16092335882158.jpg) ![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718683378.jpg) -### 3. 简单易用,内置三种爬虫,可应对各种需求场景 +### 3. 简单易用,内置四种爬虫,可应对各种需求场景 - `AirSpider` 轻量爬虫:学习成本低,可快速上手 -- `Spider` 分布式爬虫:支持断点续爬、爬虫报警、数据自动入库等功能 +- `Spider` 分布式爬虫:支持断点续爬、爬虫报警等功能,可加快爬虫采集速度 + +- `TaskSpider` 任务爬虫:从任务表里取任务做,内置支持对接redis、mysql任务表,亦可扩展其他任务来源 - `BatchSpider` 批次爬虫:可周期性的采集数据,自动将数据按照指定的采集周期划分。(如每7天全量更新一次商品销量的需求) @@ -44,7 +46,6 @@ ## 文档地址 - 官方文档:http://feapder.com -- 国内文档:https://boris-code.gitee.io/feapder - 境外文档:https://boris.org.cn/feapder - github:https://github.com/Boris-code/feapder - 更新日志:https://github.com/Boris-code/feapder/releases diff --git a/docs/README.md b/docs/README.md index 1e16f601..d5b08028 100644 --- a/docs/README.md +++ b/docs/README.md @@ -16,21 +16,23 @@ ### 1.拥有强大的监控,保障数据质量 -![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2021/09/14/16316112326191.jpg) +![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2022/10/12/16655595870715.jpg) 监控面板:[点击查看详情](http://feapder.com/#/feapder_platform/feaplat) -### 2. 内置多维度的报警(支持 钉钉、企业微信、邮箱) +### 2. 内置多维度的报警(支持 钉钉、企业微信、飞书、邮箱) ![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718974597.jpg) ![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/29/16092335882158.jpg) ![](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718683378.jpg) -### 3. 简单易用,内置三种爬虫,可应对各种需求场景 +### 3. 简单易用,内置四种爬虫,可应对各种需求场景 - `AirSpider` 轻量爬虫:学习成本低,可快速上手 -- `Spider` 分布式爬虫:支持断点续爬、爬虫报警、数据自动入库等功能 +- `Spider` 分布式爬虫:支持断点续爬、爬虫报警等功能,可加快爬虫采集速度 + +- `TaskSpider` 任务爬虫:从任务表里取任务做,内置支持对接redis、mysql任务表,亦可扩展其他任务来源 - `BatchSpider` 批次爬虫:可周期性的采集数据,自动将数据按照指定的采集周期划分。(如每7天全量更新一次商品销量的需求) @@ -39,7 +41,6 @@ ## 文档地址 - 官方文档:http://feapder.com -- 国内文档:https://boris-code.gitee.io/feapder - 境外文档:https://boris.org.cn/feapder - github:https://github.com/Boris-code/feapder - 更新日志:https://github.com/Boris-code/feapder/releases diff --git a/docs/_sidebar.md b/docs/_sidebar.md index 684d9e64..26e1fc15 100644 --- a/docs/_sidebar.md +++ b/docs/_sidebar.md @@ -20,7 +20,8 @@ * [响应-Response](source_code/Response.md) * [代理使用说明](source_code/proxy.md) * [用户池说明](source_code/UserPool.md) - * [浏览器渲染](source_code/浏览器渲染.md) + * [浏览器渲染-Selenium](source_code/浏览器渲染-Selenium.md) + * [浏览器渲染-Playwright](source_code/浏览器渲染-Playwright) * [解析器-BaseParser](source_code/BaseParser.md) * [批次解析器-BatchParser](source_code/BatchParser.md) * [Spider进阶](source_code/Spider进阶.md) diff --git "a/docs/source_code/\346\265\217\350\247\210\345\231\250\346\270\262\346\237\223-Playwright.md" "b/docs/source_code/\346\265\217\350\247\210\345\231\250\346\270\262\346\237\223-Playwright.md" new file mode 100644 index 00000000..8483b126 --- /dev/null +++ "b/docs/source_code/\346\265\217\350\247\210\345\231\250\346\270\262\346\237\223-Playwright.md" @@ -0,0 +1,258 @@ +# 浏览器渲染-Playwright + +采集动态页面时(Ajax渲染的页面),常用的有两种方案。一种是找接口拼参数,这种方式比较复杂但效率高,需要一定的爬虫功底;另外一种是采用浏览器渲染的方式,直接获取源码,简单方便 + +框架支持playwright渲染下载,每个线程持有一个playwright实例 + + +## 使用方式: + +1. 修改配置文件的渲染下载器: + + ``` + RENDER_DOWNLOADER="feapder.network.downloader.PlaywrightDownloader" + ``` +2. 使用 + + ```python + def start_requests(self): + yield feapder.Request("https://news.qq.com/", render=True) + ``` + +在返回的Request中传递`render=True`即可 + +框架支持`chromium`、`firefox`、`webkit` 三种浏览器渲染,可通过[配置文件](source_code/配置文件)进行配置。相关配置如下: + +```python +PLAYWRIGHT = dict( + user_agent=None, # 字符串 或 无参函数,返回值为user_agent + proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址 + headless=False, # 是否为无头浏览器 + driver_type="chromium", # chromium、firefox、webkit + timeout=30, # 请求超时时间 + window_size=(1024, 800), # 窗口大小 + executable_path=None, # 浏览器路径,默认为默认路径 + download_path=None, # 下载文件的路径 + render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码 + wait_until="networkidle", # 等待页面加载完成的事件,可选值:"commit", "domcontentloaded", "load", "networkidle" + use_stealth_js=False, # 使用stealth.min.js隐藏浏览器特征 + page_on_event_callback=None, # page.on() 事件的回调 如 page_on_event_callback={"dialog": lambda dialog: dialog.accept()} + storage_state_path=None, # 保存浏览器状态的路径 + url_regexes=None, # 拦截接口,支持正则,数组类型 + save_all=False, # 是否保存所有拦截的接口, 配合url_regexes使用,为False时只保存最后一次拦截的接口 +) +``` + + - `feapder.Request` 也支持`render_time`参数, 优先级大于配置文件中的`render_time` + + - 代理使用优先级:`feapder.Request`指定的代理 > 配置文件中的`PROXY_EXTRACT_API` > webdriver配置文件中的`proxy` + + - user_agent使用优先级:`feapder.Request`指定的header里的`User-Agent` > 框架随机的`User-Agent` > webdriver配置文件中的`user_agent` + +## 设置User-Agent + +> 每次生成一个新的浏览器实例时生效 + +### 方式1: + +通过配置文件的 `user_agent` 参数设置 + +### 方式2: + +通过 `feapder.Request`携带,优先级大于配置文件, 如: + +```python +def download_midware(self, request): + request.headers = { + "User-Agent": "xxxxxxxx" + } + return request +``` + +## 设置代理 + +> 每次生成一个新的浏览器实例时生效 + +### 方式1: + +通过配置文件的 `proxy` 参数设置 + +### 方式2: + +通过 `feapder.Request`携带,优先级大于配置文件, 如: + +```python +def download_midware(self, request): + request.proxies = { + "https": "https://xxx.xxx.xxx.xxx:xxxx" + } + return request +``` + +## 设置Cookie + +通过 `feapder.Request`携带,如: + +```python +def download_midware(self, request): + request.headers = { + "Cookie": "key=value; key2=value2" + } + return request +``` + +或者 + +```python +def download_midware(self, request): + request.cookies = { + "key": "value", + "key2": "value2", + } + return request +``` + +或者 + +```python +def download_midware(self, request): + request.cookies = [ + { + "domain": "xxx", + "name": "xxx", + "value": "xxx", + "expirationDate": "xxx" + }, + ] + return request +``` + +## 拦截数据示例 + +> 注意:主函数使用run方法运行,不能使用start + +```python +from playwright.sync_api import Response +from feapder.utils.webdriver import ( + PlaywrightDriver, + InterceptResponse, + InterceptRequest, +) + +import feapder + + +def on_response(response: Response): + print(response.url) + + +class TestPlaywright(feapder.AirSpider): + __custom_setting__ = dict( + RENDER_DOWNLOADER="feapder.network.downloader.PlaywrightDownloader", + PLAYWRIGHT=dict( + user_agent=None, # 字符串 或 无参函数,返回值为user_agent + proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址 + headless=False, # 是否为无头浏览器 + driver_type="chromium", # chromium、firefox、webkit + timeout=30, # 请求超时时间 + window_size=(1024, 800), # 窗口大小 + executable_path=None, # 浏览器路径,默认为默认路径 + download_path=None, # 下载文件的路径 + render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码 + wait_until="networkidle", # 等待页面加载完成的事件,可选值:"commit", "domcontentloaded", "load", "networkidle" + use_stealth_js=False, # 使用stealth.min.js隐藏浏览器特征 + # page_on_event_callback=dict(response=on_response), # 监听response事件 + # page.on() 事件的回调 如 page_on_event_callback={"dialog": lambda dialog: dialog.accept()} + storage_state_path=None, # 保存浏览器状态的路径 + url_regexes=["wallpaper/list"], # 拦截接口,支持正则,数组类型 + save_all=True, # 是否保存所有拦截的接口 + ), + ) + + def start_requests(self): + yield feapder.Request( + "http://www.soutushenqi.com/image/search/?searchWord=%E6%A0%91%E5%8F%B6", + render=True, + ) + + def parse(self, reqeust, response): + driver: PlaywrightDriver = response.driver + + intercept_response: InterceptResponse = driver.get_response("wallpaper/list") + intercept_request: InterceptRequest = intercept_response.request + + req_url = intercept_request.url + req_header = intercept_request.headers + req_data = intercept_request.data + print("请求url", req_url) + print("请求header", req_header) + print("请求data", req_data) + + data = driver.get_json("wallpaper/list") + print("接口返回的数据", data) + + print("------ 测试save_all=True ------- ") + + # 测试save_all=True + all_intercept_response: list = driver.get_all_response("wallpaper/list") + for intercept_response in all_intercept_response: + intercept_request: InterceptRequest = intercept_response.request + req_url = intercept_request.url + req_header = intercept_request.headers + req_data = intercept_request.data + print("请求url", req_url) + print("请求header", req_header) + print("请求data", req_data) + + all_intercept_json = driver.get_all_json("wallpaper/list") + for intercept_json in all_intercept_json: + print("接口返回的数据", intercept_json) + + # 千万别忘了 + driver.clear_cache() + + +if __name__ == "__main__": + TestPlaywright(thread_count=1).run() +``` +可通过配置的`page_on_event_callback`参数自定义事件的回调,如设置`on_response`的事件回调,亦可直接使用`url_regexes`设置拦截的接口 + +## 操作浏览器对象示例 + +> 注意:主函数使用run方法运行,不能使用start + +```python +import time + +from playwright.sync_api import Page + +import feapder +from feapder.utils.webdriver import PlaywrightDriver + + +class TestPlaywright(feapder.AirSpider): + __custom_setting__ = dict( + RENDER_DOWNLOADER="feapder.network.downloader.PlaywrightDownloader", + ) + + def start_requests(self): + yield feapder.Request("https://www.baidu.com", render=True) + + def parse(self, reqeust, response): + driver: PlaywrightDriver = response.driver + page: Page = driver.page + + page.type("#kw", "feapder") + page.click("#su") + page.wait_for_load_state("networkidle") + time.sleep(1) + + html = page.content() + response.text = html # 使response加载最新的页面 + for data_container in response.xpath("//div[@class='c-container']"): + print(data_container.xpath("string(.//h3)").extract_first()) + + +if __name__ == "__main__": + TestPlaywright(thread_count=1).run() +``` \ No newline at end of file diff --git "a/docs/source_code/\346\265\217\350\247\210\345\231\250\346\270\262\346\237\223.md" "b/docs/source_code/\346\265\217\350\247\210\345\231\250\346\270\262\346\237\223-Selenium.md" similarity index 97% rename from "docs/source_code/\346\265\217\350\247\210\345\231\250\346\270\262\346\237\223.md" rename to "docs/source_code/\346\265\217\350\247\210\345\231\250\346\270\262\346\237\223-Selenium.md" index 7414cfb9..665f5aed 100644 --- "a/docs/source_code/\346\265\217\350\247\210\345\231\250\346\270\262\346\237\223.md" +++ "b/docs/source_code/\346\265\217\350\247\210\345\231\250\346\270\262\346\237\223-Selenium.md" @@ -1,4 +1,4 @@ -# 浏览器渲染 +# 浏览器渲染-Selenium 采集动态页面时(Ajax渲染的页面),常用的有两种方案。一种是找接口拼参数,这种方式比较复杂但效率高,需要一定的爬虫功底;另外一种是采用浏览器渲染的方式,直接获取源码,简单方便 @@ -73,16 +73,6 @@ def download_midware(self, request): 通过 `feapder.Request`携带,优先级大于配置文件, 如: -```python -def download_midware(self, request): - request.proxies = { - "http": "http://xxx.xxx.xxx.xxx:xxxx" - } - return request -``` - -或者 - ```python def download_midware(self, request): request.proxies = { @@ -114,6 +104,21 @@ def download_midware(self, request): return request ``` +或者 + +```python +def download_midware(self, request): + request.cookies = [ + { + "domain": "xxx", + "name": "xxx", + "value": "xxx", + "expirationDate": "xxx" + }, + ] + return request +``` + ## 操作浏览器对象 通过 `response.browser` 获取浏览器对象 diff --git "a/docs/source_code/\351\205\215\347\275\256\346\226\207\344\273\266.md" "b/docs/source_code/\351\205\215\347\275\256\346\226\207\344\273\266.md" index 6ca1d936..547a6d16 100644 --- "a/docs/source_code/\351\205\215\347\275\256\346\226\207\344\273\266.md" +++ "b/docs/source_code/\351\205\215\347\275\256\346\226\207\344\273\266.md" @@ -8,103 +8,188 @@ ![-w378](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/30/16093189206589.jpg) ```python -import os +# -*- coding: utf-8 -*- +"""爬虫配置文件""" +# import os +# import sys +# +# # MYSQL +# MYSQL_IP = "localhost" +# MYSQL_PORT = 3306 +# MYSQL_DB = "" +# MYSQL_USER_NAME = "" +# MYSQL_USER_PASS = "" +# +# # MONGODB +# MONGO_IP = "localhost" +# MONGO_PORT = 27017 +# MONGO_DB = "" +# MONGO_USER_NAME = "" +# MONGO_USER_PASS = "" +# +# # REDIS +# # ip:port 多个可写为列表或者逗号隔开 如 ip1:port1,ip2:port2 或 ["ip1:port1", "ip2:port2"] +# REDISDB_IP_PORTS = "localhost:6379" +# REDISDB_USER_PASS = "" +# REDISDB_DB = 0 +# # 适用于redis哨兵模式 +# REDISDB_SERVICE_NAME = "" +# +# # 数据入库的pipeline,可自定义,默认MysqlPipeline +# ITEM_PIPELINES = [ +# "feapder.pipelines.mysql_pipeline.MysqlPipeline", +# # "feapder.pipelines.mongo_pipeline.MongoPipeline", +# # "feapder.pipelines.console_pipeline.ConsolePipeline", +# ] +# EXPORT_DATA_MAX_FAILED_TIMES = 10 # 导出数据时最大的失败次数,包括保存和更新,超过这个次数报警 +# EXPORT_DATA_MAX_RETRY_TIMES = 10 # 导出数据时最大的重试次数,包括保存和更新,超过这个次数则放弃重试 +# +# # 爬虫相关 +# # COLLECTOR +# COLLECTOR_TASK_COUNT = 32 # 每次获取任务数量,追求速度推荐32 +# +# # SPIDER +# SPIDER_THREAD_COUNT = 1 # 爬虫并发数,追求速度推荐32 +# # 下载时间间隔 单位秒。 支持随机 如 SPIDER_SLEEP_TIME = [2, 5] 则间隔为 2~5秒之间的随机数,包含2和5 +# SPIDER_SLEEP_TIME = 0 +# SPIDER_MAX_RETRY_TIMES = 10 # 每个请求最大重试次数 +# KEEP_ALIVE = False # 爬虫是否常驻 + +# 下载 +# DOWNLOADER = "feapder.network.downloader.RequestsDownloader" +# SESSION_DOWNLOADER = "feapder.network.downloader.RequestsSessionDownloader" +# RENDER_DOWNLOADER = "feapder.network.downloader.SeleniumDownloader" +# # RENDER_DOWNLOADER="feapder.network.downloader.PlaywrightDownloader", +# MAKE_ABSOLUTE_LINKS = True # 自动转成绝对连接 + +# # 浏览器渲染 +# WEBDRIVER = dict( +# pool_size=1, # 浏览器的数量 +# load_images=True, # 是否加载图片 +# user_agent=None, # 字符串 或 无参函数,返回值为user_agent +# proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址 +# headless=False, # 是否为无头浏览器 +# driver_type="CHROME", # CHROME、PHANTOMJS、FIREFOX +# timeout=30, # 请求超时时间 +# window_size=(1024, 800), # 窗口大小 +# executable_path=None, # 浏览器路径,默认为默认路径 +# render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码 +# custom_argument=[ +# "--ignore-certificate-errors", +# "--disable-blink-features=AutomationControlled", +# ], # 自定义浏览器渲染参数 +# xhr_url_regexes=None, # 拦截xhr接口,支持正则,数组类型 +# auto_install_driver=True, # 自动下载浏览器驱动 支持chrome 和 firefox +# download_path=None, # 下载文件的路径 +# use_stealth_js=False, # 使用stealth.min.js隐藏浏览器特征 +# ) +# +# PLAYWRIGHT = dict( +# user_agent=None, # 字符串 或 无参函数,返回值为user_agent +# proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址 +# headless=False, # 是否为无头浏览器 +# driver_type="chromium", # chromium、firefox、webkit +# timeout=30, # 请求超时时间 +# window_size=(1024, 800), # 窗口大小 +# executable_path=None, # 浏览器路径,默认为默认路径 +# download_path=None, # 下载文件的路径 +# render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码 +# wait_until="networkidle", # 等待页面加载完成的事件,可选值:"commit", "domcontentloaded", "load", "networkidle" +# use_stealth_js=False, # 使用stealth.min.js隐藏浏览器特征 +# page_on_event_callback=None, # page.on() 事件的回调 如 page_on_event_callback={"dialog": lambda dialog: dialog.accept()} +# storage_state_path=None, # 保存浏览器状态的路径 +# url_regexes=None, # 拦截接口,支持正则,数组类型 +# save_all=False, # 是否保存所有拦截的接口, 配合url_regexes使用,为False时只保存最后一次拦截的接口 +# ) +# +# # 爬虫启动时,重新抓取失败的requests +# RETRY_FAILED_REQUESTS = False +# # 保存失败的request +# SAVE_FAILED_REQUEST = True +# # request防丢机制。(指定的REQUEST_LOST_TIMEOUT时间内request还没做完,会重新下发 重做) +# REQUEST_LOST_TIMEOUT = 600 # 10分钟 +# # request网络请求超时时间 +# REQUEST_TIMEOUT = 22 # 等待服务器响应的超时时间,浮点数,或(connect timeout, read timeout)元组 +# # item在内存队列中最大缓存数量 +# ITEM_MAX_CACHED_COUNT = 5000 +# # item每批入库的最大数量 +# ITEM_UPLOAD_BATCH_MAX_SIZE = 1000 +# # item入库时间间隔 +# ITEM_UPLOAD_INTERVAL = 1 +# # 内存任务队列最大缓存的任务数,默认不限制;仅对AirSpider有效。 +# TASK_MAX_CACHED_SIZE = 0 +# +# # 下载缓存 利用redis缓存,但由于内存大小限制,所以建议仅供开发调试代码时使用,防止每次debug都需要网络请求 +# RESPONSE_CACHED_ENABLE = False # 是否启用下载缓存 成本高的数据或容易变需求的数据,建议设置为True +# RESPONSE_CACHED_EXPIRE_TIME = 3600 # 缓存时间 秒 +# RESPONSE_CACHED_USED = False # 是否使用缓存 补采数据时可设置为True +# +# # 设置代理 +# PROXY_EXTRACT_API = None # 代理提取API ,返回的代理分割符为\r\n +# PROXY_ENABLE = True +# +# # 随机headers +# RANDOM_HEADERS = True +# # UserAgent类型 支持 'chrome', 'opera', 'firefox', 'internetexplorer', 'safari','mobile' 若不指定则随机类型 +# USER_AGENT_TYPE = "chrome" +# # 默认使用的浏览器头 +# DEFAULT_USERAGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36" +# # requests 使用session +# USE_SESSION = False +# +# # 去重 +# ITEM_FILTER_ENABLE = False # item 去重 +# REQUEST_FILTER_ENABLE = False # request 去重 +# ITEM_FILTER_SETTING = dict( +# filter_type=1 # 永久去重(BloomFilter) = 1 、内存去重(MemoryFilter) = 2、 临时去重(ExpireFilter)= 3、轻量去重(LiteFilter)= 4 +# ) +# REQUEST_FILTER_SETTING = dict( +# filter_type=3, # 永久去重(BloomFilter) = 1 、内存去重(MemoryFilter) = 2、 临时去重(ExpireFilter)= 3、 轻量去重(LiteFilter)= 4 +# expire_time=2592000, # 过期时间1个月 +# ) +# +# # 报警 支持钉钉、飞书、企业微信、邮件 +# # 钉钉报警 +# DINGDING_WARNING_URL = "" # 钉钉机器人api +# DINGDING_WARNING_PHONE = "" # 报警人 支持列表,可指定多个 +# DINGDING_WARNING_ALL = False # 是否提示所有人, 默认为False +# # 飞书报警 +# # https://open.feishu.cn/document/ukTMukTMukTM/ucTM5YjL3ETO24yNxkjN#e1cdee9f +# FEISHU_WARNING_URL = "" # 飞书机器人api +# FEISHU_WARNING_USER = None # 报警人 {"open_id":"ou_xxxxx", "name":"xxxx"} 或 [{"open_id":"ou_xxxxx", "name":"xxxx"}] +# FEISHU_WARNING_ALL = False # 是否提示所有人, 默认为False +# # 邮件报警 +# EMAIL_SENDER = "" # 发件人 +# EMAIL_PASSWORD = "" # 授权码 +# EMAIL_RECEIVER = "" # 收件人 支持列表,可指定多个 +# EMAIL_SMTPSERVER = "smtp.163.com" # 邮件服务器 默认为163邮箱 +# # 企业微信报警 +# WECHAT_WARNING_URL = "" # 企业微信机器人api +# WECHAT_WARNING_PHONE = "" # 报警人 将会在群内@此人, 支持列表,可指定多人 +# WECHAT_WARNING_ALL = False # 是否提示所有人, 默认为False +# # 时间间隔 +# WARNING_INTERVAL = 3600 # 相同报警的报警时间间隔,防止刷屏; 0表示不去重 +# WARNING_LEVEL = "DEBUG" # 报警级别, DEBUG / INFO / ERROR +# WARNING_FAILED_COUNT = 1000 # 任务失败数 超过WARNING_FAILED_COUNT则报警 +# +# LOG_NAME = os.path.basename(os.getcwd()) +# LOG_PATH = "log/%s.log" % LOG_NAME # log存储路径 +# LOG_LEVEL = "DEBUG" +# LOG_COLOR = True # 是否带有颜色 +# LOG_IS_WRITE_TO_CONSOLE = True # 是否打印到控制台 +# LOG_IS_WRITE_TO_FILE = False # 是否写文件 +# LOG_MODE = "w" # 写文件的模式 +# LOG_MAX_BYTES = 10 * 1024 * 1024 # 每个日志文件的最大字节数 +# LOG_BACKUP_COUNT = 20 # 日志文件保留数量 +# LOG_ENCODING = "utf8" # 日志文件编码 +# OTHERS_LOG_LEVAL = "ERROR" # 第三方库的log等级 +# +# # 切换工作路径为当前项目路径 +# project_path = os.path.abspath(os.path.dirname(__file__)) +# os.chdir(project_path) # 切换工作路经 +# sys.path.insert(0, project_path) +# print("当前工作路径为 " + os.getcwd()) - -# MYSQL -MYSQL_IP = "" -MYSQL_PORT = 3306 -MYSQL_DB = "" -MYSQL_USER_NAME = "" -MYSQL_USER_PASS = "" - -# REDIS -# IP:PORT -REDISDB_IP_PORTS = "xxx:6379" -REDISDB_USER_PASS = "" -# 默认 0 到 15 共16个数据库 -REDISDB_DB = 0 - -# 数据入库的pipeline,可自定义,默认MysqlPipeline -ITEM_PIPELINES = ["feapder.pipelines.mysql_pipeline.MysqlPipeline"] - -# 爬虫相关 -# COLLECTOR -COLLECTOR_SLEEP_TIME = 1 # 从任务队列中获取任务到内存队列的间隔 -COLLECTOR_TASK_COUNT = 100 # 每次获取任务数量 - -# SPIDER -SPIDER_THREAD_COUNT = 10 # 爬虫并发数 -SPIDER_SLEEP_TIME = 0 # 下载时间间隔 单位秒。 支持随机 如 SPIDER_SLEEP_TIME = [2, 5] 则间隔为 2~5秒之间的随机数,包含2和5 -SPIDER_MAX_RETRY_TIMES = 100 # 每个请求最大重试次数 - -# 浏览器渲染下载 -WEBDRIVER = dict( - pool_size=2, # 浏览器的数量 - load_images=False, # 是否加载图片 - user_agent=None, # 字符串 或 无参函数,返回值为user_agent - proxy=None, # xxx.xxx.xxx.xxx:xxxx 或 无参函数,返回值为代理地址 - headless=False, # 是否为无头浏览器 - driver_type="CHROME", # CHROME 或 PHANTOMJS, - timeout=30, # 请求超时时间 - window_size=(1024, 800), # 窗口大小 - executable_path=None, # 浏览器路径,默认为默认路径 - render_time=0, # 渲染时长,即打开网页等待指定时间后再获取源码 -) - -# 重新尝试失败的requests 当requests重试次数超过允许的最大重试次数算失败 -RETRY_FAILED_REQUESTS = False -# request 超时时间,超过这个时间重新做(不是网络请求的超时时间)单位秒 -REQUEST_LOST_TIMEOUT = 600 # 10分钟 -# 保存失败的request -SAVE_FAILED_REQUEST = True - -# 下载缓存 利用redis缓存,由于内存小,所以仅供测试时使用 -RESPONSE_CACHED_ENABLE = False # 是否启用下载缓存 成本高的数据或容易变需求的数据,建议设置为True -RESPONSE_CACHED_EXPIRE_TIME = 3600 # 缓存时间 秒 -RESPONSE_CACHED_USED = False # 是否使用缓存 补采数据时可设置为True - -WARNING_FAILED_COUNT = 1000 # 任务失败数 超过WARNING_FAILED_COUNT则报警 - -# 爬虫是否常驻 -KEEP_ALIVE = False - -# 设置代理 -PROXY_EXTRACT_API = None # 代理提取API ,返回的代理分割符为\r\n -PROXY_ENABLE = True - -# 随机headers -RANDOM_HEADERS = True -# requests 使用session -USE_SESSION = False - -# 去重 -ITEM_FILTER_ENABLE = False # item 去重 -REQUEST_FILTER_ENABLE = False # request 去重 - -# 报警 支持钉钉及邮件,二选一即可 -# 钉钉报警 -DINGDING_WARNING_URL = "" # 钉钉机器人api -DINGDING_WARNING_PHONE = "" # 报警人 支持列表,可指定多个 -# 邮件报警 -EMAIL_SENDER = "" # 发件人 -EMAIL_PASSWORD = "" # 授权码 -EMAIL_RECEIVER = "" # 收件人 支持列表,可指定多个 -# 时间间隔 -WARNING_INTERVAL = 3600 # 相同报警的报警时间间隔,防止刷屏; 0表示不去重 -WARNING_LEVEL = "DEBUG" # 报警级别, DEBUG / ERROR - -LOG_NAME = os.path.basename(os.getcwd()) -LOG_PATH = "log/%s.log" % LOG_NAME # log存储路径 -LOG_LEVEL = "DEBUG" -LOG_COLOR = True # 是否带有颜色 -LOG_IS_WRITE_TO_CONSOLE = True # 是否打印到控制台 -LOG_IS_WRITE_TO_FILE = False # 是否写文件 -LOG_MODE = "w" # 写文件的模式 -LOG_MAX_BYTES = 10 * 1024 * 1024 # 每个日志文件的最大字节数 -LOG_BACKUP_COUNT = 20 # 日志文件保留数量 -LOG_ENCODING = "utf8" # 日志文件编码 -OTHERS_LOG_LEVAL = "ERROR" # 第三方库的log等级 ``` - 数据库连接信息默认读取的环境变量,因此若不想将自己的账号暴露给其他同事,建议写在环境变量里,环境变量的`key`与配置文件的`key`相同 diff --git a/docs/usage/AirSpider.md b/docs/usage/AirSpider.md index f645fe67..08c14185 100644 --- a/docs/usage/AirSpider.md +++ b/docs/usage/AirSpider.md @@ -8,7 +8,15 @@ AirSpider是一款轻量爬虫,学习成本低。面对一些数据量较少 示例 - feapder create -s air_spider_test +```python +feapder create -s air_spider_test + +请选择爬虫模板 +> AirSpider + Spider + TaskSpider + BatchSpider +``` 生成如下 diff --git a/docs/usage/BatchSpider.md b/docs/usage/BatchSpider.md index 0dbdcd78..d85bbce9 100644 --- a/docs/usage/BatchSpider.md +++ b/docs/usage/BatchSpider.md @@ -12,7 +12,15 @@ BatchSpider是一款分布式批次爬虫,对于需要周期性采集的数据 示例: - feapder create -s batch_spider_test 3 +```python +feapder create -s batch_spider_test + +请选择爬虫模板 + AirSpider + Spider + TaskSpider +> BatchSpider +``` 生成如下 diff --git a/docs/usage/Spider.md b/docs/usage/Spider.md index cb56f950..47736c21 100644 --- a/docs/usage/Spider.md +++ b/docs/usage/Spider.md @@ -25,7 +25,15 @@ Spider是一款基于redis的分布式爬虫,适用于海量数据采集,支 示例: - feapder create -s spider_test 2 +```python +feapder create -s spider_test + +请选择爬虫模板 + AirSpider +> Spider + TaskSpider + BatchSpider +``` 生成如下 diff --git a/docs/usage/TaskSpider.md b/docs/usage/TaskSpider.md index 326149ad..719f6481 100644 --- a/docs/usage/TaskSpider.md +++ b/docs/usage/TaskSpider.md @@ -8,7 +8,19 @@ TaskSpider是一款分布式爬虫,内部封装了取种子任务的逻辑, ## 2. 创建爬虫 -命令行 TODO +命令参考:[命令行工具](command/cmdline.md?id=_2-创建爬虫) + +示例: + +```python +feapder create -s task_spider_test + +请选择爬虫模板 + AirSpider + Spider +> TaskSpider + BatchSpider +``` 示例代码: @@ -17,7 +29,7 @@ import feapder from feapder import ArgumentParser -class TestTaskSpider(feapder.TaskSpider): +class TaskSpiderTest(feapder.TaskSpider): # 自定义数据库,若项目中有setting.py文件,此自定义可删除 __custom_setting__ = dict( REDISDB_IP_PORTS="localhost:6379", @@ -52,7 +64,7 @@ def start(args): """ 用mysql做种子表 """ - spider = TestTaskSpider( + spider = TaskSpiderTest( task_table="spider_task", # 任务表名 task_keys=["id", "url"], # 表里查询的字段 redis_key="test:task_spider", # redis里做任务队列的key @@ -69,7 +81,7 @@ def start2(args): """ 用redis做种子表 """ - spider = TestTaskSpider( + spider = TaskSpiderTest( task_table="spider_task2", # 任务表名 task_table_type="redis", # 任务表类型为redis redis_key="test:task_spider", # redis里做任务队列的key @@ -90,8 +102,8 @@ if __name__ == "__main__": parser.start() - # 下发任务 python3 test_task_spider.py --start 1 - # 采集 python3 test_task_spider.py --start 2 + # 下发任务 python3 task_spider_test.py --start 1 + # 采集 python3 task_spider_test.py --start 2 ``` ## 3. 代码讲解 diff --git a/feapder/setting.py b/feapder/setting.py index 90ef1ab4..30bc33e7 100644 --- a/feapder/setting.py +++ b/feapder/setting.py @@ -40,6 +40,7 @@ ITEM_PIPELINES = [ "feapder.pipelines.mysql_pipeline.MysqlPipeline", # "feapder.pipelines.mongo_pipeline.MongoPipeline", + # "feapder.pipelines.console_pipeline.ConsolePipeline", ] EXPORT_DATA_MAX_FAILED_TIMES = 10 # 导出数据时最大的失败次数,包括保存和更新,超过这个次数报警 EXPORT_DATA_MAX_RETRY_TIMES = 10 # 导出数据时最大的重试次数,包括保存和更新,超过这个次数则放弃重试 diff --git a/feapder/templates/project_template/setting.py b/feapder/templates/project_template/setting.py index 3956fa39..45e7a706 100644 --- a/feapder/templates/project_template/setting.py +++ b/feapder/templates/project_template/setting.py @@ -29,6 +29,7 @@ # ITEM_PIPELINES = [ # "feapder.pipelines.mysql_pipeline.MysqlPipeline", # # "feapder.pipelines.mongo_pipeline.MongoPipeline", +# # "feapder.pipelines.console_pipeline.ConsolePipeline", # ] # EXPORT_DATA_MAX_FAILED_TIMES = 10 # 导出数据时最大的失败次数,包括保存和更新,超过这个次数报警 # EXPORT_DATA_MAX_RETRY_TIMES = 10 # 导出数据时最大的重试次数,包括保存和更新,超过这个次数则放弃重试 diff --git a/tests/test_playwright.py b/tests/test_playwright.py index 376f0b3d..91668c9e 100644 --- a/tests/test_playwright.py +++ b/tests/test_playwright.py @@ -8,239 +8,35 @@ @email: boris_liu@foxmail.com """ -from playwright.sync_api import Response - -import feapder +import time +from playwright.sync_api import Page -def on_response(response: Response): - print(response.url) +import feapder +from feapder.utils.webdriver import PlaywrightDriver class TestPlaywright(feapder.AirSpider): __custom_setting__ = dict( RENDER_DOWNLOADER="feapder.network.downloader.PlaywrightDownloader", - PLAYWRIGHT=dict( - page_on_event_callback=dict(response=on_response), # 监听response事件 - # storage_state_path="playwright_state.json", # 保存登录状态 - ), ) def start_requests(self): yield feapder.Request("https://www.baidu.com", render=True) - def download_midware(self, request): - request.cookies = {"hhhhh": "66666"} - # request.cookies = [ - # { - # "domain": ".baidu.com", - # "expirationDate": 1663923578.800305, - # "hostOnly": False, - # "httpOnly": True, - # "name": "ab_sr", - # "path": "/", - # "secure": True, - # "session": False, - # "storeId": "0", - # "value": "1.0.1_MTIyODdmYzQzYTg2NzY0MGYwYWUwOTA5ODJkNTFlZDUxOTg1MzkyNzViYTc3NmFiZTk3MmU2ZTI0MDdkZTM4YzdlODQ5N2Q2ZDQzMGI0N2Y1NGE2Y2E3NjBlZWU4ZTA2MzQ3MGU5M2ZlM2M5MTBmNDVlMzU2NDBiMzZlOWNjN2IwZWZkZGRmOGIwOTUxMGYzMjQ4NDQyZGJjYTViOWI3Mg==", - # "id": 1, - # }, - # { - # "domain": ".baidu.com", - # "expirationDate": 1664009672, - # "hostOnly": False, - # "httpOnly": False, - # "name": "BA_HECTOR", - # "path": "/", - # "secure": False, - # "session": False, - # "storeId": "0", - # "value": "ak2g8k0h8g8l8h25ah0kljp71hiqt2819", - # "id": 2, - # }, - # { - # "domain": ".baidu.com", - # "expirationDate": 1682511471.350234, - # "hostOnly": False, - # "httpOnly": False, - # "name": "BAIDUID", - # "path": "/", - # "secure": False, - # "session": False, - # "storeId": "0", - # "value": "1922A166433AFD91AACA9A2591DDA842:FG=1", - # "id": 3, - # }, - # { - # "domain": ".baidu.com", - # "expirationDate": 1695459279.623494, - # "hostOnly": False, - # "httpOnly": False, - # "name": "BAIDUID_BFESS", - # "path": "/", - # "secure": True, - # "session": False, - # "storeId": "0", - # "value": "1922A166433AFD91AACA9A2591DDA842:FG=1", - # "id": 4, - # }, - # { - # "domain": ".baidu.com", - # "expirationDate": 2661324632, - # "hostOnly": False, - # "httpOnly": False, - # "name": "BIDUPSID", - # "path": "/", - # "secure": False, - # "session": False, - # "storeId": "0", - # "value": "451C45AEDA6E3B41F0F5F906A4D61A12", - # "id": 5, - # }, - # { - # "domain": ".baidu.com", - # "hostOnly": False, - # "httpOnly": False, - # "name": "delPer", - # "path": "/", - # "secure": False, - # "session": True, - # "storeId": "0", - # "value": "0", - # "id": 6, - # }, - # { - # "domain": ".baidu.com", - # "hostOnly": False, - # "httpOnly": False, - # "name": "H_PS_PSSID", - # "path": "/", - # "secure": False, - # "session": True, - # "storeId": "0", - # "value": "36543_36460_37357_36885_37273_36569_36786_37259_26350_37384_37351", - # "id": 7, - # }, - # { - # "domain": ".baidu.com", - # "expirationDate": 1689768463.32528, - # "hostOnly": False, - # "httpOnly": False, - # "name": "H_WISE_SIDS", - # "path": "/", - # "secure": False, - # "session": False, - # "storeId": "0", - # "value": "107320_110085_179346_180636_194519_196428_197471_197711_199569_204901_206125_208721_209204_209568_210304_210323_210969_212296_212739_213042_213355_214115_214130_214137_214143_214793_215730_216207_216448_216518_216616_216741_216848_216883_217090_217168_217185_217439_217915_218327_218359_218445_218454_218481_218538_218548_218598_218637_218800_218833_219254_219363_219414_219448_219449_219509_219548_219625_219666_219712_219732_219733_219738_219742_219815_219819_219839_219854_219864_219943_219946_219947_220071_220190_220301_220662_220775_220800_220853_220998_221007_221086_221107_221116_221119_221121_221278_221371_221381_221457_221502", - # "id": 8, - # }, - # { - # "domain": ".baidu.com", - # "expirationDate": 1695353323.712556, - # "hostOnly": False, - # "httpOnly": False, - # "name": "MCITY", - # "path": "/", - # "secure": False, - # "session": False, - # "storeId": "0", - # "value": "-%3A", - # "id": 9, - # }, - # { - # "domain": ".baidu.com", - # "hostOnly": False, - # "httpOnly": False, - # "name": "PSINO", - # "path": "/", - # "secure": False, - # "session": True, - # "storeId": "0", - # "value": "5", - # "id": 10, - # }, - # { - # "domain": ".baidu.com", - # "expirationDate": 3799549293.733737, - # "hostOnly": False, - # "httpOnly": False, - # "name": "PSTM", - # "path": "/", - # "secure": False, - # "session": False, - # "storeId": "0", - # "value": "1652065648", - # "id": 11, - # }, - # { - # "domain": ".baidu.com", - # "expirationDate": 1695367975.75261, - # "hostOnly": False, - # "httpOnly": False, - # "name": "ZFY", - # "path": "/", - # "secure": True, - # "session": False, - # "storeId": "0", - # "value": "X58MLRUa4SBUYQuGvOlCmzOuPsS0tcc0HBo6K5QWhBs:C", - # "id": 12, - # }, - # { - # "domain": ".www.baidu.com", - # "expirationDate": 1695367986, - # "hostOnly": False, - # "httpOnly": False, - # "name": "baikeVisitId", - # "path": "/", - # "secure": False, - # "session": False, - # "storeId": "0", - # "value": "dbd65753-d077-4a08-9464-ab1bedaf4793", - # "id": 13, - # }, - # { - # "domain": "www.baidu.com", - # "hostOnly": True, - # "httpOnly": False, - # "name": "BD_CK_SAM", - # "path": "/", - # "secure": False, - # "session": True, - # "storeId": "0", - # "value": "1", - # "id": 14, - # }, - # { - # "domain": "www.baidu.com", - # "hostOnly": True, - # "httpOnly": False, - # "name": "BD_HOME", - # "path": "/", - # "secure": False, - # "session": True, - # "storeId": "0", - # "value": "1", - # "id": 15, - # }, - # { - # "domain": "www.baidu.com", - # "expirationDate": 1664787279, - # "hostOnly": True, - # "httpOnly": False, - # "name": "BD_UPN", - # "path": "/", - # "secure": False, - # "session": False, - # "storeId": "0", - # "value": "123253", - # "id": 16, - # }, - # ] - return request - def parse(self, reqeust, response): - print(response.text) - response.browser.save_storage_stage() + driver: PlaywrightDriver = response.driver + page: Page = driver.page + + page.type("#kw", "feapder") + page.click("#su") + page.wait_for_load_state("networkidle") + time.sleep(1) + + html = page.content() + response.text = html # 使response加载最新的页面 + for data_container in response.xpath("//div[@class='c-container']"): + print(data_container.xpath("string(.//h3)").extract_first()) if __name__ == "__main__":