mirror of
https://github.com/NanmiCoder/MediaCrawler.git
synced 2025-11-25 03:15:17 +08:00
feat: 支持playwright通过cdp协议连接本地chrome浏览器
docs: 增加uv来管理python依赖的文档
This commit is contained in:
67
README.md
67
README.md
@@ -21,7 +21,7 @@
|
|||||||
目前能抓取小红书、抖音、快手、B站、微博、贴吧、知乎等平台的公开信息。
|
目前能抓取小红书、抖音、快手、B站、微博、贴吧、知乎等平台的公开信息。
|
||||||
|
|
||||||
原理:利用[playwright](https://playwright.dev/)搭桥,保留登录成功后的上下文浏览器环境,通过执行JS表达式获取一些加密参数
|
原理:利用[playwright](https://playwright.dev/)搭桥,保留登录成功后的上下文浏览器环境,通过执行JS表达式获取一些加密参数
|
||||||
通过使用此方式,免去了复现核心加密JS代码,逆向难度大大降低
|
通过使用此方式,免去了复现核心加密JS代码,逆向难度大大降低。
|
||||||
|
|
||||||
# 功能列表
|
# 功能列表
|
||||||
| 平台 | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 |
|
| 平台 | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 |
|
||||||
@@ -52,36 +52,38 @@
|
|||||||
# 安装部署方法
|
# 安装部署方法
|
||||||
> 开源不易,希望大家可以Star一下MediaCrawler仓库!!!!十分感谢!!! <br>
|
> 开源不易,希望大家可以Star一下MediaCrawler仓库!!!!十分感谢!!! <br>
|
||||||
|
|
||||||
## 创建并激活 python 虚拟环境
|
## 前置依赖
|
||||||
> 如果是爬取抖音和知乎,需要提前安装nodejs环境,版本大于等于:`16`即可 <br>
|
|
||||||
> 新增 [uv](https://github.com/astral-sh/uv) 来管理项目依赖,使用uv来替代python版本管理、pip进行依赖安装,更加方便快捷
|
|
||||||
```shell
|
|
||||||
# 进入项目根目录
|
|
||||||
cd MediaCrawler
|
|
||||||
|
|
||||||
# 创建虚拟环境
|
|
||||||
# 我的python版本是:3.9.6,requirements.txt中的库是基于这个版本的,如果是其他python版本,可能requirements.txt中的库不兼容,自行解决一下。
|
|
||||||
python -m venv venv
|
|
||||||
|
|
||||||
# macos & linux 激活虚拟环境
|
|
||||||
source venv/bin/activate
|
|
||||||
|
|
||||||
# windows 激活虚拟环境
|
### uv 安装
|
||||||
venv\Scripts\activate
|
> 在进行下一步操作之前, 请确保电脑上已经安装了uv,[uv安装地址](https://docs.astral.sh/uv/getting-started/installation)
|
||||||
|
>
|
||||||
|
> uv是否安装成功的验证, 终端输入命令:uv --version 如果正常显示版本好,那证明已经安装成功
|
||||||
|
>
|
||||||
|
> 强力安利 uv 给大家使用,简直是最强的python包管理工具
|
||||||
|
>
|
||||||
|
|
||||||
```
|
### nodejs安装
|
||||||
|
项目依赖nodejs,安装地址:https://nodejs.org/en/download/
|
||||||
|
> 如果要用python的原生venv来管理环境的话,可以参考: [原生环境管理文档](docs/原生环境管理文档.md)
|
||||||
|
|
||||||
## 安装依赖库
|
### python包安装
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
pip install -r requirements.txt
|
# 进入项目目录
|
||||||
```
|
cd MediaCrawler
|
||||||
|
|
||||||
## 安装 playwright浏览器驱动
|
# 使用 uv sync 命令来保证python版本和相关依赖包的一致性
|
||||||
|
uv sync
|
||||||
|
```
|
||||||
|
|
||||||
```shell
|
### 浏览器驱动安装
|
||||||
playwright install
|
```shell
|
||||||
```
|
# 安装浏览器驱动
|
||||||
|
playwright install
|
||||||
|
```
|
||||||
|
> MediaCrawler目前已经支持使用playwright连接你本地的Chrome浏览器了,一些因为Webdriver导致的问题迎刃而解了。
|
||||||
|
>
|
||||||
|
> 目前开放了 xhs 和 dy 这两个使用 cdp 的方式连接本地浏览器,如有需要,查看config/base_config.py中的配置项。
|
||||||
|
|
||||||
## 运行爬虫程序
|
## 运行爬虫程序
|
||||||
|
|
||||||
@@ -90,16 +92,16 @@
|
|||||||
### 一些其他支持项,也可以在config/base_config.py查看功能,写的有中文注释
|
### 一些其他支持项,也可以在config/base_config.py查看功能,写的有中文注释
|
||||||
|
|
||||||
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
|
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
|
||||||
python main.py --platform xhs --lt qrcode --type search
|
uv run main.py --platform xhs --lt qrcode --type search
|
||||||
|
|
||||||
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
|
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
|
||||||
python main.py --platform xhs --lt qrcode --type detail
|
uv run main.py --platform xhs --lt qrcode --type detail
|
||||||
|
|
||||||
# 打开对应APP扫二维码登录
|
# 打开对应APP扫二维码登录
|
||||||
|
|
||||||
# 其他平台爬虫使用示例,执行下面的命令查看
|
# 其他平台爬虫使用示例,执行下面的命令查看
|
||||||
python main.py --help
|
uv run main.py --help
|
||||||
```
|
```
|
||||||
|
|
||||||
## 数据保存
|
## 数据保存
|
||||||
- 支持关系型数据库Mysql中保存(需要提前创建数据库)
|
- 支持关系型数据库Mysql中保存(需要提前创建数据库)
|
||||||
@@ -107,7 +109,9 @@
|
|||||||
- 支持保存到csv中(data/目录下)
|
- 支持保存到csv中(data/目录下)
|
||||||
- 支持保存到json中(data/目录下)
|
- 支持保存到json中(data/目录下)
|
||||||
|
|
||||||
|
# 项目微信交流群
|
||||||
|
[加入微信交流群](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
|
||||||
|
|
||||||
|
|
||||||
# 其他常见问题可以查看在线文档
|
# 其他常见问题可以查看在线文档
|
||||||
>
|
>
|
||||||
@@ -120,10 +124,7 @@
|
|||||||
|
|
||||||
[作者的知识付费栏目介绍](https://nanmicoder.github.io/MediaCrawler/%E7%9F%A5%E8%AF%86%E4%BB%98%E8%B4%B9%E4%BB%8B%E7%BB%8D.html)
|
[作者的知识付费栏目介绍](https://nanmicoder.github.io/MediaCrawler/%E7%9F%A5%E8%AF%86%E4%BB%98%E8%B4%B9%E4%BB%8B%E7%BB%8D.html)
|
||||||
|
|
||||||
# 项目微信交流群
|
|
||||||
|
|
||||||
[加入微信交流群](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
|
|
||||||
|
|
||||||
# 感谢下列Sponsors对本仓库赞助支持
|
# 感谢下列Sponsors对本仓库赞助支持
|
||||||
<a href="https://www.swiftproxy.net/?ref=nanmi">
|
<a href="https://www.swiftproxy.net/?ref=nanmi">
|
||||||
<img src="docs/static/images/img_5.png">
|
<img src="docs/static/images/img_5.png">
|
||||||
|
|||||||
@@ -12,7 +12,7 @@
|
|||||||
from abc import ABC, abstractmethod
|
from abc import ABC, abstractmethod
|
||||||
from typing import Dict, Optional
|
from typing import Dict, Optional
|
||||||
|
|
||||||
from playwright.async_api import BrowserContext, BrowserType
|
from playwright.async_api import BrowserContext, BrowserType, Playwright
|
||||||
|
|
||||||
|
|
||||||
class AbstractCrawler(ABC):
|
class AbstractCrawler(ABC):
|
||||||
@@ -43,6 +43,19 @@ class AbstractCrawler(ABC):
|
|||||||
"""
|
"""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
|
||||||
|
user_agent: Optional[str], headless: bool = True) -> BrowserContext:
|
||||||
|
"""
|
||||||
|
使用CDP模式启动浏览器(可选实现)
|
||||||
|
:param playwright: playwright实例
|
||||||
|
:param playwright_proxy: playwright代理配置
|
||||||
|
:param user_agent: 用户代理
|
||||||
|
:param headless: 无头模式
|
||||||
|
:return: 浏览器上下文
|
||||||
|
"""
|
||||||
|
# 默认实现:回退到标准模式
|
||||||
|
return await self.launch_browser(playwright.chromium, playwright_proxy, user_agent, headless)
|
||||||
|
|
||||||
|
|
||||||
class AbstractLogin(ABC):
|
class AbstractLogin(ABC):
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
|
|||||||
@@ -45,6 +45,33 @@ HEADLESS = False
|
|||||||
# 是否保存登录状态
|
# 是否保存登录状态
|
||||||
SAVE_LOGIN_STATE = True
|
SAVE_LOGIN_STATE = True
|
||||||
|
|
||||||
|
# ==================== CDP (Chrome DevTools Protocol) 配置 ====================
|
||||||
|
# 是否启用CDP模式 - 使用用户现有的Chrome/Edge浏览器进行爬取,提供更好的反检测能力
|
||||||
|
# 启用后将自动检测并启动用户的Chrome/Edge浏览器,通过CDP协议进行控制
|
||||||
|
# 这种方式使用真实的浏览器环境,包括用户的扩展、Cookie和设置,大大降低被检测的风险
|
||||||
|
ENABLE_CDP_MODE = False
|
||||||
|
|
||||||
|
# CDP调试端口,用于与浏览器通信
|
||||||
|
# 如果端口被占用,系统会自动尝试下一个可用端口
|
||||||
|
CDP_DEBUG_PORT = 9222
|
||||||
|
|
||||||
|
# 自定义浏览器路径(可选)
|
||||||
|
# 如果为空,系统会自动检测Chrome/Edge的安装路径
|
||||||
|
# Windows示例: "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
|
||||||
|
# macOS示例: "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
|
||||||
|
CUSTOM_BROWSER_PATH = ""
|
||||||
|
|
||||||
|
# CDP模式下是否启用无头模式
|
||||||
|
# 注意:即使设置为True,某些反检测功能在无头模式下可能效果不佳
|
||||||
|
CDP_HEADLESS = False
|
||||||
|
|
||||||
|
# 浏览器启动超时时间(秒)
|
||||||
|
BROWSER_LAUNCH_TIMEOUT = 30
|
||||||
|
|
||||||
|
# 是否在程序结束时自动关闭浏览器
|
||||||
|
# 设置为False可以保持浏览器运行,便于调试
|
||||||
|
AUTO_CLOSE_BROWSER = True
|
||||||
|
|
||||||
# 数据保存类型选项配置,支持三种类型:csv、db、json, 最好保存到DB,有排重的功能。
|
# 数据保存类型选项配置,支持三种类型:csv、db、json, 最好保存到DB,有排重的功能。
|
||||||
SAVE_DATA_OPTION = "json" # csv or db or json
|
SAVE_DATA_OPTION = "json" # csv or db or json
|
||||||
|
|
||||||
|
|||||||
246
docs/CDP模式使用指南.md
Normal file
246
docs/CDP模式使用指南.md
Normal file
@@ -0,0 +1,246 @@
|
|||||||
|
# CDP模式使用指南
|
||||||
|
|
||||||
|
## 概述
|
||||||
|
|
||||||
|
CDP(Chrome DevTools Protocol)模式是一种高级的反检测爬虫技术,通过控制用户现有的Chrome/Edge浏览器来进行网页爬取。与传统的Playwright自动化相比,CDP模式具有以下优势:
|
||||||
|
|
||||||
|
### 🎯 主要优势
|
||||||
|
|
||||||
|
1. **真实浏览器环境**: 使用用户实际安装的浏览器,包含所有扩展、插件和个人设置
|
||||||
|
2. **更好的反检测能力**: 浏览器指纹更加真实,难以被网站检测为自动化工具
|
||||||
|
3. **保留用户状态**: 自动继承用户的登录状态、Cookie和浏览历史
|
||||||
|
4. **扩展支持**: 可以利用用户安装的广告拦截器、代理扩展等工具
|
||||||
|
5. **更自然的行为**: 浏览器行为模式更接近真实用户
|
||||||
|
|
||||||
|
## 快速开始
|
||||||
|
|
||||||
|
### 1. 启用CDP模式
|
||||||
|
|
||||||
|
在 `config/base_config.py` 中设置:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 启用CDP模式
|
||||||
|
ENABLE_CDP_MODE = True
|
||||||
|
|
||||||
|
# CDP调试端口(可选,默认9222)
|
||||||
|
CDP_DEBUG_PORT = 9222
|
||||||
|
|
||||||
|
# 是否在无头模式下运行(建议设为False以获得最佳反检测效果)
|
||||||
|
CDP_HEADLESS = False
|
||||||
|
|
||||||
|
# 程序结束时是否自动关闭浏览器
|
||||||
|
AUTO_CLOSE_BROWSER = True
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 运行测试
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 运行CDP功能测试
|
||||||
|
python examples/cdp_example.py
|
||||||
|
|
||||||
|
# 运行小红书爬虫(CDP模式)
|
||||||
|
python main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## 配置选项详解
|
||||||
|
|
||||||
|
### 基础配置
|
||||||
|
|
||||||
|
| 配置项 | 类型 | 默认值 | 说明 |
|
||||||
|
|--------|------|--------|------|
|
||||||
|
| `ENABLE_CDP_MODE` | bool | False | 是否启用CDP模式 |
|
||||||
|
| `CDP_DEBUG_PORT` | int | 9222 | CDP调试端口 |
|
||||||
|
| `CDP_HEADLESS` | bool | False | CDP模式下的无头模式 |
|
||||||
|
| `AUTO_CLOSE_BROWSER` | bool | True | 程序结束时是否关闭浏览器 |
|
||||||
|
|
||||||
|
### 高级配置
|
||||||
|
|
||||||
|
| 配置项 | 类型 | 默认值 | 说明 |
|
||||||
|
|--------|------|--------|------|
|
||||||
|
| `CUSTOM_BROWSER_PATH` | str | "" | 自定义浏览器路径 |
|
||||||
|
| `BROWSER_LAUNCH_TIMEOUT` | int | 30 | 浏览器启动超时时间(秒) |
|
||||||
|
|
||||||
|
### 自定义浏览器路径
|
||||||
|
|
||||||
|
如果系统自动检测失败,可以手动指定浏览器路径:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Windows示例
|
||||||
|
CUSTOM_BROWSER_PATH = r"C:\Program Files\Google\Chrome\Application\chrome.exe"
|
||||||
|
|
||||||
|
# macOS示例
|
||||||
|
CUSTOM_BROWSER_PATH = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
|
||||||
|
|
||||||
|
# Linux示例
|
||||||
|
CUSTOM_BROWSER_PATH = "/usr/bin/google-chrome"
|
||||||
|
```
|
||||||
|
|
||||||
|
## 支持的浏览器
|
||||||
|
|
||||||
|
### Windows
|
||||||
|
- Google Chrome (稳定版、Beta、Dev、Canary)
|
||||||
|
- Microsoft Edge (稳定版、Beta、Dev、Canary)
|
||||||
|
|
||||||
|
### macOS
|
||||||
|
- Google Chrome (稳定版、Beta、Dev、Canary)
|
||||||
|
- Microsoft Edge (稳定版、Beta、Dev、Canary)
|
||||||
|
|
||||||
|
### Linux
|
||||||
|
- Google Chrome / Chromium
|
||||||
|
- Microsoft Edge
|
||||||
|
|
||||||
|
## 使用示例
|
||||||
|
|
||||||
|
### 基本使用
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
from tools.cdp_browser import CDPBrowserManager
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
cdp_manager = CDPBrowserManager()
|
||||||
|
|
||||||
|
async with async_playwright() as playwright:
|
||||||
|
# 启动CDP浏览器
|
||||||
|
browser_context = await cdp_manager.launch_and_connect(
|
||||||
|
playwright=playwright,
|
||||||
|
user_agent="自定义User-Agent",
|
||||||
|
headless=False
|
||||||
|
)
|
||||||
|
|
||||||
|
# 创建页面并访问网站
|
||||||
|
page = await browser_context.new_page()
|
||||||
|
await page.goto("https://example.com")
|
||||||
|
|
||||||
|
# 执行爬取操作...
|
||||||
|
|
||||||
|
# 清理资源
|
||||||
|
await cdp_manager.cleanup()
|
||||||
|
|
||||||
|
asyncio.run(main())
|
||||||
|
```
|
||||||
|
|
||||||
|
### 在爬虫中使用
|
||||||
|
|
||||||
|
CDP模式已集成到所有平台爬虫中,只需启用配置即可:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 在config/base_config.py中
|
||||||
|
ENABLE_CDP_MODE = True
|
||||||
|
|
||||||
|
# 然后正常运行爬虫
|
||||||
|
python main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## 故障排除
|
||||||
|
|
||||||
|
### 常见问题
|
||||||
|
|
||||||
|
#### 1. 浏览器检测失败
|
||||||
|
**错误**: `未找到可用的浏览器`
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
- 确保已安装Chrome或Edge浏览器
|
||||||
|
- 检查浏览器是否在标准路径下
|
||||||
|
- 使用`CUSTOM_BROWSER_PATH`指定浏览器路径
|
||||||
|
|
||||||
|
#### 2. 端口被占用
|
||||||
|
**错误**: `无法找到可用的端口`
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
- 关闭其他使用调试端口的程序
|
||||||
|
- 修改`CDP_DEBUG_PORT`为其他端口
|
||||||
|
- 系统会自动尝试下一个可用端口
|
||||||
|
|
||||||
|
#### 3. 浏览器启动超时
|
||||||
|
**错误**: `浏览器在30秒内未能启动`
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
- 增加`BROWSER_LAUNCH_TIMEOUT`值
|
||||||
|
- 检查系统资源是否充足
|
||||||
|
- 尝试关闭其他占用资源的程序
|
||||||
|
|
||||||
|
#### 4. CDP连接失败
|
||||||
|
**错误**: `CDP连接失败`
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
- 检查防火墙设置
|
||||||
|
- 确保localhost访问正常
|
||||||
|
- 尝试重启浏览器
|
||||||
|
|
||||||
|
### 调试技巧
|
||||||
|
|
||||||
|
#### 1. 启用详细日志
|
||||||
|
```python
|
||||||
|
import logging
|
||||||
|
logging.basicConfig(level=logging.DEBUG)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. 手动测试CDP连接
|
||||||
|
```bash
|
||||||
|
# 手动启动Chrome
|
||||||
|
chrome --remote-debugging-port=9222
|
||||||
|
|
||||||
|
# 访问调试页面
|
||||||
|
curl http://localhost:9222/json
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3. 检查浏览器进程
|
||||||
|
```bash
|
||||||
|
# Windows
|
||||||
|
tasklist | findstr chrome
|
||||||
|
|
||||||
|
# macOS/Linux
|
||||||
|
ps aux | grep chrome
|
||||||
|
```
|
||||||
|
|
||||||
|
## 最佳实践
|
||||||
|
|
||||||
|
### 1. 反检测优化
|
||||||
|
- 保持`CDP_HEADLESS = False`以获得最佳反检测效果
|
||||||
|
- 使用真实的User-Agent字符串
|
||||||
|
- 避免过于频繁的请求
|
||||||
|
|
||||||
|
### 2. 性能优化
|
||||||
|
- 合理设置`AUTO_CLOSE_BROWSER`
|
||||||
|
- 复用浏览器实例而不是频繁重启
|
||||||
|
- 监控内存使用情况
|
||||||
|
|
||||||
|
### 3. 安全考虑
|
||||||
|
- 不要在生产环境中保存敏感Cookie
|
||||||
|
- 定期清理浏览器数据
|
||||||
|
- 注意用户隐私保护
|
||||||
|
|
||||||
|
### 4. 兼容性
|
||||||
|
- 测试不同浏览器版本的兼容性
|
||||||
|
- 准备回退方案(标准Playwright模式)
|
||||||
|
- 监控目标网站的反爬策略变化
|
||||||
|
|
||||||
|
## 技术原理
|
||||||
|
|
||||||
|
CDP模式的工作原理:
|
||||||
|
|
||||||
|
1. **浏览器检测**: 自动扫描系统中的Chrome/Edge安装路径
|
||||||
|
2. **进程启动**: 使用`--remote-debugging-port`参数启动浏览器
|
||||||
|
3. **CDP连接**: 通过WebSocket连接到浏览器的调试接口
|
||||||
|
4. **Playwright集成**: 使用`connectOverCDP`方法接管浏览器控制
|
||||||
|
5. **上下文管理**: 创建或复用浏览器上下文进行操作
|
||||||
|
|
||||||
|
这种方式绕过了传统WebDriver的检测机制,提供了更加隐蔽的自动化能力。
|
||||||
|
|
||||||
|
## 更新日志
|
||||||
|
|
||||||
|
### v1.0.0
|
||||||
|
- 初始版本发布
|
||||||
|
- 支持Windows和macOS的Chrome/Edge检测
|
||||||
|
- 集成到所有平台爬虫
|
||||||
|
- 提供完整的配置选项和错误处理
|
||||||
|
|
||||||
|
## 贡献
|
||||||
|
|
||||||
|
欢迎提交Issue和Pull Request来改进CDP模式功能。
|
||||||
|
|
||||||
|
## 许可证
|
||||||
|
|
||||||
|
本功能遵循项目的整体许可证条款,仅供学习和研究使用。
|
||||||
52
docs/原生环境管理文档.md
Normal file
52
docs/原生环境管理文档.md
Normal file
@@ -0,0 +1,52 @@
|
|||||||
|
## 使用python原生venv管理依赖(不推荐了)
|
||||||
|
|
||||||
|
## 创建并激活 python 虚拟环境
|
||||||
|
> 如果是爬取抖音和知乎,需要提前安装nodejs环境,版本大于等于:`16`即可 <br>
|
||||||
|
> 新增 [uv](https://github.com/astral-sh/uv) 来管理项目依赖,使用uv来替代python版本管理、pip进行依赖安装,更加方便快捷
|
||||||
|
```shell
|
||||||
|
# 进入项目根目录
|
||||||
|
cd MediaCrawler
|
||||||
|
|
||||||
|
# 创建虚拟环境
|
||||||
|
# 我的python版本是:3.9.6,requirements.txt中的库是基于这个版本的,如果是其他python版本,可能requirements.txt中的库不兼容,自行解决一下。
|
||||||
|
python -m venv venv
|
||||||
|
|
||||||
|
# macos & linux 激活虚拟环境
|
||||||
|
source venv/bin/activate
|
||||||
|
|
||||||
|
# windows 激活虚拟环境
|
||||||
|
venv\Scripts\activate
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 安装依赖库
|
||||||
|
|
||||||
|
```shell
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## 查看配置文件
|
||||||
|
|
||||||
|
## 安装 playwright浏览器驱动 (非必需)
|
||||||
|
|
||||||
|
```shell
|
||||||
|
playwright install
|
||||||
|
```
|
||||||
|
|
||||||
|
## 运行爬虫程序
|
||||||
|
|
||||||
|
```shell
|
||||||
|
### 项目默认是没有开启评论爬取模式,如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改
|
||||||
|
### 一些其他支持项,也可以在config/base_config.py查看功能,写的有中文注释
|
||||||
|
|
||||||
|
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
|
||||||
|
python main.py --platform xhs --lt qrcode --type search
|
||||||
|
|
||||||
|
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
|
||||||
|
python main.py --platform xhs --lt qrcode --type detail
|
||||||
|
|
||||||
|
# 打开对应APP扫二维码登录
|
||||||
|
|
||||||
|
# 其他平台爬虫使用示例,执行下面的命令查看
|
||||||
|
python main.py --help
|
||||||
|
```
|
||||||
3
main.py
3
main.py
@@ -11,6 +11,7 @@
|
|||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import sys
|
import sys
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
import cmd_arg
|
import cmd_arg
|
||||||
import config
|
import config
|
||||||
@@ -43,8 +44,8 @@ class CrawlerFactory:
|
|||||||
raise ValueError("Invalid Media Platform Currently only supported xhs or dy or ks or bili ...")
|
raise ValueError("Invalid Media Platform Currently only supported xhs or dy or ks or bili ...")
|
||||||
return crawler_class()
|
return crawler_class()
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
|
|
||||||
# parse cmd
|
# parse cmd
|
||||||
await cmd_arg.parse_cmd()
|
await cmd_arg.parse_cmd()
|
||||||
|
|
||||||
|
|||||||
@@ -15,7 +15,7 @@ import random
|
|||||||
from asyncio import Task
|
from asyncio import Task
|
||||||
from typing import Any, Dict, List, Optional, Tuple
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
from playwright.async_api import (BrowserContext, BrowserType, Page,
|
from playwright.async_api import (BrowserContext, BrowserType, Page, Playwright,
|
||||||
async_playwright)
|
async_playwright)
|
||||||
|
|
||||||
import config
|
import config
|
||||||
@@ -23,6 +23,7 @@ from base.base_crawler import AbstractCrawler
|
|||||||
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
|
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
|
||||||
from store import douyin as douyin_store
|
from store import douyin as douyin_store
|
||||||
from tools import utils
|
from tools import utils
|
||||||
|
from tools.cdp_browser import CDPBrowserManager
|
||||||
from var import crawler_type_var, source_keyword_var
|
from var import crawler_type_var, source_keyword_var
|
||||||
|
|
||||||
from .client import DOUYINClient
|
from .client import DOUYINClient
|
||||||
@@ -35,9 +36,11 @@ class DouYinCrawler(AbstractCrawler):
|
|||||||
context_page: Page
|
context_page: Page
|
||||||
dy_client: DOUYINClient
|
dy_client: DOUYINClient
|
||||||
browser_context: BrowserContext
|
browser_context: BrowserContext
|
||||||
|
cdp_manager: Optional[CDPBrowserManager]
|
||||||
|
|
||||||
def __init__(self) -> None:
|
def __init__(self) -> None:
|
||||||
self.index_url = "https://www.douyin.com"
|
self.index_url = "https://www.douyin.com"
|
||||||
|
self.cdp_manager = None
|
||||||
|
|
||||||
async def start(self) -> None:
|
async def start(self) -> None:
|
||||||
playwright_proxy_format, httpx_proxy_format = None, None
|
playwright_proxy_format, httpx_proxy_format = None, None
|
||||||
@@ -47,14 +50,23 @@ class DouYinCrawler(AbstractCrawler):
|
|||||||
playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
|
playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
|
||||||
|
|
||||||
async with async_playwright() as playwright:
|
async with async_playwright() as playwright:
|
||||||
# Launch a browser context.
|
# 根据配置选择启动模式
|
||||||
chromium = playwright.chromium
|
if config.ENABLE_CDP_MODE:
|
||||||
self.browser_context = await self.launch_browser(
|
utils.logger.info("[DouYinCrawler] 使用CDP模式启动浏览器")
|
||||||
chromium,
|
self.browser_context = await self.launch_browser_with_cdp(
|
||||||
None,
|
playwright, playwright_proxy_format, None,
|
||||||
user_agent=None,
|
headless=config.CDP_HEADLESS
|
||||||
headless=config.HEADLESS
|
)
|
||||||
)
|
else:
|
||||||
|
utils.logger.info("[DouYinCrawler] 使用标准模式启动浏览器")
|
||||||
|
# Launch a browser context.
|
||||||
|
chromium = playwright.chromium
|
||||||
|
self.browser_context = await self.launch_browser(
|
||||||
|
chromium,
|
||||||
|
playwright_proxy_format,
|
||||||
|
user_agent=None,
|
||||||
|
headless=config.HEADLESS
|
||||||
|
)
|
||||||
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
||||||
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
||||||
self.context_page = await self.browser_context.new_page()
|
self.context_page = await self.browser_context.new_page()
|
||||||
@@ -282,7 +294,41 @@ class DouYinCrawler(AbstractCrawler):
|
|||||||
)
|
)
|
||||||
return browser_context
|
return browser_context
|
||||||
|
|
||||||
|
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
|
||||||
|
user_agent: Optional[str], headless: bool = True) -> BrowserContext:
|
||||||
|
"""
|
||||||
|
使用CDP模式启动浏览器
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
self.cdp_manager = CDPBrowserManager()
|
||||||
|
browser_context = await self.cdp_manager.launch_and_connect(
|
||||||
|
playwright=playwright,
|
||||||
|
playwright_proxy=playwright_proxy,
|
||||||
|
user_agent=user_agent,
|
||||||
|
headless=headless
|
||||||
|
)
|
||||||
|
|
||||||
|
# 添加反检测脚本
|
||||||
|
await self.cdp_manager.add_stealth_script()
|
||||||
|
|
||||||
|
# 显示浏览器信息
|
||||||
|
browser_info = await self.cdp_manager.get_browser_info()
|
||||||
|
utils.logger.info(f"[DouYinCrawler] CDP浏览器信息: {browser_info}")
|
||||||
|
|
||||||
|
return browser_context
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.error(f"[DouYinCrawler] CDP模式启动失败,回退到标准模式: {e}")
|
||||||
|
# 回退到标准模式
|
||||||
|
chromium = playwright.chromium
|
||||||
|
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
|
||||||
|
|
||||||
async def close(self) -> None:
|
async def close(self) -> None:
|
||||||
"""Close browser context"""
|
"""Close browser context"""
|
||||||
await self.browser_context.close()
|
# 如果使用CDP模式,需要特殊处理
|
||||||
|
if self.cdp_manager:
|
||||||
|
await self.cdp_manager.cleanup()
|
||||||
|
self.cdp_manager = None
|
||||||
|
else:
|
||||||
|
await self.browser_context.close()
|
||||||
utils.logger.info("[DouYinCrawler.close] Browser context closed ...")
|
utils.logger.info("[DouYinCrawler.close] Browser context closed ...")
|
||||||
|
|||||||
@@ -16,7 +16,7 @@ import time
|
|||||||
from asyncio import Task
|
from asyncio import Task
|
||||||
from typing import Dict, List, Optional, Tuple
|
from typing import Dict, List, Optional, Tuple
|
||||||
|
|
||||||
from playwright.async_api import BrowserContext, BrowserType, Page, async_playwright
|
from playwright.async_api import BrowserContext, BrowserType, Page, Playwright, async_playwright
|
||||||
from tenacity import RetryError
|
from tenacity import RetryError
|
||||||
|
|
||||||
import config
|
import config
|
||||||
@@ -26,6 +26,7 @@ from model.m_xiaohongshu import NoteUrlInfo
|
|||||||
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
|
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
|
||||||
from store import xhs as xhs_store
|
from store import xhs as xhs_store
|
||||||
from tools import utils
|
from tools import utils
|
||||||
|
from tools.cdp_browser import CDPBrowserManager
|
||||||
from var import crawler_type_var, source_keyword_var
|
from var import crawler_type_var, source_keyword_var
|
||||||
|
|
||||||
from .client import XiaoHongShuClient
|
from .client import XiaoHongShuClient
|
||||||
@@ -39,11 +40,13 @@ class XiaoHongShuCrawler(AbstractCrawler):
|
|||||||
context_page: Page
|
context_page: Page
|
||||||
xhs_client: XiaoHongShuClient
|
xhs_client: XiaoHongShuClient
|
||||||
browser_context: BrowserContext
|
browser_context: BrowserContext
|
||||||
|
cdp_manager: Optional[CDPBrowserManager]
|
||||||
|
|
||||||
def __init__(self) -> None:
|
def __init__(self) -> None:
|
||||||
self.index_url = "https://www.xiaohongshu.com"
|
self.index_url = "https://www.xiaohongshu.com"
|
||||||
# self.user_agent = utils.get_user_agent()
|
# self.user_agent = utils.get_user_agent()
|
||||||
self.user_agent = config.UA if config.UA else "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
|
self.user_agent = config.UA if config.UA else "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
|
||||||
|
self.cdp_manager = None
|
||||||
|
|
||||||
async def start(self) -> None:
|
async def start(self) -> None:
|
||||||
playwright_proxy_format, httpx_proxy_format = None, None
|
playwright_proxy_format, httpx_proxy_format = None, None
|
||||||
@@ -57,11 +60,20 @@ class XiaoHongShuCrawler(AbstractCrawler):
|
|||||||
)
|
)
|
||||||
|
|
||||||
async with async_playwright() as playwright:
|
async with async_playwright() as playwright:
|
||||||
# Launch a browser context.
|
# 根据配置选择启动模式
|
||||||
chromium = playwright.chromium
|
if config.ENABLE_CDP_MODE:
|
||||||
self.browser_context = await self.launch_browser(
|
utils.logger.info("[XiaoHongShuCrawler] 使用CDP模式启动浏览器")
|
||||||
chromium, None, self.user_agent, headless=config.HEADLESS
|
self.browser_context = await self.launch_browser_with_cdp(
|
||||||
)
|
playwright, playwright_proxy_format, self.user_agent,
|
||||||
|
headless=config.CDP_HEADLESS
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
utils.logger.info("[XiaoHongShuCrawler] 使用标准模式启动浏览器")
|
||||||
|
# Launch a browser context.
|
||||||
|
chromium = playwright.chromium
|
||||||
|
self.browser_context = await self.launch_browser(
|
||||||
|
chromium, playwright_proxy_format, self.user_agent, headless=config.HEADLESS
|
||||||
|
)
|
||||||
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
||||||
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
||||||
# add a cookie attribute webId to avoid the appearance of a sliding captcha on the webpage
|
# add a cookie attribute webId to avoid the appearance of a sliding captcha on the webpage
|
||||||
@@ -292,6 +304,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
|
|||||||
else:
|
else:
|
||||||
crawl_interval = random.uniform(1, config.CRAWLER_MAX_SLEEP_SEC)
|
crawl_interval = random.uniform(1, config.CRAWLER_MAX_SLEEP_SEC)
|
||||||
try:
|
try:
|
||||||
|
utils.logger.info(f"[get_note_detail_async_task] Begin get note detail, note_id: {note_id}")
|
||||||
# 尝试直接获取网页版笔记详情,携带cookie
|
# 尝试直接获取网页版笔记详情,携带cookie
|
||||||
note_detail_from_html: Optional[Dict] = (
|
note_detail_from_html: Optional[Dict] = (
|
||||||
await self.xhs_client.get_note_by_id_from_html(
|
await self.xhs_client.get_note_by_id_from_html(
|
||||||
@@ -449,9 +462,40 @@ class XiaoHongShuCrawler(AbstractCrawler):
|
|||||||
)
|
)
|
||||||
return browser_context
|
return browser_context
|
||||||
|
|
||||||
|
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
|
||||||
|
user_agent: Optional[str], headless: bool = True) -> BrowserContext:
|
||||||
|
"""
|
||||||
|
使用CDP模式启动浏览器
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
self.cdp_manager = CDPBrowserManager()
|
||||||
|
browser_context = await self.cdp_manager.launch_and_connect(
|
||||||
|
playwright=playwright,
|
||||||
|
playwright_proxy=playwright_proxy,
|
||||||
|
user_agent=user_agent,
|
||||||
|
headless=headless
|
||||||
|
)
|
||||||
|
|
||||||
|
# 显示浏览器信息
|
||||||
|
browser_info = await self.cdp_manager.get_browser_info()
|
||||||
|
utils.logger.info(f"[XiaoHongShuCrawler] CDP浏览器信息: {browser_info}")
|
||||||
|
|
||||||
|
return browser_context
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.error(f"[XiaoHongShuCrawler] CDP模式启动失败,回退到标准模式: {e}")
|
||||||
|
# 回退到标准模式
|
||||||
|
chromium = playwright.chromium
|
||||||
|
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
|
||||||
|
|
||||||
async def close(self):
|
async def close(self):
|
||||||
"""Close browser context"""
|
"""Close browser context"""
|
||||||
await self.browser_context.close()
|
# 如果使用CDP模式,需要特殊处理
|
||||||
|
if self.cdp_manager:
|
||||||
|
await self.cdp_manager.cleanup()
|
||||||
|
self.cdp_manager = None
|
||||||
|
else:
|
||||||
|
await self.browser_context.close()
|
||||||
utils.logger.info("[XiaoHongShuCrawler.close] Browser context closed ...")
|
utils.logger.info("[XiaoHongShuCrawler.close] Browser context closed ...")
|
||||||
|
|
||||||
async def get_notice_media(self, note_detail: Dict):
|
async def get_notice_media(self, note_detail: Dict):
|
||||||
|
|||||||
243
tools/browser_launcher.py
Normal file
243
tools/browser_launcher.py
Normal file
@@ -0,0 +1,243 @@
|
|||||||
|
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||||
|
# 1. 不得用于任何商业用途。
|
||||||
|
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||||
|
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||||
|
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||||
|
# 5. 不得用于任何非法或不当的用途。
|
||||||
|
#
|
||||||
|
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||||
|
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||||
|
|
||||||
|
|
||||||
|
import os
|
||||||
|
import platform
|
||||||
|
import subprocess
|
||||||
|
import time
|
||||||
|
import socket
|
||||||
|
from typing import Optional, List, Tuple
|
||||||
|
import asyncio
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from tools import utils
|
||||||
|
|
||||||
|
|
||||||
|
class BrowserLauncher:
|
||||||
|
"""
|
||||||
|
浏览器启动器,用于检测和启动用户的Chrome/Edge浏览器
|
||||||
|
支持Windows和macOS系统
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.system = platform.system()
|
||||||
|
self.browser_process = None
|
||||||
|
self.debug_port = None
|
||||||
|
|
||||||
|
def detect_browser_paths(self) -> List[str]:
|
||||||
|
"""
|
||||||
|
检测系统中可用的浏览器路径
|
||||||
|
返回按优先级排序的浏览器路径列表
|
||||||
|
"""
|
||||||
|
paths = []
|
||||||
|
|
||||||
|
if self.system == "Windows":
|
||||||
|
# Windows下的常见Chrome/Edge安装路径
|
||||||
|
possible_paths = [
|
||||||
|
# Chrome路径
|
||||||
|
os.path.expandvars(r"%PROGRAMFILES%\Google\Chrome\Application\chrome.exe"),
|
||||||
|
os.path.expandvars(r"%PROGRAMFILES(X86)%\Google\Chrome\Application\chrome.exe"),
|
||||||
|
os.path.expandvars(r"%LOCALAPPDATA%\Google\Chrome\Application\chrome.exe"),
|
||||||
|
# Edge路径
|
||||||
|
os.path.expandvars(r"%PROGRAMFILES%\Microsoft\Edge\Application\msedge.exe"),
|
||||||
|
os.path.expandvars(r"%PROGRAMFILES(X86)%\Microsoft\Edge\Application\msedge.exe"),
|
||||||
|
# Chrome Beta/Dev/Canary
|
||||||
|
os.path.expandvars(r"%LOCALAPPDATA%\Google\Chrome Beta\Application\chrome.exe"),
|
||||||
|
os.path.expandvars(r"%LOCALAPPDATA%\Google\Chrome Dev\Application\chrome.exe"),
|
||||||
|
os.path.expandvars(r"%LOCALAPPDATA%\Google\Chrome SxS\Application\chrome.exe"),
|
||||||
|
]
|
||||||
|
elif self.system == "Darwin": # macOS
|
||||||
|
# macOS下的常见Chrome/Edge安装路径
|
||||||
|
possible_paths = [
|
||||||
|
# Chrome路径
|
||||||
|
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
|
||||||
|
"/Applications/Google Chrome Beta.app/Contents/MacOS/Google Chrome Beta",
|
||||||
|
"/Applications/Google Chrome Dev.app/Contents/MacOS/Google Chrome Dev",
|
||||||
|
"/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary",
|
||||||
|
# Edge路径
|
||||||
|
"/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge",
|
||||||
|
"/Applications/Microsoft Edge Beta.app/Contents/MacOS/Microsoft Edge Beta",
|
||||||
|
"/Applications/Microsoft Edge Dev.app/Contents/MacOS/Microsoft Edge Dev",
|
||||||
|
"/Applications/Microsoft Edge Canary.app/Contents/MacOS/Microsoft Edge Canary",
|
||||||
|
]
|
||||||
|
else:
|
||||||
|
# Linux等其他系统
|
||||||
|
possible_paths = [
|
||||||
|
"/usr/bin/google-chrome",
|
||||||
|
"/usr/bin/google-chrome-stable",
|
||||||
|
"/usr/bin/google-chrome-beta",
|
||||||
|
"/usr/bin/google-chrome-unstable",
|
||||||
|
"/usr/bin/chromium-browser",
|
||||||
|
"/usr/bin/chromium",
|
||||||
|
"/snap/bin/chromium",
|
||||||
|
"/usr/bin/microsoft-edge",
|
||||||
|
"/usr/bin/microsoft-edge-stable",
|
||||||
|
"/usr/bin/microsoft-edge-beta",
|
||||||
|
"/usr/bin/microsoft-edge-dev",
|
||||||
|
]
|
||||||
|
|
||||||
|
# 检查路径是否存在且可执行
|
||||||
|
for path in possible_paths:
|
||||||
|
if os.path.isfile(path) and os.access(path, os.X_OK):
|
||||||
|
paths.append(path)
|
||||||
|
|
||||||
|
return paths
|
||||||
|
|
||||||
|
def find_available_port(self, start_port: int = 9222) -> int:
|
||||||
|
"""
|
||||||
|
查找可用的端口
|
||||||
|
"""
|
||||||
|
port = start_port
|
||||||
|
while port < start_port + 100: # 最多尝试100个端口
|
||||||
|
try:
|
||||||
|
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||||
|
s.bind(('localhost', port))
|
||||||
|
return port
|
||||||
|
except OSError:
|
||||||
|
port += 1
|
||||||
|
|
||||||
|
raise RuntimeError(f"无法找到可用的端口,已尝试 {start_port} 到 {port-1}")
|
||||||
|
|
||||||
|
def launch_browser(self, browser_path: str, debug_port: int, headless: bool = False,
|
||||||
|
user_data_dir: Optional[str] = None) -> subprocess.Popen:
|
||||||
|
"""
|
||||||
|
启动浏览器进程
|
||||||
|
"""
|
||||||
|
# 基本启动参数
|
||||||
|
args = [
|
||||||
|
browser_path,
|
||||||
|
f"--remote-debugging-port={debug_port}",
|
||||||
|
"--no-first-run",
|
||||||
|
"--no-default-browser-check",
|
||||||
|
"--disable-background-timer-throttling",
|
||||||
|
"--disable-backgrounding-occluded-windows",
|
||||||
|
"--disable-renderer-backgrounding",
|
||||||
|
"--disable-features=TranslateUI",
|
||||||
|
"--disable-ipc-flooding-protection",
|
||||||
|
"--disable-hang-monitor",
|
||||||
|
"--disable-prompt-on-repost",
|
||||||
|
"--disable-sync",
|
||||||
|
"--disable-web-security", # 可能有助于某些网站的访问
|
||||||
|
"--disable-features=VizDisplayCompositor",
|
||||||
|
"--disable-extensions-except", # 保留用户扩展
|
||||||
|
"--load-extension", # 允许加载扩展
|
||||||
|
]
|
||||||
|
|
||||||
|
# 无头模式
|
||||||
|
if headless:
|
||||||
|
args.extend([
|
||||||
|
"--headless",
|
||||||
|
"--disable-gpu",
|
||||||
|
"--no-sandbox",
|
||||||
|
])
|
||||||
|
|
||||||
|
# 用户数据目录
|
||||||
|
if user_data_dir:
|
||||||
|
args.append(f"--user-data-dir={user_data_dir}")
|
||||||
|
|
||||||
|
utils.logger.info(f"[BrowserLauncher] 启动浏览器: {browser_path}")
|
||||||
|
utils.logger.info(f"[BrowserLauncher] 调试端口: {debug_port}")
|
||||||
|
utils.logger.info(f"[BrowserLauncher] 无头模式: {headless}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 在Windows上,使用CREATE_NEW_PROCESS_GROUP避免Ctrl+C影响子进程
|
||||||
|
if self.system == "Windows":
|
||||||
|
process = subprocess.Popen(
|
||||||
|
args,
|
||||||
|
stdout=subprocess.DEVNULL,
|
||||||
|
stderr=subprocess.DEVNULL,
|
||||||
|
creationflags=subprocess.CREATE_NEW_PROCESS_GROUP
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
process = subprocess.Popen(
|
||||||
|
args,
|
||||||
|
stdout=subprocess.DEVNULL,
|
||||||
|
stderr=subprocess.DEVNULL,
|
||||||
|
preexec_fn=os.setsid # 创建新的进程组
|
||||||
|
)
|
||||||
|
|
||||||
|
return process
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.error(f"[BrowserLauncher] 启动浏览器失败: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def wait_for_browser_ready(self, debug_port: int, timeout: int = 30) -> bool:
|
||||||
|
"""
|
||||||
|
等待浏览器准备就绪
|
||||||
|
"""
|
||||||
|
utils.logger.info(f"[BrowserLauncher] 等待浏览器在端口 {debug_port} 上准备就绪...")
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
while time.time() - start_time < timeout:
|
||||||
|
try:
|
||||||
|
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||||
|
s.settimeout(1)
|
||||||
|
result = s.connect_ex(('localhost', debug_port))
|
||||||
|
if result == 0:
|
||||||
|
utils.logger.info(f"[BrowserLauncher] 浏览器已在端口 {debug_port} 上准备就绪")
|
||||||
|
return True
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
time.sleep(0.5)
|
||||||
|
|
||||||
|
utils.logger.error(f"[BrowserLauncher] 浏览器在 {timeout} 秒内未能准备就绪")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def get_browser_info(self, browser_path: str) -> Tuple[str, str]:
|
||||||
|
"""
|
||||||
|
获取浏览器信息(名称和版本)
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
if "chrome" in browser_path.lower():
|
||||||
|
name = "Google Chrome"
|
||||||
|
elif "edge" in browser_path.lower() or "msedge" in browser_path.lower():
|
||||||
|
name = "Microsoft Edge"
|
||||||
|
elif "chromium" in browser_path.lower():
|
||||||
|
name = "Chromium"
|
||||||
|
else:
|
||||||
|
name = "Unknown Browser"
|
||||||
|
|
||||||
|
# 尝试获取版本信息
|
||||||
|
try:
|
||||||
|
result = subprocess.run([browser_path, "--version"],
|
||||||
|
capture_output=True, text=True, timeout=5)
|
||||||
|
version = result.stdout.strip() if result.stdout else "Unknown Version"
|
||||||
|
except:
|
||||||
|
version = "Unknown Version"
|
||||||
|
|
||||||
|
return name, version
|
||||||
|
|
||||||
|
except Exception:
|
||||||
|
return "Unknown Browser", "Unknown Version"
|
||||||
|
|
||||||
|
def cleanup(self):
|
||||||
|
"""
|
||||||
|
清理资源,关闭浏览器进程
|
||||||
|
"""
|
||||||
|
if self.browser_process:
|
||||||
|
try:
|
||||||
|
utils.logger.info("[BrowserLauncher] 正在关闭浏览器进程...")
|
||||||
|
|
||||||
|
if self.system == "Windows":
|
||||||
|
# Windows下使用taskkill强制终止进程树
|
||||||
|
subprocess.run(["taskkill", "/F", "/T", "/PID", str(self.browser_process.pid)],
|
||||||
|
capture_output=True)
|
||||||
|
else:
|
||||||
|
# Unix系统下终止进程组
|
||||||
|
os.killpg(os.getpgid(self.browser_process.pid), 9)
|
||||||
|
|
||||||
|
self.browser_process = None
|
||||||
|
utils.logger.info("[BrowserLauncher] 浏览器进程已关闭")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.warning(f"[BrowserLauncher] 关闭浏览器进程时出错: {e}")
|
||||||
266
tools/cdp_browser.py
Normal file
266
tools/cdp_browser.py
Normal file
@@ -0,0 +1,266 @@
|
|||||||
|
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||||
|
# 1. 不得用于任何商业用途。
|
||||||
|
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||||
|
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||||
|
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||||
|
# 5. 不得用于任何非法或不当的用途。
|
||||||
|
#
|
||||||
|
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||||
|
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||||
|
|
||||||
|
|
||||||
|
import os
|
||||||
|
import asyncio
|
||||||
|
from typing import Optional, Dict, Any
|
||||||
|
from playwright.async_api import Browser, BrowserContext, Playwright
|
||||||
|
|
||||||
|
import config
|
||||||
|
from tools.browser_launcher import BrowserLauncher
|
||||||
|
from tools import utils
|
||||||
|
|
||||||
|
|
||||||
|
class CDPBrowserManager:
|
||||||
|
"""
|
||||||
|
CDP浏览器管理器,负责启动和管理通过CDP连接的浏览器
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.launcher = BrowserLauncher()
|
||||||
|
self.browser: Optional[Browser] = None
|
||||||
|
self.browser_context: Optional[BrowserContext] = None
|
||||||
|
self.debug_port: Optional[int] = None
|
||||||
|
|
||||||
|
async def launch_and_connect(self, playwright: Playwright,
|
||||||
|
playwright_proxy: Optional[Dict] = None,
|
||||||
|
user_agent: Optional[str] = None,
|
||||||
|
headless: bool = False) -> BrowserContext:
|
||||||
|
"""
|
||||||
|
启动浏览器并通过CDP连接
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# 1. 检测浏览器路径
|
||||||
|
browser_path = await self._get_browser_path()
|
||||||
|
|
||||||
|
# 2. 获取可用端口
|
||||||
|
self.debug_port = self.launcher.find_available_port(config.CDP_DEBUG_PORT)
|
||||||
|
|
||||||
|
# 3. 启动浏览器
|
||||||
|
await self._launch_browser(browser_path, headless)
|
||||||
|
|
||||||
|
# 4. 通过CDP连接
|
||||||
|
await self._connect_via_cdp(playwright)
|
||||||
|
|
||||||
|
# 5. 创建浏览器上下文
|
||||||
|
browser_context = await self._create_browser_context(
|
||||||
|
playwright_proxy, user_agent
|
||||||
|
)
|
||||||
|
|
||||||
|
self.browser_context = browser_context
|
||||||
|
return browser_context
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.error(f"[CDPBrowserManager] CDP浏览器启动失败: {e}")
|
||||||
|
await self.cleanup()
|
||||||
|
raise
|
||||||
|
|
||||||
|
async def _get_browser_path(self) -> str:
|
||||||
|
"""
|
||||||
|
获取浏览器路径
|
||||||
|
"""
|
||||||
|
# 优先使用用户自定义路径
|
||||||
|
if config.CUSTOM_BROWSER_PATH and os.path.isfile(config.CUSTOM_BROWSER_PATH):
|
||||||
|
utils.logger.info(f"[CDPBrowserManager] 使用自定义浏览器路径: {config.CUSTOM_BROWSER_PATH}")
|
||||||
|
return config.CUSTOM_BROWSER_PATH
|
||||||
|
|
||||||
|
# 自动检测浏览器路径
|
||||||
|
browser_paths = self.launcher.detect_browser_paths()
|
||||||
|
|
||||||
|
if not browser_paths:
|
||||||
|
raise RuntimeError(
|
||||||
|
"未找到可用的浏览器。请确保已安装Chrome或Edge浏览器,"
|
||||||
|
"或在配置文件中设置CUSTOM_BROWSER_PATH指定浏览器路径。"
|
||||||
|
)
|
||||||
|
|
||||||
|
browser_path = browser_paths[0] # 使用第一个找到的浏览器
|
||||||
|
browser_name, browser_version = self.launcher.get_browser_info(browser_path)
|
||||||
|
|
||||||
|
utils.logger.info(f"[CDPBrowserManager] 检测到浏览器: {browser_name} ({browser_version})")
|
||||||
|
utils.logger.info(f"[CDPBrowserManager] 浏览器路径: {browser_path}")
|
||||||
|
|
||||||
|
return browser_path
|
||||||
|
|
||||||
|
async def _launch_browser(self, browser_path: str, headless: bool):
|
||||||
|
"""
|
||||||
|
启动浏览器进程
|
||||||
|
"""
|
||||||
|
# 设置用户数据目录(如果启用了保存登录状态)
|
||||||
|
user_data_dir = None
|
||||||
|
if config.SAVE_LOGIN_STATE:
|
||||||
|
user_data_dir = os.path.join(
|
||||||
|
os.getcwd(), "browser_data",
|
||||||
|
f"cdp_{config.USER_DATA_DIR % config.PLATFORM}"
|
||||||
|
)
|
||||||
|
os.makedirs(user_data_dir, exist_ok=True)
|
||||||
|
utils.logger.info(f"[CDPBrowserManager] 用户数据目录: {user_data_dir}")
|
||||||
|
|
||||||
|
# 启动浏览器
|
||||||
|
self.launcher.browser_process = self.launcher.launch_browser(
|
||||||
|
browser_path=browser_path,
|
||||||
|
debug_port=self.debug_port,
|
||||||
|
headless=headless,
|
||||||
|
user_data_dir=user_data_dir
|
||||||
|
)
|
||||||
|
|
||||||
|
# 等待浏览器准备就绪
|
||||||
|
if not self.launcher.wait_for_browser_ready(
|
||||||
|
self.debug_port, config.BROWSER_LAUNCH_TIMEOUT
|
||||||
|
):
|
||||||
|
raise RuntimeError(f"浏览器在 {config.BROWSER_LAUNCH_TIMEOUT} 秒内未能启动")
|
||||||
|
|
||||||
|
async def _connect_via_cdp(self, playwright: Playwright):
|
||||||
|
"""
|
||||||
|
通过CDP连接到浏览器
|
||||||
|
"""
|
||||||
|
cdp_url = f"http://localhost:{self.debug_port}"
|
||||||
|
utils.logger.info(f"[CDPBrowserManager] 正在通过CDP连接到浏览器: {cdp_url}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 使用Playwright的connectOverCDP方法连接
|
||||||
|
self.browser = await playwright.chromium.connect_over_cdp(cdp_url)
|
||||||
|
|
||||||
|
if self.browser.is_connected():
|
||||||
|
utils.logger.info("[CDPBrowserManager] 成功连接到浏览器")
|
||||||
|
utils.logger.info(f"[CDPBrowserManager] 浏览器上下文数量: {len(self.browser.contexts)}")
|
||||||
|
else:
|
||||||
|
raise RuntimeError("CDP连接失败")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.error(f"[CDPBrowserManager] CDP连接失败: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
async def _create_browser_context(self, playwright_proxy: Optional[Dict] = None,
|
||||||
|
user_agent: Optional[str] = None) -> BrowserContext:
|
||||||
|
"""
|
||||||
|
创建或获取浏览器上下文
|
||||||
|
"""
|
||||||
|
if not self.browser:
|
||||||
|
raise RuntimeError("浏览器未连接")
|
||||||
|
|
||||||
|
# 获取现有上下文或创建新的上下文
|
||||||
|
contexts = self.browser.contexts
|
||||||
|
|
||||||
|
if contexts:
|
||||||
|
# 使用现有的第一个上下文
|
||||||
|
browser_context = contexts[0]
|
||||||
|
utils.logger.info("[CDPBrowserManager] 使用现有的浏览器上下文")
|
||||||
|
else:
|
||||||
|
# 创建新的上下文
|
||||||
|
context_options = {
|
||||||
|
"viewport": {"width": 1920, "height": 1080},
|
||||||
|
"accept_downloads": True,
|
||||||
|
}
|
||||||
|
|
||||||
|
# 设置用户代理
|
||||||
|
if user_agent:
|
||||||
|
context_options["user_agent"] = user_agent
|
||||||
|
utils.logger.info(f"[CDPBrowserManager] 设置用户代理: {user_agent}")
|
||||||
|
|
||||||
|
# 注意:CDP模式下代理设置可能不生效,因为浏览器已经启动
|
||||||
|
if playwright_proxy:
|
||||||
|
utils.logger.warning(
|
||||||
|
"[CDPBrowserManager] 警告: CDP模式下代理设置可能不生效,"
|
||||||
|
"建议在浏览器启动前配置系统代理或浏览器代理扩展"
|
||||||
|
)
|
||||||
|
|
||||||
|
browser_context = await self.browser.new_context(**context_options)
|
||||||
|
utils.logger.info("[CDPBrowserManager] 创建新的浏览器上下文")
|
||||||
|
|
||||||
|
return browser_context
|
||||||
|
|
||||||
|
async def add_stealth_script(self, script_path: str = "libs/stealth.min.js"):
|
||||||
|
"""
|
||||||
|
添加反检测脚本
|
||||||
|
"""
|
||||||
|
if self.browser_context and os.path.exists(script_path):
|
||||||
|
try:
|
||||||
|
await self.browser_context.add_init_script(path=script_path)
|
||||||
|
utils.logger.info(f"[CDPBrowserManager] 已添加反检测脚本: {script_path}")
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.warning(f"[CDPBrowserManager] 添加反检测脚本失败: {e}")
|
||||||
|
|
||||||
|
async def add_cookies(self, cookies: list):
|
||||||
|
"""
|
||||||
|
添加Cookie
|
||||||
|
"""
|
||||||
|
if self.browser_context:
|
||||||
|
try:
|
||||||
|
await self.browser_context.add_cookies(cookies)
|
||||||
|
utils.logger.info(f"[CDPBrowserManager] 已添加 {len(cookies)} 个Cookie")
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.warning(f"[CDPBrowserManager] 添加Cookie失败: {e}")
|
||||||
|
|
||||||
|
async def get_cookies(self) -> list:
|
||||||
|
"""
|
||||||
|
获取当前Cookie
|
||||||
|
"""
|
||||||
|
if self.browser_context:
|
||||||
|
try:
|
||||||
|
cookies = await self.browser_context.cookies()
|
||||||
|
return cookies
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.warning(f"[CDPBrowserManager] 获取Cookie失败: {e}")
|
||||||
|
return []
|
||||||
|
return []
|
||||||
|
|
||||||
|
async def cleanup(self):
|
||||||
|
"""
|
||||||
|
清理资源
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# 关闭浏览器上下文
|
||||||
|
if self.browser_context:
|
||||||
|
await self.browser_context.close()
|
||||||
|
self.browser_context = None
|
||||||
|
utils.logger.info("[CDPBrowserManager] 浏览器上下文已关闭")
|
||||||
|
|
||||||
|
# 断开浏览器连接
|
||||||
|
if self.browser:
|
||||||
|
await self.browser.close()
|
||||||
|
self.browser = None
|
||||||
|
utils.logger.info("[CDPBrowserManager] 浏览器连接已断开")
|
||||||
|
|
||||||
|
# 关闭浏览器进程(如果配置为自动关闭)
|
||||||
|
if config.AUTO_CLOSE_BROWSER:
|
||||||
|
self.launcher.cleanup()
|
||||||
|
else:
|
||||||
|
utils.logger.info("[CDPBrowserManager] 浏览器进程保持运行(AUTO_CLOSE_BROWSER=False)")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.error(f"[CDPBrowserManager] 清理资源时出错: {e}")
|
||||||
|
|
||||||
|
def is_connected(self) -> bool:
|
||||||
|
"""
|
||||||
|
检查是否已连接到浏览器
|
||||||
|
"""
|
||||||
|
return self.browser is not None and self.browser.is_connected()
|
||||||
|
|
||||||
|
async def get_browser_info(self) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
获取浏览器信息
|
||||||
|
"""
|
||||||
|
if not self.browser:
|
||||||
|
return {}
|
||||||
|
|
||||||
|
try:
|
||||||
|
version = self.browser.version
|
||||||
|
contexts_count = len(self.browser.contexts)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"version": version,
|
||||||
|
"contexts_count": contexts_count,
|
||||||
|
"debug_port": self.debug_port,
|
||||||
|
"is_connected": self.is_connected()
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
utils.logger.warning(f"[CDPBrowserManager] 获取浏览器信息失败: {e}")
|
||||||
|
return {}
|
||||||
Reference in New Issue
Block a user