Merge pull request #716 from persist-1/refactor

Refactor: 使用 SQLAlchemy ORM 全面重构数据库层
This commit is contained in:
程序员阿江-Relakkes
2025-09-10 14:38:39 +08:00
committed by GitHub
54 changed files with 3885 additions and 6195 deletions

7
.gitignore vendored
View File

@@ -173,4 +173,9 @@ docs/.vitepress/cache
# other gitignore
.venv
.refer
.refer
agent_zone
debug_tools
database/*.db

View File

@@ -196,21 +196,29 @@ python main.py --help
## 💾 数据保存
支持多种数据存储方式:
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
- 参数:`--save_data_option sqlite`
- 自动创建数据库文件
- **MySQL 数据库**:支持关系型数据库 MySQL 中保存(需要提前创建数据库)
- 执行 `python db.py` 初始化数据库表结构(只在首次执行)
- **CSV 文件**:支持保存到 CSV 中(`data/` 目录下)
- **JSON 文件**:支持保存到 JSON 中(`data/` 目录下)
- **数据库存储**
- 使用参数 `--init_db` 进行数据库初始化(使用`--init_db`时不需要携带其他optional
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
1. 初始化:`--init_db sqlite`
2. 数据存储:`--save_data_option sqlite`
- **MySQL 数据库**:支持关系型数据库 MySQL 中保存(需要提前创建数据库)
1. 初始化:`--init_db mysql`
2. 数据存储:`--save_data_option db`db 参数为兼容历史更新保留)
### 使用示例:
```shell
# 使用 SQLite(推荐个人用户使用
# 初始化 SQLite 数据库(使用'--init_db'时不需要携带其他optional
uv run main.py --init_db sqlite
# 使用 SQLite 存储数据(推荐个人用户使用)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
# 使用 MySQL
```
```shell
# 初始化 MySQL 数据库
uv run main.py --init_db mysql
# 使用 MySQL 存储数据为适配历史更新db参数进行沿用
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```

View File

@@ -194,21 +194,29 @@ python main.py --help
## 💾 Data Storage
Supports multiple data storage methods:
- **SQLite Database**: Lightweight database without server, ideal for personal use (recommended)
- Parameter: `--save_data_option sqlite`
- Database file created automatically
- **MySQL Database**: Supports saving to relational database MySQL (need to create database in advance)
- Execute `python db.py` to initialize database table structure (only execute on first run)
- **CSV Files**: Supports saving to CSV (under `data/` directory)
- **JSON Files**: Supports saving to JSON (under `data/` directory)
- **Database Storage**
- Use the `--init_db` parameter for database initialization (when using `--init_db`, no other optional arguments are needed)
- **SQLite Database**: Lightweight database, no server required, suitable for personal use (recommended)
1. Initialization: `--init_db sqlite`
2. Data Storage: `--save_data_option sqlite`
- **MySQL Database**: Supports saving to relational database MySQL (database needs to be created in advance)
1. Initialization: `--init_db mysql`
2. Data Storage: `--save_data_option db` (the db parameter is retained for compatibility with historical updates)
### Usage Examples:
```shell
# Use SQLite (recommended for personal users)
# Initialize SQLite database (when using '--init_db', no other optional arguments are needed)
uv run main.py --init_db sqlite
# Use SQLite to store data (recommended for personal users)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
# Use MySQL
```
```shell
# Initialize MySQL database
uv run main.py --init_db mysql
# Use MySQL to store data (the db parameter is retained for compatibility with historical updates)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```

View File

@@ -194,21 +194,29 @@ python main.py --help
## 💾 Almacenamiento de Datos
Soporta múltiples métodos de almacenamiento de datos:
- **Base de Datos SQLite**: Base de datos ligera sin servidor, ideal para uso personal (recomendado)
- Parámetro: `--save_data_option sqlite`
- Se crea automáticamente el archivo de base de datos
- **Base de Datos MySQL**: Soporta guardar en base de datos relacional MySQL (necesita crear base de datos con anticipación)
- Ejecute `python db.py` para inicializar la estructura de tablas de la base de datos (solo ejecutar en la primera ejecución)
- **Archivos CSV**: Soporta guardar en CSV (bajo el directorio `data/`)
- **Archivos JSON**: Soporta guardar en JSON (bajo el directorio `data/`)
- **Almacenamiento en Base de Datos**
- Use el parámetro `--init_db` para la inicialización de la base de datos (cuando use `--init_db`, no se necesitan otros argumentos opcionales)
- **Base de Datos SQLite**: Base de datos ligera, no requiere servidor, adecuada para uso personal (recomendado)
1. Inicialización: `--init_db sqlite`
2. Almacenamiento de Datos: `--save_data_option sqlite`
- **Base de Datos MySQL**: Soporta guardar en la base de datos relacional MySQL (la base de datos debe crearse con anticipación)
1. Inicialización: `--init_db mysql`
2. Almacenamiento de Datos: `--save_data_option db` (el parámetro db se mantiene por compatibilidad con actualizaciones históricas)
### Ejemplos de Uso:
```shell
# Usar SQLite (recomendado para usuarios personales)
# Inicializar la base de datos SQLite (cuando use '--init_db', no se necesitan otros argumentos opcionales)
uv run main.py --init_db sqlite
# Usar SQLite para almacenar datos (recomendado para usuarios personales)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
# Usar MySQL
```
```shell
# Inicializar la base de datos MySQL
uv run main.py --init_db mysql
# Usar MySQL para almacenar datos (el parámetro db se mantiene por compatibilidad con actualizaciones históricas)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```

View File

@@ -1,107 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 14:21
# @Desc : 异步Aiomysql的增删改查封装
from typing import Any, Dict, List, Union
import aiomysql
class AsyncMysqlDB:
def __init__(self, pool: aiomysql.Pool) -> None:
self.__pool = pool
async def query(self, sql: str, *args: Union[str, int]) -> List[Dict[str, Any]]:
"""
从给定的 SQL 中查询记录,返回的是一个列表
:param sql: 查询的sql
:param args: sql中传递动态参数列表
:return:
"""
async with self.__pool.acquire() as conn:
async with conn.cursor(aiomysql.DictCursor) as cur:
await cur.execute(sql, args)
data = await cur.fetchall()
return data or []
async def get_first(self, sql: str, *args: Union[str, int]) -> Union[Dict[str, Any], None]:
"""
从给定的 SQL 中查询记录,返回的是符合条件的第一个结果
:param sql: 查询的sql
:param args:sql中传递动态参数列表
:return:
"""
async with self.__pool.acquire() as conn:
async with conn.cursor(aiomysql.DictCursor) as cur:
await cur.execute(sql, args)
data = await cur.fetchone()
return data
async def item_to_table(self, table_name: str, item: Dict[str, Any]) -> int:
"""
表中插入数据
:param table_name: 表名
:param item: 一条记录的字典信息
:return:
"""
fields = list(item.keys())
values = list(item.values())
fields = [f'`{field}`' for field in fields]
fieldstr = ','.join(fields)
valstr = ','.join(['%s'] * len(item))
sql = "INSERT INTO %s (%s) VALUES(%s)" % (table_name, fieldstr, valstr)
async with self.__pool.acquire() as conn:
async with conn.cursor(aiomysql.DictCursor) as cur:
await cur.execute(sql, values)
lastrowid = cur.lastrowid
return lastrowid
async def update_table(self, table_name: str, updates: Dict[str, Any], field_where: str,
value_where: Union[str, int, float]) -> int:
"""
更新指定表的记录
:param table_name: 表名
:param updates: 需要更新的字段和值的 key - value 映射
:param field_where: update 语句 where 条件中的字段名
:param value_where: update 语句 where 条件中的字段值
:return:
"""
upsets = []
values = []
for k, v in updates.items():
s = '`%s`=%%s' % k
upsets.append(s)
values.append(v)
upsets = ','.join(upsets)
sql = 'UPDATE %s SET %s WHERE %s="%s"' % (
table_name,
upsets,
field_where, value_where,
)
async with self.__pool.acquire() as conn:
async with conn.cursor() as cur:
rows = await cur.execute(sql, values)
return rows
async def execute(self, sql: str, *args: Union[str, int]) -> int:
"""
需要更新、写入等操作的 excute 执行语句
:param sql:
:param args:
:return:
"""
async with self.__pool.acquire() as conn:
async with conn.cursor() as cur:
rows = await cur.execute(sql, args)
return rows

View File

@@ -1,111 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 14:21
# @Desc : 异步SQLite的增删改查封装
from typing import Any, Dict, List, Union
import aiosqlite
class AsyncSqliteDB:
def __init__(self, db_path: str) -> None:
self.__db_path = db_path
async def query(self, sql: str, *args: Union[str, int]) -> List[Dict[str, Any]]:
"""
从给定的 SQL 中查询记录,返回的是一个列表
:param sql: 查询的sql
:param args: sql中传递动态参数列表
:return:
"""
async with aiosqlite.connect(self.__db_path) as conn:
conn.row_factory = aiosqlite.Row
async with conn.execute(sql, args) as cursor:
rows = await cursor.fetchall()
return [dict(row) for row in rows] if rows else []
async def get_first(self, sql: str, *args: Union[str, int]) -> Union[Dict[str, Any], None]:
"""
从给定的 SQL 中查询记录,返回的是符合条件的第一个结果
:param sql: 查询的sql
:param args:sql中传递动态参数列表
:return:
"""
async with aiosqlite.connect(self.__db_path) as conn:
conn.row_factory = aiosqlite.Row
async with conn.execute(sql, args) as cursor:
row = await cursor.fetchone()
return dict(row) if row else None
async def item_to_table(self, table_name: str, item: Dict[str, Any]) -> int:
"""
表中插入数据
:param table_name: 表名
:param item: 一条记录的字典信息
:return:
"""
fields = list(item.keys())
values = list(item.values())
fieldstr = ','.join(fields)
valstr = ','.join(['?'] * len(item))
sql = f"INSERT INTO {table_name} ({fieldstr}) VALUES({valstr})"
async with aiosqlite.connect(self.__db_path) as conn:
async with conn.execute(sql, values) as cursor:
await conn.commit()
return cursor.lastrowid
async def update_table(self, table_name: str, updates: Dict[str, Any], field_where: str,
value_where: Union[str, int, float]) -> int:
"""
更新指定表的记录
:param table_name: 表名
:param updates: 需要更新的字段和值的 key - value 映射
:param field_where: update 语句 where 条件中的字段名
:param value_where: update 语句 where 条件中的字段值
:return:
"""
upsets = []
values = []
for k, v in updates.items():
upsets.append(f'{k}=?')
values.append(v)
upsets_str = ','.join(upsets)
values.append(value_where)
sql = f'UPDATE {table_name} SET {upsets_str} WHERE {field_where}=?'
async with aiosqlite.connect(self.__db_path) as conn:
async with conn.execute(sql, values) as cursor:
await conn.commit()
return cursor.rowcount
async def execute(self, sql: str, *args: Union[str, int]) -> int:
"""
需要更新、写入等操作的 excute 执行语句
:param sql:
:param args:
:return:
"""
async with aiosqlite.connect(self.__db_path) as conn:
async with conn.execute(sql, args) as cursor:
await conn.commit()
return cursor.rowcount
async def executescript(self, sql_script: str) -> None:
"""
执行SQL脚本用于初始化数据库表结构
:param sql_script: SQL脚本内容
:return:
"""
async with aiosqlite.connect(self.__db_path) as conn:
await conn.executescript(sql_script)
await conn.commit()

View File

@@ -38,6 +38,9 @@ async def parse_cmd():
parser.add_argument('--save_data_option', type=str,
help='Where to save the data / 数据保存方式 (csv=CSV文件 | db=MySQL数据库 | json=JSON文件 | sqlite=SQLite数据库)',
choices=['csv', 'db', 'json', 'sqlite'], default=config.SAVE_DATA_OPTION)
parser.add_argument('--init_db', type=str,
help='Initialize database schema / 初始化数据库表结构 (sqlite | mysql)',
choices=['sqlite', 'mysql'], default=None)
parser.add_argument('--cookies', type=str,
help='Cookies used for cookie login type / Cookie登录方式使用的Cookie值', default=config.COOKIES)
@@ -53,3 +56,5 @@ async def parse_cmd():
config.ENABLE_GET_SUB_COMMENTS = args.get_sub_comment
config.SAVE_DATA_OPTION = args.save_data_option
config.COOKIES = args.cookies
return args

View File

@@ -71,7 +71,7 @@ USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name
START_PAGE = 1
# 爬取视频/帖子的数量控制
CRAWLER_MAX_NOTES_COUNT = 200
CRAWLER_MAX_NOTES_COUNT = 15
# 并发爬虫数量控制
MAX_CONCURRENCY_NUM = 1

View File

@@ -18,6 +18,14 @@ MYSQL_DB_HOST = os.getenv("MYSQL_DB_HOST", "localhost")
MYSQL_DB_PORT = os.getenv("MYSQL_DB_PORT", 3306)
MYSQL_DB_NAME = os.getenv("MYSQL_DB_NAME", "media_crawler")
mysql_db_config = {
"user": MYSQL_DB_USER,
"password": MYSQL_DB_PWD,
"host": MYSQL_DB_HOST,
"port": MYSQL_DB_PORT,
"db_name": MYSQL_DB_NAME,
}
# redis config
REDIS_DB_HOST = "127.0.0.1" # your redis host
@@ -30,4 +38,8 @@ CACHE_TYPE_REDIS = "redis"
CACHE_TYPE_MEMORY = "memory"
# sqlite config
SQLITE_DB_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "schema", "sqlite_tables.db")
SQLITE_DB_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "database", "sqlite_tables.db")
sqlite_db_config = {
"db_path": SQLITE_DB_PATH
}

0
database/__init__.py Normal file
View File

35
database/db.py Normal file
View File

@@ -0,0 +1,35 @@
# persist-1<persist1@126.com>
# 原因:将 db.py 改造为模块,移除直接执行入口,修复相对导入问题。
# 副作用:无
# 回滚策略:还原此文件。
import asyncio
import sys
from pathlib import Path
# Add project root to sys.path
project_root = Path(__file__).resolve().parents[1]
if str(project_root) not in sys.path:
sys.path.append(str(project_root))
from tools import utils
from database.db_session import create_tables
async def init_table_schema(db_type: str):
"""
Initializes the database table schema.
This will create tables based on the ORM models.
Args:
db_type: The type of database, 'sqlite' or 'mysql'.
"""
utils.logger.info(f"[init_table_schema] begin init {db_type} table schema ...")
await create_tables(db_type)
utils.logger.info(f"[init_table_schema] {db_type} table schema init successful")
async def init_db(db_type: str = None):
await init_table_schema(db_type)
async def close():
"""
Placeholder for closing database connections if needed in the future.
"""
pass

70
database/db_session.py Normal file
View File

@@ -0,0 +1,70 @@
from sqlalchemy import text
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
from contextlib import asynccontextmanager
from .models import Base
import config
from config.db_config import mysql_db_config, sqlite_db_config
# Keep a cache of engines
_engines = {}
async def create_database_if_not_exists(db_type: str):
if db_type == "mysql" or db_type == "db":
# Connect to the server without a database
server_url = f"mysql+asyncmy://{mysql_db_config['user']}:{mysql_db_config['password']}@{mysql_db_config['host']}:{mysql_db_config['port']}"
engine = create_async_engine(server_url, echo=False)
async with engine.connect() as conn:
await conn.execute(text(f"CREATE DATABASE IF NOT EXISTS {mysql_db_config['db_name']}"))
await engine.dispose()
def get_async_engine(db_type: str = None):
if db_type is None:
db_type = config.SAVE_DATA_OPTION
if db_type in _engines:
return _engines[db_type]
if db_type in ["json", "csv"]:
return None
if db_type == "sqlite":
db_url = f"sqlite+aiosqlite:///{sqlite_db_config['db_path']}"
elif db_type == "mysql" or db_type == "db":
db_url = f"mysql+asyncmy://{mysql_db_config['user']}:{mysql_db_config['password']}@{mysql_db_config['host']}:{mysql_db_config['port']}/{mysql_db_config['db_name']}"
else:
raise ValueError(f"Unsupported database type: {db_type}")
engine = create_async_engine(db_url, echo=False)
_engines[db_type] = engine
return engine
async def create_tables(db_type: str = None):
if db_type is None:
db_type = config.SAVE_DATA_OPTION
await create_database_if_not_exists(db_type)
engine = get_async_engine(db_type)
if engine:
async with engine.begin() as conn:
await conn.run_sync(Base.metadata.create_all)
@asynccontextmanager
async def get_session() -> AsyncSession:
engine = get_async_engine(config.SAVE_DATA_OPTION)
if not engine:
yield None
return
AsyncSessionFactory = sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
session = AsyncSessionFactory()
try:
yield session
await session.commit()
except Exception as e:
await session.rollback()
raise e
finally:
await session.close()

434
database/models.py Normal file
View File

@@ -0,0 +1,434 @@
from sqlalchemy import create_engine, Column, Integer, Text, String, BigInteger
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
class BilibiliVideo(Base):
__tablename__ = 'bilibili_video'
id = Column(Integer, primary_key=True)
video_id = Column(BigInteger, nullable=False, index=True, unique=True)
video_url = Column(Text, nullable=False)
user_id = Column(BigInteger, index=True)
nickname = Column(Text)
avatar = Column(Text)
liked_count = Column(Integer)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
video_type = Column(Text)
title = Column(Text)
desc = Column(Text)
create_time = Column(BigInteger, index=True)
disliked_count = Column(Text)
video_play_count = Column(Text)
video_favorite_count = Column(Text)
video_share_count = Column(Text)
video_coin_count = Column(Text)
video_danmaku = Column(Text)
video_comment = Column(Text)
video_cover_url = Column(Text)
source_keyword = Column(Text, default='')
class BilibiliVideoComment(Base):
__tablename__ = 'bilibili_video_comment'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
sex = Column(Text)
sign = Column(Text)
avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(BigInteger, index=True)
video_id = Column(BigInteger, index=True)
content = Column(Text)
create_time = Column(BigInteger)
sub_comment_count = Column(Text)
parent_comment_id = Column(String(255))
like_count = Column(Text, default='0')
class BilibiliUpInfo(Base):
__tablename__ = 'bilibili_up_info'
id = Column(Integer, primary_key=True)
user_id = Column(BigInteger, index=True)
nickname = Column(Text)
sex = Column(Text)
sign = Column(Text)
avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
total_fans = Column(Integer)
total_liked = Column(Integer)
user_rank = Column(Integer)
is_official = Column(Integer)
class BilibiliContactInfo(Base):
__tablename__ = 'bilibili_contact_info'
id = Column(Integer, primary_key=True)
up_id = Column(BigInteger, index=True)
fan_id = Column(BigInteger, index=True)
up_name = Column(Text)
fan_name = Column(Text)
up_sign = Column(Text)
fan_sign = Column(Text)
up_avatar = Column(Text)
fan_avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
class BilibiliUpDynamic(Base):
__tablename__ = 'bilibili_up_dynamic'
id = Column(Integer, primary_key=True)
dynamic_id = Column(BigInteger, index=True)
user_id = Column(String(255))
user_name = Column(Text)
text = Column(Text)
type = Column(Text)
pub_ts = Column(BigInteger)
total_comments = Column(Integer)
total_forwards = Column(Integer)
total_liked = Column(Integer)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
class DouyinAweme(Base):
__tablename__ = 'douyin_aweme'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
sec_uid = Column(String(255))
short_user_id = Column(String(255))
user_unique_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
user_signature = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
aweme_id = Column(BigInteger, index=True)
aweme_type = Column(Text)
title = Column(Text)
desc = Column(Text)
create_time = Column(BigInteger, index=True)
liked_count = Column(Text)
comment_count = Column(Text)
share_count = Column(Text)
collected_count = Column(Text)
aweme_url = Column(Text)
cover_url = Column(Text)
video_download_url = Column(Text)
music_download_url = Column(Text)
note_download_url = Column(Text)
source_keyword = Column(Text, default='')
class DouyinAwemeComment(Base):
__tablename__ = 'douyin_aweme_comment'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
sec_uid = Column(String(255))
short_user_id = Column(String(255))
user_unique_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
user_signature = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(BigInteger, index=True)
aweme_id = Column(BigInteger, index=True)
content = Column(Text)
create_time = Column(BigInteger)
sub_comment_count = Column(Text)
parent_comment_id = Column(String(255))
like_count = Column(Text, default='0')
pictures = Column(Text, default='')
class DyCreator(Base):
__tablename__ = 'dy_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
desc = Column(Text)
gender = Column(Text)
follows = Column(Text)
fans = Column(Text)
interaction = Column(Text)
videos_count = Column(String(255))
class KuaishouVideo(Base):
__tablename__ = 'kuaishou_video'
id = Column(Integer, primary_key=True)
user_id = Column(String(64))
nickname = Column(Text)
avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
video_id = Column(String(255), index=True)
video_type = Column(Text)
title = Column(Text)
desc = Column(Text)
create_time = Column(BigInteger, index=True)
liked_count = Column(Text)
viewd_count = Column(Text)
video_url = Column(Text)
video_cover_url = Column(Text)
video_play_url = Column(Text)
source_keyword = Column(Text, default='')
class KuaishouVideoComment(Base):
__tablename__ = 'kuaishou_video_comment'
id = Column(Integer, primary_key=True)
user_id = Column(Text)
nickname = Column(Text)
avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(BigInteger, index=True)
video_id = Column(String(255), index=True)
content = Column(Text)
create_time = Column(BigInteger)
sub_comment_count = Column(Text)
class WeiboNote(Base):
__tablename__ = 'weibo_note'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
gender = Column(Text)
profile_url = Column(Text)
ip_location = Column(Text, default='')
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
note_id = Column(BigInteger, index=True)
content = Column(Text)
create_time = Column(BigInteger, index=True)
create_date_time = Column(String(255), index=True)
liked_count = Column(Text)
comments_count = Column(Text)
shared_count = Column(Text)
note_url = Column(Text)
source_keyword = Column(Text, default='')
class WeiboNoteComment(Base):
__tablename__ = 'weibo_note_comment'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
gender = Column(Text)
profile_url = Column(Text)
ip_location = Column(Text, default='')
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(BigInteger, index=True)
note_id = Column(BigInteger, index=True)
content = Column(Text)
create_time = Column(BigInteger)
create_date_time = Column(String(255), index=True)
comment_like_count = Column(Text)
sub_comment_count = Column(Text)
parent_comment_id = Column(String(255))
class WeiboCreator(Base):
__tablename__ = 'weibo_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
desc = Column(Text)
gender = Column(Text)
follows = Column(Text)
fans = Column(Text)
tag_list = Column(Text)
class XhsCreator(Base):
__tablename__ = 'xhs_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
desc = Column(Text)
gender = Column(Text)
follows = Column(Text)
fans = Column(Text)
interaction = Column(Text)
tag_list = Column(Text)
class XhsNote(Base):
__tablename__ = 'xhs_note'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
note_id = Column(String(255), index=True)
type = Column(Text)
title = Column(Text)
desc = Column(Text)
video_url = Column(Text)
time = Column(BigInteger, index=True)
last_update_time = Column(BigInteger)
liked_count = Column(Text)
collected_count = Column(Text)
comment_count = Column(Text)
share_count = Column(Text)
image_list = Column(Text)
tag_list = Column(Text)
note_url = Column(Text)
source_keyword = Column(Text, default='')
xsec_token = Column(Text)
class XhsNoteComment(Base):
__tablename__ = 'xhs_note_comment'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(String(255), index=True)
create_time = Column(BigInteger, index=True)
note_id = Column(String(255))
content = Column(Text)
sub_comment_count = Column(Integer)
pictures = Column(Text)
parent_comment_id = Column(String(255))
like_count = Column(Text)
class TiebaNote(Base):
__tablename__ = 'tieba_note'
id = Column(Integer, primary_key=True)
note_id = Column(String(644), index=True)
title = Column(Text)
desc = Column(Text)
note_url = Column(Text)
publish_time = Column(String(255), index=True)
user_link = Column(Text, default='')
user_nickname = Column(Text, default='')
user_avatar = Column(Text, default='')
tieba_id = Column(String(255), default='')
tieba_name = Column(Text)
tieba_link = Column(Text)
total_replay_num = Column(Integer, default=0)
total_replay_page = Column(Integer, default=0)
ip_location = Column(Text, default='')
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
source_keyword = Column(Text, default='')
class TiebaComment(Base):
__tablename__ = 'tieba_comment'
id = Column(Integer, primary_key=True)
comment_id = Column(String(255), index=True)
parent_comment_id = Column(String(255), default='')
content = Column(Text)
user_link = Column(Text, default='')
user_nickname = Column(Text, default='')
user_avatar = Column(Text, default='')
tieba_id = Column(String(255), default='')
tieba_name = Column(Text)
tieba_link = Column(Text)
publish_time = Column(String(255), index=True)
ip_location = Column(Text, default='')
sub_comment_count = Column(Integer, default=0)
note_id = Column(String(255), index=True)
note_url = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
class TiebaCreator(Base):
__tablename__ = 'tieba_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(64))
user_name = Column(Text)
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
gender = Column(Text)
follows = Column(Text)
fans = Column(Text)
registration_duration = Column(Text)
class ZhihuContent(Base):
__tablename__ = 'zhihu_content'
id = Column(Integer, primary_key=True)
content_id = Column(String(64), index=True)
content_type = Column(Text)
content_text = Column(Text)
content_url = Column(Text)
question_id = Column(String(255))
title = Column(Text)
desc = Column(Text)
created_time = Column(String(32), index=True)
updated_time = Column(Text)
voteup_count = Column(Integer, default=0)
comment_count = Column(Integer, default=0)
source_keyword = Column(Text)
user_id = Column(String(255))
user_link = Column(Text)
user_nickname = Column(Text)
user_avatar = Column(Text)
user_url_token = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
# persist-1<persist1@126.com>
# 原因:修复 ORM 模型定义错误,确保与数据库表结构一致。
# 副作用:无
# 回滚策略:还原此行
class ZhihuComment(Base):
__tablename__ = 'zhihu_comment'
id = Column(Integer, primary_key=True)
comment_id = Column(String(64), index=True)
parent_comment_id = Column(String(64))
content = Column(Text)
publish_time = Column(String(32), index=True)
ip_location = Column(Text)
sub_comment_count = Column(Integer, default=0)
like_count = Column(Integer, default=0)
dislike_count = Column(Integer, default=0)
content_id = Column(String(64), index=True)
content_type = Column(Text)
user_id = Column(String(64))
user_link = Column(Text)
user_nickname = Column(Text)
user_avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
class ZhihuCreator(Base):
__tablename__ = 'zhihu_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(64), unique=True, index=True)
user_link = Column(Text)
user_nickname = Column(Text)
user_avatar = Column(Text)
url_token = Column(Text)
gender = Column(Text)
ip_location = Column(Text)
follows = Column(Integer, default=0)
fans = Column(Integer, default=0)
anwser_count = Column(Integer, default=0)
video_count = Column(Integer, default=0)
question_count = Column(Integer, default=0)
article_count = Column(Integer, default=0)
column_count = Column(Integer, default=0)
get_voteup_count = Column(Integer, default=0)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)

209
db.py
View File

@@ -1,209 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 14:54
# @Desc : mediacrawler db 管理
import asyncio
from typing import Dict
from urllib.parse import urlparse
import aiofiles
import aiomysql
import config
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from tools import utils
from var import db_conn_pool_var, media_crawler_db_var
async def init_mediacrawler_db():
"""
初始化数据库链接池对象并将该对象塞给media_crawler_db_var上下文变量
Returns:
"""
pool = await aiomysql.create_pool(
host=config.MYSQL_DB_HOST,
port=config.MYSQL_DB_PORT,
user=config.MYSQL_DB_USER,
password=config.MYSQL_DB_PWD,
db=config.MYSQL_DB_NAME,
autocommit=True,
)
async_db_obj = AsyncMysqlDB(pool)
# 将连接池对象和封装的CRUD sql接口对象放到上下文变量中
db_conn_pool_var.set(pool)
media_crawler_db_var.set(async_db_obj)
async def init_sqlite_db():
"""
初始化SQLite数据库对象并将该对象塞给media_crawler_db_var上下文变量
Returns:
"""
async_db_obj = AsyncSqliteDB(config.SQLITE_DB_PATH)
# 将SQLite数据库对象放到上下文变量中
media_crawler_db_var.set(async_db_obj)
async def init_db():
"""
初始化db连接池
Returns:
"""
utils.logger.info("[init_db] start init mediacrawler db connect object")
if config.SAVE_DATA_OPTION == "sqlite":
await init_sqlite_db()
utils.logger.info("[init_db] end init sqlite db connect object")
else:
await init_mediacrawler_db()
utils.logger.info("[init_db] end init mysql db connect object")
async def close():
"""
关闭数据库连接
Returns:
"""
utils.logger.info("[close] close mediacrawler db connection")
if config.SAVE_DATA_OPTION == "sqlite":
# SQLite数据库连接会在AsyncSqliteDB对象销毁时自动关闭
utils.logger.info("[close] sqlite db connection will be closed automatically")
else:
# MySQL连接池关闭
db_pool: aiomysql.Pool = db_conn_pool_var.get()
if db_pool is not None:
db_pool.close()
utils.logger.info("[close] mysql db pool closed")
async def init_table_schema(db_type: str = None):
"""
用来初始化数据库表结构,请在第一次需要创建表结构的时候使用,多次执行该函数会将已有的表以及数据全部删除
Args:
db_type: 数据库类型,可选值为 'sqlite''mysql',如果不指定则使用配置文件中的设置
Returns:
"""
# 如果没有指定数据库类型,则使用配置文件中的设置
if db_type is None:
db_type = config.SAVE_DATA_OPTION
if db_type == "sqlite":
utils.logger.info("[init_table_schema] begin init sqlite table schema ...")
# 检查并删除可能存在的损坏数据库文件
import os
if os.path.exists(config.SQLITE_DB_PATH):
try:
# 尝试删除现有的数据库文件
os.remove(config.SQLITE_DB_PATH)
utils.logger.info(f"[init_table_schema] removed existing sqlite db file: {config.SQLITE_DB_PATH}")
except Exception as e:
utils.logger.warning(f"[init_table_schema] failed to remove existing sqlite db file: {e}")
# 如果删除失败,尝试重命名文件
try:
backup_path = f"{config.SQLITE_DB_PATH}.backup_{utils.get_current_timestamp()}"
os.rename(config.SQLITE_DB_PATH, backup_path)
utils.logger.info(f"[init_table_schema] renamed existing sqlite db file to: {backup_path}")
except Exception as rename_e:
utils.logger.error(f"[init_table_schema] failed to rename existing sqlite db file: {rename_e}")
raise rename_e
await init_sqlite_db()
async_db_obj: AsyncSqliteDB = media_crawler_db_var.get()
async with aiofiles.open("schema/sqlite_tables.sql", mode="r", encoding="utf-8") as f:
schema_sql = await f.read()
await async_db_obj.executescript(schema_sql)
utils.logger.info("[init_table_schema] sqlite table schema init successful")
elif db_type == "mysql":
utils.logger.info("[init_table_schema] begin init mysql table schema ...")
await init_mediacrawler_db()
async_db_obj: AsyncMysqlDB = media_crawler_db_var.get()
async with aiofiles.open("schema/tables.sql", mode="r", encoding="utf-8") as f:
schema_sql = await f.read()
await async_db_obj.execute(schema_sql)
utils.logger.info("[init_table_schema] mysql table schema init successful")
await close()
else:
utils.logger.error(f"[init_table_schema] 不支持的数据库类型: {db_type}")
raise ValueError(f"不支持的数据库类型: {db_type},支持的类型: sqlite, mysql")
def show_database_options():
"""
显示支持的数据库选项
"""
print("\n=== MediaCrawler 数据库初始化工具 ===")
print("支持的数据库类型:")
print("1. sqlite - SQLite 数据库 (轻量级,无需额外配置)")
print("2. mysql - MySQL 数据库 (需要配置数据库连接信息)")
print("3. config - 使用配置文件中的设置")
print("4. exit - 退出程序")
print("="*50)
def get_user_choice():
"""
获取用户选择的数据库类型
Returns:
str: 用户选择的数据库类型
"""
while True:
choice = input("请输入数据库类型 (sqlite/mysql/config/exit): ").strip().lower()
if choice in ['sqlite', 'mysql', 'config', 'exit']:
return choice
else:
print("❌ 无效的选择,请输入: sqlite, mysql, config 或 exit")
async def main():
"""
主函数,处理用户交互和数据库初始化
"""
try:
show_database_options()
while True:
choice = get_user_choice()
if choice == 'exit':
print("👋 程序已退出")
break
elif choice == 'config':
print(f"📋 使用配置文件中的设置: {config.SAVE_DATA_OPTION}")
await init_table_schema()
print("✅ 数据库表结构初始化完成!")
break
else:
print(f"🚀 开始初始化 {choice.upper()} 数据库...")
await init_table_schema(choice)
print("✅ 数据库表结构初始化完成!")
break
except KeyboardInterrupt:
print("\n\n⚠️ 用户中断操作")
except Exception as e:
print(f"\n❌ 初始化失败: {str(e)}")
utils.logger.error(f"[main] 数据库初始化失败: {str(e)}")
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(main())

View File

@@ -54,15 +54,19 @@
python main.py --help
```
## 数据保存
- 支持关系型数据库Mysql中保存需要提前创建数据库
- 执行 `python db.py` 初始化数据库数据库表结构(只在首次执行)
- 支持轻量级SQLite数据库保存无需额外安装数据库服务器
- 本地文件数据库,适合个人使用和小规模数据存储
- 使用参数 `--save_data_option sqlite` 启用SQLite存储
- 数据库文件自动创建在项目目录下schema/sqlite_tables.db
- 支持保存到csv中data/目录下)
- 支持保存到json中data/目录下)
## 💾 数据存储
支持多种数据存储方式:
- **CSV 文件**: 支持保存至 CSV (位于 `data/` 目录下)
- **JSON 文件**: 支持保存至 JSON (位于 `data/` 目录下)
- **数据库存储**
- 使用 `--init_db` 参数进行数据库初始化 (使用 `--init_db` 时,无需其他可选参数)
- **SQLite 数据库**: 轻量级数据库,无需服务器,适合个人使用 (推荐)
1. 初始化: `--init_db sqlite`
2. 数据存储: `--save_data_option sqlite`
- **MySQL 数据库**: 支持保存至关系型数据库 MySQL (需提前创建数据库)
1. 初始化: `--init_db mysql`
2. 数据存储: `--save_data_option db` (db 参数为兼容历史更新保留)
## 免责声明
> **免责声明:**

View File

@@ -2,36 +2,70 @@
```
MediaCrawler
├── base
│ └── base_crawler.py # 项目的抽象类
├── browser_data # 换成用户的浏览器数据目录
├── config
│ ├── account_config.py # 账号代理池配置
├── base
│ └── base_crawler.py # 项目的抽象
├── cache
│ ├── abs_cache.py # 缓存抽象基类
│ ├── cache_factory.py # 缓存工厂
│ ├── local_cache.py # 本地缓存实现
│ └── redis_cache.py # Redis缓存实现
├── cmd_arg
│ └── arg.py # 命令行参数定义
├── config
│ ├── base_config.py # 基础配置
── db_config.py # 数据库配置
├── data # 数据保存目录
├── libs
── db_config.py # 数据库配置
│ └── ... # 各平台配置文件
├── constant
│ └── ... # 各平台常量定义
├── database
│ ├── db.py # 数据库ORM封装增删改查
│ ├── db_session.py # 数据库会话管理
│ └── models.py # 数据库模型定义
├── docs
│ └── ... # 项目文档
├── libs
│ ├── douyin.js # 抖音Sign函数
── stealth.min.js # 去除浏览器自动化特征的JS
── stealth.min.js # 去除浏览器自动化特征的JS
│ └── zhihu.js # 知乎Sign函数
├── media_platform
│ ├── douyin # 抖音crawler实现
│ ├── xhs # 小红书crawler实现
│ ├── bilibili # B站crawler实现
── kuaishou # 快手crawler实现
├── modles
│ ├── douyin.py # 抖音数据模型
── xiaohongshu.py # 小红书数据模型
│ ├── kuaishou.py # 快手数据模型
── bilibili.py # B站数据模型
│ ├── bilibili # B站采集实现
│ ├── douyin # 抖音采集实现
│ ├── kuaishou # 快手采集实现
── tieba # 百度贴吧采集实现
│ ├── weibo # 微博采集实现
│ ├── xhs # 小红书采集实现
── zhihu # 知乎采集实现
├── model
── m_baidu_tieba.py # 百度贴吧数据模型
│ ├── m_douyin.py # 抖音数据模型
│ ├── m_kuaishou.py # 快手数据模型
│ ├── m_weibo.py # 微博数据模型
│ ├── m_xiaohongshu.py # 小红书数据模型
│ └── m_zhihu.py # 知乎数据模型
├── proxy
│ ├── base_proxy.py # 代理基类
│ ├── providers # 代理提供商实现
│ ├── proxy_ip_pool.py # 代理IP池
│ └── types.py # 代理类型定义
├── store
│ ├── bilibili # B站数据存储实现
│ ├── douyin # 抖音数据存储实现
│ ├── kuaishou # 快手数据存储实现
│ ├── tieba # 贴吧数据存储实现
│ ├── weibo # 微博数据存储实现
│ ├── xhs # 小红书数据存储实现
│ └── zhihu # 知乎数据存储实现
├── test
│ ├── test_db_sync.py # 数据库同步测试
│ ├── test_proxy_ip_pool.py # 代理IP池测试
│ └── ... # 其他测试用例
├── tools
│ ├── utils.py # 暴露给外部的工具函数
│ ├── crawler_util.py # 爬虫相关的工具函数
│ ├── slider_util.py # 滑块相关的工具函数
│ ├── time_util.py # 时间相关的工具函数
── easing.py # 模拟滑动轨迹相关的函数
| └── words.py # 生成词云图相关的函数
├── db.py # DB ORM
── main.py # 程序入口
├── var.py # 上下文变量定义
└── recv_sms_notification.py # 短信转发器的HTTP SERVER接口
│ ├── browser_launcher.py # 浏览器启动器
│ ├── cdp_browser.py # CDP浏览器控制
│ ├── crawler_util.py # 爬虫工具函数
│ ├── utils.py # 通用工具函数
── ...
├── main.py # 程序入口, 支持 --init_db 参数来初始化数据库
├── recv_sms.py # 短信转发HTTP SERVER接口
── var.py # 全局上下文变量定义
```

16
main.py
View File

@@ -15,7 +15,7 @@ from typing import Optional
import cmd_arg
import config
import db
from database import db
from base.base_crawler import AbstractCrawler
from media_platform.bilibili import BilibiliCrawler
from media_platform.douyin import DouYinCrawler
@@ -50,16 +50,24 @@ class CrawlerFactory:
crawler: Optional[AbstractCrawler] = None
# persist-1<persist1@126.com>
# 原因:增加 --init_db 功能,用于数据库初始化。
# 副作用:无
# 回滚策略:还原此文件。
async def main():
# Init crawler
global crawler
# parse cmd
await cmd_arg.parse_cmd()
args = await cmd_arg.parse_cmd()
# init db
if config.SAVE_DATA_OPTION in ["db", "sqlite"]:
await db.init_db()
if args.init_db:
await db.init_db(args.init_db)
print(f"Database {args.init_db} initialized successfully.")
return # Exit the main function cleanly
crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
await crawler.start()

View File

@@ -9,6 +9,9 @@ dependencies = [
"aiofiles~=23.2.1",
"aiomysql==0.2.0",
"aiosqlite>=0.21.0",
"alembic>=1.16.5",
"asyncmy>=0.2.10",
"cryptography>=45.0.7",
"fastapi==0.110.2",
"httpx==0.28.1",
"jieba==0.42.1",
@@ -24,6 +27,7 @@ dependencies = [
"python-dotenv==1.0.1",
"redis~=4.6.0",
"requests==2.32.3",
"sqlalchemy>=2.0.43",
"tenacity==8.2.2",
"uvicorn==0.29.0",
"wordcloud==1.9.3",

View File

@@ -18,4 +18,8 @@ parsel==1.9.1
pyexecjs==1.5.1
pandas==2.2.3
aiosqlite==0.21.0
pyhumps==3.8.0
pyhumps==3.8.0
cryptography>=45.0.7
alembic>=1.16.5
asyncmy>=0.2.10
sqlalchemy>=2.0.43

Binary file not shown.

View File

@@ -1,569 +0,0 @@
-- SQLite版本的MediaCrawler数据库表结构
-- 从MySQL tables.sql转换而来适配SQLite语法
-- ----------------------------
-- Table structure for bilibili_video
-- ----------------------------
DROP TABLE IF EXISTS bilibili_video;
CREATE TABLE bilibili_video (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
video_id TEXT NOT NULL,
video_type TEXT NOT NULL,
title TEXT DEFAULT NULL,
desc TEXT,
create_time INTEGER NOT NULL,
liked_count TEXT DEFAULT NULL,
disliked_count TEXT DEFAULT NULL,
video_play_count TEXT DEFAULT NULL,
video_favorite_count TEXT DEFAULT NULL,
video_share_count TEXT DEFAULT NULL,
video_coin_count TEXT DEFAULT NULL,
video_danmaku TEXT DEFAULT NULL,
video_comment TEXT DEFAULT NULL,
video_url TEXT DEFAULT NULL,
video_cover_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_bilibili_vi_video_i_31c36e ON bilibili_video(video_id);
CREATE INDEX idx_bilibili_vi_create__73e0ec ON bilibili_video(create_time);
-- ----------------------------
-- Table structure for bilibili_video_comment
-- ----------------------------
DROP TABLE IF EXISTS bilibili_video_comment;
CREATE TABLE bilibili_video_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
sex TEXT DEFAULT NULL,
sign TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
video_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
sub_comment_count TEXT NOT NULL,
parent_comment_id TEXT DEFAULT NULL,
like_count TEXT NOT NULL DEFAULT '0'
);
CREATE INDEX idx_bilibili_vi_comment_41c34e ON bilibili_video_comment(comment_id);
CREATE INDEX idx_bilibili_vi_video_i_f22873 ON bilibili_video_comment(video_id);
-- ----------------------------
-- Table structure for bilibili_up_info
-- ----------------------------
DROP TABLE IF EXISTS bilibili_up_info;
CREATE TABLE bilibili_up_info (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
sex TEXT DEFAULT NULL,
sign TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
total_fans INTEGER DEFAULT NULL,
total_liked INTEGER DEFAULT NULL,
user_rank INTEGER DEFAULT NULL,
is_official INTEGER DEFAULT NULL
);
CREATE INDEX idx_bilibili_vi_user_123456 ON bilibili_up_info(user_id);
-- ----------------------------
-- Table structure for bilibili_contact_info
-- ----------------------------
DROP TABLE IF EXISTS bilibili_contact_info;
CREATE TABLE bilibili_contact_info (
id INTEGER PRIMARY KEY AUTOINCREMENT,
up_id TEXT DEFAULT NULL,
fan_id TEXT DEFAULT NULL,
up_name TEXT DEFAULT NULL,
fan_name TEXT DEFAULT NULL,
up_sign TEXT DEFAULT NULL,
fan_sign TEXT DEFAULT NULL,
up_avatar TEXT DEFAULT NULL,
fan_avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_bilibili_contact_info_up_id ON bilibili_contact_info(up_id);
CREATE INDEX idx_bilibili_contact_info_fan_id ON bilibili_contact_info(fan_id);
-- ----------------------------
-- Table structure for bilibili_up_dynamic
-- ----------------------------
DROP TABLE IF EXISTS bilibili_up_dynamic;
CREATE TABLE bilibili_up_dynamic (
id INTEGER PRIMARY KEY AUTOINCREMENT,
dynamic_id TEXT DEFAULT NULL,
user_id TEXT DEFAULT NULL,
user_name TEXT DEFAULT NULL,
text TEXT DEFAULT NULL,
type TEXT DEFAULT NULL,
pub_ts INTEGER DEFAULT NULL,
total_comments INTEGER DEFAULT NULL,
total_forwards INTEGER DEFAULT NULL,
total_liked INTEGER DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_bilibili_up_dynamic_dynamic_id ON bilibili_up_dynamic(dynamic_id);
-- ----------------------------
-- Table structure for douyin_aweme
-- ----------------------------
DROP TABLE IF EXISTS douyin_aweme;
CREATE TABLE douyin_aweme (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
sec_uid TEXT DEFAULT NULL,
short_user_id TEXT DEFAULT NULL,
user_unique_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
user_signature TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
aweme_id TEXT NOT NULL,
aweme_type TEXT NOT NULL,
title TEXT DEFAULT NULL,
desc TEXT,
create_time INTEGER NOT NULL,
liked_count TEXT DEFAULT NULL,
comment_count TEXT DEFAULT NULL,
share_count TEXT DEFAULT NULL,
collected_count TEXT DEFAULT NULL,
aweme_url TEXT DEFAULT NULL,
cover_url TEXT DEFAULT NULL,
video_download_url TEXT DEFAULT NULL,
music_download_url TEXT DEFAULT NULL,
note_download_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_douyin_awem_aweme_i_6f7bc6 ON douyin_aweme(aweme_id);
CREATE INDEX idx_douyin_awem_create__299dfe ON douyin_aweme(create_time);
-- ----------------------------
-- Table structure for douyin_aweme_comment
-- ----------------------------
DROP TABLE IF EXISTS douyin_aweme_comment;
CREATE TABLE douyin_aweme_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
sec_uid TEXT DEFAULT NULL,
short_user_id TEXT DEFAULT NULL,
user_unique_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
user_signature TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
aweme_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
sub_comment_count TEXT NOT NULL,
parent_comment_id TEXT DEFAULT NULL,
like_count TEXT NOT NULL DEFAULT '0',
pictures TEXT NOT NULL DEFAULT ''
);
CREATE INDEX idx_douyin_awem_comment_fcd7e4 ON douyin_aweme_comment(comment_id);
CREATE INDEX idx_douyin_awem_aweme_i_c50049 ON douyin_aweme_comment(aweme_id);
-- ----------------------------
-- Table structure for dy_creator
-- ----------------------------
DROP TABLE IF EXISTS dy_creator;
CREATE TABLE dy_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
desc TEXT,
gender TEXT DEFAULT NULL,
follows TEXT DEFAULT NULL,
fans TEXT DEFAULT NULL,
interaction TEXT DEFAULT NULL,
videos_count TEXT DEFAULT NULL
);
-- ----------------------------
-- Table structure for kuaishou_video
-- ----------------------------
DROP TABLE IF EXISTS kuaishou_video;
CREATE TABLE kuaishou_video (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
video_id TEXT NOT NULL,
video_type TEXT NOT NULL,
title TEXT DEFAULT NULL,
desc TEXT,
create_time INTEGER NOT NULL,
liked_count TEXT DEFAULT NULL,
viewd_count TEXT DEFAULT NULL,
video_url TEXT DEFAULT NULL,
video_cover_url TEXT DEFAULT NULL,
video_play_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_kuaishou_vi_video_i_c5c6a6 ON kuaishou_video(video_id);
CREATE INDEX idx_kuaishou_vi_create__a10dee ON kuaishou_video(create_time);
-- ----------------------------
-- Table structure for kuaishou_video_comment
-- ----------------------------
DROP TABLE IF EXISTS kuaishou_video_comment;
CREATE TABLE kuaishou_video_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
video_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
sub_comment_count TEXT NOT NULL
);
CREATE INDEX idx_kuaishou_vi_comment_ed48fa ON kuaishou_video_comment(comment_id);
CREATE INDEX idx_kuaishou_vi_video_i_e50914 ON kuaishou_video_comment(video_id);
-- ----------------------------
-- Table structure for weibo_note
-- ----------------------------
DROP TABLE IF EXISTS weibo_note;
CREATE TABLE weibo_note (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
gender TEXT DEFAULT NULL,
profile_url TEXT DEFAULT NULL,
ip_location TEXT DEFAULT '',
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
note_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
create_date_time TEXT NOT NULL,
liked_count TEXT DEFAULT NULL,
comments_count TEXT DEFAULT NULL,
shared_count TEXT DEFAULT NULL,
note_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_weibo_note_note_id_f95b1a ON weibo_note(note_id);
CREATE INDEX idx_weibo_note_create__692709 ON weibo_note(create_time);
CREATE INDEX idx_weibo_note_create__d05ed2 ON weibo_note(create_date_time);
-- ----------------------------
-- Table structure for weibo_note_comment
-- ----------------------------
DROP TABLE IF EXISTS weibo_note_comment;
CREATE TABLE weibo_note_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
gender TEXT DEFAULT NULL,
profile_url TEXT DEFAULT NULL,
ip_location TEXT DEFAULT '',
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
note_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
create_date_time TEXT NOT NULL,
comment_like_count TEXT NOT NULL,
sub_comment_count TEXT NOT NULL,
parent_comment_id TEXT DEFAULT NULL
);
CREATE INDEX idx_weibo_note__comment_c7611c ON weibo_note_comment(comment_id);
CREATE INDEX idx_weibo_note__note_id_24f108 ON weibo_note_comment(note_id);
CREATE INDEX idx_weibo_note__create__667fe3 ON weibo_note_comment(create_date_time);
-- ----------------------------
-- Table structure for weibo_creator
-- ----------------------------
DROP TABLE IF EXISTS weibo_creator;
CREATE TABLE weibo_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
desc TEXT,
gender TEXT DEFAULT NULL,
follows TEXT DEFAULT NULL,
fans TEXT DEFAULT NULL,
tag_list TEXT
);
-- ----------------------------
-- Table structure for xhs_creator
-- ----------------------------
DROP TABLE IF EXISTS xhs_creator;
CREATE TABLE xhs_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
desc TEXT,
gender TEXT DEFAULT NULL,
follows TEXT DEFAULT NULL,
fans TEXT DEFAULT NULL,
interaction TEXT DEFAULT NULL,
tag_list TEXT
);
-- ----------------------------
-- Table structure for xhs_note
-- ----------------------------
DROP TABLE IF EXISTS xhs_note;
CREATE TABLE xhs_note (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
note_id TEXT NOT NULL,
type TEXT DEFAULT NULL,
title TEXT DEFAULT NULL,
desc TEXT,
video_url TEXT,
time INTEGER NOT NULL,
last_update_time INTEGER NOT NULL,
liked_count TEXT DEFAULT NULL,
collected_count TEXT DEFAULT NULL,
comment_count TEXT DEFAULT NULL,
share_count TEXT DEFAULT NULL,
image_list TEXT,
tag_list TEXT,
note_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT '',
xsec_token TEXT DEFAULT NULL
);
CREATE INDEX idx_xhs_note_note_id_209457 ON xhs_note(note_id);
CREATE INDEX idx_xhs_note_time_eaa910 ON xhs_note(time);
-- ----------------------------
-- Table structure for xhs_note_comment
-- ----------------------------
DROP TABLE IF EXISTS xhs_note_comment;
CREATE TABLE xhs_note_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
create_time INTEGER NOT NULL,
note_id TEXT NOT NULL,
content TEXT NOT NULL,
sub_comment_count INTEGER NOT NULL,
pictures TEXT DEFAULT NULL,
parent_comment_id TEXT DEFAULT NULL,
like_count TEXT DEFAULT NULL
);
CREATE INDEX idx_xhs_note_co_comment_8e8349 ON xhs_note_comment(comment_id);
CREATE INDEX idx_xhs_note_co_create__204f8d ON xhs_note_comment(create_time);
-- ----------------------------
-- Table structure for tieba_note
-- ----------------------------
DROP TABLE IF EXISTS tieba_note;
CREATE TABLE tieba_note (
id INTEGER PRIMARY KEY AUTOINCREMENT,
note_id TEXT NOT NULL,
title TEXT NOT NULL,
desc TEXT,
note_url TEXT NOT NULL,
publish_time TEXT NOT NULL,
user_link TEXT DEFAULT '',
user_nickname TEXT DEFAULT '',
user_avatar TEXT DEFAULT '',
tieba_id TEXT DEFAULT '',
tieba_name TEXT NOT NULL,
tieba_link TEXT NOT NULL,
total_replay_num INTEGER DEFAULT 0,
total_replay_page INTEGER DEFAULT 0,
ip_location TEXT DEFAULT '',
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_tieba_note_note_id ON tieba_note(note_id);
CREATE INDEX idx_tieba_note_publish_time ON tieba_note(publish_time);
-- ----------------------------
-- Table structure for tieba_comment
-- ----------------------------
DROP TABLE IF EXISTS tieba_comment;
CREATE TABLE tieba_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
comment_id TEXT NOT NULL,
parent_comment_id TEXT DEFAULT '',
content TEXT NOT NULL,
user_link TEXT DEFAULT '',
user_nickname TEXT DEFAULT '',
user_avatar TEXT DEFAULT '',
tieba_id TEXT DEFAULT '',
tieba_name TEXT NOT NULL,
tieba_link TEXT NOT NULL,
publish_time TEXT DEFAULT '',
ip_location TEXT DEFAULT '',
sub_comment_count INTEGER DEFAULT 0,
note_id TEXT NOT NULL,
note_url TEXT NOT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_tieba_comment_comment_id ON tieba_comment(comment_id);
CREATE INDEX idx_tieba_comment_note_id ON tieba_comment(note_id);
CREATE INDEX idx_tieba_comment_publish_time ON tieba_comment(publish_time);
-- ----------------------------
-- Table structure for tieba_creator
-- ----------------------------
DROP TABLE IF EXISTS tieba_creator;
CREATE TABLE tieba_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
user_name TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
gender TEXT DEFAULT NULL,
follows TEXT DEFAULT NULL,
fans TEXT DEFAULT NULL,
registration_duration TEXT DEFAULT NULL
);
-- ----------------------------
-- Table structure for zhihu_content
-- ----------------------------
DROP TABLE IF EXISTS zhihu_content;
CREATE TABLE zhihu_content (
id INTEGER PRIMARY KEY AUTOINCREMENT,
content_id TEXT NOT NULL,
content_type TEXT NOT NULL,
content_text TEXT,
content_url TEXT NOT NULL,
question_id TEXT DEFAULT NULL,
title TEXT NOT NULL,
desc TEXT,
created_time TEXT NOT NULL,
updated_time TEXT NOT NULL,
voteup_count INTEGER NOT NULL DEFAULT 0,
comment_count INTEGER NOT NULL DEFAULT 0,
source_keyword TEXT DEFAULT NULL,
user_id TEXT NOT NULL,
user_link TEXT NOT NULL,
user_nickname TEXT NOT NULL,
user_avatar TEXT NOT NULL,
user_url_token TEXT NOT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_zhihu_content_content_id ON zhihu_content(content_id);
CREATE INDEX idx_zhihu_content_created_time ON zhihu_content(created_time);
-- ----------------------------
-- Table structure for zhihu_comment
-- ----------------------------
DROP TABLE IF EXISTS zhihu_comment;
CREATE TABLE zhihu_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
comment_id TEXT NOT NULL,
parent_comment_id TEXT DEFAULT NULL,
content TEXT NOT NULL,
publish_time TEXT NOT NULL,
ip_location TEXT DEFAULT NULL,
sub_comment_count INTEGER NOT NULL DEFAULT 0,
like_count INTEGER NOT NULL DEFAULT 0,
dislike_count INTEGER NOT NULL DEFAULT 0,
content_id TEXT NOT NULL,
content_type TEXT NOT NULL,
user_id TEXT NOT NULL,
user_link TEXT NOT NULL,
user_nickname TEXT NOT NULL,
user_avatar TEXT NOT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_zhihu_comment_comment_id ON zhihu_comment(comment_id);
CREATE INDEX idx_zhihu_comment_content_id ON zhihu_comment(content_id);
CREATE INDEX idx_zhihu_comment_publish_time ON zhihu_comment(publish_time);
-- ----------------------------
-- Table structure for zhihu_creator
-- ----------------------------
DROP TABLE IF EXISTS zhihu_creator;
CREATE TABLE zhihu_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL UNIQUE,
user_link TEXT NOT NULL,
user_nickname TEXT NOT NULL,
user_avatar TEXT NOT NULL,
url_token TEXT NOT NULL,
gender TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
follows INTEGER NOT NULL DEFAULT 0,
fans INTEGER NOT NULL DEFAULT 0,
anwser_count INTEGER NOT NULL DEFAULT 0,
video_count INTEGER NOT NULL DEFAULT 0,
question_count INTEGER NOT NULL DEFAULT 0,
article_count INTEGER NOT NULL DEFAULT 0,
column_count INTEGER NOT NULL DEFAULT 0,
get_voteup_count INTEGER NOT NULL DEFAULT 0,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE UNIQUE INDEX idx_zhihu_creator_user_id ON zhihu_creator(user_id);

View File

@@ -1,597 +0,0 @@
-- ----------------------------
-- Table structure for bilibili_video
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_video`;
CREATE TABLE `bilibili_video`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`video_type` varchar(16) NOT NULL COMMENT '视频类型',
`title` varchar(500) DEFAULT NULL COMMENT '视频标题',
`desc` longtext COMMENT '视频描述',
`create_time` bigint NOT NULL COMMENT '视频发布时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
`disliked_count` varchar(16) DEFAULT NULL COMMENT '视频点踩数',
`video_play_count` varchar(16) DEFAULT NULL COMMENT '视频播放数量',
`video_favorite_count` varchar(16) DEFAULT NULL COMMENT '视频收藏数量',
`video_share_count` varchar(16) DEFAULT NULL COMMENT '视频分享数量',
`video_coin_count` varchar(16) DEFAULT NULL COMMENT '视频投币数量',
`video_danmaku` varchar(16) DEFAULT NULL COMMENT '视频弹幕数量',
`video_comment` varchar(16) DEFAULT NULL COMMENT '视频评论数量',
`video_url` varchar(512) DEFAULT NULL COMMENT '视频详情URL',
`video_cover_url` varchar(512) DEFAULT NULL COMMENT '视频封面图 URL',
PRIMARY KEY (`id`),
KEY `idx_bilibili_vi_video_i_31c36e` (`video_id`),
KEY `idx_bilibili_vi_create__73e0ec` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B站视频';
-- ----------------------------
-- Table structure for bilibili_video_comment
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_video_comment`;
CREATE TABLE `bilibili_video_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`sex` varchar(64) DEFAULT NULL COMMENT '用户性别',
`sign` text DEFAULT NULL COMMENT '用户签名',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_bilibili_vi_comment_41c34e` (`comment_id`),
KEY `idx_bilibili_vi_video_i_f22873` (`video_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站视频评论';
-- ----------------------------
-- Table structure for bilibili_up_info
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_up_info`;
CREATE TABLE `bilibili_up_info`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`sex` varchar(64) DEFAULT NULL COMMENT '用户性别',
`sign` text DEFAULT NULL COMMENT '用户签名',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`total_fans` bigint DEFAULT NULL COMMENT '粉丝数',
`total_liked` bigint DEFAULT NULL COMMENT '总获赞数',
`user_rank` int DEFAULT NULL COMMENT '用户等级',
`is_official` int DEFAULT NULL COMMENT '是否官号',
PRIMARY KEY (`id`),
KEY `idx_bilibili_vi_user_123456` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站UP主信息';
-- ----------------------------
-- Table structure for bilibili_contact_info
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_contact_info`;
CREATE TABLE `bilibili_contact_info`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`up_id` varchar(64) DEFAULT NULL COMMENT 'up主ID',
`fan_id` varchar(64) DEFAULT NULL COMMENT '粉丝ID',
`up_name` varchar(64) DEFAULT NULL COMMENT 'up主昵称',
`fan_name` varchar(64) DEFAULT NULL COMMENT '粉丝昵称',
`up_sign` longtext DEFAULT NULL COMMENT 'up主签名',
`fan_sign` longtext DEFAULT NULL COMMENT '粉丝签名',
`up_avatar` varchar(255) DEFAULT NULL COMMENT 'up主头像地址',
`fan_avatar` varchar(255) DEFAULT NULL COMMENT '粉丝头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
KEY `idx_bilibili_contact_info_up_id` (`up_id`),
KEY `idx_bilibili_contact_info_fan_id` (`fan_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站联系人信息';
-- ----------------------------
-- Table structure for bilibili_up_dynamic
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_up_dynamic`;
CREATE TABLE `bilibili_up_dynamic`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`dynamic_id` varchar(64) DEFAULT NULL COMMENT '动态ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`user_name` varchar(64) DEFAULT NULL COMMENT '用户名',
`text` longtext DEFAULT NULL COMMENT '动态文本',
`type` varchar(64) DEFAULT NULL COMMENT '动态类型',
`pub_ts` bigint DEFAULT NULL COMMENT '动态发布时间',
`total_comments` bigint DEFAULT NULL COMMENT '评论数',
`total_forwards` bigint DEFAULT NULL COMMENT '转发数',
`total_liked` bigint DEFAULT NULL COMMENT '点赞数',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
KEY `idx_bilibili_up_dynamic_dynamic_id` (`dynamic_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站up主动态信息';
-- ----------------------------
-- Table structure for douyin_aweme
-- ----------------------------
DROP TABLE IF EXISTS `douyin_aweme`;
CREATE TABLE `douyin_aweme`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`sec_uid` varchar(128) DEFAULT NULL COMMENT '用户sec_uid',
`short_user_id` varchar(64) DEFAULT NULL COMMENT '用户短ID',
`user_unique_id` varchar(64) DEFAULT NULL COMMENT '用户唯一ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`user_signature` varchar(500) DEFAULT NULL COMMENT '用户签名',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`aweme_id` varchar(64) NOT NULL COMMENT '视频ID',
`aweme_type` varchar(16) NOT NULL COMMENT '视频类型',
`title` varchar(1024) DEFAULT NULL COMMENT '视频标题',
`desc` longtext COMMENT '视频描述',
`create_time` bigint NOT NULL COMMENT '视频发布时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
`comment_count` varchar(16) DEFAULT NULL COMMENT '视频评论数',
`share_count` varchar(16) DEFAULT NULL COMMENT '视频分享数',
`collected_count` varchar(16) DEFAULT NULL COMMENT '视频收藏数',
`aweme_url` varchar(255) DEFAULT NULL COMMENT '视频详情页URL',
`cover_url` varchar(500) DEFAULT NULL COMMENT '视频封面图URL',
`video_download_url` longtext COMMENT '视频下载地址',
`music_download_url` longtext COMMENT '音乐下载地址',
`note_download_url` longtext COMMENT '笔记下载地址',
PRIMARY KEY (`id`),
KEY `idx_douyin_awem_aweme_i_6f7bc6` (`aweme_id`),
KEY `idx_douyin_awem_create__299dfe` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音视频';
-- ----------------------------
-- Table structure for douyin_aweme_comment
-- ----------------------------
DROP TABLE IF EXISTS `douyin_aweme_comment`;
CREATE TABLE `douyin_aweme_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`sec_uid` varchar(128) DEFAULT NULL COMMENT '用户sec_uid',
`short_user_id` varchar(64) DEFAULT NULL COMMENT '用户短ID',
`user_unique_id` varchar(64) DEFAULT NULL COMMENT '用户唯一ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`user_signature` varchar(500) DEFAULT NULL COMMENT '用户签名',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`aweme_id` varchar(64) NOT NULL COMMENT '视频ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_douyin_awem_comment_fcd7e4` (`comment_id`),
KEY `idx_douyin_awem_aweme_i_c50049` (`aweme_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音视频评论';
-- ----------------------------
-- Table structure for dy_creator
-- ----------------------------
DROP TABLE IF EXISTS `dy_creator`;
CREATE TABLE `dy_creator`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(128) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`desc` longtext COMMENT '用户描述',
`gender` varchar(1) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`interaction` varchar(16) DEFAULT NULL COMMENT '获赞数',
`videos_count` varchar(16) DEFAULT NULL COMMENT '作品数',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音博主信息';
-- ----------------------------
-- Table structure for kuaishou_video
-- ----------------------------
DROP TABLE IF EXISTS `kuaishou_video`;
CREATE TABLE `kuaishou_video`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`video_type` varchar(16) NOT NULL COMMENT '视频类型',
`title` varchar(500) DEFAULT NULL COMMENT '视频标题',
`desc` longtext COMMENT '视频描述',
`create_time` bigint NOT NULL COMMENT '视频发布时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
`viewd_count` varchar(16) DEFAULT NULL COMMENT '视频浏览数量',
`video_url` varchar(512) DEFAULT NULL COMMENT '视频详情URL',
`video_cover_url` varchar(512) DEFAULT NULL COMMENT '视频封面图 URL',
`video_play_url` varchar(512) DEFAULT NULL COMMENT '视频播放 URL',
PRIMARY KEY (`id`),
KEY `idx_kuaishou_vi_video_i_c5c6a6` (`video_id`),
KEY `idx_kuaishou_vi_create__a10dee` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='快手视频';
-- ----------------------------
-- Table structure for kuaishou_video_comment
-- ----------------------------
DROP TABLE IF EXISTS `kuaishou_video_comment`;
CREATE TABLE `kuaishou_video_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_kuaishou_vi_comment_ed48fa` (`comment_id`),
KEY `idx_kuaishou_vi_video_i_e50914` (`video_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='快手视频评论';
-- ----------------------------
-- Table structure for weibo_note
-- ----------------------------
DROP TABLE IF EXISTS `weibo_note`;
CREATE TABLE `weibo_note`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`gender` varchar(12) DEFAULT NULL COMMENT '用户性别',
`profile_url` varchar(255) DEFAULT NULL COMMENT '用户主页地址',
`ip_location` varchar(32) DEFAULT '发布微博的地理信息',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`note_id` varchar(64) NOT NULL COMMENT '帖子ID',
`content` longtext COMMENT '帖子正文内容',
`create_time` bigint NOT NULL COMMENT '帖子发布时间戳',
`create_date_time` varchar(32) NOT NULL COMMENT '帖子发布日期时间',
`liked_count` varchar(16) DEFAULT NULL COMMENT '帖子点赞数',
`comments_count` varchar(16) DEFAULT NULL COMMENT '帖子评论数量',
`shared_count` varchar(16) DEFAULT NULL COMMENT '帖子转发数量',
`note_url` varchar(512) DEFAULT NULL COMMENT '帖子详情URL',
PRIMARY KEY (`id`),
KEY `idx_weibo_note_note_id_f95b1a` (`note_id`),
KEY `idx_weibo_note_create__692709` (`create_time`),
KEY `idx_weibo_note_create__d05ed2` (`create_date_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博帖子';
-- ----------------------------
-- Table structure for weibo_note_comment
-- ----------------------------
DROP TABLE IF EXISTS `weibo_note_comment`;
CREATE TABLE `weibo_note_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`gender` varchar(12) DEFAULT NULL COMMENT '用户性别',
`profile_url` varchar(255) DEFAULT NULL COMMENT '用户主页地址',
`ip_location` varchar(32) DEFAULT '发布微博的地理信息',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`note_id` varchar(64) NOT NULL COMMENT '帖子ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`create_date_time` varchar(32) NOT NULL COMMENT '评论日期时间',
`comment_like_count` varchar(16) NOT NULL COMMENT '评论点赞数量',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_weibo_note__comment_c7611c` (`comment_id`),
KEY `idx_weibo_note__note_id_24f108` (`note_id`),
KEY `idx_weibo_note__create__667fe3` (`create_date_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博帖子评论';
-- ----------------------------
-- Table structure for xhs_creator
-- ----------------------------
DROP TABLE IF EXISTS `xhs_creator`;
CREATE TABLE `xhs_creator`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`desc` longtext COMMENT '用户描述',
`gender` varchar(1) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`interaction` varchar(16) DEFAULT NULL COMMENT '获赞和收藏数',
`tag_list` longtext COMMENT '标签列表',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书博主';
-- ----------------------------
-- Table structure for xhs_note
-- ----------------------------
DROP TABLE IF EXISTS `xhs_note`;
CREATE TABLE `xhs_note`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`note_id` varchar(64) NOT NULL COMMENT '笔记ID',
`type` varchar(16) DEFAULT NULL COMMENT '笔记类型(normal | video)',
`title` varchar(255) DEFAULT NULL COMMENT '笔记标题',
`desc` longtext COMMENT '笔记描述',
`video_url` longtext COMMENT '视频地址',
`time` bigint NOT NULL COMMENT '笔记发布时间戳',
`last_update_time` bigint NOT NULL COMMENT '笔记最后更新时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '笔记点赞数',
`collected_count` varchar(16) DEFAULT NULL COMMENT '笔记收藏数',
`comment_count` varchar(16) DEFAULT NULL COMMENT '笔记评论数',
`share_count` varchar(16) DEFAULT NULL COMMENT '笔记分享数',
`image_list` longtext COMMENT '笔记封面图片列表',
`tag_list` longtext COMMENT '标签列表',
`note_url` varchar(255) DEFAULT NULL COMMENT '笔记详情页的URL',
PRIMARY KEY (`id`),
KEY `idx_xhs_note_note_id_209457` (`note_id`),
KEY `idx_xhs_note_time_eaa910` (`time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书笔记';
-- ----------------------------
-- Table structure for xhs_note_comment
-- ----------------------------
DROP TABLE IF EXISTS `xhs_note_comment`;
CREATE TABLE `xhs_note_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`note_id` varchar(64) NOT NULL COMMENT '笔记ID',
`content` longtext NOT NULL COMMENT '评论内容',
`sub_comment_count` int NOT NULL COMMENT '子评论数量',
`pictures` varchar(512) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_xhs_note_co_comment_8e8349` (`comment_id`),
KEY `idx_xhs_note_co_create__204f8d` (`create_time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书笔记评论';
-- ----------------------------
-- alter table xhs_note_comment to support parent_comment_id
-- ----------------------------
ALTER TABLE `xhs_note_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
ALTER TABLE `douyin_aweme_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
ALTER TABLE `bilibili_video_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
ALTER TABLE `weibo_note_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
DROP TABLE IF EXISTS `tieba_note`;
CREATE TABLE tieba_note
(
id BIGINT AUTO_INCREMENT PRIMARY KEY,
note_id VARCHAR(644) NOT NULL COMMENT '帖子ID',
title VARCHAR(255) NOT NULL COMMENT '帖子标题',
`desc` TEXT COMMENT '帖子描述',
note_url VARCHAR(255) NOT NULL COMMENT '帖子链接',
publish_time VARCHAR(255) NOT NULL COMMENT '发布时间',
user_link VARCHAR(255) DEFAULT '' COMMENT '用户主页链接',
user_nickname VARCHAR(255) DEFAULT '' COMMENT '用户昵称',
user_avatar VARCHAR(255) DEFAULT '' COMMENT '用户头像地址',
tieba_id VARCHAR(255) DEFAULT '' COMMENT '贴吧ID',
tieba_name VARCHAR(255) NOT NULL COMMENT '贴吧名称',
tieba_link VARCHAR(255) NOT NULL COMMENT '贴吧链接',
total_replay_num INT DEFAULT 0 COMMENT '帖子回复总数',
total_replay_page INT DEFAULT 0 COMMENT '帖子回复总页数',
ip_location VARCHAR(255) DEFAULT '' COMMENT 'IP地理位置',
add_ts BIGINT NOT NULL COMMENT '添加时间戳',
last_modify_ts BIGINT NOT NULL COMMENT '最后修改时间戳',
KEY `idx_tieba_note_note_id` (`note_id`),
KEY `idx_tieba_note_publish_time` (`publish_time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='贴吧帖子表';
DROP TABLE IF EXISTS `tieba_comment`;
CREATE TABLE tieba_comment
(
id BIGINT AUTO_INCREMENT PRIMARY KEY,
comment_id VARCHAR(255) NOT NULL COMMENT '评论ID',
parent_comment_id VARCHAR(255) DEFAULT '' COMMENT '父评论ID',
content TEXT NOT NULL COMMENT '评论内容',
user_link VARCHAR(255) DEFAULT '' COMMENT '用户主页链接',
user_nickname VARCHAR(255) DEFAULT '' COMMENT '用户昵称',
user_avatar VARCHAR(255) DEFAULT '' COMMENT '用户头像地址',
tieba_id VARCHAR(255) DEFAULT '' COMMENT '贴吧ID',
tieba_name VARCHAR(255) NOT NULL COMMENT '贴吧名称',
tieba_link VARCHAR(255) NOT NULL COMMENT '贴吧链接',
publish_time VARCHAR(255) DEFAULT '' COMMENT '发布时间',
ip_location VARCHAR(255) DEFAULT '' COMMENT 'IP地理位置',
sub_comment_count INT DEFAULT 0 COMMENT '子评论数',
note_id VARCHAR(255) NOT NULL COMMENT '帖子ID',
note_url VARCHAR(255) NOT NULL COMMENT '帖子链接',
add_ts BIGINT NOT NULL COMMENT '添加时间戳',
last_modify_ts BIGINT NOT NULL COMMENT '最后修改时间戳',
KEY `idx_tieba_comment_comment_id` (`note_id`),
KEY `idx_tieba_comment_note_id` (`note_id`),
KEY `idx_tieba_comment_publish_time` (`publish_time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='贴吧评论表';
-- 增加搜索来源关键字字段
alter table bilibili_video
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table douyin_aweme
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table kuaishou_video
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table weibo_note
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table xhs_note
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table tieba_note
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
DROP TABLE IF EXISTS `weibo_creator`;
CREATE TABLE `weibo_creator`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`desc` longtext COMMENT '用户描述',
`gender` varchar(2) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`tag_list` longtext COMMENT '标签列表',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博博主';
ALTER TABLE `xhs_note_comment`
ADD COLUMN `like_count` VARCHAR(64) DEFAULT NULL COMMENT '评论点赞数量';
DROP TABLE IF EXISTS `tieba_creator`;
CREATE TABLE `tieba_creator`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`user_name` varchar(64) NOT NULL COMMENT '用户名',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`gender` varchar(2) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`registration_duration` varchar(16) DEFAULT NULL COMMENT '吧龄',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='贴吧创作者';
DROP TABLE IF EXISTS `zhihu_content`;
CREATE TABLE `zhihu_content` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`content_id` varchar(64) NOT NULL COMMENT '内容ID',
`content_type` varchar(16) NOT NULL COMMENT '内容类型(article | answer | zvideo)',
`content_text` longtext COMMENT '内容文本, 如果是视频类型这里为空',
`content_url` varchar(255) NOT NULL COMMENT '内容落地链接',
`question_id` varchar(64) DEFAULT NULL COMMENT '问题ID, type为answer时有值',
`title` varchar(255) NOT NULL COMMENT '内容标题',
`desc` longtext COMMENT '内容描述',
`created_time` varchar(32) NOT NULL COMMENT '创建时间',
`updated_time` varchar(32) NOT NULL COMMENT '更新时间',
`voteup_count` int NOT NULL DEFAULT '0' COMMENT '赞同人数',
`comment_count` int NOT NULL DEFAULT '0' COMMENT '评论数量',
`source_keyword` varchar(64) DEFAULT NULL COMMENT '来源关键词',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
`user_url_token` varchar(255) NOT NULL COMMENT '用户url_token',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
KEY `idx_zhihu_content_content_id` (`content_id`),
KEY `idx_zhihu_content_created_time` (`created_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='知乎内容(回答、文章、视频)';
DROP TABLE IF EXISTS `zhihu_comment`;
CREATE TABLE `zhihu_comment` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`parent_comment_id` varchar(64) DEFAULT NULL COMMENT '父评论ID',
`content` text NOT NULL COMMENT '评论内容',
`publish_time` varchar(32) NOT NULL COMMENT '发布时间',
`ip_location` varchar(64) DEFAULT NULL COMMENT 'IP地理位置',
`sub_comment_count` int NOT NULL DEFAULT '0' COMMENT '子评论数',
`like_count` int NOT NULL DEFAULT '0' COMMENT '点赞数',
`dislike_count` int NOT NULL DEFAULT '0' COMMENT '踩数',
`content_id` varchar(64) NOT NULL COMMENT '内容ID',
`content_type` varchar(16) NOT NULL COMMENT '内容类型(article | answer | zvideo)',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
KEY `idx_zhihu_comment_comment_id` (`comment_id`),
KEY `idx_zhihu_comment_content_id` (`content_id`),
KEY `idx_zhihu_comment_publish_time` (`publish_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='知乎评论';
DROP TABLE IF EXISTS `zhihu_creator`;
CREATE TABLE `zhihu_creator` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
`url_token` varchar(64) NOT NULL COMMENT '用户URL Token',
`gender` varchar(16) DEFAULT NULL COMMENT '用户性别',
`ip_location` varchar(64) DEFAULT NULL COMMENT 'IP地理位置',
`follows` int NOT NULL DEFAULT 0 COMMENT '关注数',
`fans` int NOT NULL DEFAULT 0 COMMENT '粉丝数',
`anwser_count` int NOT NULL DEFAULT 0 COMMENT '回答数',
`video_count` int NOT NULL DEFAULT 0 COMMENT '视频数',
`question_count` int NOT NULL DEFAULT 0 COMMENT '问题数',
`article_count` int NOT NULL DEFAULT 0 COMMENT '文章数',
`column_count` int NOT NULL DEFAULT 0 COMMENT '专栏数',
`get_voteup_count` int NOT NULL DEFAULT 0 COMMENT '获得的赞同数',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
UNIQUE KEY `idx_zhihu_creator_user_id` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='知乎创作者';
-- add column `like_count` to douyin_aweme_comment
alter table douyin_aweme_comment add column `like_count` varchar(255) NOT NULL DEFAULT '0' COMMENT '点赞数';
alter table xhs_note add column xsec_token varchar(50) default null comment '签名算法';
alter table douyin_aweme_comment add column `pictures` varchar(500) NOT NULL DEFAULT '' COMMENT '评论图片列表';
alter table bilibili_video_comment add column `like_count` varchar(255) NOT NULL DEFAULT '0' COMMENT '点赞数';

View File

@@ -18,7 +18,7 @@ from typing import List
import config
from var import source_keyword_var
from .bilibili_store_impl import *
from ._store_impl import *
from .bilibilli_store_media import *

View File

@@ -0,0 +1,299 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : B站存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
from sqlalchemy import select
from sqlalchemy.orm import sessionmaker
import config
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import BilibiliVideoComment, BilibiliVideo, BilibiliUpInfo, BilibiliUpDynamic, BilibiliContactInfo
from tools.async_file_writer import AsyncFileWriter
from tools import utils, words
from var import crawler_type_var
class BiliCsvStoreImplement(AbstractStore):
def __init__(self):
self.file_writer = AsyncFileWriter(
crawler_type=crawler_type_var.get(),
platform="bilibili"
)
async def store_content(self, content_item: Dict):
"""
content CSV storage implementation
Args:
content_item:
Returns:
"""
await self.file_writer.write_to_csv(
item=content_item,
item_type="videos"
)
async def store_comment(self, comment_item: Dict):
"""
comment CSV storage implementation
Args:
comment_item:
Returns:
"""
await self.file_writer.write_to_csv(
item=comment_item,
item_type="comments"
)
async def store_creator(self, creator: Dict):
"""
creator CSV storage implementation
Args:
creator:
Returns:
"""
await self.file_writer.write_to_csv(
item=creator,
item_type="creators"
)
async def store_contact(self, contact_item: Dict):
"""
creator contact CSV storage implementation
Args:
contact_item: creator's contact item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=contact_item,
item_type="contacts"
)
async def store_dynamic(self, dynamic_item: Dict):
"""
creator dynamic CSV storage implementation
Args:
dynamic_item: creator's contact item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=dynamic_item,
item_type="dynamics"
)
class BiliDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Bilibili content DB storage implementation
Args:
content_item: content item dict
"""
video_id = content_item.get("video_id")
async with get_session() as session:
result = await session.execute(select(BilibiliVideo).where(BilibiliVideo.video_id == video_id))
video_detail = result.scalar_one_or_none()
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
new_content = BilibiliVideo(**content_item)
session.add(new_content)
else:
for key, value in content_item.items():
setattr(video_detail, key, value)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Bilibili comment DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
result = await session.execute(select(BilibiliVideoComment).where(BilibiliVideoComment.comment_id == comment_id))
comment_detail = result.scalar_one_or_none()
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
new_comment = BilibiliVideoComment(**comment_item)
session.add(new_comment)
else:
for key, value in comment_item.items():
setattr(comment_detail, key, value)
await session.commit()
async def store_creator(self, creator: Dict):
"""
Bilibili creator DB storage implementation
Args:
creator: creator item dict
"""
creator_id = creator.get("user_id")
async with get_session() as session:
result = await session.execute(select(BilibiliUpInfo).where(BilibiliUpInfo.user_id == creator_id))
creator_detail = result.scalar_one_or_none()
if not creator_detail:
creator["add_ts"] = utils.get_current_timestamp()
new_creator = BilibiliUpInfo(**creator)
session.add(new_creator)
else:
for key, value in creator.items():
setattr(creator_detail, key, value)
await session.commit()
async def store_contact(self, contact_item: Dict):
"""
Bilibili contact DB storage implementation
Args:
contact_item: contact item dict
"""
up_id = contact_item.get("up_id")
fan_id = contact_item.get("fan_id")
async with get_session() as session:
result = await session.execute(
select(BilibiliContactInfo).where(BilibiliContactInfo.up_id == up_id, BilibiliContactInfo.fan_id == fan_id)
)
contact_detail = result.scalar_one_or_none()
if not contact_detail:
contact_item["add_ts"] = utils.get_current_timestamp()
new_contact = BilibiliContactInfo(**contact_item)
session.add(new_contact)
else:
for key, value in contact_item.items():
setattr(contact_detail, key, value)
await session.commit()
async def store_dynamic(self, dynamic_item):
"""
Bilibili dynamic DB storage implementation
Args:
dynamic_item: dynamic item dict
"""
dynamic_id = dynamic_item.get("dynamic_id")
async with get_session() as session:
result = await session.execute(select(BilibiliUpDynamic).where(BilibiliUpDynamic.dynamic_id == dynamic_id))
dynamic_detail = result.scalar_one_or_none()
if not dynamic_detail:
dynamic_item["add_ts"] = utils.get_current_timestamp()
new_dynamic = BilibiliUpDynamic(**dynamic_item)
session.add(new_dynamic)
else:
for key, value in dynamic_item.items():
setattr(dynamic_detail, key, value)
await session.commit()
class BiliJsonStoreImplement(AbstractStore):
def __init__(self):
self.file_writer = AsyncFileWriter(
crawler_type=crawler_type_var.get(),
platform="bilibili"
)
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=content_item,
item_type="contents"
)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=comment_item,
item_type="comments"
)
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=creator,
item_type="creators"
)
async def store_contact(self, contact_item: Dict):
"""
creator contact JSON storage implementation
Args:
contact_item: creator's contact item dict
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=contact_item,
item_type="contacts"
)
async def store_dynamic(self, dynamic_item: Dict):
"""
creator dynamic JSON storage implementation
Args:
dynamic_item: creator's contact item dict
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=dynamic_item,
item_type="dynamics"
)
class BiliSqliteStoreImplement(BiliDbStoreImplement):
pass

View File

@@ -1,465 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 19:34
# @Desc : B站存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0])for file_name in os.listdir(file_store_path)])+1
except ValueError:
return 1
class BiliCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/bilibili"
file_count:int=calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/bilibili/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Bilibili content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Bilibili comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Bilibili creator CSV storage implementation
Args:
creator: creator item dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creators")
async def store_contact(self, contact_item: Dict):
"""
Bilibili contact CSV storage implementation
Args:
contact_item: creator's contact item dict
Returns:
"""
await self.save_data_to_csv(save_item=contact_item, store_type="contacts")
async def store_dynamic(self, dynamic_item: Dict):
"""
Bilibili dynamic CSV storage implementation
Args:
dynamic_item: creator's dynamic item dict
Returns:
"""
await self.save_data_to_csv(save_item=dynamic_item, store_type="dynamics")
class BiliDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Bilibili content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .bilibili_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
video_id = content_item.get("video_id")
video_detail: Dict = await query_content_by_content_id(content_id=video_id)
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(video_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Bilibili content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .bilibili_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Bilibili creator DB storage implementation
Args:
creator: creator item dict
Returns:
"""
from .bilibili_store_sql import (add_new_creator,
query_creator_by_creator_id,
update_creator_by_creator_id)
creator_id = creator.get("user_id")
creator_detail: Dict = await query_creator_by_creator_id(creator_id=creator_id)
if not creator_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_creator_id(creator_id,creator_item=creator)
async def store_contact(self, contact_item: Dict):
"""
Bilibili contact DB storage implementation
Args:
contact_item: contact item dict
Returns:
"""
from .bilibili_store_sql import (add_new_contact,
query_contact_by_up_and_fan,
update_contact_by_id, )
up_id = contact_item.get("up_id")
fan_id = contact_item.get("fan_id")
contact_detail: Dict = await query_contact_by_up_and_fan(up_id=up_id, fan_id=fan_id)
if not contact_detail:
contact_item["add_ts"] = utils.get_current_timestamp()
await add_new_contact(contact_item)
else:
key_id = contact_detail.get("id")
await update_contact_by_id(id=key_id, contact_item=contact_item)
async def store_dynamic(self, dynamic_item):
"""
Bilibili dynamic DB storage implementation
Args:
dynamic_item: dynamic item dict
Returns:
"""
from .bilibili_store_sql import (add_new_dynamic,
query_dynamic_by_dynamic_id,
update_dynamic_by_dynamic_id)
dynamic_id = dynamic_item.get("dynamic_id")
dynamic_detail = await query_dynamic_by_dynamic_id(dynamic_id=dynamic_id)
if not dynamic_detail:
dynamic_item["add_ts"] = utils.get_current_timestamp()
await add_new_dynamic(dynamic_item)
else:
await update_dynamic_by_dynamic_id(dynamic_id, dynamic_item=dynamic_item)
class BiliJsonStoreImplement(AbstractStore):
json_store_path: str = "data/bilibili/json"
words_store_path: str = "data/bilibili/words"
lock = asyncio.Lock()
file_count:int=calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str,str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.save_data_to_json(creator, "creators")
async def store_contact(self, contact_item: Dict):
"""
creator contact JSON storage implementation
Args:
contact_item: creator's contact item dict
Returns:
"""
await self.save_data_to_json(save_item=contact_item, store_type="contacts")
async def store_dynamic(self, dynamic_item: Dict):
"""
creator dynamic JSON storage implementation
Args:
dynamic_item: creator's contact item dict
Returns:
"""
await self.save_data_to_json(save_item=dynamic_item, store_type="dynamics")
class BiliSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Bilibili content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .bilibili_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
video_id = content_item.get("video_id")
video_detail: Dict = await query_content_by_content_id(content_id=video_id)
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(video_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Bilibili comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .bilibili_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Bilibili creator SQLite storage implementation
Args:
creator: creator item dict
Returns:
"""
from .bilibili_store_sql import (add_new_creator,
query_creator_by_creator_id,
update_creator_by_creator_id)
creator_id = creator.get("user_id")
creator_detail: Dict = await query_creator_by_creator_id(creator_id=creator_id)
if not creator_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_creator_id(creator_id, creator_item=creator)
async def store_contact(self, contact_item: Dict):
"""
Bilibili contact SQLite storage implementation
Args:
contact_item: contact item dict
Returns:
"""
from .bilibili_store_sql import (add_new_contact,
query_contact_by_up_and_fan,
update_contact_by_id, )
up_id = contact_item.get("up_id")
fan_id = contact_item.get("fan_id")
contact_detail: Dict = await query_contact_by_up_and_fan(up_id=up_id, fan_id=fan_id)
if not contact_detail:
contact_item["add_ts"] = utils.get_current_timestamp()
await add_new_contact(contact_item)
else:
key_id = contact_detail.get("id")
await update_contact_by_id(id=key_id, contact_item=contact_item)
async def store_dynamic(self, dynamic_item):
"""
Bilibili dynamic SQLite storage implementation
Args:
dynamic_item: dynamic item dict
Returns:
"""
from .bilibili_store_sql import (add_new_dynamic,
query_dynamic_by_dynamic_id,
update_dynamic_by_dynamic_id)
dynamic_id = dynamic_item.get("dynamic_id")
dynamic_detail = await query_dynamic_by_dynamic_id(dynamic_id=dynamic_id)
if not dynamic_detail:
dynamic_item["add_ts"] = utils.get_current_timestamp()
await add_new_dynamic(dynamic_item)
else:
await update_dynamic_by_dynamic_id(dynamic_id, dynamic_item=dynamic_item)

View File

@@ -1,253 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_video where video_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_video", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_video", content_item, "video_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_video_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_video_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_video_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_creator_id(creator_id: str) -> Dict:
"""
查询up主信息
Args:
creator_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_up_info where user_id = '{creator_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增up主信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_up_info", creator_item)
return last_row_id
async def update_creator_by_creator_id(creator_id: str, creator_item: Dict) -> int:
"""
更新up主信息
Args:
creator_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_up_info", creator_item, "user_id", creator_id)
return effect_row
async def query_contact_by_up_and_fan(up_id: str, fan_id: str) -> Dict:
"""
查询一条关联关系
Args:
up_id:
fan_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_contact_info where up_id = '{up_id}' and fan_id = '{fan_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_contact(contact_item: Dict) -> int:
"""
新增关联关系
Args:
contact_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_contact_info", contact_item)
return last_row_id
async def update_contact_by_id(id: str, contact_item: Dict) -> int:
"""
更新关联关系
Args:
id:
contact_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_contact_info", contact_item, "id", id)
return effect_row
async def query_dynamic_by_dynamic_id(dynamic_id: str) -> Dict:
"""
查询一条动态信息
Args:
dynamic_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_up_dynamic where dynamic_id = '{dynamic_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_dynamic(dynamic_item: Dict) -> int:
"""
新增动态信息
Args:
dynamic_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_up_dynamic", dynamic_item)
return last_row_id
async def update_dynamic_by_dynamic_id(dynamic_id: str, dynamic_item: Dict) -> int:
"""
更新动态信息
Args:
dynamic_id:
dynamic_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_up_dynamic", dynamic_item, "dynamic_id", dynamic_id)
return effect_row

View File

@@ -17,7 +17,7 @@ from typing import List
import config
from var import source_keyword_var
from .douyin_store_impl import *
from ._store_impl import *
from .douyin_store_media import *

198
store/douyin/_store_impl.py Normal file
View File

@@ -0,0 +1,198 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 抖音存储实现类
import asyncio
import json
import os
import pathlib
from typing import Dict
from sqlalchemy import select
import config
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import DouyinAweme, DouyinAwemeComment, DyCreator
from tools import utils, words
from tools.async_file_writer import AsyncFileWriter
from var import crawler_type_var
class DouyinCsvStoreImplement(AbstractStore):
def __init__(self):
self.file_writer = AsyncFileWriter(
crawler_type=crawler_type_var.get(),
platform="douyin"
)
async def store_content(self, content_item: Dict):
"""
Douyin content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=content_item,
item_type="contents"
)
async def store_comment(self, comment_item: Dict):
"""
Douyin comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=comment_item,
item_type="comments"
)
async def store_creator(self, creator: Dict):
"""
Douyin creator CSV storage implementation
Args:
creator: creator item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=creator,
item_type="creators"
)
class DouyinDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Douyin content DB storage implementation
Args:
content_item: content item dict
"""
aweme_id = content_item.get("aweme_id")
async with get_session() as session:
result = await session.execute(select(DouyinAweme).where(DouyinAweme.aweme_id == aweme_id))
aweme_detail = result.scalar_one_or_none()
if not aweme_detail:
content_item["add_ts"] = utils.get_current_timestamp()
if content_item.get("title"):
new_content = DouyinAweme(**content_item)
session.add(new_content)
else:
for key, value in content_item.items():
setattr(aweme_detail, key, value)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Douyin comment DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
result = await session.execute(select(DouyinAwemeComment).where(DouyinAwemeComment.comment_id == comment_id))
comment_detail = result.scalar_one_or_none()
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
new_comment = DouyinAwemeComment(**comment_item)
session.add(new_comment)
else:
for key, value in comment_item.items():
setattr(comment_detail, key, value)
await session.commit()
async def store_creator(self, creator: Dict):
"""
Douyin creator DB storage implementation
Args:
creator: creator dict
"""
user_id = creator.get("user_id")
async with get_session() as session:
result = await session.execute(select(DyCreator).where(DyCreator.user_id == user_id))
user_detail = result.scalar_one_or_none()
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
new_creator = DyCreator(**creator)
session.add(new_creator)
else:
for key, value in creator.items():
setattr(user_detail, key, value)
await session.commit()
class DouyinJsonStoreImplement(AbstractStore):
def __init__(self):
self.file_writer = AsyncFileWriter(
crawler_type=crawler_type_var.get(),
platform="douyin"
)
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=content_item,
item_type="contents"
)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=comment_item,
item_type="comments"
)
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=creator,
item_type="creators"
)
class DouyinSqliteStoreImplement(DouyinDbStoreImplement):
pass

View File

@@ -1,324 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 18:46
# @Desc : 抖音存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class DouyinCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/douyin"
file_count: int = calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/douyin/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Douyin content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Douyin comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Douyin creator CSV storage implementation
Args:
creator: creator item dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creator")
class DouyinDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Douyin content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .douyin_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
aweme_id = content_item.get("aweme_id")
aweme_detail: Dict = await query_content_by_content_id(content_id=aweme_id)
if not aweme_detail:
content_item["add_ts"] = utils.get_current_timestamp()
if content_item.get("title"):
await add_new_content(content_item)
else:
await update_content_by_content_id(aweme_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Douyin content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .douyin_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Douyin content DB storage implementation
Args:
creator: creator dict
Returns:
"""
from .douyin_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class DouyinJsonStoreImplement(AbstractStore):
json_store_path: str = "data/douyin/json"
words_store_path: str = "data/douyin/words"
lock = asyncio.Lock()
file_count: int = calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str,str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False, indent=4))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
Douyin creator CSV storage implementation
Args:
creator: creator item dict
Returns:
"""
await self.save_data_to_json(save_item=creator, store_type="creator")
class DouyinSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Douyin content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .douyin_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
aweme_id = content_item.get("aweme_id")
aweme_detail: Dict = await query_content_by_content_id(content_id=aweme_id)
if not aweme_detail:
content_item["add_ts"] = utils.get_current_timestamp()
if content_item.get("title"):
await add_new_content(content_item)
else:
await update_content_by_content_id(aweme_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Douyin comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .douyin_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Douyin creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
from .douyin_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,160 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from douyin_aweme where aweme_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("douyin_aweme", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("douyin_aweme", content_item, "aweme_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from douyin_aweme_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("douyin_aweme_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("douyin_aweme_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from dy_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("dy_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("dy_creator", creator_item, "user_id", user_id)
return effect_row

View File

@@ -18,7 +18,7 @@ from typing import List
import config
from var import source_keyword_var
from .kuaishou_store_impl import *
from ._store_impl import *
class KuaishouStoreFactory:

View File

@@ -0,0 +1,160 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 快手存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
from tools.async_file_writer import AsyncFileWriter
import aiofiles
from sqlalchemy import select
import config
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import KuaishouVideo, KuaishouVideoComment
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class KuaishouCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="kuaishou", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
Kuaishou content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Kuaishou comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
pass
class KuaishouDbStoreImplement(AbstractStore):
async def store_creator(self, creator: Dict):
pass
async def store_content(self, content_item: Dict):
"""
Kuaishou content DB storage implementation
Args:
content_item: content item dict
"""
video_id = content_item.get("video_id")
async with get_session() as session:
result = await session.execute(select(KuaishouVideo).where(KuaishouVideo.video_id == video_id))
video_detail = result.scalar_one_or_none()
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
new_content = KuaishouVideo(**content_item)
session.add(new_content)
else:
for key, value in content_item.items():
setattr(video_detail, key, value)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Kuaishou comment DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
result = await session.execute(
select(KuaishouVideoComment).where(KuaishouVideoComment.comment_id == comment_id))
comment_detail = result.scalar_one_or_none()
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
new_comment = KuaishouVideoComment(**comment_item)
session.add(new_comment)
else:
for key, value in comment_item.items():
setattr(comment_detail, key, value)
await session.commit()
class KuaishouJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="kuaishou", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
pass
class KuaishouSqliteStoreImplement(KuaishouDbStoreImplement):
async def store_creator(self, creator: Dict):
pass

View File

@@ -1,290 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 20:03
# @Desc : 快手存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0])for file_name in os.listdir(file_store_path)])+1
except ValueError:
return 1
class KuaishouCsvStoreImplement(AbstractStore):
async def store_creator(self, creator: Dict):
pass
csv_store_path: str = "data/kuaishou"
file_count:int=calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/douyin/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Kuaishou content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Kuaishou comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
class KuaishouDbStoreImplement(AbstractStore):
async def store_creator(self, creator: Dict):
pass
async def store_content(self, content_item: Dict):
"""
Kuaishou content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .kuaishou_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
video_id = content_item.get("video_id")
video_detail: Dict = await query_content_by_content_id(content_id=video_id)
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(video_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Kuaishou content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .kuaishou_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
class KuaishouJsonStoreImplement(AbstractStore):
json_store_path: str = "data/kuaishou/json"
words_store_path: str = "data/kuaishou/words"
lock = asyncio.Lock()
file_count:int=calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str,str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
Kuaishou content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_json(creator, "creator")
class KuaishouSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Kuaishou content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .kuaishou_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
video_id = content_item.get("video_id")
video_detail: Dict = await query_content_by_content_id(content_id=video_id)
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(video_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Kuaishou comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .kuaishou_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Kuaishou creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
pass

View File

@@ -1,114 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from kuaishou_video where video_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("kuaishou_video", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("kuaishou_video", content_item, "video_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from kuaishou_video_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("kuaishou_video_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("kuaishou_video_comment", comment_item, "comment_id", comment_id)
return effect_row

View File

@@ -15,8 +15,7 @@ from typing import List
from model.m_baidu_tieba import TiebaComment, TiebaCreator, TiebaNote
from var import source_keyword_var
from . import tieba_store_impl
from .tieba_store_impl import *
from ._store_impl import *
class TieBaStoreFactory:

192
store/tieba/_store_impl.py Normal file
View File

@@ -0,0 +1,192 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 贴吧存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
import config
from base.base_crawler import AbstractStore
from database.models import TiebaNote, TiebaComment, TiebaCreator
from tools import utils, words
from database.db_session import get_session
from var import crawler_type_var
from tools.async_file_writer import AsyncFileWriter
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class TieBaCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="tieba", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
tieba content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
tieba comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
tieba content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.writer.write_to_csv(item_type="creators", item=creator)
class TieBaDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
tieba content DB storage implementation
Args:
content_item: content item dict
"""
note_id = content_item.get("note_id")
async with get_session() as session:
stmt = select(TiebaNote).where(TiebaNote.note_id == note_id)
res = await session.execute(stmt)
db_note = res.scalar_one_or_none()
if db_note:
for key, value in content_item.items():
setattr(db_note, key, value)
else:
db_note = TiebaNote(**content_item)
session.add(db_note)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
tieba content DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
stmt = select(TiebaComment).where(TiebaComment.comment_id == comment_id)
res = await session.execute(stmt)
db_comment = res.scalar_one_or_none()
if db_comment:
for key, value in comment_item.items():
setattr(db_comment, key, value)
else:
db_comment = TiebaComment(**comment_item)
session.add(db_comment)
await session.commit()
async def store_creator(self, creator: Dict):
"""
tieba content DB storage implementation
Args:
creator: creator dict
"""
user_id = creator.get("user_id")
async with get_session() as session:
stmt = select(TiebaCreator).where(TiebaCreator.user_id == user_id)
res = await session.execute(stmt)
db_creator = res.scalar_one_or_none()
if db_creator:
for key, value in creator.items():
setattr(db_creator, key, value)
else:
db_creator = TiebaCreator(**creator)
session.add(db_creator)
await session.commit()
class TieBaJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="tieba", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
tieba content JSON storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
tieba comment JSON storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
tieba content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.writer.write_single_item_to_json(item_type="creators", item=creator)
class TieBaSqliteStoreImplement(TieBaDbStoreImplement):
"""
Tieba sqlite store implement
"""
pass

View File

@@ -1,318 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class TieBaCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/tieba"
file_count: int = calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/tieba/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
f.fileno()
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
tieba content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
tieba comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
tieba content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creator")
class TieBaDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
tieba content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .tieba_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
tieba content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .tieba_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
tieba content DB storage implementation
Args:
creator: creator dict
Returns:
"""
from .tieba_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class TieBaJsonStoreImplement(AbstractStore):
json_store_path: str = "data/tieba/json"
words_store_path: str = "data/tieba/words"
lock = asyncio.Lock()
file_count: int = calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str, str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name, words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
tieba content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_json(creator, "creator")
class TieBaSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
tieba content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .tieba_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
tieba comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .tieba_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
tieba creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
from .tieba_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,156 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from tieba_note where note_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("tieba_note", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("tieba_note", content_item, "note_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from tieba_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("tieba_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("tieba_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from tieba_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("tieba_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("tieba_creator", creator_item, "user_id", user_id)
return effect_row

View File

@@ -19,7 +19,7 @@ from typing import List
from var import source_keyword_var
from .weibo_store_media import *
from .weibo_store_impl import *
from ._store_impl import *
class WeibostoreFactory:

214
store/weibo/_store_impl.py Normal file
View File

@@ -0,0 +1,214 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 微博存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
import config
from base.base_crawler import AbstractStore
from database.models import WeiboCreator, WeiboNote, WeiboNoteComment
from tools import utils, words
from tools.async_file_writer import AsyncFileWriter
from database.db_session import get_session
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class WeiboCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="weibo", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
Weibo content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Weibo comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
Weibo creator CSV storage implementation
Args:
creator:
Returns:
"""
await self.writer.write_to_csv(item_type="creators", item=creator)
class WeiboDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Weibo content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
note_id = content_item.get("note_id")
async with get_session() as session:
stmt = select(WeiboNote).where(WeiboNote.note_id == note_id)
res = await session.execute(stmt)
db_note = res.scalar_one_or_none()
if db_note:
db_note.last_modify_ts = utils.get_current_timestamp()
for key, value in content_item.items():
if hasattr(db_note, key):
setattr(db_note, key, value)
else:
content_item["add_ts"] = utils.get_current_timestamp()
content_item["last_modify_ts"] = utils.get_current_timestamp()
db_note = WeiboNote(**content_item)
session.add(db_note)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Weibo content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
stmt = select(WeiboNoteComment).where(WeiboNoteComment.comment_id == comment_id)
res = await session.execute(stmt)
db_comment = res.scalar_one_or_none()
if db_comment:
db_comment.last_modify_ts = utils.get_current_timestamp()
for key, value in comment_item.items():
if hasattr(db_comment, key):
setattr(db_comment, key, value)
else:
comment_item["add_ts"] = utils.get_current_timestamp()
comment_item["last_modify_ts"] = utils.get_current_timestamp()
db_comment = WeiboNoteComment(**comment_item)
session.add(db_comment)
await session.commit()
async def store_creator(self, creator: Dict):
"""
Weibo creator DB storage implementation
Args:
creator:
Returns:
"""
user_id = creator.get("user_id")
async with get_session() as session:
stmt = select(WeiboCreator).where(WeiboCreator.user_id == user_id)
res = await session.execute(stmt)
db_creator = res.scalar_one_or_none()
if db_creator:
db_creator.last_modify_ts = utils.get_current_timestamp()
for key, value in creator.items():
if hasattr(db_creator, key):
setattr(db_creator, key, value)
else:
creator["add_ts"] = utils.get_current_timestamp()
creator["last_modify_ts"] = utils.get_current_timestamp()
db_creator = WeiboCreator(**creator)
session.add(db_creator)
await session.commit()
class WeiboJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="weibo", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="creators", item=creator)
class WeiboSqliteStoreImplement(WeiboDbStoreImplement):
"""
Weibo content SQLite storage implementation
"""
pass

View File

@@ -1,326 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 21:35
# @Desc : 微博存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class WeiboCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/weibo"
file_count: int = calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/bilibili/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Weibo content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Weibo comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Weibo creator CSV storage implementation
Args:
creator:
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creators")
class WeiboDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Weibo content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .weibo_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Weibo content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .weibo_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Weibo creator DB storage implementation
Args:
creator:
Returns:
"""
from .weibo_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class WeiboJsonStoreImplement(AbstractStore):
json_store_path: str = "data/weibo/json"
words_store_path: str = "data/weibo/words"
lock = asyncio.Lock()
file_count: int = calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str, str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name, words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.save_data_to_json(creator, "creators")
class WeiboSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Weibo content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .weibo_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Weibo comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .weibo_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Weibo creator SQLite storage implementation
Args:
creator:
Returns:
"""
from .weibo_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,160 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from weibo_note where note_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("weibo_note", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("weibo_note", content_item, "note_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from weibo_note_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("weibo_note_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("weibo_note_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from weibo_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("weibo_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("weibo_creator", creator_item, "user_id", user_id)
return effect_row

View File

@@ -17,9 +17,8 @@ from typing import List
import config
from var import source_keyword_var
from . import xhs_store_impl
from .xhs_store_media import *
from .xhs_store_impl import *
from ._store_impl import *
class XhsStoreFactory:

260
store/xhs/_store_impl.py Normal file
View File

@@ -0,0 +1,260 @@
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 小红书存储实现类
import json
import os
from datetime import datetime
from typing import List, Dict, Any
from sqlalchemy import select, update, delete
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import Session
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import XhsNote, XhsNoteComment, XhsCreator
from tools.async_file_writer import AsyncFileWriter
from tools.time_util import get_current_timestamp
class XhsCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="xhs", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
store content data to csv file
:param content_item:
:return:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
store comment data to csv file
:param comment_item:
:return:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator_item: Dict):
pass
def flush(self):
pass
class XhsJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="xhs", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
store content data to json file
:param content_item:
:return:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
store comment data to json file
:param comment_item:
:return:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator_item: Dict):
pass
def flush(self):
"""
flush data to json file
:return:
"""
pass
class XhsDbStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
async def store_content(self, content_item: Dict):
note_id = content_item.get("note_id")
if not note_id:
return
async with get_session() as session:
if await self.content_is_exist(session, note_id):
await self.update_content(session, content_item)
else:
await self.add_content(session, content_item)
async def add_content(self, session: AsyncSession, content_item: Dict):
add_ts = int(get_current_timestamp())
last_modify_ts = int(get_current_timestamp())
note = XhsNote(
user_id=content_item.get("user_id"),
nickname=content_item.get("nickname"),
avatar=content_item.get("avatar"),
ip_location=content_item.get("ip_location"),
add_ts=add_ts,
last_modify_ts=last_modify_ts,
note_id=content_item.get("note_id"),
type=content_item.get("type"),
title=content_item.get("title"),
desc=content_item.get("desc"),
video_url=content_item.get("video_url"),
time=content_item.get("time"),
last_update_time=content_item.get("last_update_time"),
liked_count=str(content_item.get("liked_count")),
collected_count=str(content_item.get("collected_count")),
comment_count=str(content_item.get("comment_count")),
share_count=str(content_item.get("share_count")),
image_list=json.dumps(content_item.get("image_list")),
tag_list=json.dumps(content_item.get("tag_list")),
note_url=content_item.get("note_url"),
source_keyword=content_item.get("source_keyword", ""),
xsec_token=content_item.get("xsec_token", "")
)
session.add(note)
async def update_content(self, session: AsyncSession, content_item: Dict):
note_id = content_item.get("note_id")
last_modify_ts = int(get_current_timestamp())
update_data = {
"last_modify_ts": last_modify_ts,
"liked_count": str(content_item.get("liked_count")),
"collected_count": str(content_item.get("collected_count")),
"comment_count": str(content_item.get("comment_count")),
"share_count": str(content_item.get("share_count")),
"last_update_time": content_item.get("last_update_time"),
}
stmt = update(XhsNote).where(XhsNote.note_id == note_id).values(**update_data)
await session.execute(stmt)
async def content_is_exist(self, session: AsyncSession, note_id: str) -> bool:
stmt = select(XhsNote).where(XhsNote.note_id == note_id)
result = await session.execute(stmt)
return result.first() is not None
async def store_comment(self, comment_item: Dict):
if not comment_item:
return
async with get_session() as session:
comment_id = comment_item.get("comment_id")
if not comment_id:
return
if await self.comment_is_exist(session, comment_id):
await self.update_comment(session, comment_item)
else:
await self.add_comment(session, comment_item)
async def add_comment(self, session: AsyncSession, comment_item: Dict):
add_ts = int(get_current_timestamp())
last_modify_ts = int(get_current_timestamp())
comment = XhsNoteComment(
user_id=comment_item.get("user_id"),
nickname=comment_item.get("nickname"),
avatar=comment_item.get("avatar"),
ip_location=comment_item.get("ip_location"),
add_ts=add_ts,
last_modify_ts=last_modify_ts,
comment_id=comment_item.get("comment_id"),
create_time=comment_item.get("create_time"),
note_id=comment_item.get("note_id"),
content=comment_item.get("content"),
sub_comment_count=comment_item.get("sub_comment_count"),
pictures=json.dumps(comment_item.get("pictures")),
parent_comment_id=comment_item.get("parent_comment_id"),
like_count=str(comment_item.get("like_count"))
)
session.add(comment)
async def update_comment(self, session: AsyncSession, comment_item: Dict):
comment_id = comment_item.get("comment_id")
last_modify_ts = int(get_current_timestamp())
update_data = {
"last_modify_ts": last_modify_ts,
"like_count": str(comment_item.get("like_count")),
"sub_comment_count": comment_item.get("sub_comment_count"),
}
stmt = update(XhsNoteComment).where(XhsNoteComment.comment_id == comment_id).values(**update_data)
await session.execute(stmt)
async def comment_is_exist(self, session: AsyncSession, comment_id: str) -> bool:
stmt = select(XhsNoteComment).where(XhsNoteComment.comment_id == comment_id)
result = await session.execute(stmt)
return result.first() is not None
async def store_creator(self, creator_item: Dict):
user_id = creator_item.get("user_id")
if not user_id:
return
async with get_session() as session:
if await self.creator_is_exist(session, user_id):
await self.update_creator(session, creator_item)
else:
await self.add_creator(session, creator_item)
async def add_creator(self, session: AsyncSession, creator_item: Dict):
add_ts = int(get_current_timestamp())
last_modify_ts = int(get_current_timestamp())
creator = XhsCreator(
user_id=creator_item.get("user_id"),
nickname=creator_item.get("nickname"),
avatar=creator_item.get("avatar"),
ip_location=creator_item.get("ip_location"),
add_ts=add_ts,
last_modify_ts=last_modify_ts,
desc=creator_item.get("desc"),
gender=creator_item.get("gender"),
follows=str(creator_item.get("follows")),
fans=str(creator_item.get("fans")),
interaction=str(creator_item.get("interaction")),
tag_list=json.dumps(creator_item.get("tag_list"))
)
session.add(creator)
async def update_creator(self, session: AsyncSession, creator_item: Dict):
user_id = creator_item.get("user_id")
last_modify_ts = int(get_current_timestamp())
update_data = {
"last_modify_ts": last_modify_ts,
"nickname": creator_item.get("nickname"),
"avatar": creator_item.get("avatar"),
"desc": creator_item.get("desc"),
"follows": str(creator_item.get("follows")),
"fans": str(creator_item.get("fans")),
"interaction": str(creator_item.get("interaction")),
"tag_list": json.dumps(creator_item.get("tag_list"))
}
stmt = update(XhsCreator).where(XhsCreator.user_id == user_id).values(**update_data)
await session.execute(stmt)
async def creator_is_exist(self, session: AsyncSession, user_id: str) -> bool:
stmt = select(XhsCreator).where(XhsCreator.user_id == user_id)
result = await session.execute(stmt)
return result.first() is not None
async def get_all_content(self) -> List[Dict]:
async with get_session() as session:
stmt = select(XhsNote)
result = await session.execute(stmt)
return [item.__dict__ for item in result.scalars().all()]
async def get_all_comments(self) -> List[Dict]:
async with get_session() as session:
stmt = select(XhsNoteComment)
result = await session.execute(stmt)
return [item.__dict__ for item in result.scalars().all()]
class XhsSqliteStoreImplement(XhsDbStoreImplement):
def __init__(self, **kwargs):
super().__init__(**kwargs)

View File

@@ -1,318 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 16:58
# @Desc : 小红书存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0])for file_name in os.listdir(file_store_path)])+1
except ValueError:
return 1
class XhsCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/xhs"
file_count:int=calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/xhs/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
f.fileno()
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Xiaohongshu content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Xiaohongshu comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Xiaohongshu content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creator")
class XhsDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Xiaohongshu content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .xhs_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Xiaohongshu content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .xhs_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Xiaohongshu content DB storage implementation
Args:
creator: creator dict
Returns:
"""
from .xhs_store_sql import (add_new_creator, query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class XhsJsonStoreImplement(AbstractStore):
json_store_path: str = "data/xhs/json"
words_store_path: str = "data/xhs/words"
lock = asyncio.Lock()
file_count:int=calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str,str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False, indent=4))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
Xiaohongshu content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_json(creator, "creator")
class XhsSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Xiaohongshu content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .xhs_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Xiaohongshu comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .xhs_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Xiaohongshu creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
from .xhs_store_sql import (add_new_creator, query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,160 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from xhs_note where note_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("xhs_note", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("xhs_note", content_item, "note_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from xhs_note_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("xhs_note_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("xhs_note_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from xhs_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("xhs_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("xhs_creator", creator_item, "user_id", user_id)
return effect_row

View File

@@ -15,7 +15,7 @@ from typing import List
import config
from base.base_crawler import AbstractStore
from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator
from store.zhihu.zhihu_store_impl import (ZhihuCsvStoreImplement,
from ._store_impl import (ZhihuCsvStoreImplement,
ZhihuDbStoreImplement,
ZhihuJsonStoreImplement,
ZhihuSqliteStoreImplement)

191
store/zhihu/_store_impl.py Normal file
View File

@@ -0,0 +1,191 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 知乎存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
import config
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import ZhihuContent, ZhihuComment, ZhihuCreator
from tools import utils, words
from var import crawler_type_var
from tools.async_file_writer import AsyncFileWriter
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class ZhihuCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="zhihu", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
Zhihu content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Zhihu comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
Zhihu content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.writer.write_to_csv(item_type="creators", item=creator)
class ZhihuDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Zhihu content DB storage implementation
Args:
content_item: content item dict
"""
content_id = content_item.get("content_id")
async with get_session() as session:
stmt = select(ZhihuContent).where(ZhihuContent.content_id == content_id)
result = await session.execute(stmt)
existing_content = result.scalars().first()
if existing_content:
for key, value in content_item.items():
setattr(existing_content, key, value)
else:
new_content = ZhihuContent(**content_item)
session.add(new_content)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Zhihu content DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
stmt = select(ZhihuComment).where(ZhihuComment.comment_id == comment_id)
result = await session.execute(stmt)
existing_comment = result.scalars().first()
if existing_comment:
for key, value in comment_item.items():
setattr(existing_comment, key, value)
else:
new_comment = ZhihuComment(**comment_item)
session.add(new_comment)
await session.commit()
async def store_creator(self, creator: Dict):
"""
Zhihu content DB storage implementation
Args:
creator: creator dict
"""
user_id = creator.get("user_id")
async with get_session() as session:
stmt = select(ZhihuCreator).where(ZhihuCreator.user_id == user_id)
result = await session.execute(stmt)
existing_creator = result.scalars().first()
if existing_creator:
for key, value in creator.items():
setattr(existing_creator, key, value)
else:
new_creator = ZhihuCreator(**creator)
session.add(new_creator)
await session.commit()
class ZhihuJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="zhihu", crawler_type=kwargs.get("crawler_type"))
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
Zhihu content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.writer.write_single_item_to_json(item_type="creators", item=creator)
class ZhihuSqliteStoreImplement(ZhihuDbStoreImplement):
"""
Zhihu content SQLite storage implementation
"""
pass

View File

@@ -1,318 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class ZhihuCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/zhihu"
file_count: int = calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/zhihu/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
f.fileno()
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Zhihu content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Zhihu comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Zhihu content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creator")
class ZhihuDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Zhihu content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .zhihu_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Zhihu content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .zhihu_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Zhihu content DB storage implementation
Args:
creator: creator dict
Returns:
"""
from .zhihu_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class ZhihuJsonStoreImplement(AbstractStore):
json_store_path: str = "data/zhihu/json"
words_store_path: str = "data/zhihu/words"
lock = asyncio.Lock()
file_count: int = calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str, str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name, words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False, indent=4))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
Zhihu content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_json(creator, "creator")
class ZhihuSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Zhihu content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .zhihu_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Zhihu comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .zhihu_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Zhihu creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
from .zhihu_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,156 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录zhihu的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from zhihu_content where content_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录zhihu的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("zhihu_content", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录zhihu的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("zhihu_content", content_item, "content_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from zhihu_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("zhihu_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("zhihu_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from zhihu_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("zhihu_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("zhihu_creator", creator_item, "user_id", user_id)
return effect_row

233
test/test_db_sync.py Normal file
View File

@@ -0,0 +1,233 @@
# -*- coding: utf-8 -*-
# @Author : persist-1<persist1@126.com>
# @Time : 2025/9/8 00:02
# @Desc : 用于将orm映射模型database/models.py与两种数据库实际结构进行对比并进行更新操作连接数据库->结构比对->差异报告->交互式同步)
# @Tips : 该脚本需要安装依赖'pymysql==1.1.0'
import os
import sys
from sqlalchemy import create_engine, inspect as sqlalchemy_inspect
from sqlalchemy.schema import MetaData
# 将项目根目录添加到 sys.path
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from config.db_config import mysql_db_config, sqlite_db_config
from database.models import Base
def get_mysql_engine():
"""创建并返回一个MySQL数据库引擎"""
conn_str = f"mysql+pymysql://{mysql_db_config['user']}:{mysql_db_config['password']}@{mysql_db_config['host']}:{mysql_db_config['port']}/{mysql_db_config['db_name']}"
return create_engine(conn_str)
def get_sqlite_engine():
"""创建并返回一个SQLite数据库引擎"""
conn_str = f"sqlite:///{sqlite_db_config['db_path']}"
return create_engine(conn_str)
def get_db_schema(engine):
"""获取数据库的当前表结构"""
inspector = sqlalchemy_inspect(engine)
schema = {}
for table_name in inspector.get_table_names():
columns = {}
for column in inspector.get_columns(table_name):
columns[column['name']] = str(column['type'])
schema[table_name] = columns
return schema
def get_orm_schema():
"""获取ORM模型的表结构"""
schema = {}
for table_name, table in Base.metadata.tables.items():
columns = {}
for column in table.columns:
columns[column.name] = str(column.type)
schema[table_name] = columns
return schema
def compare_schemas(db_schema, orm_schema):
"""比较数据库结构和ORM模型结构返回差异"""
db_tables = set(db_schema.keys())
orm_tables = set(orm_schema.keys())
added_tables = orm_tables - db_tables
deleted_tables = db_tables - orm_tables
common_tables = db_tables.intersection(orm_tables)
changed_tables = {}
for table in common_tables:
db_cols = set(db_schema[table].keys())
orm_cols = set(orm_schema[table].keys())
added_cols = orm_cols - db_cols
deleted_cols = db_cols - orm_cols
modified_cols = {}
for col in db_cols.intersection(orm_cols):
if db_schema[table][col] != orm_schema[table][col]:
modified_cols[col] = (db_schema[table][col], orm_schema[table][col])
if added_cols or deleted_cols or modified_cols:
changed_tables[table] = {
"added": list(added_cols),
"deleted": list(deleted_cols),
"modified": modified_cols
}
return {
"added_tables": list(added_tables),
"deleted_tables": list(deleted_tables),
"changed_tables": changed_tables
}
def print_diff(db_name, diff):
"""打印差异报告"""
print(f"--- {db_name} 数据库结构差异报告 ---")
if not any(diff.values()):
print("数据库结构与ORM模型一致无需同步。")
return
if diff.get("added_tables"):
print("\n[+] 新增的表:")
for table in diff["added_tables"]:
print(f" - {table}")
if diff.get("deleted_tables"):
print("\n[-] 删除的表:")
for table in diff["deleted_tables"]:
print(f" - {table}")
if diff.get("changed_tables"):
print("\n[*] 变动的表:")
for table, changes in diff["changed_tables"].items():
print(f" - {table}:")
if changes.get("added"):
print(" [+] 新增字段:", ", ".join(changes["added"]))
if changes.get("deleted"):
print(" [-] 删除字段:", ", ".join(changes["deleted"]))
if changes.get("modified"):
print(" [*] 修改字段:")
for col, types in changes["modified"].items():
print(f" - {col}: {types[0]} -> {types[1]}")
print("--- 报告结束 ---")
def sync_database(engine, diff):
"""将ORM模型同步到数据库"""
metadata = Base.metadata
# Alembic的上下文配置
from alembic.migration import MigrationContext
from alembic.operations import Operations
conn = engine.connect()
ctx = MigrationContext.configure(conn)
op = Operations(ctx)
# 处理删除的表
for table_name in diff['deleted_tables']:
op.drop_table(table_name)
print(f"已删除表: {table_name}")
# 处理新增的表
for table_name in diff['added_tables']:
table = metadata.tables.get(table_name)
if table is not None:
table.create(engine)
print(f"已创建表: {table_name}")
# 处理字段变更
for table_name, changes in diff['changed_tables'].items():
# 删除字段
for col_name in changes['deleted']:
op.drop_column(table_name, col_name)
print(f"在表 {table_name} 中已删除字段: {col_name}")
# 新增字段
for col_name in changes['added']:
table = metadata.tables.get(table_name)
column = table.columns.get(col_name)
if column is not None:
op.add_column(table_name, column)
print(f"在表 {table_name} 中已新增字段: {col_name}")
# 修改字段
for col_name, types in changes['modified'].items():
table = metadata.tables.get(table_name)
if table is not None:
column = table.columns.get(col_name)
if column is not None:
op.alter_column(table_name, col_name, type_=column.type)
print(f"在表 {table_name} 中已修改字段: {col_name} (类型变为 {column.type})")
def main():
"""主函数"""
orm_schema = get_orm_schema()
# 处理 MySQL
try:
mysql_engine = get_mysql_engine()
mysql_schema = get_db_schema(mysql_engine)
mysql_diff = compare_schemas(mysql_schema, orm_schema)
print_diff("MySQL", mysql_diff)
if any(mysql_diff.values()):
choice = input(">>> 需要人工确认是否要将ORM模型同步到MySQL数据库? (y/N): ")
if choice.lower() == 'y':
sync_database(mysql_engine, mysql_diff)
print("MySQL数据库同步完成。")
except Exception as e:
print(f"处理MySQL时出错: {e}")
# 处理 SQLite
try:
sqlite_engine = get_sqlite_engine()
sqlite_schema = get_db_schema(sqlite_engine)
sqlite_diff = compare_schemas(sqlite_schema, orm_schema)
print_diff("SQLite", sqlite_diff)
if any(sqlite_diff.values()):
choice = input(">>> 需要人工确认是否要将ORM模型同步到SQLite数据库? (y/N): ")
if choice.lower() == 'y':
# 注意SQLite不支持ALTER COLUMN来修改字段类型这里简化处理
print("警告SQLite的字段修改支持有限此脚本不会执行修改字段类型的操作。")
sync_database(sqlite_engine, sqlite_diff)
print("SQLite数据库同步完成。")
except Exception as e:
print(f"处理SQLite时出错: {e}")
if __name__ == "__main__":
main()
######################### Feedback example #########################
# [*] 变动的表:
# - kuaishou_video:
# [*] 修改字段:
# - user_id: TEXT -> VARCHAR(64)
# - xhs_note_comment:
# [*] 修改字段:
# - comment_id: BIGINT -> VARCHAR(255)
# - zhihu_content:
# [*] 修改字段:
# - created_time: BIGINT -> VARCHAR(32)
# - content_id: BIGINT -> VARCHAR(64)
# - zhihu_creator:
# [*] 修改字段:
# - user_id: INTEGER -> VARCHAR(64)
# - tieba_note:
# [*] 修改字段:
# - publish_time: BIGINT -> VARCHAR(255)
# - tieba_id: INTEGER -> VARCHAR(255)
# - note_id: BIGINT -> VARCHAR(644)
# --- 报告结束 ---
# >>> 需要人工确认是否要将ORM模型同步到MySQL数据库? (y/N): y
# 在表 kuaishou_video 中已修改字段: user_id (类型变为 VARCHAR(64))
# 在表 xhs_note_comment 中已修改字段: comment_id (类型变为 VARCHAR(255))
# 在表 zhihu_content 中已修改字段: created_time (类型变为 VARCHAR(32))
# 在表 zhihu_content 中已修改字段: content_id (类型变为 VARCHAR(64))
# 在表 zhihu_creator 中已修改字段: user_id (类型变为 VARCHAR(64))
# 在表 tieba_note 中已修改字段: publish_time (类型变为 VARCHAR(255))
# 在表 tieba_note 中已修改字段: tieba_id (类型变为 VARCHAR(255))
# 在表 tieba_note 中已修改字段: note_id (类型变为 VARCHAR(644))
# MySQL数据库同步完成。

View File

@@ -0,0 +1,50 @@
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict, List
import aiofiles
from tools.utils import utils
class AsyncFileWriter:
def __init__(self, platform: str, crawler_type: str):
self.lock = asyncio.Lock()
self.platform = platform
self.crawler_type = crawler_type
def _get_file_path(self, file_type: str, item_type: str) -> str:
base_path = f"data/{self.platform}/{file_type}"
pathlib.Path(base_path).mkdir(parents=True, exist_ok=True)
file_name = f"{self.crawler_type}_{item_type}_{utils.get_current_time()}.{file_type}"
return os.path.join(base_path, file_name)
async def write_to_csv(self, item: Dict, item_type: str):
file_path = self._get_file_path('csv', item_type)
async with self.lock:
file_exists = os.path.exists(file_path)
async with aiofiles.open(file_path, 'a', newline='', encoding='utf-8-sig') as f:
writer = csv.DictWriter(f, fieldnames=item.keys())
if not file_exists or await f.tell() == 0:
await writer.writeheader()
await writer.writerow(item)
async def write_single_item_to_json(self, item: Dict, item_type: str):
file_path = self._get_file_path('json', item_type)
async with self.lock:
existing_data = []
if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
try:
content = await f.read()
if content:
existing_data = json.loads(content)
if not isinstance(existing_data, list):
existing_data = [existing_data]
except json.JSONDecodeError:
existing_data = []
existing_data.append(item)
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
await f.write(json.dumps(existing_data, ensure_ascii=False, indent=4))

2370
uv.lock generated

File diff suppressed because it is too large Load Diff

3
var.py
View File

@@ -15,11 +15,8 @@ from typing import List
import aiomysql
from async_db import AsyncMysqlDB
request_keyword_var: ContextVar[str] = ContextVar("request_keyword", default="")
crawler_type_var: ContextVar[str] = ContextVar("crawler_type", default="")
comment_tasks_var: ContextVar[List[Task]] = ContextVar("comment_tasks", default=[])
media_crawler_db_var: ContextVar[AsyncMysqlDB] = ContextVar("media_crawler_db_var")
db_conn_pool_var: ContextVar[aiomysql.Pool] = ContextVar("db_conn_pool_var")
source_keyword_var: ContextVar[str] = ContextVar("source_keyword", default="")