mirror of
https://github.com/Usagi-org/ai-goofish-monitor.git
synced 2025-11-25 03:15:07 +08:00
docs(XIANYU_MONITOR_DOCUMENTATION): 添加项目外部文档
添加 CLAUDE.md 文件以指导代码工作
This commit is contained in:
129
CLAUDE.md
129
CLAUDE.md
@@ -1 +1,128 @@
|
||||
回复issue或pr评论时使用中文进行交流。
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
This is an intelligent monitoring robot for Xianyu (Goofish), a Chinese second-hand marketplace. It uses Playwright for web scraping and AI (multimodal LLMs) for intelligent filtering and analysis of listings. The project features a comprehensive web-based management interface built with FastAPI.
|
||||
|
||||
Key features:
|
||||
- Web UI for task management, AI prompt editing, real-time logs and result browsing
|
||||
- AI-driven task creation using natural language descriptions
|
||||
- Concurrent multi-task monitoring with independent configurations
|
||||
- Real-time stream processing of new listings
|
||||
- Deep AI analysis combining product images/text and seller profiling
|
||||
- Instant notifications via ntfy.sh, WeChat Work bots, and Bark
|
||||
- Cron-based task scheduling
|
||||
- Docker deployment support
|
||||
- Robust anti-scraping strategies with randomized delays
|
||||
|
||||
## Repository Structure
|
||||
|
||||
- `web_server.py`: Main FastAPI application with web UI and task management
|
||||
- `spider_v2.py`: Core spider script that executes monitoring tasks
|
||||
- `src/`: Core modules
|
||||
- `scraper.py`: Main scraping logic and AI integration
|
||||
- `ai_handler.py`: AI analysis and notification functions
|
||||
- `parsers.py`: Data parsing for search results and user profiles
|
||||
- `config.py`: Configuration management and AI client initialization
|
||||
- `utils.py`: Utility functions (retry, random sleep, data formatting)
|
||||
- `prompt_utils.py`: AI prompt generation for new tasks
|
||||
- `task.py`: Task data models and file operations
|
||||
- `prompts/`: AI prompt templates and criteria files
|
||||
- `static/` and `templates/`: Web UI frontend files
|
||||
- `config.json`: Task configurations
|
||||
- `xianyu_state.json`: Login session state for authenticated scraping
|
||||
- `.env`: Environment variables for API keys, notification settings, etc.
|
||||
|
||||
## Common Development Commands
|
||||
|
||||
### Setting up the environment
|
||||
```bash
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Copy and configure environment variables
|
||||
cp .env.example .env
|
||||
# Edit .env with your settings
|
||||
```
|
||||
|
||||
### Running the application
|
||||
```bash
|
||||
# Start the web management interface
|
||||
python web_server.py
|
||||
|
||||
# Run spider tasks directly (usually managed through web UI)
|
||||
python spider_v2.py
|
||||
```
|
||||
|
||||
### Docker deployment
|
||||
```bash
|
||||
# Build and start services
|
||||
docker-compose up --build -d
|
||||
|
||||
# View logs
|
||||
docker-compose logs -f
|
||||
|
||||
# Stop services
|
||||
docker-compose down
|
||||
```
|
||||
|
||||
### Development utilities
|
||||
```bash
|
||||
# Generate new AI criteria files from natural language descriptions
|
||||
python prompt_generator.py
|
||||
```
|
||||
|
||||
## Core Architecture
|
||||
|
||||
### Web Management Interface (web_server.py)
|
||||
- FastAPI application providing REST API and serving web UI
|
||||
- Authentication using HTTP Basic Auth
|
||||
- Task lifecycle management (create, update, start, stop, delete)
|
||||
- Real-time log streaming and result browsing
|
||||
- AI prompt file editing
|
||||
- System status monitoring
|
||||
- Scheduler integration for cron-based tasks
|
||||
|
||||
### Spider Engine (spider_v2.py)
|
||||
- Asynchronous task execution using asyncio
|
||||
- Playwright integration for browser automation
|
||||
- Multi-task concurrent processing
|
||||
- Configurable search filters (keywords, price ranges, personal items only)
|
||||
- Detailed product and seller information extraction
|
||||
|
||||
### Scraping & Analysis (src/scraper.py)
|
||||
- Playwright-based web scraping with anti-detection measures
|
||||
- User profile scraping (seller/buyer ratings, transaction history)
|
||||
- Image downloading for AI analysis
|
||||
- Real-time AI analysis pipeline with retry logic
|
||||
- Notification sending for recommended items
|
||||
|
||||
### AI Processing (src/ai_handler.py)
|
||||
- Multimodal AI analysis of product images and structured data
|
||||
- Integration with OpenAI-compatible APIs
|
||||
- Base64 image encoding for model input
|
||||
- Response validation and error handling
|
||||
- Notification services (ntfy, WeChat Work, Bark, Webhook)
|
||||
|
||||
### Data Flow
|
||||
1. Tasks defined in `config.json` are loaded by `spider_v2.py`
|
||||
2. Playwright performs authenticated searches on Goofish
|
||||
3. New listings are detected and detailed information scraped
|
||||
4. Product images are downloaded
|
||||
5. Complete product/seller data sent to AI for analysis
|
||||
6. AI response determines if item meets criteria
|
||||
7. Recommended items trigger notifications
|
||||
8. All results saved to JSONL files
|
||||
9. Web UI provides management interface for all components
|
||||
|
||||
## Key Configuration Files
|
||||
|
||||
- `.env`: API keys, notification settings, web auth credentials
|
||||
- `config.json`: Monitoring task definitions with filters and AI prompts
|
||||
- `xianyu_state.json`: Browser session state for authenticated scraping
|
||||
- Prompt files in `prompts/` directory define AI analysis criteria
|
||||
|
||||
开发文档
|
||||
@[XIANYU_MONITOR_DOCUMENTATION.md](XIANYU_MONITOR_DOCUMENTATION.md)
|
||||
379
XIANYU_MONITOR_DOCUMENTATION.md
Normal file
379
XIANYU_MONITOR_DOCUMENTATION.md
Normal file
@@ -0,0 +1,379 @@
|
||||
# Xianyu (Goofish) Intelligent Monitoring Robot - External Documentation
|
||||
|
||||
## 1. Project Overview
|
||||
|
||||
### 1.1 Main Purpose and Features
|
||||
|
||||
The Xianyu (Goofish) Intelligent Monitoring Robot is an advanced tool for monitoring and analyzing second-hand listings on Xianyu (Goofish), a popular Chinese second-hand marketplace. It combines web scraping with AI-powered filtering and analysis to help users find items that match their specific criteria.
|
||||
|
||||
Key features include:
|
||||
- **Web UI for task management**: A comprehensive web-based management interface built with FastAPI
|
||||
- **AI-driven task creation**: Natural language descriptions to automatically generate monitoring tasks
|
||||
- **Concurrent multi-task monitoring**: Independent configurations for multiple monitoring tasks
|
||||
- **Real-time stream processing**: Immediate processing of new listings as they appear
|
||||
- **Deep AI analysis**: Multimodal AI analysis combining product images/text and seller profiling
|
||||
- **Instant notifications**: Via ntfy.sh, WeChat Work bots, and Bark
|
||||
- **Cron-based task scheduling**: Automated execution of monitoring tasks
|
||||
- **Docker deployment support**: Containerized deployment for easy setup
|
||||
- **Robust anti-scraping strategies**: Randomized delays and behavior simulation
|
||||
|
||||
### 1.2 Key Technologies Used
|
||||
|
||||
- **FastAPI**: Web framework for the management interface
|
||||
- **Playwright**: Web scraping and browser automation
|
||||
- **AI Integration**: Multimodal LLMs for intelligent filtering and analysis
|
||||
- **AsyncIO**: Asynchronous programming for concurrent task execution
|
||||
- **Docker**: Containerization for deployment
|
||||
- **APScheduler**: Task scheduling
|
||||
|
||||
### 1.3 Architecture Overview
|
||||
|
||||
The system consists of several core components:
|
||||
1. **Web Management Interface** (`web_server.py`): FastAPI application providing REST API and serving the web UI
|
||||
2. **Spider Engine** (`spider_v2.py`): Core spider script that executes monitoring tasks
|
||||
3. **Scraping & Analysis** (`src/scraper.py`): Main scraping logic and AI integration
|
||||
4. **AI Processing** (`src/ai_handler.py`): AI analysis and notification functions
|
||||
5. **Configuration Management** (`src/config.py`): Configuration management and AI client initialization
|
||||
6. **Task Management** (`src/task.py`): Task data models and file operations
|
||||
7. **Utility Functions** (`src/utils.py`): Utility functions for retry logic, random sleep, data formatting
|
||||
8. **Prompt Utilities** (`src/prompt_utils.py`): AI prompt generation for new tasks
|
||||
|
||||
## 2. Core Components
|
||||
|
||||
### 2.1 Web Management Interface (web_server.py)
|
||||
|
||||
The web management interface is built with FastAPI and provides:
|
||||
- Authentication using HTTP Basic Auth
|
||||
- Task lifecycle management (create, update, start, stop, delete)
|
||||
- Real-time log streaming and result browsing
|
||||
- AI prompt file editing
|
||||
- System status monitoring
|
||||
- Scheduler integration for cron-based tasks
|
||||
|
||||
Key features:
|
||||
- RESTful API endpoints for all management operations
|
||||
- Real-time log streaming with incremental updates
|
||||
- Result browsing with pagination, filtering, and sorting
|
||||
- Prompt file management for AI analysis customization
|
||||
- Notification settings configuration
|
||||
- System status monitoring and diagnostics
|
||||
|
||||
### 2.2 Spider Engine (spider_v2.py)
|
||||
|
||||
The spider engine is responsible for executing monitoring tasks:
|
||||
- Asynchronous task execution using asyncio
|
||||
- Playwright integration for browser automation
|
||||
- Multi-task concurrent processing
|
||||
- Configurable search filters (keywords, price ranges, personal items only)
|
||||
- Detailed product and seller information extraction
|
||||
|
||||
The spider processes tasks defined in `config.json` and performs the following steps:
|
||||
1. Load task configurations
|
||||
2. Navigate to search results on Goofish
|
||||
3. Apply filters (newest first, personal items only, price range)
|
||||
4. Process search results across multiple pages
|
||||
5. Extract detailed product information
|
||||
6. Collect seller profile information
|
||||
7. Pass data to AI for analysis
|
||||
8. Send notifications for recommended items
|
||||
9. Save results to JSONL files
|
||||
|
||||
### 2.3 Scraping & AI Analysis (src/scraper.py)
|
||||
|
||||
This module handles the core scraping functionality:
|
||||
- Playwright-based web scraping with anti-detection measures
|
||||
- User profile scraping (seller/buyer ratings, transaction history)
|
||||
- Image downloading for AI analysis
|
||||
- Real-time AI analysis pipeline with retry logic
|
||||
- Notification sending for recommended items
|
||||
|
||||
Key functions:
|
||||
- `scrape_user_profile()`: Collects comprehensive seller information
|
||||
- `scrape_xianyu()`: Main scraping function that orchestrates the entire process
|
||||
|
||||
### 2.4 AI Processing (src/ai_handler.py)
|
||||
|
||||
The AI processing module handles:
|
||||
- Multimodal AI analysis of product images and structured data
|
||||
- Integration with OpenAI-compatible APIs
|
||||
- Base64 image encoding for model input
|
||||
- Response validation and error handling
|
||||
- Notification services (ntfy, WeChat Work, Bark, Webhook)
|
||||
|
||||
Key functions:
|
||||
- `download_all_images()`: Downloads product images for AI analysis
|
||||
- `get_ai_analysis()`: Sends product data and images to AI for analysis
|
||||
- `send_ntfy_notification()`: Sends notifications when items are recommended
|
||||
- `encode_image_to_base64()`: Encodes images for AI processing
|
||||
|
||||
### 2.5 Configuration Management (src/config.py)
|
||||
|
||||
Handles all configuration aspects:
|
||||
- Environment variable loading with dotenv
|
||||
- AI client initialization
|
||||
- File path management
|
||||
- API URL patterns
|
||||
- HTTP headers configuration
|
||||
|
||||
Key configuration elements:
|
||||
- AI model settings (API key, base URL, model name)
|
||||
- Notification service URLs
|
||||
- Browser settings (headless mode, browser type)
|
||||
- Debug and feature flags
|
||||
|
||||
### 2.6 Task Management (src/task.py)
|
||||
|
||||
Manages task data models and persistence:
|
||||
- Task data models using Pydantic
|
||||
- File-based storage in config.json
|
||||
- CRUD operations for tasks
|
||||
|
||||
### 2.7 Utility Functions (src/utils.py)
|
||||
|
||||
Provides common utility functions:
|
||||
- Retry mechanism with exponential backoff
|
||||
- Safe nested dictionary access
|
||||
- Random sleep for anti-detection
|
||||
- URL conversion utilities
|
||||
- Data formatting and saving functions
|
||||
|
||||
### 2.8 Prompt Utilities (src/prompt_utils.py)
|
||||
|
||||
Handles AI prompt generation and management:
|
||||
- AI-powered criteria generation from natural language descriptions
|
||||
- Prompt template management
|
||||
- Configuration file updates
|
||||
|
||||
## 3. Key Features
|
||||
|
||||
### 3.1 Web UI for Task Management
|
||||
|
||||
The web interface provides a comprehensive management dashboard:
|
||||
- Task creation and configuration
|
||||
- Real-time log viewing
|
||||
- Result browsing with filtering and sorting
|
||||
- Prompt file editing
|
||||
- System status monitoring
|
||||
- Notification settings management
|
||||
|
||||
### 3.2 AI-Driven Task Creation
|
||||
|
||||
Users can create monitoring tasks using natural language descriptions:
|
||||
- Describe what you're looking for in plain language
|
||||
- AI automatically generates complex filtering criteria
|
||||
- Customizable analysis standards for different product types
|
||||
|
||||
### 3.3 Concurrent Multi-Task Monitoring
|
||||
|
||||
Supports running multiple monitoring tasks simultaneously:
|
||||
- Each task has independent configuration
|
||||
- Tasks run concurrently without interference
|
||||
- Resource management to prevent overload
|
||||
|
||||
### 3.4 Real-Time Stream Processing
|
||||
|
||||
Processes new listings as they appear:
|
||||
- Immediate analysis of new items
|
||||
- Real-time notifications
|
||||
- Continuous monitoring without batch delays
|
||||
|
||||
### 3.5 AI Analysis with Multimodal LLMs
|
||||
|
||||
Advanced AI analysis combining:
|
||||
- Product images analysis
|
||||
- Structured product data evaluation
|
||||
- Seller profile assessment
|
||||
- Comprehensive recommendation scoring
|
||||
|
||||
### 3.6 Notification Systems
|
||||
|
||||
Multiple notification channels:
|
||||
- **ntfy.sh**: Push notifications to mobile devices
|
||||
- **WeChat Work bots**: Enterprise messaging integration
|
||||
- **Bark**: iOS/macOS push notifications
|
||||
- **Webhook**: Generic webhook support for custom integrations
|
||||
|
||||
### 3.7 Cron-Based Scheduling
|
||||
|
||||
Flexible task scheduling using cron expressions:
|
||||
- Each task can have its own schedule
|
||||
- Supports complex scheduling patterns
|
||||
- Automatic task execution
|
||||
|
||||
### 3.8 Docker Deployment Support
|
||||
|
||||
Containerized deployment for easy setup:
|
||||
- Dockerfile for building images
|
||||
- docker-compose configuration
|
||||
- Environment variable configuration
|
||||
- Volume mounting for persistent data
|
||||
|
||||
### 3.9 Anti-Scraping Strategies
|
||||
|
||||
Robust anti-detection measures:
|
||||
- Randomized delays between actions
|
||||
- Behavior simulation to mimic human users
|
||||
- Headless browser detection avoidance
|
||||
- Session management with state preservation
|
||||
|
||||
## 4. Configuration Files
|
||||
|
||||
### 4.1 .env File for Environment Variables
|
||||
|
||||
The `.env` file contains all configuration settings:
|
||||
- AI model configuration (API key, base URL, model name)
|
||||
- Notification service URLs and credentials
|
||||
- Browser settings (headless mode, browser type)
|
||||
- Debug and feature flags
|
||||
- Web interface authentication credentials
|
||||
|
||||
Key environment variables:
|
||||
- `OPENAI_API_KEY`: API key for the AI service
|
||||
- `OPENAI_BASE_URL`: Base URL for the AI service
|
||||
- `OPENAI_MODEL_NAME`: Name of the AI model to use
|
||||
- `NTFY_TOPIC_URL`: URL for ntfy.sh notifications
|
||||
- `WX_BOT_URL`: WeChat Work bot URL
|
||||
- `BARK_URL`: Bark notification URL
|
||||
- `RUN_HEADLESS`: Whether to run browser in headless mode
|
||||
- `WEB_USERNAME`/`WEB_PASSWORD`: Web interface credentials
|
||||
|
||||
### 4.2 config.json for Task Definitions
|
||||
|
||||
The `config.json` file defines all monitoring tasks:
|
||||
- Task name and enabled status
|
||||
- Search keywords and filters
|
||||
- Page limits and personal item preferences
|
||||
- Price range filters
|
||||
- Cron scheduling expressions
|
||||
- AI prompt file references
|
||||
|
||||
Example task configuration:
|
||||
```json
|
||||
{
|
||||
"task_name": "MacBook Air M1",
|
||||
"enabled": true,
|
||||
"keyword": "macbook air m1",
|
||||
"max_pages": 5,
|
||||
"personal_only": true,
|
||||
"min_price": "3000",
|
||||
"max_price": "5000",
|
||||
"cron": "3 12 * * *",
|
||||
"ai_prompt_base_file": "prompts/base_prompt.txt",
|
||||
"ai_prompt_criteria_file": "prompts/macbook_criteria.txt",
|
||||
"is_running": false
|
||||
}
|
||||
```
|
||||
|
||||
### 4.3 xianyu_state.json for Login Session State
|
||||
|
||||
The `xianyu_state.json` file stores browser session state for authenticated scraping:
|
||||
- Cookie information for logged-in sessions
|
||||
- Browser storage state
|
||||
- Authentication tokens
|
||||
|
||||
This file is generated by the login process and is essential for accessing personalized content on Xianyu.
|
||||
|
||||
### 4.4 Prompt Files in prompts/ Directory
|
||||
|
||||
The `prompts/` directory contains AI prompt templates:
|
||||
- `base_prompt.txt`: Base prompt template with output format
|
||||
- `*_criteria.txt`: Product-specific analysis criteria
|
||||
|
||||
The prompt system uses a two-part approach:
|
||||
1. **Base Prompt**: Defines the structure and output format for AI responses
|
||||
2. **Criteria Files**: Product-specific analysis criteria that are inserted into the base prompt
|
||||
|
||||
## 5. Deployment and Usage
|
||||
|
||||
### 5.1 Environment Setup
|
||||
|
||||
1. Clone the repository:
|
||||
```bash
|
||||
git clone https://github.com/dingyufei615/ai-goofish-monitor
|
||||
cd ai-goofish-monitor
|
||||
```
|
||||
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Configure environment variables by copying `.env.example` to `.env` and editing the values:
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
|
||||
### 5.2 Running the Application
|
||||
|
||||
1. Start the web management interface:
|
||||
```bash
|
||||
python web_server.py
|
||||
```
|
||||
|
||||
2. Access the web interface at `http://127.0.0.1:8000`
|
||||
|
||||
3. Configure tasks through the web interface or by editing `config.json` directly
|
||||
|
||||
### 5.3 Docker Deployment
|
||||
|
||||
1. Build and start services:
|
||||
```bash
|
||||
docker-compose up --build -d
|
||||
```
|
||||
|
||||
2. View logs:
|
||||
```bash
|
||||
docker-compose logs -f
|
||||
```
|
||||
|
||||
3. Stop services:
|
||||
```bash
|
||||
docker-compose down
|
||||
```
|
||||
|
||||
### 5.4 Development Utilities
|
||||
|
||||
Generate new AI criteria files from natural language descriptions:
|
||||
```bash
|
||||
python prompt_generator.py
|
||||
```
|
||||
|
||||
## 6. Data Flow
|
||||
|
||||
1. Tasks defined in `config.json` are loaded by `spider_v2.py`
|
||||
2. Playwright performs authenticated searches on Goofish
|
||||
3. New listings are detected and detailed information scraped
|
||||
4. Product images are downloaded
|
||||
5. Complete product/seller data sent to AI for analysis
|
||||
6. AI response determines if item meets criteria
|
||||
7. Recommended items trigger notifications
|
||||
8. All results saved to JSONL files
|
||||
9. Web UI provides management interface for all components
|
||||
|
||||
## 7. Security and Authentication
|
||||
|
||||
The web interface uses HTTP Basic Auth for protection:
|
||||
- Configurable username and password via environment variables
|
||||
- All API endpoints and static resources require authentication
|
||||
- Health check endpoint (`/health`) is publicly accessible
|
||||
- Default credentials are `admin`/`admin123` (should be changed in production)
|
||||
|
||||
## 8. Error Handling and Recovery
|
||||
|
||||
The system includes robust error handling:
|
||||
- Retry mechanisms for network failures
|
||||
- Graceful degradation when AI services are unavailable
|
||||
- Automatic cleanup of temporary files
|
||||
- Detailed logging for troubleshooting
|
||||
- Recovery from browser automation errors
|
||||
|
||||
## 9. Performance Considerations
|
||||
|
||||
To maintain good performance and avoid detection:
|
||||
- Randomized delays between actions
|
||||
- Concurrent task processing with resource limits
|
||||
- Efficient image handling with temporary file cleanup
|
||||
- Asynchronous processing to maximize throughput
|
||||
- Proper session management to reduce login requirements
|
||||
|
||||
This documentation provides a comprehensive overview of the Xianyu Intelligent Monitoring Robot, covering its architecture, components, features, and usage instructions. The system is designed to be flexible, extensible, and user-friendly while providing powerful monitoring and analysis capabilities for second-hand marketplace listings.
|
||||
Reference in New Issue
Block a user