docs(XIANYU_MONITOR_DOCUMENTATION): 添加项目外部文档

添加 CLAUDE.md 文件以指导代码工作
2025-11-25 03:15:07 +08:00 · 2025-09-02 15:06:11 +08:00
parent bf2f9c6463
commit 0c38110b60
2 changed files with 507 additions and 1 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1 +1,128 @@
-回复issue或pr评论时使用中文进行交流。
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+This is an intelligent monitoring robot for Xianyu (Goofish), a Chinese second-hand marketplace. It uses Playwright for web scraping and AI (multimodal LLMs) for intelligent filtering and analysis of listings. The project features a comprehensive web-based management interface built with FastAPI.
+
+Key features:
+- Web UI for task management, AI prompt editing, real-time logs and result browsing
+- AI-driven task creation using natural language descriptions
+- Concurrent multi-task monitoring with independent configurations
+- Real-time stream processing of new listings
+- Deep AI analysis combining product images/text and seller profiling
+- Instant notifications via ntfy.sh, WeChat Work bots, and Bark
+- Cron-based task scheduling
+- Docker deployment support
+- Robust anti-scraping strategies with randomized delays
+
+## Repository Structure
+
+- `web_server.py`: Main FastAPI application with web UI and task management
+- `spider_v2.py`: Core spider script that executes monitoring tasks
+- `src/`: Core modules
+  - `scraper.py`: Main scraping logic and AI integration
+  - `ai_handler.py`: AI analysis and notification functions
+  - `parsers.py`: Data parsing for search results and user profiles
+  - `config.py`: Configuration management and AI client initialization
+  - `utils.py`: Utility functions (retry, random sleep, data formatting)
+  - `prompt_utils.py`: AI prompt generation for new tasks
+  - `task.py`: Task data models and file operations
+- `prompts/`: AI prompt templates and criteria files
+- `static/` and `templates/`: Web UI frontend files
+- `config.json`: Task configurations
+- `xianyu_state.json`: Login session state for authenticated scraping
+- `.env`: Environment variables for API keys, notification settings, etc.
+
+## Common Development Commands
+
+### Setting up the environment
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Copy and configure environment variables
+cp .env.example .env
+# Edit .env with your settings
+```
+
+### Running the application
+```bash
+# Start the web management interface
+python web_server.py
+
+# Run spider tasks directly (usually managed through web UI)
+python spider_v2.py
+```
+
+### Docker deployment
+```bash
+# Build and start services
+docker-compose up --build -d
+
+# View logs
+docker-compose logs -f
+
+# Stop services
+docker-compose down
+```
+
+### Development utilities
+```bash
+# Generate new AI criteria files from natural language descriptions
+python prompt_generator.py
+```
+
+## Core Architecture
+
+### Web Management Interface (web_server.py)
+- FastAPI application providing REST API and serving web UI
+- Authentication using HTTP Basic Auth
+- Task lifecycle management (create, update, start, stop, delete)
+- Real-time log streaming and result browsing
+- AI prompt file editing
+- System status monitoring
+- Scheduler integration for cron-based tasks
+
+### Spider Engine (spider_v2.py)
+- Asynchronous task execution using asyncio
+- Playwright integration for browser automation
+- Multi-task concurrent processing
+- Configurable search filters (keywords, price ranges, personal items only)
+- Detailed product and seller information extraction
+
+### Scraping & Analysis (src/scraper.py)
+- Playwright-based web scraping with anti-detection measures
+- User profile scraping (seller/buyer ratings, transaction history)
+- Image downloading for AI analysis
+- Real-time AI analysis pipeline with retry logic
+- Notification sending for recommended items
+
+### AI Processing (src/ai_handler.py)
+- Multimodal AI analysis of product images and structured data
+- Integration with OpenAI-compatible APIs
+- Base64 image encoding for model input
+- Response validation and error handling
+- Notification services (ntfy, WeChat Work, Bark, Webhook)
+
+### Data Flow
+1. Tasks defined in `config.json` are loaded by `spider_v2.py`
+2. Playwright performs authenticated searches on Goofish
+3. New listings are detected and detailed information scraped
+4. Product images are downloaded
+5. Complete product/seller data sent to AI for analysis
+6. AI response determines if item meets criteria
+7. Recommended items trigger notifications
+8. All results saved to JSONL files
+9. Web UI provides management interface for all components
+
+## Key Configuration Files
+
+- `.env`: API keys, notification settings, web auth credentials
+- `config.json`: Monitoring task definitions with filters and AI prompts
+- `xianyu_state.json`: Browser session state for authenticated scraping
+- Prompt files in `prompts/` directory define AI analysis criteria
+
+开发文档
+@[XIANYU_MONITOR_DOCUMENTATION.md](XIANYU_MONITOR_DOCUMENTATION.md)
--- a/XIANYU_MONITOR_DOCUMENTATION.md
+++ b/XIANYU_MONITOR_DOCUMENTATION.md
@@ -0,0 +1,379 @@
+# Xianyu (Goofish) Intelligent Monitoring Robot - External Documentation
+
+## 1. Project Overview
+
+### 1.1 Main Purpose and Features
+
+The Xianyu (Goofish) Intelligent Monitoring Robot is an advanced tool for monitoring and analyzing second-hand listings on Xianyu (Goofish), a popular Chinese second-hand marketplace. It combines web scraping with AI-powered filtering and analysis to help users find items that match their specific criteria.
+
+Key features include:
+- **Web UI for task management**: A comprehensive web-based management interface built with FastAPI
+- **AI-driven task creation**: Natural language descriptions to automatically generate monitoring tasks
+- **Concurrent multi-task monitoring**: Independent configurations for multiple monitoring tasks
+- **Real-time stream processing**: Immediate processing of new listings as they appear
+- **Deep AI analysis**: Multimodal AI analysis combining product images/text and seller profiling
+- **Instant notifications**: Via ntfy.sh, WeChat Work bots, and Bark
+- **Cron-based task scheduling**: Automated execution of monitoring tasks
+- **Docker deployment support**: Containerized deployment for easy setup
+- **Robust anti-scraping strategies**: Randomized delays and behavior simulation
+
+### 1.2 Key Technologies Used
+
+- **FastAPI**: Web framework for the management interface
+- **Playwright**: Web scraping and browser automation
+- **AI Integration**: Multimodal LLMs for intelligent filtering and analysis
+- **AsyncIO**: Asynchronous programming for concurrent task execution
+- **Docker**: Containerization for deployment
+- **APScheduler**: Task scheduling
+
+### 1.3 Architecture Overview
+
+The system consists of several core components:
+1. **Web Management Interface** (`web_server.py`): FastAPI application providing REST API and serving the web UI
+2. **Spider Engine** (`spider_v2.py`): Core spider script that executes monitoring tasks
+3. **Scraping & Analysis** (`src/scraper.py`): Main scraping logic and AI integration
+4. **AI Processing** (`src/ai_handler.py`): AI analysis and notification functions
+5. **Configuration Management** (`src/config.py`): Configuration management and AI client initialization
+6. **Task Management** (`src/task.py`): Task data models and file operations
+7. **Utility Functions** (`src/utils.py`): Utility functions for retry logic, random sleep, data formatting
+8. **Prompt Utilities** (`src/prompt_utils.py`): AI prompt generation for new tasks
+
+## 2. Core Components
+
+### 2.1 Web Management Interface (web_server.py)
+
+The web management interface is built with FastAPI and provides:
+- Authentication using HTTP Basic Auth
+- Task lifecycle management (create, update, start, stop, delete)
+- Real-time log streaming and result browsing
+- AI prompt file editing
+- System status monitoring
+- Scheduler integration for cron-based tasks
+
+Key features:
+- RESTful API endpoints for all management operations
+- Real-time log streaming with incremental updates
+- Result browsing with pagination, filtering, and sorting
+- Prompt file management for AI analysis customization
+- Notification settings configuration
+- System status monitoring and diagnostics
+
+### 2.2 Spider Engine (spider_v2.py)
+
+The spider engine is responsible for executing monitoring tasks:
+- Asynchronous task execution using asyncio
+- Playwright integration for browser automation
+- Multi-task concurrent processing
+- Configurable search filters (keywords, price ranges, personal items only)
+- Detailed product and seller information extraction
+
+The spider processes tasks defined in `config.json` and performs the following steps:
+1. Load task configurations
+2. Navigate to search results on Goofish
+3. Apply filters (newest first, personal items only, price range)
+4. Process search results across multiple pages
+5. Extract detailed product information
+6. Collect seller profile information
+7. Pass data to AI for analysis
+8. Send notifications for recommended items
+9. Save results to JSONL files
+
+### 2.3 Scraping & AI Analysis (src/scraper.py)
+
+This module handles the core scraping functionality:
+- Playwright-based web scraping with anti-detection measures
+- User profile scraping (seller/buyer ratings, transaction history)
+- Image downloading for AI analysis
+- Real-time AI analysis pipeline with retry logic
+- Notification sending for recommended items
+
+Key functions:
+- `scrape_user_profile()`: Collects comprehensive seller information
+- `scrape_xianyu()`: Main scraping function that orchestrates the entire process
+
+### 2.4 AI Processing (src/ai_handler.py)
+
+The AI processing module handles:
+- Multimodal AI analysis of product images and structured data
+- Integration with OpenAI-compatible APIs
+- Base64 image encoding for model input
+- Response validation and error handling
+- Notification services (ntfy, WeChat Work, Bark, Webhook)
+
+Key functions:
+- `download_all_images()`: Downloads product images for AI analysis
+- `get_ai_analysis()`: Sends product data and images to AI for analysis
+- `send_ntfy_notification()`: Sends notifications when items are recommended
+- `encode_image_to_base64()`: Encodes images for AI processing
+
+### 2.5 Configuration Management (src/config.py)
+
+Handles all configuration aspects:
+- Environment variable loading with dotenv
+- AI client initialization
+- File path management
+- API URL patterns
+- HTTP headers configuration
+
+Key configuration elements:
+- AI model settings (API key, base URL, model name)
+- Notification service URLs
+- Browser settings (headless mode, browser type)
+- Debug and feature flags
+
+### 2.6 Task Management (src/task.py)
+
+Manages task data models and persistence:
+- Task data models using Pydantic
+- File-based storage in config.json
+- CRUD operations for tasks
+
+### 2.7 Utility Functions (src/utils.py)
+
+Provides common utility functions:
+- Retry mechanism with exponential backoff
+- Safe nested dictionary access
+- Random sleep for anti-detection
+- URL conversion utilities
+- Data formatting and saving functions
+
+### 2.8 Prompt Utilities (src/prompt_utils.py)
+
+Handles AI prompt generation and management:
+- AI-powered criteria generation from natural language descriptions
+- Prompt template management
+- Configuration file updates
+
+## 3. Key Features
+
+### 3.1 Web UI for Task Management
+
+The web interface provides a comprehensive management dashboard:
+- Task creation and configuration
+- Real-time log viewing
+- Result browsing with filtering and sorting
+- Prompt file editing
+- System status monitoring
+- Notification settings management
+
+### 3.2 AI-Driven Task Creation
+
+Users can create monitoring tasks using natural language descriptions:
+- Describe what you're looking for in plain language
+- AI automatically generates complex filtering criteria
+- Customizable analysis standards for different product types
+
+### 3.3 Concurrent Multi-Task Monitoring
+
+Supports running multiple monitoring tasks simultaneously:
+- Each task has independent configuration
+- Tasks run concurrently without interference
+- Resource management to prevent overload
+
+### 3.4 Real-Time Stream Processing
+
+Processes new listings as they appear:
+- Immediate analysis of new items
+- Real-time notifications
+- Continuous monitoring without batch delays
+
+### 3.5 AI Analysis with Multimodal LLMs
+
+Advanced AI analysis combining:
+- Product images analysis
+- Structured product data evaluation
+- Seller profile assessment
+- Comprehensive recommendation scoring
+
+### 3.6 Notification Systems
+
+Multiple notification channels:
+- **ntfy.sh**: Push notifications to mobile devices
+- **WeChat Work bots**: Enterprise messaging integration
+- **Bark**: iOS/macOS push notifications
+- **Webhook**: Generic webhook support for custom integrations
+
+### 3.7 Cron-Based Scheduling
+
+Flexible task scheduling using cron expressions:
+- Each task can have its own schedule
+- Supports complex scheduling patterns
+- Automatic task execution
+
+### 3.8 Docker Deployment Support
+
+Containerized deployment for easy setup:
+- Dockerfile for building images
+- docker-compose configuration
+- Environment variable configuration
+- Volume mounting for persistent data
+
+### 3.9 Anti-Scraping Strategies
+
+Robust anti-detection measures:
+- Randomized delays between actions
+- Behavior simulation to mimic human users
+- Headless browser detection avoidance
+- Session management with state preservation
+
+## 4. Configuration Files
+
+### 4.1 .env File for Environment Variables
+
+The `.env` file contains all configuration settings:
+- AI model configuration (API key, base URL, model name)
+- Notification service URLs and credentials
+- Browser settings (headless mode, browser type)
+- Debug and feature flags
+- Web interface authentication credentials
+
+Key environment variables:
+- `OPENAI_API_KEY`: API key for the AI service
+- `OPENAI_BASE_URL`: Base URL for the AI service
+- `OPENAI_MODEL_NAME`: Name of the AI model to use
+- `NTFY_TOPIC_URL`: URL for ntfy.sh notifications
+- `WX_BOT_URL`: WeChat Work bot URL
+- `BARK_URL`: Bark notification URL
+- `RUN_HEADLESS`: Whether to run browser in headless mode
+- `WEB_USERNAME`/`WEB_PASSWORD`: Web interface credentials
+
+### 4.2 config.json for Task Definitions
+
+The `config.json` file defines all monitoring tasks:
+- Task name and enabled status
+- Search keywords and filters
+- Page limits and personal item preferences
+- Price range filters
+- Cron scheduling expressions
+- AI prompt file references
+
+Example task configuration:
+```json
+{
+  "task_name": "MacBook Air M1",
+  "enabled": true,
+  "keyword": "macbook air m1",
+  "max_pages": 5,
+  "personal_only": true,
+  "min_price": "3000",
+  "max_price": "5000",
+  "cron": "3 12 * * *",
+  "ai_prompt_base_file": "prompts/base_prompt.txt",
+  "ai_prompt_criteria_file": "prompts/macbook_criteria.txt",
+  "is_running": false
+}
+```
+
+### 4.3 xianyu_state.json for Login Session State
+
+The `xianyu_state.json` file stores browser session state for authenticated scraping:
+- Cookie information for logged-in sessions
+- Browser storage state
+- Authentication tokens
+
+This file is generated by the login process and is essential for accessing personalized content on Xianyu.
+
+### 4.4 Prompt Files in prompts/ Directory
+
+The `prompts/` directory contains AI prompt templates:
+- `base_prompt.txt`: Base prompt template with output format
+- `*_criteria.txt`: Product-specific analysis criteria
+
+The prompt system uses a two-part approach:
+1. **Base Prompt**: Defines the structure and output format for AI responses
+2. **Criteria Files**: Product-specific analysis criteria that are inserted into the base prompt
+
+## 5. Deployment and Usage
+
+### 5.1 Environment Setup
+
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/dingyufei615/ai-goofish-monitor
+   cd ai-goofish-monitor
+   ```
+
+2. Install dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+3. Configure environment variables by copying `.env.example` to `.env` and editing the values:
+   ```bash
+   cp .env.example .env
+   ```
+
+### 5.2 Running the Application
+
+1. Start the web management interface:
+   ```bash
+   python web_server.py
+   ```
+
+2. Access the web interface at `http://127.0.0.1:8000`
+
+3. Configure tasks through the web interface or by editing `config.json` directly
+
+### 5.3 Docker Deployment
+
+1. Build and start services:
+   ```bash
+   docker-compose up --build -d
+   ```
+
+2. View logs:
+   ```bash
+   docker-compose logs -f
+   ```
+
+3. Stop services:
+   ```bash
+   docker-compose down
+   ```
+
+### 5.4 Development Utilities
+
+Generate new AI criteria files from natural language descriptions:
+```bash
+python prompt_generator.py
+```
+
+## 6. Data Flow
+
+1. Tasks defined in `config.json` are loaded by `spider_v2.py`
+2. Playwright performs authenticated searches on Goofish
+3. New listings are detected and detailed information scraped
+4. Product images are downloaded
+5. Complete product/seller data sent to AI for analysis
+6. AI response determines if item meets criteria
+7. Recommended items trigger notifications
+8. All results saved to JSONL files
+9. Web UI provides management interface for all components
+
+## 7. Security and Authentication
+
+The web interface uses HTTP Basic Auth for protection:
+- Configurable username and password via environment variables
+- All API endpoints and static resources require authentication
+- Health check endpoint (`/health`) is publicly accessible
+- Default credentials are `admin`/`admin123` (should be changed in production)
+
+## 8. Error Handling and Recovery
+
+The system includes robust error handling:
+- Retry mechanisms for network failures
+- Graceful degradation when AI services are unavailable
+- Automatic cleanup of temporary files
+- Detailed logging for troubleshooting
+- Recovery from browser automation errors
+
+## 9. Performance Considerations
+
+To maintain good performance and avoid detection:
+- Randomized delays between actions
+- Concurrent task processing with resource limits
+- Efficient image handling with temporary file cleanup
+- Asynchronous processing to maximize throughput
+- Proper session management to reduce login requirements
+
+This documentation provides a comprehensive overview of the Xianyu Intelligent Monitoring Robot, covering its architecture, components, features, and usage instructions. The system is designed to be flexible, extensible, and user-friendly while providing powerful monitoring and analysis capabilities for second-hand marketplace listings.