feat: Initial commit of Clutch-IQ project

2026-02-05 23:26:03 +08:00
commit a355239861
66 changed files with 12922 additions and 0 deletions
--- a/docs/API_INTERFACE_GUIDE.md
+++ b/docs/API_INTERFACE_GUIDE.md
@@ -0,0 +1,76 @@
+# Clutch-IQ Inference API Interface Guide
+
+## Overview
+The Inference Service (`src/inference/app.py`) supports **two types of payloads** to accommodate different use cases: Real-time Game Integration and Strategy Simulation (Dashboard).
+
+## 1. Raw Game State Payload (Game Integration)
+Used when receiving data directly from the CS2 Game State Integration (GSI) or Parser. The server performs Feature Engineering.
+
+**Use Case:** Real-time match prediction.
+
+**Payload Structure:**
+```json
+{
+  "game_time": 60.0,
+  "is_bomb_planted": 0,
+  "site": 0,
+  "players": [
+    {
+      "team_num": 2, // 2=T, 3=CT
+      "is_alive": true,
+      "health": 100,
+      "X": -1200, "Y": 500, "Z": 128,
+      "active_weapon_name": "ak47",
+      "balance": 4500,
+      "equip_value": 2700
+    },
+    ...
+  ]
+}
+```
+
+**Processing Logic:**
+- `process_payload` extracts `players` list.
+- Calculates `t_alive`, `health_diff`, `t_spread`, `pincer_index`, etc.
+- Returns feature vector.
+
+---
+
+## 2. Pre-calculated Feature Payload (Dashboard/Simulation)
+Used when the client (e.g., Streamlit Dashboard) manually sets the tactical situation. The server skips feature engineering and uses provided values.
+
+**Use Case:** "What-if" analysis, Strategy Dashboard.
+
+**Payload Structure:**
+```json
+{
+  "t_alive": 2,
+  "ct_alive": 3,
+  "t_health": 180,
+  "ct_health": 290,
+  "t_equip_value": 8500,
+  "ct_equip_value": 14000,
+  "t_total_cash": 1200,
+  "ct_total_cash": 3500,
+  "team_distance": 1500.5,
+  "t_spread": 400.2,
+  "ct_spread": 800.1,
+  "t_area": 40000.0,
+  "ct_area": 64000.0,
+  "t_pincer_index": 0.45,
+  "ct_pincer_index": 0.22,
+  "is_bomb_planted": 0,
+  "site": 0,
+  "game_time": 60.0
+}
+```
+
+**Processing Logic:**
+- `process_payload` detects presence of `t_alive` / `ct_alive`.
+- Uses values directly.
+- Auto-calculates derived fields like `health_diff` (`ct - t`) if missing.
+
+## Error Handling
+If you receive `Error: {"error":"Not supported type for data.<class 'NoneType'>"}`:
+- **Cause:** You sent a payload that matches neither format (e.g., missing `players` list AND missing direct features).
+- **Fix:** Ensure your JSON body matches one of the structures above.
--- a/docs/Clutch_Prediction_Implementation_Plan.md
+++ b/docs/Clutch_Prediction_Implementation_Plan.md
@@ -0,0 +1,109 @@
+# Project Clutch-IQ: CS2 实时胜率预测系统实施方案
+
+> **Version**: 3.0 (Final Architecture)
+> **Date**: 2026-01-31
+> **Status**: Ready for Implementation
+
+---
+
+## 1. 项目愿景 (Vision)
+构建一个**职业级、物理感知、战术驱动**的 CS2 实时残局胜率预测引擎。
+该系统不仅输出胜率数值（如 "CT Win 30%"），更能解析战术成因（如“因缺少拆弹钳且时间不足”），服务于赛后复盘、直播增强和战术分析。
+
+---
+
+## 2. 核心架构 (Architecture)
+
+### 2.1 三层流水线设计
+1.  **Phase 1: 数据快照引擎 (Snapshot Engine)** - *ETL 层*
+    - 负责从 Demo 解析高频、高精度的“战术切片”。
+2.  **Phase 2: 特征工程工厂 (Feature Factory)** - *逻辑层*
+    - 将原始数据转化为物理特征（路径距离）和博弈特征（交叉火力）。
+3.  **Phase 3: 模型预测服务 (Inference Service)** - *应用层*
+    - 基于 XGBoost/LightGBM 提供毫秒级实时预测。
+
+---
+
+## 3. 详细实施蓝图 (Implementation Roadmap)
+
+### Phase 1: 高精度数据快照 (The Snapshot Engine)
+
+#### 1.1 智能触发器 (Smart Triggers)
+为了过滤冗余数据，系统仅在以下时刻捕获快照：
+*   **关键事件**：`Player_Death`, `Bomb_Plant`, `Bomb_Defuse_Start`, `Bomb_Defuse_End`
+*   **状态剧变**：任意玩家 HP 损失 > 20（捕捉对枪结果）
+*   **时间心跳**：残局阶段 (≤3v3) 每 5 秒强制采样一次
+
+#### 1.2 标准化快照字段 (Snapshot Schema)
+每个快照包含 4 类核心数据：
+
+| 类别 | 字段名 | 说明 | 来源 |
+| :--- | :--- | :--- | :--- |
+| **元数据** | `match_id`, `round`, `tick` | 唯一索引 | Demo |
+| **局势** | `bomb_state`, `bomb_timer` | C4 状态 (0:未下, 1:已下, 2:被拆) | Demo |
+| **局势** | `seconds_remaining` | 回合/C4 倒计时 | Demo |
+| **人员** | `ct_alive`, `t_alive` | 存活人数 | Demo |
+| **人员** | `ct_hp_sum`, `t_hp_sum` | 团队总血量 | Demo |
+| **装备** | `ct_has_kit`, `t_has_c4` | **关键道具** (钳子/C4) | Demo |
+| **空间** | `ct_positions`, `t_positions` | 原始坐标 (用于后续计算) | Demo |
+
+---
+
+### Phase 2: 特征工程与融合 (Feature Engineering)
+
+#### 2.1 物理感知特征 (Physics-Aware Features)
+*   **F1: 路径距离 (NavMesh Distance)**
+    *   *革新点*：放弃欧氏距离，使用地图路网计算真实移动距离。
+    *   *实现*：预计算 `Map_Zone_Distance_Matrix`，实时查询。
+*   **F2: 时间压力指数 (Time Pressure Index - TPI)**
+    *   *公式*：$TPI = \frac{\text{TravelTime} + \text{DefuseTime}}{\text{TimeRemaining}}$
+    *   *判定*：$TPI > 1.0 \rightarrow$ 胜率强制归零。
+*   **F3: 视线与掩体 (Line of Sight)**
+    *   *特征*：`is_blind` (致盲状态), `is_in_smoke` (烟雾状态)。
+
+#### 2.2 战术博弈特征 (Tactical Features)
+*   **F4: 交叉火力系数 (Crossfire Coefficient)**
+    *   *逻辑*：计算多名 CT 与目标 T 的夹角。夹角接近 90° 时胜率加成最大。
+*   **F5: 经济势能差 (Economy Momentum)**
+    *   *公式*：$\Delta E = \text{CT\_Equip\_Value} - \text{T\_Equip\_Value}$
+    *   *作用*：量化“长枪打手枪”的装备压制力。
+
+#### 2.3 选手画像注入 (Player Profiling)
+利用 L3 数据库增强模型对“人”的理解：
+*   **F6: 明星光环**：`max_alive_rating` (存活最强选手的 Rating)。
+*   **F7: 残局专家**：`avg_clutch_win_rate` (存活选手的历史残局胜率)。
+
+---
+
+### Phase 3: 模型训练与策略 (Modeling Strategy)
+
+#### 3.1 训练配置
+*   **算法**：**XGBoost** (分类器)
+*   **目标函数**：`LogLoss` (优化概率准确性)
+*   **评估指标**：`AUC` (排序能力), `Brier Score` (校准度)
+
+#### 3.2 样本清洗策略
+*   **剔除保枪局 (Filter Save Rounds)**：
+    *   若残局结束时：`Damage_Dealt == 0` AND `Dist_To_Enemy > 50m` AND `Weapon_Value > 2000`
+    *   判定为“主动放弃”，剔除样本，防止污染胜率模型。
+
+---
+
+## 4. 交付物清单 (Deliverables)
+
+1.  **`extract_snapshots.py`**
+    *   基于 `demoparser2` 的 Python 脚本，批量处理 Demo 生成 CSV 训练集。
+2.  **`map_nav_graph.json`**
+    *   核心地图 (Mirage, Inferno 等) 的区域距离查找表。
+3.  **`Clutch_Predictor_Model.pkl`**
+    *   训练好的 XGBoost 模型文件。
+4.  **`Win_Prob_Service.py`**
+    *   简单的 Flask 接口：输入当前状态 JSON $\rightarrow$ 输出 `{ "ct_win_prob": 0.35, "key_factor": "time_pressure" }`。
+
+---
+
+## 5. 下一步行动 (Action Items)
+
+1.  **[High Priority]** 开发 `extract_snapshots.py` 原型，跑通基础数据流。
+2.  **[Medium Priority]** 构建 Mirage 地图的简单网格距离表。
+3.  **[Medium Priority]** 整合 L3 数据库，生成选手能力特征表。
--- a/docs/DATABASE_LOGICAL_STRUCTURE.md
+++ b/docs/DATABASE_LOGICAL_STRUCTURE.md
@@ -0,0 +1,109 @@
+# Database Logical Structure (ER Diagram)
+
+This diagram illustrates the logical relationships and data flow between the storage layers (L1, L2, L3) in the optimized architecture.
+
+```mermaid
+erDiagram
+    %% ==========================================
+    %% L1 LAYER: RAW DATA (Data Lake)
+    %% ==========================================
+    
+    L1A_raw_iframe_network {
+        string match_id PK
+        json content "Raw API Response"
+        timestamp processed_at
+    }
+
+    L1B_tick_snapshots_parquet {
+        string match_id FK
+        int tick
+        int round
+        json player_states "Positions, HP, Equip"
+        json bomb_state
+        string file_path "Parquet File Location"
+    }
+
+    %% ==========================================
+    %% L2 LAYER: DATA WAREHOUSE (Structured)
+    %% ==========================================
+
+    dim_players {
+        string steam_id_64 PK
+        string username
+        float rating
+        float avg_clutch_win_rate
+    }
+
+    dim_maps {
+        int map_id PK
+        string map_name "de_mirage"
+        string nav_mesh_path
+    }
+
+    fact_matches {
+        string match_id PK
+        int map_id FK
+        timestamp start_time
+        int winner_team
+        int final_score_ct
+        int final_score_t
+    }
+
+    fact_rounds {
+        string round_id PK
+        string match_id FK
+        int round_num
+        int winner_side
+        string win_reason "Elimination/Bomb/Time"
+    }
+
+    L2_Spatial_NavMesh {
+        string map_name PK
+        string zone_id
+        binary distance_matrix "Pre-calculated paths"
+    }
+
+    %% ==========================================
+    %% L3 LAYER: FEATURE STORE (AI Ready)
+    %% ==========================================
+
+    L3_Offline_Features {
+        string snapshot_id PK
+        float feature_tpi "Time Pressure Index"
+        float feature_crossfire "Tactical Score"
+        float feature_equipment_diff
+        int label_is_win "Target Variable"
+    }
+
+    %% ==========================================
+    %% RELATIONSHIPS
+    %% ==========================================
+
+    %% L1 -> L2 Flow
+    L1A_raw_iframe_network ||--|{ fact_matches : "Extracts to"
+    L1A_raw_iframe_network ||--|{ dim_players : "Extracts to"
+    L1B_tick_snapshots_parquet }|--|| fact_matches : "Belongs to"
+    L1B_tick_snapshots_parquet }|--|| fact_rounds : "Details"
+
+    %% L2 Relations
+    fact_matches }|--|| dim_maps : "Played on"
+    fact_rounds }|--|| fact_matches : "Part of"
+    
+    %% L2 -> L3 Flow (Feature Engineering)
+    L3_Offline_Features }|--|| L1B_tick_snapshots_parquet : "Computed from"
+    L3_Offline_Features }|--|| L2_Spatial_NavMesh : "Uses Physics from"
+    L3_Offline_Features }|--|| dim_players : "Enriched with"
+```
+
+## 结构说明 (Structure Explanation)
+
+1.  **L1 源数据层**:
+    *   **左上 (L1A)**: 传统的数据库表，存储比赛结果元数据。
+    *   **左下 (L1B)**: **虚线框表示的文件系统**。虽然物理上是 Parquet 文件，但在逻辑上它是一张巨大的“Tick 级快照表”，通过 `match_id` 与其他层关联。
+
+2.  **L2 数仓层**:
+    *   **核心 (Dim/Fact)**: 标准的星型模型。`fact_matches` 是核心事实表，关联 `dim_players` (人) 和 `dim_maps` (地)。
+    *   **空间 (Spatial)**: 独立的查找表逻辑，为每一张 `dim_maps` 提供物理距离计算支持。
+
+3.  **L3 特征层**:
+    *   **右侧 (Features)**: 这是宽表（Wide Table），每一行直接对应模型的一个训练样本。它不存储原始数据，而是存储**计算后的数值** (如 TPI 指数)，直接由 L1B (位置) + L2 Spatial (距离) + Dim Players (能力) 融合计算而来。
--- a/docs/OPTIMIZED_ARCHITECTURE.md
+++ b/docs/OPTIMIZED_ARCHITECTURE.md
@@ -0,0 +1,130 @@
+# Clutch-IQ & Data Warehouse Optimized Architecture (v4.0)
+
+## 0. 本仓库目录映射
+
+- L1A（网页抓取原始数据，SQLite）：`database/L1/L1.db`
+- L1B（Demo 快照，Parquet）：`data/processed/*.parquet`
+- L2（结构化数仓，SQLite）：`database/L2/L2.db`
+- L3（特征库，SQLite）：`database/L3/L3.db`
+- 离线 ETL：`src/etl/`（Demo → Parquet）
+- 训练：`src/training/train.py`
+- 在线推理：`src/inference/app.py`
+
+## 1. 核心设计理念：混合流批架构 (Hybrid Batch/Stream Architecture)
+
+为了同时满足 **大规模历史数据分析** (L2/L3) 和 **毫秒级实时胜率预测** (Clutch-IQ)，我们将架构优化为现代化的数据平台模式。
+
+核心变更点：
+1.  **存储分层**: 高频快照（Tick/Frame）使用 **Parquet**；聚合后的业务/特征数据使用 **SQLite**。
+2.  **特征解耦**: 引入 **Feature Store（特征库）** 概念，统一管理离线训练与在线推理使用的特征。
+3.  **闭环反馈（可选）**: 预测结果可回写到 L2/L3，用于后续分析与迭代。
+
+---
+
+## 2. 优化后的分层架构图
+
+```mermaid
+graph TD
+    %% Data Sources
+    Web[5eplay Web Data] --> L1A
+    Demo[CS2 .dem Files] --> L1B
+    GSI[Real-time GSI Stream] --> Inference
+
+    %% L1 Layer: Data Lake (Raw)
+    subgraph "L1: Data Lake (Raw Ingestion)"
+        L1A[L1A: Metadata Store] -- SQLite --> L1A_DB[(database/L1/L1.db)]
+        L1B[L1B: Telemetry Engine] -- Parquet --> L1B_Files[(data/processed/*.parquet)]
+    end
+
+    %% L2 Layer: Data Warehouse (Clean)
+    subgraph "L2: Data Warehouse (Structured)"
+        L1A_DB --> L2_ETL
+        L1B_Files --> L2_ETL[L2 Processors]
+        L2_ETL --> L2_SQL[(database/L2/L2.db)]
+        L2_ETL --> L2_Spatial[(L2_Spatial: NavMesh/Grids)]
+    end
+
+    %% L3 Layer: Feature Store (Analytics & AI)
+    subgraph "L3: Feature Store (Machine Learning)"
+        L2_SQL --> L3_Offline
+        L2_Spatial --> L3_Offline
+        L3_Offline[Offline Feature Build] --> L3_DB[(database/L3/L3.db)]
+        L3_Offline -- XGBoost --> Model[Clutch Predictor Model]
+        
+        L3_DB --> Inference
+    end
+
+    %% Application Layer
+    subgraph "App: Clutch-IQ Service"
+        Inference[Inference Engine]
+        Model --> Inference
+        Inference --> API[Win Prob API]
+    end
+    
+    API -.-> L2_SQL : Feedback Loop (Log Predictions)
+```
+
+---
+
+## 3. 层级详细定义与优化点
+
+### **L1: 数据湖层 (Data Lake)**
+*   **L1A (Web Metadata)**: 保持现状。
+    *   *存储*: SQLite
+    *   *内容*: 比赛元数据、比分。
+*   **L1B (Demo Telemetry) [优化重点]**:
+    *   *变更*: **不把 Tick/Frame 快照直接塞进 SQLite**。Demo 快照数据量大（64/128 tick/s），SQLite 容易膨胀且读写慢。
+    *   *优化*: 使用 **Parquet**（列式存储）保存快照，便于批量训练与分析。
+    *   *优势*: 高压缩、高吞吐、与 Pandas/XGBoost 训练流程匹配。
+
+### **L2: 数仓层 (Data Warehouse)**
+*   **L2 Core (Business)**: 保持现状。
+    *   *存储*: SQLite
+    *   *内容*: 玩家维度 (Dim_Player)、比赛维度 (Fact_Match) 的清洗数据。
+*   **L2 Spatial (Physics) [新增]**:
+    *   *内容*: **地图导航网格 (Nav Mesh)**、距离矩阵、地图区域划分。
+    *   *用途*: 为 L3 提供物理计算基础（如计算 A 点到 B 点的真实跑图时间，而非直线距离）。
+
+### **L3: 特征商店层 (Feature Store)**
+*   **定义**: 不再只是一个 DB，而是一套**特征注册表**。
+*   **Offline Store**:
+    *   从 L2 聚合计算玩家/队伍特征，落到 L3（便于复用与快速查询）。
+    *   训练标签（Label）仍来自比赛结果/回合结果（例如 `round_winner`）。
+*   **Online Store**:
+    *   在线推理时使用的快速查表数据（例如玩家能力/地图预计算数据）。
+    *   *例子*: 地图距离矩阵（预先算好的点对点距离），推理时查表以降低延迟。
+
+---
+
+## 4. 全方位评价 (Comprehensive Evaluation)
+
+### ✅ 优势 (Pros)
+
+1.  **高性能 (Performance)**:
+    *   引入 Parquet 解决了海量 Tick 数据的 I/O 瓶颈。
+    *   预计算 L2 Spatial 数据，确保实时预测延迟低于 50ms。
+
+2.  **可扩展性 (Scalability)**:
+    *   L1B 和 L3 的文件式存储架构支持分布式处理（未来可迁移至 Spark/Dask）。
+    *   新增地图只需更新 L2 Spatial，不影响模型逻辑。
+
+3.  **即时性与准确性平衡 (Real-time Readiness)**:
+    *   架构明确区分了“离线训练”（追求精度，处理慢）和“在线推理”（追求速度，查表为主）。
+
+4.  **模块化 (Modularity)**:
+    *   L1/L2/L3 职责边界清晰，数据污染风险低。Clutch-IQ 只是 L3 的一个“消费者”，不破坏原有数仓结构。
+
+### ⚠️ 潜在挑战 (Cons)
+
+1.  **技术栈复杂性**:
+    *   引入 Parquet 需要 Python `pyarrow` 或 `fastparquet` 库支持。
+    *   需要维护文件系统（File System）和数据库（SQLite）两种存储范式。
+
+2.  **冷启动成本**:
+    *   L2 Spatial 需要针对每张地图（Mirage, Inferno, Nuke...）单独构建导航网格数据，前期工作量大。
+
+---
+
+## 5. 结论
+
+该优化架构从**单机分析型**向**工业级 AI 生产型**转变。它不仅能支持当前的胜率预测，更为未来扩展（如：反作弊行为分析、AI 教练系统）打下了坚实的底层基础。
--- a/docs/README.md
+++ b/docs/README.md
@@ -0,0 +1,11 @@
+# docs/
+
+项目文档集中存放目录。
+
+## 文档索引
+
+- OPTIMIZED_ARCHITECTURE.md：仓库整体架构与 L1/L2/L3 分层说明
+- DATABASE_LOGICAL_STRUCTURE.md：数仓逻辑结构（ER/关系）说明
+- API_INTERFACE_GUIDE.md：在线推理接口（/predict）入参格式与用法
+- Clutch_Prediction_Implementation_Plan.md：实施路线与交付物规划
+