Overview
Systematic data collection from the Kenshi game Wiki, covering factions, worldbook entries, and other core game data. The project focuses on crawler anonymity and data cleaning completeness.
Crawler Strategy
Anonymous Crawler
Multi-layer anonymity protection:
| Layer | Approach |
|---|---|
| Fingerprint spoofing | curl_cffi simulating Chrome 131 TLS fingerprint |
| IP hiding | Free proxy pool with auto-rotation, optional Tor support |
| Behavior simulation | 2-5s smart delay + 30% human-like long delay |
| Failure recovery | Automatic dead proxy removal and retry |
Multi-Round Crawling
Progressive strategy: quick crawl → safe crawl → deep crawl → link discovery → supplementary crawl, ensuring complete data coverage.
Data Cleaning
The cleaning pipeline includes:
- Loading and merging results from multiple crawl rounds
- Deduplication (based on page title and content hash)
- Invalid field cleaning and format normalization
- Outputting structured JSON datasets (
all_factions.json,kenshi_worldbook.json, etc.)
Outputs
all_factions.json— Structured faction datakenshi_worldbook.json— Worldbook entrieskenshi_full_data.json— Complete datasetkenshi_clean_database/— Cleaned categorized output