Project Background
Kenshi has rich game content, and its Wiki is an important source of information for players. This project systematically scrapes Kenshi game Wiki data, covering factions, worldbook, and other core game data, providing data support for subsequent game analysis tools.
Technical Challenges
Anti-Crawler Countermeasures
Wiki platforms typically have anti-crawler mechanisms requiring multi-layer anonymity protection:
| Layer | Solution |
|---|---|
| Fingerprint spoofing | curl_cffi mimicking Chrome 131 TLS fingerprint |
| IP hiding | Free proxy pool auto-rotation, Tor network support |
| Behavior simulation | 2-5 second smart delay + 30% probability human long delay |
| Failure recovery | Auto-detect and remove failed proxies, auto-retry |
Crawler Implementation
from curl_cffi import requests
# Mimic Chrome 131 TLS fingerprint
session = requests.Session(impersonate="chrome131")
def crawl_page(url, proxy=None):
proxies = {"http": proxy, "https": proxy} if proxy else None
response = session.get(url, proxies=proxies, timeout=30)
return response.textMulti-Round Crawling Strategy
Progressive strategy ensures complete data coverage:
Quick crawl → Safe crawl → Deep crawl → Link discovery → Supplemental crawl
Round Objectives
| Round | Target | Characteristic |
|---|---|---|
| Quick crawl | Main pages | Short delay, fast coverage |
| Safe crawl | Important pages | Long delay, ensure success |
| Deep crawl | Detailed content | Parse sub-pages |
| Link discovery | Newly found links | Expand coverage |
| Supplemental crawl | Retry failures | Fill gaps |
Data Cleaning Process
def clean_data(raw_data):
# 1. Merge multi-round crawl results
all_pages = merge_crawl_results(raw_data)
# 2. Deduplicate (based on page title and content hash)
unique_pages = deduplicate(all_pages)
# 3. Clean invalid fields
cleaned = clean_fields(unique_pages)
# 4. Normalize structure
return normalize_structure(cleaned)Cleaning Steps
- Load and merge: Load and merge multi-round crawl results
- Deduplication: Remove duplicates based on page title and content hash
- Field cleaning: Remove invalid fields, unify format
- Structured output: Generate JSON datasets
Output Results
| File | Content |
|---|---|
all_factions.json | Faction structured data |
kenshi_worldbook.json | Worldbook entries |
kenshi_full_data.json | Complete dataset |
kenshi_clean_database/ | Cleaned categorical output |
Usage
# Install dependencies
pip install curl_cffi beautifulsoup4
# Run crawler
python run_safe_crawl.py
# Data cleaning
python clean_and_merge.pyNotes
- Please follow the Wiki's robots.txt and terms of use
- Recommend running crawler during off-peak hours
- Data is for learning and research purposes only
Related Links
Last updated: 2026-03-26