Kenshi Wiki Data Scraping: Anonymous Crawling and Data Cleaning

Project Background

Kenshi has rich game content, and its Wiki is an important source of information for players. This project systematically scrapes Kenshi game Wiki data, covering factions, worldbook, and other core game data, providing data support for subsequent game analysis tools.

Technical Challenges

Anti-Crawler Countermeasures

Wiki platforms typically have anti-crawler mechanisms requiring multi-layer anonymity protection:

Layer	Solution
Fingerprint spoofing	curl_cffi mimicking Chrome 131 TLS fingerprint
IP hiding	Free proxy pool auto-rotation, Tor network support
Behavior simulation	2-5 second smart delay + 30% probability human long delay
Failure recovery	Auto-detect and remove failed proxies, auto-retry

Crawler Implementation

from curl_cffi import requests
 
# Mimic Chrome 131 TLS fingerprint
session = requests.Session(impersonate="chrome131")
 
def crawl_page(url, proxy=None):
    proxies = {"http": proxy, "https": proxy} if proxy else None
    response = session.get(url, proxies=proxies, timeout=30)
    return response.text

Multi-Round Crawling Strategy

Progressive strategy ensures complete data coverage:

Quick crawl → Safe crawl → Deep crawl → Link discovery → Supplemental crawl

Round Objectives

Round	Target	Characteristic
Quick crawl	Main pages	Short delay, fast coverage
Safe crawl	Important pages	Long delay, ensure success
Deep crawl	Detailed content	Parse sub-pages
Link discovery	Newly found links	Expand coverage
Supplemental crawl	Retry failures	Fill gaps

Data Cleaning Process

def clean_data(raw_data):
    # 1. Merge multi-round crawl results
    all_pages = merge_crawl_results(raw_data)
 
    # 2. Deduplicate (based on page title and content hash)
    unique_pages = deduplicate(all_pages)
 
    # 3. Clean invalid fields
    cleaned = clean_fields(unique_pages)
 
    # 4. Normalize structure
    return normalize_structure(cleaned)

Cleaning Steps

Load and merge: Load and merge multi-round crawl results
Deduplication: Remove duplicates based on page title and content hash
Field cleaning: Remove invalid fields, unify format
Structured output: Generate JSON datasets

Output Results

File	Content
`all_factions.json`	Faction structured data
`kenshi_worldbook.json`	Worldbook entries
`kenshi_full_data.json`	Complete dataset
`kenshi_clean_database/`	Cleaned categorical output

Usage

# Install dependencies
pip install curl_cffi beautifulsoup4
 
# Run crawler
python run_safe_crawl.py
 
# Data cleaning
python clean_and_merge.py

Notes

Please follow the Wiki's robots.txt and terms of use
Recommend running crawler during off-peak hours
Data is for learning and research purposes only