Kenshi Wiki Data Scraping & Cleaning

Overview

Systematic data collection from the Kenshi game Wiki, covering factions, worldbook entries, and other core game data. The project focuses on crawler anonymity and data cleaning completeness.

Crawler Strategy

Anonymous Crawler

Multi-layer anonymity protection:

Layer	Approach
Fingerprint spoofing	curl_cffi simulating Chrome 131 TLS fingerprint
IP hiding	Free proxy pool with auto-rotation, optional Tor support
Behavior simulation	2-5s smart delay + 30% human-like long delay
Failure recovery	Automatic dead proxy removal and retry

Multi-Round Crawling

Progressive strategy: quick crawl → safe crawl → deep crawl → link discovery → supplementary crawl, ensuring complete data coverage.

Data Cleaning

The cleaning pipeline includes:

Loading and merging results from multiple crawl rounds
Deduplication (based on page title and content hash)
Invalid field cleaning and format normalization
Outputting structured JSON datasets (all_factions.json, kenshi_worldbook.json, etc.)

Outputs

all_factions.json — Structured faction data
kenshi_worldbook.json — Worldbook entries
kenshi_full_data.json — Complete dataset
kenshi_clean_database/ — Cleaned categorized output