Back to blog

Kenshi Wiki Data Scraping: Anonymous Crawling and Data Cleaning

Systematic scraping, cleaning, and organizing of Kenshi game Wiki data to build structured game knowledge datasets.

#Python#Web Scraping#Data#Proxy#Kenshi

Project Background

Kenshi has rich game content, and its Wiki is an important source of information for players. This project systematically scrapes Kenshi game Wiki data, covering factions, worldbook, and other core game data, providing data support for subsequent game analysis tools.

Technical Challenges

Anti-Crawler Countermeasures

Wiki platforms typically have anti-crawler mechanisms requiring multi-layer anonymity protection:

LayerSolution
Fingerprint spoofingcurl_cffi mimicking Chrome 131 TLS fingerprint
IP hidingFree proxy pool auto-rotation, Tor network support
Behavior simulation2-5 second smart delay + 30% probability human long delay
Failure recoveryAuto-detect and remove failed proxies, auto-retry

Crawler Implementation

from curl_cffi import requests
 
# Mimic Chrome 131 TLS fingerprint
session = requests.Session(impersonate="chrome131")
 
def crawl_page(url, proxy=None):
    proxies = {"http": proxy, "https": proxy} if proxy else None
    response = session.get(url, proxies=proxies, timeout=30)
    return response.text

Multi-Round Crawling Strategy

Progressive strategy ensures complete data coverage:

Quick crawl → Safe crawl → Deep crawl → Link discovery → Supplemental crawl

Round Objectives

RoundTargetCharacteristic
Quick crawlMain pagesShort delay, fast coverage
Safe crawlImportant pagesLong delay, ensure success
Deep crawlDetailed contentParse sub-pages
Link discoveryNewly found linksExpand coverage
Supplemental crawlRetry failuresFill gaps

Data Cleaning Process

def clean_data(raw_data):
    # 1. Merge multi-round crawl results
    all_pages = merge_crawl_results(raw_data)
 
    # 2. Deduplicate (based on page title and content hash)
    unique_pages = deduplicate(all_pages)
 
    # 3. Clean invalid fields
    cleaned = clean_fields(unique_pages)
 
    # 4. Normalize structure
    return normalize_structure(cleaned)

Cleaning Steps

  1. Load and merge: Load and merge multi-round crawl results
  2. Deduplication: Remove duplicates based on page title and content hash
  3. Field cleaning: Remove invalid fields, unify format
  4. Structured output: Generate JSON datasets

Output Results

FileContent
all_factions.jsonFaction structured data
kenshi_worldbook.jsonWorldbook entries
kenshi_full_data.jsonComplete dataset
kenshi_clean_database/Cleaned categorical output

Usage

# Install dependencies
pip install curl_cffi beautifulsoup4
 
# Run crawler
python run_safe_crawl.py
 
# Data cleaning
python clean_and_merge.py

Notes

  • Please follow the Wiki's robots.txt and terms of use
  • Recommend running crawler during off-peak hours
  • Data is for learning and research purposes only

Related Links


Last updated: 2026-03-26