Back to projects
ToolsComplete

Kenshi Wiki Data Scraping & Cleaning

Systematic scraping, cleaning, and structuring of Kenshi game wiki data into organized knowledge datasets.

Pythoncurl_cffiBeautifulSoupProxy

Overview

Systematic data collection from the Kenshi game Wiki, covering factions, worldbook entries, and other core game data. The project focuses on crawler anonymity and data cleaning completeness.

Crawler Strategy

Anonymous Crawler

Multi-layer anonymity protection:

LayerApproach
Fingerprint spoofingcurl_cffi simulating Chrome 131 TLS fingerprint
IP hidingFree proxy pool with auto-rotation, optional Tor support
Behavior simulation2-5s smart delay + 30% human-like long delay
Failure recoveryAutomatic dead proxy removal and retry

Multi-Round Crawling

Progressive strategy: quick crawl → safe crawl → deep crawl → link discovery → supplementary crawl, ensuring complete data coverage.

Data Cleaning

The cleaning pipeline includes:

  • Loading and merging results from multiple crawl rounds
  • Deduplication (based on page title and content hash)
  • Invalid field cleaning and format normalization
  • Outputting structured JSON datasets (all_factions.json, kenshi_worldbook.json, etc.)

Outputs

  • all_factions.json — Structured faction data
  • kenshi_worldbook.json — Worldbook entries
  • kenshi_full_data.json — Complete dataset
  • kenshi_clean_database/ — Cleaned categorized output