Chapter 3 Data transformation
3.1 Getting readable data
We used Selenium and Beautiful Soup to parse the HLTV website and to download the demo files. To avoid putting too much load on the site, we downloaded the demo files over the course of a full week. This resulted in over 600 demo files. Python notebook - parser and demo downloader
To generate readable data from the demo files, we used a python library [1] to parse them in JSON files, resulting in 625 successful conversions and 20GB of pre-processed data from the original 80+GB of data. Python notebook - pre-processing. This process converts the demo files into json files.
The pre-processed data is still very much dirty as some files were corrupted. Also, in-game technical pauses, warmup rounds, game restarts, etc, can mess up the format and add rounds. We still managed to clean the data with the csgo python library cleaning functions.
The following Python notebook explains the data structure of a generated JSON file: Python notebook - data structures
To narrow down the amount of data points, we chose to merge all the matches json files into 5 json files containing information that we judged relevant for analysis : kills data, damages data, flashed data,and grenades data. To do so, we had to iterate through all the json files and all the relevant keys that we wanted to analyze.
To combine the json files and join the matches information such as score, winner, and teams, we used the parsed html from HLTV to cross the information. Python notebook - combining json files. The last step allows having consistent data across all matches and events that can later be joined, or grouped together in R.
Lion drive link to the combined json files and parsed html (1.5GB)
3.2 Kills data
The kills data stored in the ALL_kills.json file contains information on every kill that happened such as the player positions, weapon used etc. It has 103,261 rows and 59 columns. The various column names in the kills data set are:
## [1] "tick" "seconds" "clockTime"
## [4] "attackerSteamID" "attackerName" "attackerTeam"
## [7] "attackerSide" "attackerX" "attackerY"
## [10] "attackerZ" "attackerAreaID" "attackerAreaName"
## [13] "attackerViewX" "attackerViewY" "victimSteamID"
## [16] "victimName" "victimTeam" "victimSide"
## [19] "victimX" "victimY" "victimZ"
## [22] "victimAreaID" "victimAreaName" "victimViewX"
## [25] "victimViewY" "assisterSteamID" "assisterName"
## [28] "assisterTeam" "assisterSide" "isSuicide"
## [31] "isTeamkill" "isWallbang" "penetratedObjects"
## [34] "isFirstKill" "isHeadshot" "victimBlinded"
## [37] "attackerBlinded" "flashThrowerSteamID" "flashThrowerName"
## [40] "flashThrowerTeam" "flashThrowerSide" "noScope"
## [43] "thruSmoke" "distance" "isTrade"
## [46] "playerTradedName" "playerTradedTeam" "playerTradedSteamID"
## [49] "weapon" "round_won_by" "match_id"
## [52] "map_name" "round_num" "round_tot"
## [55] "match_num" "match_event" "match_date"
## [58] "match_team1" "match_team2"
3.3 Damages data
The damages data stored in the ALL_damages.json file contains information on every damage dealt such as the player positions, weapon used, damage dealt etc. It has 337,151 rows and 42 columns. The various columns in the damages data set are:
## [1] "tick" "seconds" "clockTime" "attackerSteamID"
## [5] "attackerName" "attackerTeam" "attackerSide" "attackerX"
## [9] "attackerY" "attackerZ" "attackerAreaID" "attackerAreaName"
## [13] "attackerViewX" "attackerViewY" "attackerStrafe" "victimSteamID"
## [17] "victimName" "victimTeam" "victimSide" "victimX"
## [21] "victimY" "victimZ" "victimAreaID" "victimAreaName"
## [25] "victimViewX" "victimViewY" "weapon" "hpDamage"
## [29] "hpDamageTaken" "armorDamage" "armorDamageTaken" "hitGroup"
## [33] "round_won_by" "match_id" "map_name" "round_num"
## [37] "round_tot" "match_num" "match_event" "match_date"
## [41] "match_team1" "match_team2"
3.4 Flashes data
The flashes data stored in the ALL_flashes.json file contains information on every flash grenade thrown. It has 232,181 rows and 36 columns. The various column names in the flashes data set are:
## [1] "tick" "seconds" "clockTime" "attackerSteamID"
## [5] "attackerName" "attackerTeam" "attackerSide" "attackerX"
## [9] "attackerY" "attackerZ" "attackerAreaID" "attackerAreaName"
## [13] "attackerViewX" "attackerViewY" "playerSteamID" "playerName"
## [17] "playerTeam" "playerSide" "playerX" "playerY"
## [21] "playerZ" "playerAreaID" "playerAreaName" "playerViewX"
## [25] "playerViewY" "flashDuration" "round_won_by" "match_id"
## [29] "map_name" "round_num" "round_tot" "match_num"
## [33] "match_event" "match_date" "match_team1" "match_team2"
3.5 grenades data
The grenades data stored in the ALL_grenades.json file contains information on every grenades thrown that is not a flash grenade. It has 298,613 rows and 30 columns. The various column names in the grenades data set are:
## [1] "throwTick" "destroyTick" "throwSeconds" "destroySeconds"
## [5] "throwerSteamID" "throwerName" "throwerTeam" "throwerSide"
## [9] "throwerX" "throwerY" "throwerZ" "throwerAreaID"
## [13] "throwerAreaName" "grenadeType" "grenadeX" "grenadeY"
## [17] "grenadeZ" "grenadeAreaID" "grenadeAreaName" "UniqueID"
## [21] "round_won_by" "match_id" "map_name" "round_num"
## [25] "round_tot" "match_num" "match_event" "match_date"
## [29] "match_team1" "match_team2"
3.6 Common data
All the previous json data have the same information on the matches (match id, event, date) and teams (team1, team2, winner).
[1] Analyzing Counter-Strike: Global Offensive Data, Peter Xenopoulos, (2021), GitHub repository, https://github.com/pnxenopoulos/csgo