Objective

The objective of this project is to perform data cleaning and transformation on the raw, messy fifa21 data set, and then to analyze the data to derive insights such as who are the most valuable players, which players have stayed at a club for the longest time, the most skilled players, and if there’s a relationship between age, weight, and height of a player.

library(tidyverse)
library(skimr)

Import the data set

fifa <- read_csv("F:\\Tutorials\\R tutorials\\Datasets\\fifa21 raw data v2.csv")
View(fifa)
head(fifa)
## # A tibble: 6 x 77
##       ID Name   LongName photoUrl playerUrl Nationality   Age `<U+2193>OVA`   POT Club 
##    <dbl> <chr>  <chr>    <chr>    <chr>     <chr>       <dbl>  <dbl> <dbl> <chr>
## 1 158023 L. Me~ Lionel ~ https:/~ http://s~ Argentina      33     93    93 "\n\~
## 2  20801 Crist~ C. Rona~ https:/~ http://s~ Portugal       35     92    92 "\n\~
## 3 200389 J. Ob~ Jan Obl~ https:/~ http://s~ Slovenia       27     91    93 "\n\~
## 4 192985 K. De~ Kevin D~ https:/~ http://s~ Belgium        29     91    91 "\n\~
## 5 190871 Neyma~ Neymar ~ https:/~ http://s~ Brazil         28     91    91 "\n\~
## 6 188545 R. Le~ Robert ~ https:/~ http://s~ Poland         31     91    91 "\n\~
## # i 67 more variables: Contract <chr>, Positions <chr>, Height <chr>,
## #   Weight <chr>, `Preferred Foot` <chr>, BOV <dbl>, `Best Position` <chr>,
## #   Joined <chr>, `Loan Date End` <chr>, Value <chr>, Wage <chr>,
## #   `Release Clause` <chr>, Attacking <dbl>, Crossing <dbl>, Finishing <dbl>,
## #   `Heading Accuracy` <dbl>, `Short Passing` <dbl>, Volleys <dbl>,
## #   Skill <dbl>, Dribbling <dbl>, Curve <dbl>, `FK Accuracy` <dbl>,
## #   `Long Passing` <dbl>, `Ball Control` <dbl>, Movement <dbl>, ...

Understanding the variables in the data set

This data set contains a unique player ID, their general information such as: Name, Club, Nationality, Age, Height, Weight, and positions, and also football attributes such as Attacking, Skills, Movement, Power, and Mentality

Attacking = Crossing + Finishing + Heading Accuracy + Short Passing + Volleys

Skill = Dribbling + Curve + FK Accuracy + Long Passing + Ball Control

Movement = Acceleration + Sprint speed + Agility + Reactions + Balance

Power = Shot Power + Jumping + Stamina + Strength + Long Shots

Mentality = Aggression + Interceptions + Positioning + Vision + Penalties

Goal Keeping = GK Diving + GK Handling + GK Kicking + GK Positioning + GK Reflexes

Defending = Marking + Standing Tackle + Sliding Tackle

OVA = Overall Average of the player

Exploring the data set first

Use glimpse to check the structure and data types

glimpse(fifa)
## Rows: 18,979
## Columns: 77
## $ ID                 <dbl> 158023, 20801, 200389, 192985, 190871, 188545, 2093~
## $ Name               <chr> "L. Messi", "Cristiano Ronaldo", "J. Oblak", "K. De~
## $ LongName           <chr> "Lionel Messi", "C. Ronaldo dos Santos Aveiro", "Ja~
## $ photoUrl           <chr> "https://cdn.sofifa.com/players/158/023/21_60.png",~
## $ playerUrl          <chr> "http://sofifa.com/player/158023/lionel-messi/21000~
## $ Nationality        <chr> "Argentina", "Portugal", "Slovenia", "Belgium", "Br~
## $ Age                <dbl> 33, 35, 27, 29, 28, 31, 28, 27, 21, 28, 28, 28, 28,~
## $ `<U+2193>OVA`             <dbl> 93, 92, 91, 91, 91, 91, 90, 90, 90, 90, 90, 90, 89,~
## $ POT                <dbl> 93, 92, 93, 91, 91, 91, 90, 91, 95, 93, 91, 90, 89,~
## $ Club               <chr> "\n\n\n\nFC Barcelona", "\n\n\n\nJuventus", "\n\n\n~
## $ Contract           <chr> "2004 ~ 2021", "2018 ~ 2022", "2014 ~ 2023", "2015 ~
## $ Positions          <chr> "RW, ST, CF", "ST, LW", "GK", "CAM, CM", "LW, CAM",~
## $ Height             <chr> "170cm", "187cm", "188cm", "181cm", "175cm", "184cm~
## $ Weight             <chr> "72kg", "83kg", "87kg", "70kg", "68kg", "80kg", "71~
## $ `Preferred Foot`   <chr> "Left", "Right", "Right", "Right", "Right", "Right"~
## $ BOV                <dbl> 93, 92, 91, 91, 91, 91, 90, 90, 91, 90, 90, 90, 89,~
## $ `Best Position`    <chr> "RW", "ST", "GK", "CAM", "LW", "ST", "RW", "GK", "S~
## $ Joined             <chr> "1-Jul-04", "10-Jul-18", "16-Jul-14", "30-Aug-15", ~
## $ `Loan Date End`    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ Value              <chr> "€103.5M", "€63M", "€120M", "€129M", "€132M", "€111~
## $ Wage               <chr> "€560K", "€220K", "€125K", "€370K", "€270K", "€240K~
## $ `Release Clause`   <chr> "€138.4M", "€75.9M", "€159.4M", "€161M", "€166.5M",~
## $ Attacking          <dbl> 429, 437, 95, 407, 408, 423, 392, 114, 408, 118, 31~
## $ Crossing           <dbl> 85, 84, 13, 94, 85, 71, 79, 17, 78, 18, 53, 76, 58,~
## $ Finishing          <dbl> 95, 95, 11, 82, 87, 94, 91, 13, 91, 14, 52, 90, 64,~
## $ `Heading Accuracy` <dbl> 70, 90, 15, 55, 62, 85, 59, 19, 73, 11, 87, 84, 80,~
## $ `Short Passing`    <dbl> 91, 82, 43, 94, 87, 84, 84, 45, 83, 61, 79, 85, 84,~
## $ Volleys            <dbl> 88, 86, 13, 82, 87, 89, 79, 20, 83, 14, 45, 75, 63,~
## $ Skill              <dbl> 470, 414, 109, 441, 448, 407, 406, 138, 394, 144, 3~
## $ Dribbling          <dbl> 96, 88, 12, 88, 95, 85, 90, 27, 92, 21, 70, 91, 69,~
## $ Curve              <dbl> 93, 81, 13, 85, 88, 79, 83, 19, 79, 18, 60, 76, 63,~
## $ `FK Accuracy`      <dbl> 94, 76, 14, 83, 89, 85, 69, 18, 63, 12, 70, 64, 74,~
## $ `Long Passing`     <dbl> 91, 77, 40, 93, 81, 70, 75, 44, 70, 63, 86, 71, 84,~
## $ `Ball Control`     <dbl> 96, 92, 30, 92, 95, 88, 89, 30, 90, 30, 77, 89, 79,~
## $ Movement           <dbl> 451, 431, 307, 398, 453, 407, 460, 268, 458, 254, 3~
## $ Acceleration       <dbl> 91, 87, 43, 77, 94, 77, 94, 56, 96, 38, 72, 95, 60,~
## $ `Sprint Speed`     <dbl> 80, 91, 60, 76, 89, 78, 92, 47, 96, 50, 79, 93, 69,~
## $ Agility            <dbl> 91, 87, 67, 78, 96, 77, 91, 40, 92, 37, 61, 93, 61,~
## $ Reactions          <dbl> 94, 95, 88, 91, 91, 93, 92, 88, 92, 86, 89, 93, 87,~
## $ Balance            <dbl> 95, 71, 49, 76, 83, 82, 91, 37, 82, 43, 53, 86, 66,~
## $ Power              <dbl> 389, 444, 268, 408, 357, 420, 393, 240, 404, 268, 4~
## $ `Shot Power`       <dbl> 86, 94, 59, 91, 80, 89, 80, 64, 86, 66, 81, 84, 88,~
## $ Jumping            <dbl> 68, 95, 78, 63, 62, 84, 69, 52, 77, 79, 90, 86, 87,~
## $ Stamina            <dbl> 72, 84, 41, 89, 81, 76, 85, 32, 86, 35, 75, 88, 90,~
## $ Strength           <dbl> 69, 78, 78, 74, 50, 86, 75, 78, 76, 78, 92, 70, 91,~
## $ `Long Shots`       <dbl> 94, 93, 12, 91, 84, 85, 84, 14, 79, 10, 64, 78, 81,~
## $ Mentality          <dbl> 347, 353, 140, 408, 356, 391, 376, 140, 341, 171, 3~
## $ Aggression         <dbl> 44, 63, 34, 76, 51, 81, 63, 27, 62, 43, 83, 75, 91,~
## $ Interceptions      <dbl> 40, 29, 19, 66, 36, 49, 55, 11, 38, 22, 90, 35, 87,~
## $ Positioning        <dbl> 93, 95, 11, 88, 87, 94, 91, 13, 91, 11, 47, 92, 72,~
## $ Vision             <dbl> 95, 82, 65, 94, 90, 79, 84, 66, 80, 70, 65, 85, 80,~
## $ Penalties          <dbl> 75, 84, 11, 84, 92, 88, 83, 23, 70, 25, 62, 71, 66,~
## $ Composure          <dbl> 96, 95, 68, 91, 93, 88, 90, 65, 84, 70, 90, 84, 84,~
## $ Defending          <dbl> 91, 84, 57, 186, 94, 96, 122, 50, 100, 48, 272, 122~
## $ Marking            <dbl> 32, 28, 27, 68, 35, 35, 38, 15, 34, 25, 93, 42, 84,~
## $ `Standing Tackle`  <dbl> 35, 32, 12, 65, 30, 42, 43, 19, 34, 13, 93, 42, 88,~
## $ `Sliding Tackle`   <dbl> 24, 24, 18, 53, 29, 19, 41, 16, 32, 10, 86, 38, 87,~
## $ Goalkeeping        <dbl> 54, 58, 437, 56, 59, 51, 62, 439, 42, 439, 58, 56, ~
## $ `GK Diving`        <dbl> 6, 7, 87, 15, 9, 15, 14, 86, 13, 88, 13, 10, 13, 84~
## $ `GK Handling`      <dbl> 11, 11, 92, 13, 9, 6, 14, 88, 5, 85, 10, 10, 14, 89~
## $ `GK Kicking`       <dbl> 15, 15, 78, 5, 15, 12, 9, 85, 7, 88, 13, 15, 16, 74~
## $ `GK Positioning`   <dbl> 14, 14, 90, 10, 15, 8, 11, 91, 11, 88, 11, 7, 12, 8~
## $ `GK Reflexes`      <dbl> 8, 11, 90, 13, 11, 10, 14, 89, 6, 90, 11, 14, 12, 8~
## $ `Total Stats`      <dbl> 2231, 2221, 1413, 2304, 2175, 2195, 2211, 1389, 214~
## $ `Base Stats`       <dbl> 466, 464, 489, 485, 451, 457, 470, 490, 466, 484, 4~
## $ `W/F`              <chr> "4 <U+2605>", "4 <U+2605>", "3 <U+2605>", "5 <U+2605>", "5 <U+2605>", "4 <U+2605>", "3 <U+2605>", "3~
## $ SM                 <chr> "4<U+2605>", "5<U+2605>", "1<U+2605>", "4<U+2605>", "5<U+2605>", "4<U+2605>", "4<U+2605>", "1<U+2605>", "5<U+2605>~
## $ `A/W`              <chr> "Medium", "High", "Medium", "High", "High", "High",~
## $ `D/W`              <chr> "Low", "Low", "Medium", "High", "Medium", "Medium",~
## $ IR                 <chr> "5 <U+2605>", "5 <U+2605>", "3 <U+2605>", "4 <U+2605>", "5 <U+2605>", "4 <U+2605>", "3 <U+2605>", "3~
## $ PAC                <dbl> 85, 89, 87, 76, 91, 78, 93, 86, 96, 88, 76, 94, 65,~
## $ SHO                <dbl> 92, 93, 92, 86, 85, 91, 86, 88, 86, 85, 60, 85, 73,~
## $ PAS                <dbl> 91, 81, 78, 93, 86, 78, 81, 85, 78, 88, 71, 80, 76,~
## $ DRI                <dbl> 95, 89, 90, 88, 94, 85, 90, 89, 91, 90, 71, 90, 72,~
## $ DEF                <dbl> 38, 35, 52, 64, 36, 43, 45, 51, 39, 45, 91, 44, 86,~
## $ PHY                <dbl> 65, 77, 90, 78, 59, 82, 75, 91, 76, 88, 86, 76, 91,~
## $ Hits               <chr> "771", "562", "150", "207", "595", "248", "246", "1~

Count total number of rows and columns

nrow(fifa)
## [1] 18979
ncol(fifa)
## [1] 77

There are 18979 rows and 77 columns in the data set

Check for duplicates and missing values

fifa %>% 
  summarize(dups = sum(duplicated(.)))
## # A tibble: 1 x 1
##    dups
##   <int>
## 1     0
#checking on the ID column
sum(duplicated(fifa$ID))
## [1] 0

The data set does not contain any duplicates

#Using skimr to check missing values in the data set
skim(fifa)

The data does not contain missing values

Data Cleaning Tasks

  1. Drop columns that won’t be used in the analysis
  2. Fix the column names
  3. Remove the unnecessary newline characters from all columns that have them
  4. Convert the height and weight columns into numeric
  5. Convert the value, Wage, and release clause columns into numeric, multiply the columns by there respective value, e.g “M” should be multiplied by 1000000
  6. Remove star characters in the W/F, SM, and IR columns
  7. Split the contract column into three separate columns: the year the contract starts, ends, and the type of contract

Data Analysis Tasks

  1. Based on the Joined column, check which players have been playing at one club for more 10 years
  2. Which are the most valuable players, and which players are highly valued but still underpaid
  3. Which positions have the highest average overall ratings (OVA)?
  4. Is there a relationship between age, height, and weight of a player?
  5. What is the distribution of players across different clubs and Which clubs have the highest average overall rating among its players?
  6. How does the length of a player’s contract relate to their market value?

1. Create a new data frame and drop columns that won’t be considered in the analysis

fifa21 <- fifa %>% 
  select(-c(LongName, photoUrl, playerUrl))

View(fifa21)

2. Fix the column names

library(janitor)

fifa21 <- fifa21 %>% 
  clean_names()

3. Remove newline characters from all columns that have them

# Club column
unique(fifa21$club)
fifa21$club <- if_else(str_detect(fifa21$club, "\n{4}"), 
                       str_replace_all(fifa21$club, "\n{4}", ""), fifa21$club)

4. Convert the height and weight columns to numeric

Height Column

unique(fifa21$height)
##  [1] "170cm"  "187cm"  "188cm"  "181cm"  "175cm"  "184cm"  "191cm"  "178cm" 
##  [9] "193cm"  "185cm"  "199cm"  "173cm"  "168cm"  "176cm"  "177cm"  "183cm" 
## [17] "180cm"  "189cm"  "179cm"  "195cm"  "172cm"  "182cm"  "186cm"  "192cm" 
## [25] "165cm"  "194cm"  "167cm"  "196cm"  "163cm"  "190cm"  "174cm"  "169cm" 
## [33] "171cm"  "197cm"  "200cm"  "166cm"  "6'2\""  "164cm"  "198cm"  "6'3\"" 
## [41] "6'5\""  "5'11\"" "6'4\""  "6'1\""  "6'0\""  "5'10\"" "5'9\""  "5'6\"" 
## [49] "5'7\""  "5'4\""  "201cm"  "158cm"  "162cm"  "161cm"  "160cm"  "203cm" 
## [57] "157cm"  "156cm"  "202cm"  "159cm"  "206cm"  "155cm"

First strip the string “cm” then Write a function that removes the non numeric characters, then converts the values to one unit

fifa21$height <- str_replace_all(fifa21$height, "cm", "")

inches_to_cm <- function(height_cm){
  if (str_detect(height_cm, "'")) {
    height_cm = str_replace_all(height_cm, c("'"=".", "\""=""))
    cm = round(as.numeric(height_cm) * 30.48, 0)
    print(cm)
  }
  else{
    print(as.numeric(height_cm))
  }
}

#Use the sapply function 

fifa21$height <- sapply(fifa21$height, inches_to_cm)
unique(fifa21$height)
##  [1] 170 187 188 181 175 184 191 178 193 185 199 173 168 176 177 183 180 189 179
## [20] 195 172 182 186 192 165 194 167 196 163 190 174 169 171 197 200 166 164 198
## [39] 156 155 201 158 162 161 160 203 157 202 159 206

Weight column

unique(fifa21$weight)
##  [1] "72kg"   "83kg"   "87kg"   "70kg"   "68kg"   "80kg"   "71kg"   "91kg"  
##  [9] "73kg"   "85kg"   "92kg"   "69kg"   "84kg"   "96kg"   "81kg"   "82kg"  
## [17] "75kg"   "86kg"   "89kg"   "74kg"   "76kg"   "64kg"   "78kg"   "90kg"  
## [25] "66kg"   "60kg"   "94kg"   "79kg"   "67kg"   "65kg"   "59kg"   "61kg"  
## [33] "93kg"   "88kg"   "97kg"   "77kg"   "62kg"   "63kg"   "95kg"   "100kg" 
## [41] "58kg"   "183lbs" "179lbs" "172lbs" "196lbs" "176lbs" "185lbs" "170lbs"
## [49] "203lbs" "168lbs" "161lbs" "146lbs" "130lbs" "190lbs" "174lbs" "148lbs"
## [57] "165lbs" "159lbs" "192lbs" "181lbs" "139lbs" "154lbs" "157lbs" "163lbs"
## [65] "98kg"   "103kg"  "99kg"   "102kg"  "56kg"   "101kg"  "57kg"   "55kg"  
## [73] "104kg"  "107kg"  "110kg"  "53kg"   "50kg"   "54kg"   "52kg"

Strip the “kg” string then convert lbs to a common unit

fifa21$weight <- str_replace_all(fifa21$weight, "kg", "")

lbs_to_kg <- function(weight_kg){
  if (str_detect(weight_kg, "lbs")) {
    weight_kg = str_replace_all(weight_kg, "lbs", "")
    kgs = round(as.numeric(weight_kg) * 0.454, 0)
    print(kgs)
  }
  else {
    print(as.numeric(weight_kg))
  }
}

fifa21$weight <- sapply(fifa21$weight, lbs_to_kg)
unique(fifa21$weight)
##  [1]  72  83  87  70  68  80  71  91  73  85  92  69  84  96  81  82  75  86  89
## [20]  74  76  64  78  90  66  60  94  79  67  65  59  61  93  88  97  77  62  63
## [39]  95 100  58  98 103  99 102  56 101  57  55 104 107 110  53  50  54  52

5. Convert the value, wage, and release clause columns to numeric

unique(fifa21$value)
unique(fifa21$wage)
unique(fifa21$release_clause)
#First drop the euro sign
fifa21 <- fifa21 %>% 
  mutate(
    across(
      c(value, wage, release_clause),
      .fns = ~str_replace_all(.x, "€", "")
    )
  )


value_mil <- function(value_x){
  if (str_detect(value_x, "M")) {
    value_x = str_replace_all(value_x, "M", "")
    value_x = round(as.numeric(value_x) * 1000000, 0)
    print(value_x)
  }
  else if (str_detect(value_x, "K")) {
    value_x = str_replace_all(value_x, "K", "")
    value_x = round(as.numeric(value_x) * 1000, 0)
    print(value_x)
  }
  else {
    print(as.numeric(value_x))
  }
}

fifa21$value <- sapply(fifa21$value, value_mil)

fifa21$wage <- sapply(fifa21$wage, value_mil)

fifa21$release_clause <- sapply(fifa21$release_clause, value_mil)

6. Remove star characters in the W/F, SM, and IR columns

unique(fifa21$w_f)
## [1] "4 <U+2605>" "3 <U+2605>" "5 <U+2605>" "2 <U+2605>" "1 <U+2605>"
unique(fifa21$sm)
## [1] "4<U+2605>" "5<U+2605>" "1<U+2605>" "2<U+2605>" "3<U+2605>"
unique(fifa21$ir)
## [1] "5 <U+2605>" "3 <U+2605>" "4 <U+2605>" "2 <U+2605>" "1 <U+2605>"
star_fn <- function(star_char){
  if (str_detect(star_char, " ★|★")){
    star_char = str_replace_all(star_char, " ★|★", "")
    print(as.numeric(star_char))
  }}


fifa21$w_f <-  sapply(fifa21$w_f, star_fn)

fifa21$sm <- sapply(fifa21$sm, star_fn)

fifa21$ir <- sapply(fifa21$ir, star_fn)

7. Split the contract column into three separate columns: the year the contract starts, ends, and the type of contract

unique(fifa21$contract)
##   [1] "2004 ~ 2021"          "2018 ~ 2022"          "2014 ~ 2023"         
##   [4] "2015 ~ 2023"          "2017 ~ 2022"          "2017 ~ 2023"         
##   [7] "2018 ~ 2024"          "2014 ~ 2022"          "2018 ~ 2023"         
##  [10] "2016 ~ 2023"          "2013 ~ 2023"          "2011 ~ 2023"         
##  [13] "2009 ~ 2022"          "2005 ~ 2021"          "2011 ~ 2021"         
##  [16] "2015 ~ 2022"          "2017 ~ 2024"          "2010 ~ 2024"         
##  [19] "2012 ~ 2021"          "2019 ~ 2024"          "2015 ~ 2024"         
##  [22] "2017 ~ 2025"          "2020 ~ 2025"          "2019 ~ 2023"         
##  [25] "2008 ~ 2023"          "2015 ~ 2021"          "2020 ~ 2022"         
##  [28] "2012 ~ 2022"          "2016 ~ 2025"          "2013 ~ 2022"         
##  [31] "2011 ~ 2022"          "2012 ~ 2024"          "2016 ~ 2021"         
##  [34] "2012 ~ 2023"          "2008 ~ 2022"          "2019 ~ 2022"         
##  [37] "2017 ~ 2021"          "2013 ~ 2024"          "2020 ~ 2024"         
##  [40] "2010 ~ 2022"          "2020 ~ 2021"          "2011 ~ 2024"         
##  [43] "2020 ~ 2023"          "2014 ~ 2024"          "2013 ~ 2026"         
##  [46] "2016 ~ 2022"          "2010 ~ 2021"          "2013 ~ 2021"         
##  [49] "2019 ~ 2025"          "2018 ~ 2025"          "2016 ~ 2024"         
##  [52] "2018 ~ 2021"          "2009 ~ 2024"          "2007 ~ 2022"         
##  [55] "Jun 30, 2021 On Loan" "2009 ~ 2021"          "2019 ~ 2021"         
##  [58] "2019 ~ 2026"          "Free"                 "2012 ~ 2028"         
##  [61] "2010 ~ 2023"          "2014 ~ 2021"          "2015 ~ 2025"         
##  [64] "2014 ~ 2026"          "2012 ~ 2025"          "2017 ~ 2020"         
##  [67] "2002 ~ 2022"          "2020 ~ 2027"          "2013 ~ 2025"         
##  [70] "Dec 31, 2020 On Loan" "2019 ~ 2020"          "2011 ~ 2025"         
##  [73] "2016 ~ 2020"          "2007 ~ 2021"          "2020 ~ 2026"         
##  [76] "2010 ~ 2025"          "2009 ~ 2023"          "2008 ~ 2021"         
##  [79] "2020 ~ 2020"          "2016 ~ 2026"          "Jan 30, 2021 On Loan"
##  [82] "2012 ~ 2020"          "2014 ~ 2025"          "Jun 30, 2022 On Loan"
##  [85] "2015 ~ 2020"          "May 31, 2021 On Loan" "2018 ~ 2020"         
##  [88] "2014 ~ 2020"          "2013 ~ 2020"          "2006 ~ 2024"         
##  [91] "Jul 5, 2021 On Loan"  "Dec 31, 2021 On Loan" "2004 ~ 2025"         
##  [94] "2011 ~ 2020"          "Jul 1, 2021 On Loan"  "Jan 1, 2021 On Loan" 
##  [97] "2006 ~ 2023"          "Aug 31, 2021 On Loan" "2006 ~ 2021"         
## [100] "2005 ~ 2023"          "2003 ~ 2020"          "2009 ~ 2020"         
## [103] "2002 ~ 2020"          "2005 ~ 2020"          "2005 ~ 2022"         
## [106] "Jan 31, 2021 On Loan" "2010 ~ 2020"          "Dec 30, 2021 On Loan"
## [109] "2008 ~ 2020"          "2007 ~ 2020"          "2003 ~ 2021"         
## [112] "Jun 23, 2021 On Loan" "Jan 3, 2021 On Loan"  "Nov 27, 2021 On Loan"
## [115] "2002 ~ 2021"          "Jan 17, 2021 On Loan" "Jun 30, 2023 On Loan"
## [118] "1998 ~ 2021"          "2003 ~ 2022"          "2007 ~ 2023"         
## [121] "Jul 31, 2021 On Loan" "Nov 22, 2020 On Loan" "May 31, 2022 On Loan"
## [124] "2006 ~ 2020"          "Dec 30, 2020 On Loan" "2007 ~ 2025"         
## [127] "Jan 4, 2021 On Loan"  "Nov 30, 2020 On Loan" "2004 ~ 2020"         
## [130] "2009 ~ 2025"          "Aug 1, 2021 On Loan"

Create a function that checks whether the rows in the contract column fall under (contract, free, loan) contract types

contract_type <- function(type){
  if (str_detect(type, "~")) {
    print("Contract")
  }
  else if (str_detect(type, "Free")){
    print("Free")
  }
  else if (str_detect(type, "On Loan")){
    print("On Loan")
  }
  else {
    print(type)
  }
}


#Use mutate to create a new column that holds the type of contract
#Then use separate to split contract column into start year and end year
fifa21 <- fifa21 %>% 
  mutate(Type_of_contract = sapply(fifa21$contract, contract_type)) %>% 
  separate(contract, into = c("Start_year", "End_year"), sep = "~")


#After separating the contract column, it appears that the Start_year column contains date values ("On Loan" rows) which should belong in the End_year column, thus in this section, we are going to populate the end_year column with this values.

unique(fifa21$Start_year)

fifa21$End_year <- if_else(is.na(fifa21$End_year), fifa21$Start_year, fifa21$End_year)

#Then drop the 'On Loan" and "Free" rows from start_year column
fifa21$Start_year <- if_else(str_detect(fifa21$Start_year, "On Loan|Free"), NA, fifa21$Start_year)



#Populate the NA values in start_year with the right year, refer to the joined column
unique(fifa21$joined)
class(fifa21$joined)

fifa21 <- fifa21 %>% 
  mutate(joined = as_date(joined, format = "%d-%b-%y"),
         YearJoined = as.character(year(joined)),
         Start_year = if_else(is.na(Start_year), YearJoined, Start_year))



#Finally, create a function that replaces "On Loan" string in the end year column and retrieves the date only
unique(fifa21$End_year)


 end_year_fixed <- function(loan_date){
  if (str_detect(loan_date, "On Loan")) {
    loan_date = str_replace_all(loan_date, "On Loan", "")
    loan_date = year(as_date(loan_date, format = "%b-%d-%y"))
    print(loan_date)
  }
  else{
    print(str_trim(loan_date))
  }
}

fifa21$End_year <- sapply(fifa21$End_year, end_year_fixed)



fifa21 <- fifa21 %>% 
  mutate(Loan_Date_End = as_date(joined, format = "%d-%b-%y"),
         Loan_End = as.character(year(joined)),
         End_year = if_else(is.na(End_year), Loan_End, End_year))

 
 
#Trim whitespaces from start year column
fifa21$Start_year <- str_trim(fifa21$Start_year)

Confirm the results

unique(fifa21$Type_of_contract)
## [1] "Contract" "On Loan"  "Free"
unique(fifa21$Start_year)
##  [1] "2004" "2018" "2014" "2015" "2017" "2016" "2013" "2011" "2009" "2005"
## [11] "2010" "2012" "2019" "2020" "2008" "2007" "2002" "2006" "2003" "1998"
unique(fifa21$End_year)
##  [1] "2021" "2022" "2023" "2024" "2025" "2026" "2013" "2010" "2015" "Free"
## [11] "2019" "2018" "2028" "2017" "2016" "2020" "2027" "2014" "2012" "2011"

Data Analysis / Visualization

1. Based on the ‘Joined’ column, check which players have been playing at a club for more than 10 years!

class(fifa21$joined)
## [1] "Date"
fifa21 %>% 
  mutate(duration = as.numeric(difftime(today(), joined, units = "days")) %/% 365.25) %>% 
  filter(duration >= 10) %>% 
  arrange(desc(duration)) %>% 
  select(name, nationality, club, duration) %>% 
  slice_max(duration, n = 10)
## # A tibble: 12 x 4
##    name          nationality         club                duration
##    <chr>         <chr>               <chr>                  <dbl>
##  1 H. Sogahata   Japan               Kashima Antlers           26
##  2 I. Akinfeev   Russia              PFC CSKA Moscow           22
##  3 Kim Kwang Suk Korea Republic      Pohang Steelers           22
##  4 K. Nakamura   Japan               Kawasaki Frontale         21
##  5 D. Lewington  England             Milton Keynes Dons        21
##  6 H. Otani      Japan               Kashiwa Reysol            21
##  7 M. McNulty    Republic of Ireland Cork City                 21
##  8 P. Karlsson   Sweden              AIK                       20
##  9 T. Aoyama     Japan               Sanfrecce Hiroshima       20
## 10 M. Bloomfield England             Wycombe Wanderers         20
## 11 D. Zibung     Switzerland         FC Luzern                 20
## 12 Yang Qipeng   China PR            Tianjin TEDA FC           20

759 players have been playing at a club for more than 10 years. Above are the top 10 players who have stayed at a club for the longest time.

2. Which players are highly valuable but still underpaid (on low wages)? (hint: scatter plot between wage and value

library(ggrepel)
library(PerformanceAnalytics)

#Check the data types of Wage and Value
class(fifa21$value)
## [1] "numeric"
class(fifa21$wage)
## [1] "numeric"
#The correlation chart below shows that there exists a strong positive relationship of (0.81) between the Value and wage of a player

fifa21 %>% 
  select(value, wage, release_clause) %>% 
  chart.Correlation()

#Use histograms to check the distribution of Wage and Value

#Majority of the players earn less than 25,000

fifa21 %>% 
ggplot(aes(wage)) +
  geom_histogram() +
  scale_x_continuous(breaks = seq(0, 500000, 100000),
                     labels = c(0, "100K", "200K", "300K", "400K", "500K"))

fifa21 %>% 
  ggplot(aes(value)) +
  geom_histogram() +
  scale_x_continuous(breaks = seq(0, 200000000, 50000000),
                     labels = c(0, "50M", "100M", "150M", "200M"))

#Scatter plot that shows the relationship between value and wage of players. It displays names of players who are highly valyed but still earn less

fifa21 %>% 
filter(wage <= 60000 & value >= 50000000) %>% 
ggplot(aes(value, wage, label = name)) +
  geom_point(aes(color = best_position), position = "jitter") + 
  geom_text_repel(hjust = 1,
                  nudge_y = 4,
                  direction = "x") +
  scale_y_continuous(breaks = seq(0, 500000, 50000),
                     labels = c(0, "50K", "100K", "150K", "200K", "250K", "300K", "350K", "400K", "450K", "500K")) +
  scale_x_continuous(breaks = seq(0, 200000000, 25000000),
                     labels = c(0,"25M", "50M", "75M", "100M", "125M", "150M", "175M", "200M")) +
  labs(
    title = "Underpaid Players who are Highly Valuable",
    subtitle = "(G. Donnarumma is one of the highest valued players but\n still gets underpaid)"
  ) +
  theme_classic() +
  theme(legend.position = "none",
        title = element_text(size = 14, face = "bold"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold"),
        plot.subtitle = element_text(colour = "gray30", face = "italic", size = 11))

3. Who are the most valuable players?

library(kableExtra)

Most_Valuable <- fifa21 %>% 
  filter(value > 100000000) %>% 
  select(name, club, value) %>% 
  arrange(desc(value))

Most_Valuable %>% 
  kable(digits = 0, format = "html", 
        caption = "Mbappe and Neymar Jr were the most valuable players") %>% 
  kable_classic("striped", "bordered", full_width = FALSE, html_font = "cambria",
                position = "left",
                fixed_thead = T) %>%
  row_spec(0, bold = TRUE) %>%
  column_spec(1, bold = TRUE, color = "black", background = ("#F0E68C")) %>% 
  column_spec(2, color = "black", background = "lightblue")
Mbappe and Neymar Jr were the most valuable players
name club value
K. Mbappé Paris Saint-Germain 185500000
Neymar Jr Paris Saint-Germain 132000000
K. De Bruyne Manchester City 129000000
J. Sancho Borussia Dortmund 124000000
K. Havertz Chelsea 121000000
M. Salah Liverpool 120500000
S. Mané Liverpool 120500000
J. Oblak Atlético Madrid 120000000
R. Sterling Manchester City 114500000
T. Alexander-Arnold Liverpool 114000000
V. van Dijk Liverpool 113000000
R. Lewandowski FC Bayern München 111000000
M. ter Stegen FC Barcelona 110000000
P. Dybala Juventus 109000000
H. Kane Tottenham Hotspur 109000000
Bruno Fernandes Manchester United 105500000
L. Messi FC Barcelona 103500000
J. Kimmich FC Bayern München 103000000
Alisson Liverpool 102000000

4. Is there a significant relationship between the weight, height, and age of a player? (Does the height and age of a player have a significant impact on the weight?)

#Create a new data frame to hold the variables of interest
df <- fifa21 %>% 
  select(weight, height, age)


#Scatter plot to check relationships between weight and height, weight and age respectively
ggplot(df, aes(height, weight)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

ggplot(df, aes(age, weight)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

#Using a correlation matrix

df %>% 
  select_if(is.numeric) %>% 
  chart.Correlation()

#From above scatter plots, there is a strong positive linear relationship between weight and height, where as there is a weak positive relationship between Age and weight. This can also be seen in the correlation matrix. (correlation between age and weight = 0.24).
#Thus to perform multiple linear regression, we convert Age into a categorical variable

df <- df %>% 
   mutate(AgeCat = cut(age, breaks = c(0, 20, 25, 30, 35, 60),
                      labels = c("A", "B", "C", "D", "E")))
View(df)


model1 <- lm(weight ~ height + AgeCat, data = df)
plot(model1)

#Checking assumptions of multiple linear regression

#The residuals vs Fitted plot indicates that there exists an approximately linear relationship since the line is fairly horizontal
#The Normal Q-Q plot shows that the residuals from the model are approximately normally distributed with some slight deviations on both ends of the dashed line
#The Scale-Location plot has a fairly horizontal line, this shows that there is homoscedasticity in the model since the points are almost equally spread from the regression line across the observations
#The fairly horizontal line in the residuals vs Leverage plot indicates that there may be some few outliers in the data

summary(model1)
## 
## Call:
## lm(formula = weight ~ height + AgeCat, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.306  -2.811   0.026   2.744  34.600 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -67.756417   0.837064  -80.94   <2e-16 ***
## height        0.777858   0.004628  168.09   <2e-16 ***
## AgeCatB       1.330939   0.090435   14.72   <2e-16 ***
## AgeCatC       2.714874   0.093601   29.00   <2e-16 ***
## AgeCatD       3.415417   0.115214   29.64   <2e-16 ***
## AgeCatE       4.698167   0.244299   19.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.359 on 18973 degrees of freedom
## Multiple R-squared:  0.6204, Adjusted R-squared:  0.6203 
## F-statistic:  6201 on 5 and 18973 DF,  p-value: < 2.2e-16
# Scatter plot representing the model
ggplot(df, aes(height, weight, color = AgeCat)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Interpretation of Above Model

From the scatter plot, we can notice that Height has an effect on weight, as height increases by one unit we expect the weight to increase by 0.778. This increase is assumed to be the same for all the age categories. The Age category also has an effect on weight, for players in age category A the weight increases by 0.778, in age category B weight increases by 1.33, in age category C weight increases by 2.71, in age category D weight increases by 3.41, while in age category E weight increases by 4.69. The effect is assumed to be the same for all heights.

The model has a residual standard error of 4.359, and a Multiple R-squared value of 0.6204. This indicates that approximately 62.04% of the variability in weight can be explained by the model. The F-statistic also has a p-value of < 0.05 indicating that the model is statistically significant.

5. Which positions have the highest average overall ratings (OVA)

unique(fifa21$best_position)
##  [1] "RW"  "ST"  "GK"  "CAM" "LW"  "CB"  "CDM" "CF"  "CM"  "RB"  "LB"  "LM" 
## [13] "RM"  "LWB" "RWB"
ggplot(fifa21, aes(fct_rev(fct_reorder(best_position, ova)), ova)) +
  geom_boxplot(aes(fill = best_position)) +
  labs(x = "Position",
       y = "Overall average rating",
       title = "Positions with the highest overall average ratings",
       subtitle = "(CF players have the highest average ratings)") +
  theme_classic() +
  theme(plot.title = element_text(size = 14, face = "bold"),
        plot.subtitle = element_text(size = 10, colour = "gray50"),
        axis.title = element_text(size = 12, colour = "black", face = "bold"),
        legend.position = "none")

Central Forward (CF) has the highest median overall rating across all positions, while Goal Keepers (GK) has the least ratings.

6. What is the distribution of players across different clubs and Which clubs have the highest average overall rating among its players?

#Distribution of players across different clubs in percentages

fifa21 %>%
  group_by(club) %>%
  summarize(n = n()) %>%
  group_by(n) %>%
  summarize(count = n()) %>%
  mutate(percentage = count / sum(count) * 100) %>%
  arrange(desc(count)) %>% 
  slice_max(percentage, n = 5)
## # A tibble: 5 x 3
##       n count percentage
##   <int> <int>      <dbl>
## 1    28   152      22.3 
## 2    30   119      17.4 
## 3    27   104      15.2 
## 4    26    64       9.38
## 5    25    58       8.50
# clubs with highest rated players

highest_ratings <- fifa21 %>% 
  group_by(club) %>% 
  summarize(avg_ratings = mean(ova),
            median_ratings = median(ova)) %>% 
  arrange(desc(avg_ratings)) %>% 
  slice_max(avg_ratings, n=10)

highest_ratings %>% 
  kable(digits = 0, format = "html", 
        caption = "Clubs with highest average overall rating of players") %>% 
  kable_classic("striped", "bordered", full_width = FALSE, html_font = "cambria",
                position = "left",
                fixed_thead = T) %>%
  row_spec(0, bold = TRUE) %>%
  column_spec(1, bold = TRUE, color = "black", background = ("#F0E68C")) %>% 
  column_spec(2, color = "black", background = "lightblue")
Clubs with highest average overall rating of players
club avg_ratings median_ratings
FC Bayern München 81 82
Real Madrid 80 81
Inter 79 80
Napoli 79 79
Chelsea 78 79
Manchester United 78 79
Juventus 78 80
Paris Saint-Germain 78 79
Liverpool 77 80
Sevilla FC 77 78

From the data set, 22.28% of the clubs have 28 players each, 17.45% have 30 players, while surprisingly, there are clubs with less than 22 players. Approximately 15 clubs have 22 players and below.

FC Bayern Munchen, Real Madrid, Inter, and Napoli are the teams with the most highly rated players, while Finn Harps and Waerford FC have the lowest rated players.

7. How does the length of a player’s contract relate to their market value?

class(fifa21$Start_year)
## [1] "character"
class(fifa21$End_year)
## [1] "character"
#Convert above variables to numeric
fifa21$Start_year <- as.numeric(fifa21$Start_year)

fifa21$End_year <- if_else(is.na(fifa21$End_year), NA, as.numeric(fifa21$End_year))

mkt_val <- fifa21 %>% 
  mutate(contract_length = End_year - Start_year) 


#Hypotheses Testing

#Null Hypothesis: There is no relationship between a player's contract length and their market value
#Alternative Hypothesis: There is a relationship between a player's contract length and their market value

cor.test(mkt_val$contract_length, mkt_val$value, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  mkt_val$contract_length and mkt_val$value
## S = 8.631e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.213385

There appears to be a weak positive relationship between a player’s contract length and their market value. The Spearman’s rank test yields a correlation coefficient of 0.213385, and a p-value < 0.05, thus we reject the null hypothesis and conclude that there is a relationship between a player’s contract length and their market value. As the contract length increases, the market value is also expected to increase.