The objective of this project is to perform data cleaning and transformation on the raw, messy fifa21 data set, and then to analyze the data to derive insights such as who are the most valuable players, which players have stayed at a club for the longest time, the most skilled players, and if there’s a relationship between age, weight, and height of a player.
library(tidyverse)
library(skimr)
Import the data set
fifa <- read_csv("F:\\Tutorials\\R tutorials\\Datasets\\fifa21 raw data v2.csv")
View(fifa)
head(fifa)
## # A tibble: 6 x 77
## ID Name LongName photoUrl playerUrl Nationality Age `<U+2193>OVA` POT Club
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 158023 L. Me~ Lionel ~ https:/~ http://s~ Argentina 33 93 93 "\n\~
## 2 20801 Crist~ C. Rona~ https:/~ http://s~ Portugal 35 92 92 "\n\~
## 3 200389 J. Ob~ Jan Obl~ https:/~ http://s~ Slovenia 27 91 93 "\n\~
## 4 192985 K. De~ Kevin D~ https:/~ http://s~ Belgium 29 91 91 "\n\~
## 5 190871 Neyma~ Neymar ~ https:/~ http://s~ Brazil 28 91 91 "\n\~
## 6 188545 R. Le~ Robert ~ https:/~ http://s~ Poland 31 91 91 "\n\~
## # i 67 more variables: Contract <chr>, Positions <chr>, Height <chr>,
## # Weight <chr>, `Preferred Foot` <chr>, BOV <dbl>, `Best Position` <chr>,
## # Joined <chr>, `Loan Date End` <chr>, Value <chr>, Wage <chr>,
## # `Release Clause` <chr>, Attacking <dbl>, Crossing <dbl>, Finishing <dbl>,
## # `Heading Accuracy` <dbl>, `Short Passing` <dbl>, Volleys <dbl>,
## # Skill <dbl>, Dribbling <dbl>, Curve <dbl>, `FK Accuracy` <dbl>,
## # `Long Passing` <dbl>, `Ball Control` <dbl>, Movement <dbl>, ...
Understanding the variables in the data set
This data set contains a unique player ID, their general information such as: Name, Club, Nationality, Age, Height, Weight, and positions, and also football attributes such as Attacking, Skills, Movement, Power, and Mentality
Attacking = Crossing + Finishing + Heading Accuracy + Short Passing + Volleys
Skill = Dribbling + Curve + FK Accuracy + Long Passing + Ball Control
Movement = Acceleration + Sprint speed + Agility + Reactions + Balance
Power = Shot Power + Jumping + Stamina + Strength + Long Shots
Mentality = Aggression + Interceptions + Positioning + Vision + Penalties
Goal Keeping = GK Diving + GK Handling + GK Kicking + GK Positioning + GK Reflexes
Defending = Marking + Standing Tackle + Sliding Tackle
OVA = Overall Average of the player
Exploring the data set first
Use glimpse to check the structure and data types
glimpse(fifa)
## Rows: 18,979
## Columns: 77
## $ ID <dbl> 158023, 20801, 200389, 192985, 190871, 188545, 2093~
## $ Name <chr> "L. Messi", "Cristiano Ronaldo", "J. Oblak", "K. De~
## $ LongName <chr> "Lionel Messi", "C. Ronaldo dos Santos Aveiro", "Ja~
## $ photoUrl <chr> "https://cdn.sofifa.com/players/158/023/21_60.png",~
## $ playerUrl <chr> "http://sofifa.com/player/158023/lionel-messi/21000~
## $ Nationality <chr> "Argentina", "Portugal", "Slovenia", "Belgium", "Br~
## $ Age <dbl> 33, 35, 27, 29, 28, 31, 28, 27, 21, 28, 28, 28, 28,~
## $ `<U+2193>OVA` <dbl> 93, 92, 91, 91, 91, 91, 90, 90, 90, 90, 90, 90, 89,~
## $ POT <dbl> 93, 92, 93, 91, 91, 91, 90, 91, 95, 93, 91, 90, 89,~
## $ Club <chr> "\n\n\n\nFC Barcelona", "\n\n\n\nJuventus", "\n\n\n~
## $ Contract <chr> "2004 ~ 2021", "2018 ~ 2022", "2014 ~ 2023", "2015 ~
## $ Positions <chr> "RW, ST, CF", "ST, LW", "GK", "CAM, CM", "LW, CAM",~
## $ Height <chr> "170cm", "187cm", "188cm", "181cm", "175cm", "184cm~
## $ Weight <chr> "72kg", "83kg", "87kg", "70kg", "68kg", "80kg", "71~
## $ `Preferred Foot` <chr> "Left", "Right", "Right", "Right", "Right", "Right"~
## $ BOV <dbl> 93, 92, 91, 91, 91, 91, 90, 90, 91, 90, 90, 90, 89,~
## $ `Best Position` <chr> "RW", "ST", "GK", "CAM", "LW", "ST", "RW", "GK", "S~
## $ Joined <chr> "1-Jul-04", "10-Jul-18", "16-Jul-14", "30-Aug-15", ~
## $ `Loan Date End` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ Value <chr> "€103.5M", "€63M", "€120M", "€129M", "€132M", "€111~
## $ Wage <chr> "€560K", "€220K", "€125K", "€370K", "€270K", "€240K~
## $ `Release Clause` <chr> "€138.4M", "€75.9M", "€159.4M", "€161M", "€166.5M",~
## $ Attacking <dbl> 429, 437, 95, 407, 408, 423, 392, 114, 408, 118, 31~
## $ Crossing <dbl> 85, 84, 13, 94, 85, 71, 79, 17, 78, 18, 53, 76, 58,~
## $ Finishing <dbl> 95, 95, 11, 82, 87, 94, 91, 13, 91, 14, 52, 90, 64,~
## $ `Heading Accuracy` <dbl> 70, 90, 15, 55, 62, 85, 59, 19, 73, 11, 87, 84, 80,~
## $ `Short Passing` <dbl> 91, 82, 43, 94, 87, 84, 84, 45, 83, 61, 79, 85, 84,~
## $ Volleys <dbl> 88, 86, 13, 82, 87, 89, 79, 20, 83, 14, 45, 75, 63,~
## $ Skill <dbl> 470, 414, 109, 441, 448, 407, 406, 138, 394, 144, 3~
## $ Dribbling <dbl> 96, 88, 12, 88, 95, 85, 90, 27, 92, 21, 70, 91, 69,~
## $ Curve <dbl> 93, 81, 13, 85, 88, 79, 83, 19, 79, 18, 60, 76, 63,~
## $ `FK Accuracy` <dbl> 94, 76, 14, 83, 89, 85, 69, 18, 63, 12, 70, 64, 74,~
## $ `Long Passing` <dbl> 91, 77, 40, 93, 81, 70, 75, 44, 70, 63, 86, 71, 84,~
## $ `Ball Control` <dbl> 96, 92, 30, 92, 95, 88, 89, 30, 90, 30, 77, 89, 79,~
## $ Movement <dbl> 451, 431, 307, 398, 453, 407, 460, 268, 458, 254, 3~
## $ Acceleration <dbl> 91, 87, 43, 77, 94, 77, 94, 56, 96, 38, 72, 95, 60,~
## $ `Sprint Speed` <dbl> 80, 91, 60, 76, 89, 78, 92, 47, 96, 50, 79, 93, 69,~
## $ Agility <dbl> 91, 87, 67, 78, 96, 77, 91, 40, 92, 37, 61, 93, 61,~
## $ Reactions <dbl> 94, 95, 88, 91, 91, 93, 92, 88, 92, 86, 89, 93, 87,~
## $ Balance <dbl> 95, 71, 49, 76, 83, 82, 91, 37, 82, 43, 53, 86, 66,~
## $ Power <dbl> 389, 444, 268, 408, 357, 420, 393, 240, 404, 268, 4~
## $ `Shot Power` <dbl> 86, 94, 59, 91, 80, 89, 80, 64, 86, 66, 81, 84, 88,~
## $ Jumping <dbl> 68, 95, 78, 63, 62, 84, 69, 52, 77, 79, 90, 86, 87,~
## $ Stamina <dbl> 72, 84, 41, 89, 81, 76, 85, 32, 86, 35, 75, 88, 90,~
## $ Strength <dbl> 69, 78, 78, 74, 50, 86, 75, 78, 76, 78, 92, 70, 91,~
## $ `Long Shots` <dbl> 94, 93, 12, 91, 84, 85, 84, 14, 79, 10, 64, 78, 81,~
## $ Mentality <dbl> 347, 353, 140, 408, 356, 391, 376, 140, 341, 171, 3~
## $ Aggression <dbl> 44, 63, 34, 76, 51, 81, 63, 27, 62, 43, 83, 75, 91,~
## $ Interceptions <dbl> 40, 29, 19, 66, 36, 49, 55, 11, 38, 22, 90, 35, 87,~
## $ Positioning <dbl> 93, 95, 11, 88, 87, 94, 91, 13, 91, 11, 47, 92, 72,~
## $ Vision <dbl> 95, 82, 65, 94, 90, 79, 84, 66, 80, 70, 65, 85, 80,~
## $ Penalties <dbl> 75, 84, 11, 84, 92, 88, 83, 23, 70, 25, 62, 71, 66,~
## $ Composure <dbl> 96, 95, 68, 91, 93, 88, 90, 65, 84, 70, 90, 84, 84,~
## $ Defending <dbl> 91, 84, 57, 186, 94, 96, 122, 50, 100, 48, 272, 122~
## $ Marking <dbl> 32, 28, 27, 68, 35, 35, 38, 15, 34, 25, 93, 42, 84,~
## $ `Standing Tackle` <dbl> 35, 32, 12, 65, 30, 42, 43, 19, 34, 13, 93, 42, 88,~
## $ `Sliding Tackle` <dbl> 24, 24, 18, 53, 29, 19, 41, 16, 32, 10, 86, 38, 87,~
## $ Goalkeeping <dbl> 54, 58, 437, 56, 59, 51, 62, 439, 42, 439, 58, 56, ~
## $ `GK Diving` <dbl> 6, 7, 87, 15, 9, 15, 14, 86, 13, 88, 13, 10, 13, 84~
## $ `GK Handling` <dbl> 11, 11, 92, 13, 9, 6, 14, 88, 5, 85, 10, 10, 14, 89~
## $ `GK Kicking` <dbl> 15, 15, 78, 5, 15, 12, 9, 85, 7, 88, 13, 15, 16, 74~
## $ `GK Positioning` <dbl> 14, 14, 90, 10, 15, 8, 11, 91, 11, 88, 11, 7, 12, 8~
## $ `GK Reflexes` <dbl> 8, 11, 90, 13, 11, 10, 14, 89, 6, 90, 11, 14, 12, 8~
## $ `Total Stats` <dbl> 2231, 2221, 1413, 2304, 2175, 2195, 2211, 1389, 214~
## $ `Base Stats` <dbl> 466, 464, 489, 485, 451, 457, 470, 490, 466, 484, 4~
## $ `W/F` <chr> "4 <U+2605>", "4 <U+2605>", "3 <U+2605>", "5 <U+2605>", "5 <U+2605>", "4 <U+2605>", "3 <U+2605>", "3~
## $ SM <chr> "4<U+2605>", "5<U+2605>", "1<U+2605>", "4<U+2605>", "5<U+2605>", "4<U+2605>", "4<U+2605>", "1<U+2605>", "5<U+2605>~
## $ `A/W` <chr> "Medium", "High", "Medium", "High", "High", "High",~
## $ `D/W` <chr> "Low", "Low", "Medium", "High", "Medium", "Medium",~
## $ IR <chr> "5 <U+2605>", "5 <U+2605>", "3 <U+2605>", "4 <U+2605>", "5 <U+2605>", "4 <U+2605>", "3 <U+2605>", "3~
## $ PAC <dbl> 85, 89, 87, 76, 91, 78, 93, 86, 96, 88, 76, 94, 65,~
## $ SHO <dbl> 92, 93, 92, 86, 85, 91, 86, 88, 86, 85, 60, 85, 73,~
## $ PAS <dbl> 91, 81, 78, 93, 86, 78, 81, 85, 78, 88, 71, 80, 76,~
## $ DRI <dbl> 95, 89, 90, 88, 94, 85, 90, 89, 91, 90, 71, 90, 72,~
## $ DEF <dbl> 38, 35, 52, 64, 36, 43, 45, 51, 39, 45, 91, 44, 86,~
## $ PHY <dbl> 65, 77, 90, 78, 59, 82, 75, 91, 76, 88, 86, 76, 91,~
## $ Hits <chr> "771", "562", "150", "207", "595", "248", "246", "1~
Count total number of rows and columns
nrow(fifa)
## [1] 18979
ncol(fifa)
## [1] 77
There are 18979 rows and 77 columns in the data set
Check for duplicates and missing values
fifa %>%
summarize(dups = sum(duplicated(.)))
## # A tibble: 1 x 1
## dups
## <int>
## 1 0
#checking on the ID column
sum(duplicated(fifa$ID))
## [1] 0
The data set does not contain any duplicates
#Using skimr to check missing values in the data set
skim(fifa)
The data does not contain missing values
1. Create a new data frame and drop columns that won’t be considered in the analysis
fifa21 <- fifa %>%
select(-c(LongName, photoUrl, playerUrl))
View(fifa21)
2. Fix the column names
library(janitor)
fifa21 <- fifa21 %>%
clean_names()
3. Remove newline characters from all columns that have them
# Club column
unique(fifa21$club)
fifa21$club <- if_else(str_detect(fifa21$club, "\n{4}"),
str_replace_all(fifa21$club, "\n{4}", ""), fifa21$club)
4. Convert the height and weight columns to numeric
Height Column
unique(fifa21$height)
## [1] "170cm" "187cm" "188cm" "181cm" "175cm" "184cm" "191cm" "178cm"
## [9] "193cm" "185cm" "199cm" "173cm" "168cm" "176cm" "177cm" "183cm"
## [17] "180cm" "189cm" "179cm" "195cm" "172cm" "182cm" "186cm" "192cm"
## [25] "165cm" "194cm" "167cm" "196cm" "163cm" "190cm" "174cm" "169cm"
## [33] "171cm" "197cm" "200cm" "166cm" "6'2\"" "164cm" "198cm" "6'3\""
## [41] "6'5\"" "5'11\"" "6'4\"" "6'1\"" "6'0\"" "5'10\"" "5'9\"" "5'6\""
## [49] "5'7\"" "5'4\"" "201cm" "158cm" "162cm" "161cm" "160cm" "203cm"
## [57] "157cm" "156cm" "202cm" "159cm" "206cm" "155cm"
First strip the string “cm” then Write a function that removes the non numeric characters, then converts the values to one unit
fifa21$height <- str_replace_all(fifa21$height, "cm", "")
inches_to_cm <- function(height_cm){
if (str_detect(height_cm, "'")) {
height_cm = str_replace_all(height_cm, c("'"=".", "\""=""))
cm = round(as.numeric(height_cm) * 30.48, 0)
print(cm)
}
else{
print(as.numeric(height_cm))
}
}
#Use the sapply function
fifa21$height <- sapply(fifa21$height, inches_to_cm)
unique(fifa21$height)
## [1] 170 187 188 181 175 184 191 178 193 185 199 173 168 176 177 183 180 189 179
## [20] 195 172 182 186 192 165 194 167 196 163 190 174 169 171 197 200 166 164 198
## [39] 156 155 201 158 162 161 160 203 157 202 159 206
Weight column
unique(fifa21$weight)
## [1] "72kg" "83kg" "87kg" "70kg" "68kg" "80kg" "71kg" "91kg"
## [9] "73kg" "85kg" "92kg" "69kg" "84kg" "96kg" "81kg" "82kg"
## [17] "75kg" "86kg" "89kg" "74kg" "76kg" "64kg" "78kg" "90kg"
## [25] "66kg" "60kg" "94kg" "79kg" "67kg" "65kg" "59kg" "61kg"
## [33] "93kg" "88kg" "97kg" "77kg" "62kg" "63kg" "95kg" "100kg"
## [41] "58kg" "183lbs" "179lbs" "172lbs" "196lbs" "176lbs" "185lbs" "170lbs"
## [49] "203lbs" "168lbs" "161lbs" "146lbs" "130lbs" "190lbs" "174lbs" "148lbs"
## [57] "165lbs" "159lbs" "192lbs" "181lbs" "139lbs" "154lbs" "157lbs" "163lbs"
## [65] "98kg" "103kg" "99kg" "102kg" "56kg" "101kg" "57kg" "55kg"
## [73] "104kg" "107kg" "110kg" "53kg" "50kg" "54kg" "52kg"
Strip the “kg” string then convert lbs to a common unit
fifa21$weight <- str_replace_all(fifa21$weight, "kg", "")
lbs_to_kg <- function(weight_kg){
if (str_detect(weight_kg, "lbs")) {
weight_kg = str_replace_all(weight_kg, "lbs", "")
kgs = round(as.numeric(weight_kg) * 0.454, 0)
print(kgs)
}
else {
print(as.numeric(weight_kg))
}
}
fifa21$weight <- sapply(fifa21$weight, lbs_to_kg)
unique(fifa21$weight)
## [1] 72 83 87 70 68 80 71 91 73 85 92 69 84 96 81 82 75 86 89
## [20] 74 76 64 78 90 66 60 94 79 67 65 59 61 93 88 97 77 62 63
## [39] 95 100 58 98 103 99 102 56 101 57 55 104 107 110 53 50 54 52
5. Convert the value, wage, and release clause columns to numeric
unique(fifa21$value)
unique(fifa21$wage)
unique(fifa21$release_clause)
#First drop the euro sign
fifa21 <- fifa21 %>%
mutate(
across(
c(value, wage, release_clause),
.fns = ~str_replace_all(.x, "€", "")
)
)
value_mil <- function(value_x){
if (str_detect(value_x, "M")) {
value_x = str_replace_all(value_x, "M", "")
value_x = round(as.numeric(value_x) * 1000000, 0)
print(value_x)
}
else if (str_detect(value_x, "K")) {
value_x = str_replace_all(value_x, "K", "")
value_x = round(as.numeric(value_x) * 1000, 0)
print(value_x)
}
else {
print(as.numeric(value_x))
}
}
fifa21$value <- sapply(fifa21$value, value_mil)
fifa21$wage <- sapply(fifa21$wage, value_mil)
fifa21$release_clause <- sapply(fifa21$release_clause, value_mil)
6. Remove star characters in the W/F, SM, and IR columns
unique(fifa21$w_f)
## [1] "4 <U+2605>" "3 <U+2605>" "5 <U+2605>" "2 <U+2605>" "1 <U+2605>"
unique(fifa21$sm)
## [1] "4<U+2605>" "5<U+2605>" "1<U+2605>" "2<U+2605>" "3<U+2605>"
unique(fifa21$ir)
## [1] "5 <U+2605>" "3 <U+2605>" "4 <U+2605>" "2 <U+2605>" "1 <U+2605>"
star_fn <- function(star_char){
if (str_detect(star_char, " ★|★")){
star_char = str_replace_all(star_char, " ★|★", "")
print(as.numeric(star_char))
}}
fifa21$w_f <- sapply(fifa21$w_f, star_fn)
fifa21$sm <- sapply(fifa21$sm, star_fn)
fifa21$ir <- sapply(fifa21$ir, star_fn)
7. Split the contract column into three separate columns: the year the contract starts, ends, and the type of contract
unique(fifa21$contract)
## [1] "2004 ~ 2021" "2018 ~ 2022" "2014 ~ 2023"
## [4] "2015 ~ 2023" "2017 ~ 2022" "2017 ~ 2023"
## [7] "2018 ~ 2024" "2014 ~ 2022" "2018 ~ 2023"
## [10] "2016 ~ 2023" "2013 ~ 2023" "2011 ~ 2023"
## [13] "2009 ~ 2022" "2005 ~ 2021" "2011 ~ 2021"
## [16] "2015 ~ 2022" "2017 ~ 2024" "2010 ~ 2024"
## [19] "2012 ~ 2021" "2019 ~ 2024" "2015 ~ 2024"
## [22] "2017 ~ 2025" "2020 ~ 2025" "2019 ~ 2023"
## [25] "2008 ~ 2023" "2015 ~ 2021" "2020 ~ 2022"
## [28] "2012 ~ 2022" "2016 ~ 2025" "2013 ~ 2022"
## [31] "2011 ~ 2022" "2012 ~ 2024" "2016 ~ 2021"
## [34] "2012 ~ 2023" "2008 ~ 2022" "2019 ~ 2022"
## [37] "2017 ~ 2021" "2013 ~ 2024" "2020 ~ 2024"
## [40] "2010 ~ 2022" "2020 ~ 2021" "2011 ~ 2024"
## [43] "2020 ~ 2023" "2014 ~ 2024" "2013 ~ 2026"
## [46] "2016 ~ 2022" "2010 ~ 2021" "2013 ~ 2021"
## [49] "2019 ~ 2025" "2018 ~ 2025" "2016 ~ 2024"
## [52] "2018 ~ 2021" "2009 ~ 2024" "2007 ~ 2022"
## [55] "Jun 30, 2021 On Loan" "2009 ~ 2021" "2019 ~ 2021"
## [58] "2019 ~ 2026" "Free" "2012 ~ 2028"
## [61] "2010 ~ 2023" "2014 ~ 2021" "2015 ~ 2025"
## [64] "2014 ~ 2026" "2012 ~ 2025" "2017 ~ 2020"
## [67] "2002 ~ 2022" "2020 ~ 2027" "2013 ~ 2025"
## [70] "Dec 31, 2020 On Loan" "2019 ~ 2020" "2011 ~ 2025"
## [73] "2016 ~ 2020" "2007 ~ 2021" "2020 ~ 2026"
## [76] "2010 ~ 2025" "2009 ~ 2023" "2008 ~ 2021"
## [79] "2020 ~ 2020" "2016 ~ 2026" "Jan 30, 2021 On Loan"
## [82] "2012 ~ 2020" "2014 ~ 2025" "Jun 30, 2022 On Loan"
## [85] "2015 ~ 2020" "May 31, 2021 On Loan" "2018 ~ 2020"
## [88] "2014 ~ 2020" "2013 ~ 2020" "2006 ~ 2024"
## [91] "Jul 5, 2021 On Loan" "Dec 31, 2021 On Loan" "2004 ~ 2025"
## [94] "2011 ~ 2020" "Jul 1, 2021 On Loan" "Jan 1, 2021 On Loan"
## [97] "2006 ~ 2023" "Aug 31, 2021 On Loan" "2006 ~ 2021"
## [100] "2005 ~ 2023" "2003 ~ 2020" "2009 ~ 2020"
## [103] "2002 ~ 2020" "2005 ~ 2020" "2005 ~ 2022"
## [106] "Jan 31, 2021 On Loan" "2010 ~ 2020" "Dec 30, 2021 On Loan"
## [109] "2008 ~ 2020" "2007 ~ 2020" "2003 ~ 2021"
## [112] "Jun 23, 2021 On Loan" "Jan 3, 2021 On Loan" "Nov 27, 2021 On Loan"
## [115] "2002 ~ 2021" "Jan 17, 2021 On Loan" "Jun 30, 2023 On Loan"
## [118] "1998 ~ 2021" "2003 ~ 2022" "2007 ~ 2023"
## [121] "Jul 31, 2021 On Loan" "Nov 22, 2020 On Loan" "May 31, 2022 On Loan"
## [124] "2006 ~ 2020" "Dec 30, 2020 On Loan" "2007 ~ 2025"
## [127] "Jan 4, 2021 On Loan" "Nov 30, 2020 On Loan" "2004 ~ 2020"
## [130] "2009 ~ 2025" "Aug 1, 2021 On Loan"
Create a function that checks whether the rows in the contract column fall under (contract, free, loan) contract types
contract_type <- function(type){
if (str_detect(type, "~")) {
print("Contract")
}
else if (str_detect(type, "Free")){
print("Free")
}
else if (str_detect(type, "On Loan")){
print("On Loan")
}
else {
print(type)
}
}
#Use mutate to create a new column that holds the type of contract
#Then use separate to split contract column into start year and end year
fifa21 <- fifa21 %>%
mutate(Type_of_contract = sapply(fifa21$contract, contract_type)) %>%
separate(contract, into = c("Start_year", "End_year"), sep = "~")
#After separating the contract column, it appears that the Start_year column contains date values ("On Loan" rows) which should belong in the End_year column, thus in this section, we are going to populate the end_year column with this values.
unique(fifa21$Start_year)
fifa21$End_year <- if_else(is.na(fifa21$End_year), fifa21$Start_year, fifa21$End_year)
#Then drop the 'On Loan" and "Free" rows from start_year column
fifa21$Start_year <- if_else(str_detect(fifa21$Start_year, "On Loan|Free"), NA, fifa21$Start_year)
#Populate the NA values in start_year with the right year, refer to the joined column
unique(fifa21$joined)
class(fifa21$joined)
fifa21 <- fifa21 %>%
mutate(joined = as_date(joined, format = "%d-%b-%y"),
YearJoined = as.character(year(joined)),
Start_year = if_else(is.na(Start_year), YearJoined, Start_year))
#Finally, create a function that replaces "On Loan" string in the end year column and retrieves the date only
unique(fifa21$End_year)
end_year_fixed <- function(loan_date){
if (str_detect(loan_date, "On Loan")) {
loan_date = str_replace_all(loan_date, "On Loan", "")
loan_date = year(as_date(loan_date, format = "%b-%d-%y"))
print(loan_date)
}
else{
print(str_trim(loan_date))
}
}
fifa21$End_year <- sapply(fifa21$End_year, end_year_fixed)
fifa21 <- fifa21 %>%
mutate(Loan_Date_End = as_date(joined, format = "%d-%b-%y"),
Loan_End = as.character(year(joined)),
End_year = if_else(is.na(End_year), Loan_End, End_year))
#Trim whitespaces from start year column
fifa21$Start_year <- str_trim(fifa21$Start_year)
Confirm the results
unique(fifa21$Type_of_contract)
## [1] "Contract" "On Loan" "Free"
unique(fifa21$Start_year)
## [1] "2004" "2018" "2014" "2015" "2017" "2016" "2013" "2011" "2009" "2005"
## [11] "2010" "2012" "2019" "2020" "2008" "2007" "2002" "2006" "2003" "1998"
unique(fifa21$End_year)
## [1] "2021" "2022" "2023" "2024" "2025" "2026" "2013" "2010" "2015" "Free"
## [11] "2019" "2018" "2028" "2017" "2016" "2020" "2027" "2014" "2012" "2011"
1. Based on the ‘Joined’ column, check which players have been playing at a club for more than 10 years!
class(fifa21$joined)
## [1] "Date"
fifa21 %>%
mutate(duration = as.numeric(difftime(today(), joined, units = "days")) %/% 365.25) %>%
filter(duration >= 10) %>%
arrange(desc(duration)) %>%
select(name, nationality, club, duration) %>%
slice_max(duration, n = 10)
## # A tibble: 12 x 4
## name nationality club duration
## <chr> <chr> <chr> <dbl>
## 1 H. Sogahata Japan Kashima Antlers 26
## 2 I. Akinfeev Russia PFC CSKA Moscow 22
## 3 Kim Kwang Suk Korea Republic Pohang Steelers 22
## 4 K. Nakamura Japan Kawasaki Frontale 21
## 5 D. Lewington England Milton Keynes Dons 21
## 6 H. Otani Japan Kashiwa Reysol 21
## 7 M. McNulty Republic of Ireland Cork City 21
## 8 P. Karlsson Sweden AIK 20
## 9 T. Aoyama Japan Sanfrecce Hiroshima 20
## 10 M. Bloomfield England Wycombe Wanderers 20
## 11 D. Zibung Switzerland FC Luzern 20
## 12 Yang Qipeng China PR Tianjin TEDA FC 20
759 players have been playing at a club for more than 10 years. Above are the top 10 players who have stayed at a club for the longest time.
2. Which players are highly valuable but still underpaid (on low wages)? (hint: scatter plot between wage and value
library(ggrepel)
library(PerformanceAnalytics)
#Check the data types of Wage and Value
class(fifa21$value)
## [1] "numeric"
class(fifa21$wage)
## [1] "numeric"
#The correlation chart below shows that there exists a strong positive relationship of (0.81) between the Value and wage of a player
fifa21 %>%
select(value, wage, release_clause) %>%
chart.Correlation()
#Use histograms to check the distribution of Wage and Value
#Majority of the players earn less than 25,000
fifa21 %>%
ggplot(aes(wage)) +
geom_histogram() +
scale_x_continuous(breaks = seq(0, 500000, 100000),
labels = c(0, "100K", "200K", "300K", "400K", "500K"))
fifa21 %>%
ggplot(aes(value)) +
geom_histogram() +
scale_x_continuous(breaks = seq(0, 200000000, 50000000),
labels = c(0, "50M", "100M", "150M", "200M"))
#Scatter plot that shows the relationship between value and wage of players. It displays names of players who are highly valyed but still earn less
fifa21 %>%
filter(wage <= 60000 & value >= 50000000) %>%
ggplot(aes(value, wage, label = name)) +
geom_point(aes(color = best_position), position = "jitter") +
geom_text_repel(hjust = 1,
nudge_y = 4,
direction = "x") +
scale_y_continuous(breaks = seq(0, 500000, 50000),
labels = c(0, "50K", "100K", "150K", "200K", "250K", "300K", "350K", "400K", "450K", "500K")) +
scale_x_continuous(breaks = seq(0, 200000000, 25000000),
labels = c(0,"25M", "50M", "75M", "100M", "125M", "150M", "175M", "200M")) +
labs(
title = "Underpaid Players who are Highly Valuable",
subtitle = "(G. Donnarumma is one of the highest valued players but\n still gets underpaid)"
) +
theme_classic() +
theme(legend.position = "none",
title = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold"),
plot.subtitle = element_text(colour = "gray30", face = "italic", size = 11))
3. Who are the most valuable players?
library(kableExtra)
Most_Valuable <- fifa21 %>%
filter(value > 100000000) %>%
select(name, club, value) %>%
arrange(desc(value))
Most_Valuable %>%
kable(digits = 0, format = "html",
caption = "Mbappe and Neymar Jr were the most valuable players") %>%
kable_classic("striped", "bordered", full_width = FALSE, html_font = "cambria",
position = "left",
fixed_thead = T) %>%
row_spec(0, bold = TRUE) %>%
column_spec(1, bold = TRUE, color = "black", background = ("#F0E68C")) %>%
column_spec(2, color = "black", background = "lightblue")
| name | club | value |
|---|---|---|
| K. Mbappé | Paris Saint-Germain | 185500000 |
| Neymar Jr | Paris Saint-Germain | 132000000 |
| K. De Bruyne | Manchester City | 129000000 |
| J. Sancho | Borussia Dortmund | 124000000 |
| K. Havertz | Chelsea | 121000000 |
| M. Salah | Liverpool | 120500000 |
| S. Mané | Liverpool | 120500000 |
| J. Oblak | Atlético Madrid | 120000000 |
| R. Sterling | Manchester City | 114500000 |
| T. Alexander-Arnold | Liverpool | 114000000 |
| V. van Dijk | Liverpool | 113000000 |
| R. Lewandowski | FC Bayern München | 111000000 |
| M. ter Stegen | FC Barcelona | 110000000 |
| P. Dybala | Juventus | 109000000 |
| H. Kane | Tottenham Hotspur | 109000000 |
| Bruno Fernandes | Manchester United | 105500000 |
| L. Messi | FC Barcelona | 103500000 |
| J. Kimmich | FC Bayern München | 103000000 |
| Alisson | Liverpool | 102000000 |
4. Is there a significant relationship between the weight, height, and age of a player? (Does the height and age of a player have a significant impact on the weight?)
#Create a new data frame to hold the variables of interest
df <- fifa21 %>%
select(weight, height, age)
#Scatter plot to check relationships between weight and height, weight and age respectively
ggplot(df, aes(height, weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(df, aes(age, weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
#Using a correlation matrix
df %>%
select_if(is.numeric) %>%
chart.Correlation()
#From above scatter plots, there is a strong positive linear relationship between weight and height, where as there is a weak positive relationship between Age and weight. This can also be seen in the correlation matrix. (correlation between age and weight = 0.24).
#Thus to perform multiple linear regression, we convert Age into a categorical variable
df <- df %>%
mutate(AgeCat = cut(age, breaks = c(0, 20, 25, 30, 35, 60),
labels = c("A", "B", "C", "D", "E")))
View(df)
model1 <- lm(weight ~ height + AgeCat, data = df)
plot(model1)
#Checking assumptions of multiple linear regression
#The residuals vs Fitted plot indicates that there exists an approximately linear relationship since the line is fairly horizontal
#The Normal Q-Q plot shows that the residuals from the model are approximately normally distributed with some slight deviations on both ends of the dashed line
#The Scale-Location plot has a fairly horizontal line, this shows that there is homoscedasticity in the model since the points are almost equally spread from the regression line across the observations
#The fairly horizontal line in the residuals vs Leverage plot indicates that there may be some few outliers in the data
summary(model1)
##
## Call:
## lm(formula = weight ~ height + AgeCat, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.306 -2.811 0.026 2.744 34.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.756417 0.837064 -80.94 <2e-16 ***
## height 0.777858 0.004628 168.09 <2e-16 ***
## AgeCatB 1.330939 0.090435 14.72 <2e-16 ***
## AgeCatC 2.714874 0.093601 29.00 <2e-16 ***
## AgeCatD 3.415417 0.115214 29.64 <2e-16 ***
## AgeCatE 4.698167 0.244299 19.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.359 on 18973 degrees of freedom
## Multiple R-squared: 0.6204, Adjusted R-squared: 0.6203
## F-statistic: 6201 on 5 and 18973 DF, p-value: < 2.2e-16
# Scatter plot representing the model
ggplot(df, aes(height, weight, color = AgeCat)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Interpretation of Above Model
From the scatter plot, we can notice that Height has an effect on weight, as height increases by one unit we expect the weight to increase by 0.778. This increase is assumed to be the same for all the age categories. The Age category also has an effect on weight, for players in age category A the weight increases by 0.778, in age category B weight increases by 1.33, in age category C weight increases by 2.71, in age category D weight increases by 3.41, while in age category E weight increases by 4.69. The effect is assumed to be the same for all heights.
The model has a residual standard error of 4.359, and a Multiple R-squared value of 0.6204. This indicates that approximately 62.04% of the variability in weight can be explained by the model. The F-statistic also has a p-value of < 0.05 indicating that the model is statistically significant.
5. Which positions have the highest average overall ratings (OVA)
unique(fifa21$best_position)
## [1] "RW" "ST" "GK" "CAM" "LW" "CB" "CDM" "CF" "CM" "RB" "LB" "LM"
## [13] "RM" "LWB" "RWB"
ggplot(fifa21, aes(fct_rev(fct_reorder(best_position, ova)), ova)) +
geom_boxplot(aes(fill = best_position)) +
labs(x = "Position",
y = "Overall average rating",
title = "Positions with the highest overall average ratings",
subtitle = "(CF players have the highest average ratings)") +
theme_classic() +
theme(plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10, colour = "gray50"),
axis.title = element_text(size = 12, colour = "black", face = "bold"),
legend.position = "none")
Central Forward (CF) has the highest median overall rating across all positions, while Goal Keepers (GK) has the least ratings.
6. What is the distribution of players across different clubs and Which clubs have the highest average overall rating among its players?
#Distribution of players across different clubs in percentages
fifa21 %>%
group_by(club) %>%
summarize(n = n()) %>%
group_by(n) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count) * 100) %>%
arrange(desc(count)) %>%
slice_max(percentage, n = 5)
## # A tibble: 5 x 3
## n count percentage
## <int> <int> <dbl>
## 1 28 152 22.3
## 2 30 119 17.4
## 3 27 104 15.2
## 4 26 64 9.38
## 5 25 58 8.50
# clubs with highest rated players
highest_ratings <- fifa21 %>%
group_by(club) %>%
summarize(avg_ratings = mean(ova),
median_ratings = median(ova)) %>%
arrange(desc(avg_ratings)) %>%
slice_max(avg_ratings, n=10)
highest_ratings %>%
kable(digits = 0, format = "html",
caption = "Clubs with highest average overall rating of players") %>%
kable_classic("striped", "bordered", full_width = FALSE, html_font = "cambria",
position = "left",
fixed_thead = T) %>%
row_spec(0, bold = TRUE) %>%
column_spec(1, bold = TRUE, color = "black", background = ("#F0E68C")) %>%
column_spec(2, color = "black", background = "lightblue")
| club | avg_ratings | median_ratings |
|---|---|---|
| FC Bayern München | 81 | 82 |
| Real Madrid | 80 | 81 |
| Inter | 79 | 80 |
| Napoli | 79 | 79 |
| Chelsea | 78 | 79 |
| Manchester United | 78 | 79 |
| Juventus | 78 | 80 |
| Paris Saint-Germain | 78 | 79 |
| Liverpool | 77 | 80 |
| Sevilla FC | 77 | 78 |
From the data set, 22.28% of the clubs have 28 players each, 17.45% have 30 players, while surprisingly, there are clubs with less than 22 players. Approximately 15 clubs have 22 players and below.
FC Bayern Munchen, Real Madrid, Inter, and Napoli are the teams with the most highly rated players, while Finn Harps and Waerford FC have the lowest rated players.
7. How does the length of a player’s contract relate to their market value?
class(fifa21$Start_year)
## [1] "character"
class(fifa21$End_year)
## [1] "character"
#Convert above variables to numeric
fifa21$Start_year <- as.numeric(fifa21$Start_year)
fifa21$End_year <- if_else(is.na(fifa21$End_year), NA, as.numeric(fifa21$End_year))
mkt_val <- fifa21 %>%
mutate(contract_length = End_year - Start_year)
#Hypotheses Testing
#Null Hypothesis: There is no relationship between a player's contract length and their market value
#Alternative Hypothesis: There is a relationship between a player's contract length and their market value
cor.test(mkt_val$contract_length, mkt_val$value, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: mkt_val$contract_length and mkt_val$value
## S = 8.631e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.213385
There appears to be a weak positive relationship between a player’s contract length and their market value. The Spearman’s rank test yields a correlation coefficient of 0.213385, and a p-value < 0.05, thus we reject the null hypothesis and conclude that there is a relationship between a player’s contract length and their market value. As the contract length increases, the market value is also expected to increase.