Youth Risk Behavior Surveillance

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. We are working with a random sample of observations for one of the years of the survey.

Load the data

We start by loading and glimpsing the data to get an idea of the variables we are working with.

data(yrbss)
glimpse(yrbss)

## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…

skimr::skim(yrbss)

Table 1: Data summary
Name	yrbss
Number of rows	13583
Number of columns	13
_______________________
Column type frequency:
character	8
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
gender	12	1.00	4	6	2
grade	79	0.99	1	5	5
hispanic	231	0.98	3	8	2
race	2805	0.79	5	41	5
helmet_12m	311	0.98	5	12	6
text_while_driving_30d	918	0.93	1	13	8
hours_tv_per_school_day	338	0.98	1	12	7
school_night_hours_sleep	1248	0.91	1	3	7

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	77	0.99	16.16	1.26	12.00	15.0	16.00	17.00	18.00	▁▂▅▅▇
height	1004	0.93	1.69	0.10	1.27	1.6	1.68	1.78	2.11	▁▅▇▃▁
weight	1004	0.93	67.91	16.90	29.94	56.2	64.41	76.20	180.99	▆▇▂▁▁
physically_active_7d	273	0.98	3.90	2.56	0.00	2.0	4.00	7.00	7.00	▆▂▅▃▇
strength_training_7d	1176	0.91	2.95	2.58	0.00	0.0	3.00	5.00	7.00	▇▂▅▂▅

Exploratory Data Analysis

We will analyse the weight variable and see if exercise influences the weight of participants. To start off, we can see from the skim table that there’s 1004 missing values in the column. The distribution of the remaining values can be seen in the histogram below. The distribution is skewed right, with a mean of 67.91 kilograms.

ggplot(yrbss) +
  aes(x = weight) +
  geom_histogram() +
  theme_minimal()

We consider the relationship between physical acitvity and weight. We will create a variable ‘physical_3plus’ that says whether a high schooler is physically active for at least 3 days a week. See below the number and % of students that are physically active. We have tested whether the count function and groupby-summarise give the same result for the count and the percentage, which is the case.

df_active <- yrbss %>%
  drop_na(physically_active_7d) %>%
  mutate(physical_3plus = ifelse(physically_active_7d > 2, 'yes','no') )

groupby_active <- df_active %>%
  group_by(physical_3plus) %>%
  summarise(nr_physical_3plus = n()) %>%
  mutate(perc_physical_3plus = nr_physical_3plus / sum(nr_physical_3plus))

groupby_active

## # A tibble: 2 × 3
##   physical_3plus nr_physical_3plus perc_physical_3plus
##   <chr>                      <int>               <dbl>
## 1 no                          4404               0.331
## 2 yes                         8906               0.669

count_active <- df_active %>%
  count(physical_3plus) %>%
  mutate(perc_physical_3plus = n / sum(n))

count_active

## # A tibble: 2 × 3
##   physical_3plus     n perc_physical_3plus
##   <chr>          <int>               <dbl>
## 1 no              4404               0.331
## 2 yes             8906               0.669

We have calculated a 95% confidence interval for the proportion of high schoolers who are not active 3 or more days a week through bootstrapping with the infer package, which is 0.3 - 0.326. This means the majority of people are active 3+ days a week.

library(infer)

ci_proportion_activity <- df_active %>% 
  mutate(physical_3plus = factor(physical_3plus)) %>%
  specify(response = physical_3plus, success = 'no') %>% 
  generate(type = 'bootstrap', reps = 1000) %>% 
  calculate(stat = 'prop') %>% 
  get_confidence_interval(level = 0.95)

ci_proportion_activity

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.323    0.339

To analyse the relationship between weight and physical activity, we are plotting 2 boxplots, 1 for those who are physically active 3+ days a week and those who are not. We would have expected physical activity to be inversely correlated with weight, as activity generally makes you lose weight. However, it appears in the boxplots that this is not the case. To explain that physically active people are heavier, we can think of two things. Firstly, muscle is heavier than fat, and people who exercise often have more muscle. Secondly, people who are heavier might want to exercise more often to lose weight.

ggplot(df_active) + 
  aes(x = physical_3plus, y = weight) +
  geom_boxplot() +
  labs(
    title = 'Physical activity linked to higher average weight',
    y = 'Weight',
    x = 'Physically active 3+ days'
  ) +
  coord_flip()

Confidence Interval

Boxplots show how the medians of the two distributions compare, but we also want to compare the means and see if there is a statistical difference. First we calculate the 95% confidence interval and then do a hypothesis test to confirm the difference.

means_active <- df_active %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE),
            std_weight = sd(weight, na.rm = TRUE),
            count = n(), 
            se = std_weight/sqrt(count),
            critical_t = qt(0.975, count-1),
            lower = mean_weight - critical_t * se,
            upper = mean_weight + critical_t * se)

means_active

## # A tibble: 2 × 8
##   physical_3plus mean_weight std_weight count    se critical_t lower upper
##   <chr>                <dbl>      <dbl> <int> <dbl>      <dbl> <dbl> <dbl>
## 1 no                    66.7       17.6  4404 0.266       1.96  66.2  67.2
## 2 yes                   68.4       16.5  8906 0.175       1.96  68.1  68.8

There is an observed difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.

Hypothesis test with formula

We are now going to test whether there’s a difference between the mean weights for those who are physically active at least 3x a week and those we don’t. Our null hypothesis is that there is no difference between the means. The alternative hypothesis is that there is a difference. We perform a t-test below to assess these hypotheses.

t.test( weight ~ physical_3plus, data = df_active)

## 
##  Welch Two Sample t-test
## 
## data:  weight by physical_3plus
## t = -5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -2.42 -1.12
## sample estimates:
##  mean in group no mean in group yes 
##              66.7              68.4

According to the t-test, there is a significant difference between the weight of people with 3+ days of phsyical activity and those without. Those who have 3+ days of physical activity have a higher weight.

POINT OF PROGRESS FOR STORYTIME

Hypothesis test with `infer`

Next, we will use the infer package to do a bootstrap simulation of the analysis. This will be used to calculate the difference in means.

obs_diff <- df_active %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

null_dist <- df_active %>%
  # specify variables
  specify(weight ~ physical_3plus) %>%
  
  # assume independence, i.e, there is no difference
  hypothesize(null = "independence") %>%
  
  # generate 1000 reps, of type "permute"
  generate(reps = 1000, type = "permute") %>%
  
  # calculate statistic of difference, namely "diff in means"
  calculate(stat = "diff in means", order = c("yes", "no"))

null_dist

## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1  0.0507
##  2         2 -0.286 
##  3         3 -0.155 
##  4         4 -0.242 
##  5         5  0.0582
##  6         6  0.0122
##  7         7 -0.0598
##  8         8  0.516 
##  9         9 -0.809 
## 10        10  0.154 
## # … with 990 more rows

visualise(null_dist)

We have used hypothesize to set the null hypothesis as a test for independence, i.e. that there is no difference between the two mean Our null hypothesis (that there is no difference between the two means) has been set using the hypothesize function. The alternative hypothesis is that there is a difference in weight between those who exercise three or more days a week and those who don’t.

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()

After this visualisation, we need to find the p-value associated with this statistical test to determine whether there is a statistically significant difference.

See below the visualisation of the p-value on top of the distribution. The p-value is also printed, which is 0 in this case. That means that there is a significant difference between the two means.

null_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided")

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

IMDB ratings: Differences between directors

We have a dataset containing a few thousand movies and their IMBD ratings. We want to explore whether movies directed by Steven Spielberg and Tim Burton are rated the same.

We have loaded the data and shown a glimpse of it below.

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)

## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

To start assessing this, we create a dataframe with the 95% confidence interval for the mean of the rating for each of the directors, using the t-statistic. Then we plot the confidence intervals on the same graph to see if they overlap a lot.

chosen_directors <- c('Tim Burton','Steven Spielberg')

directors <- movies %>% 
  filter(director %in% chosen_directors) %>%
  group_by(director) %>%
  summarise(mean_rating = mean(rating),
            sd_rating = sd(rating),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_rating = sd_rating/sqrt(count),
            margin_of_error = t_critical * se_rating,
            lower = mean_rating - margin_of_error,
            upper = mean_rating + margin_of_error) %>%
  mutate(labels = round(mean_rating, 2))

directors

## # A tibble: 2 × 10
##   director      mean_…¹ sd_ra…² count t_cri…³ se_ra…⁴ margi…⁵ lower upper labels
##   <chr>           <dbl>   <dbl> <int>   <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>
## 1 Steven Spiel…    7.57   0.695    23    2.07   0.145   0.301  7.27  7.87   7.57
## 2 Tim Burton       6.93   0.749    16    2.13   0.187   0.399  6.53  7.33   6.93
## # … with abbreviated variable names ¹mean_rating, ²sd_rating, ³t_critical,
## #   ⁴se_rating, ⁵margin_of_error

ggplot(data = directors) +
  aes(y = director) +
  geom_errorbarh(aes(xmin = lower, xmax = upper, color = director), size = 2, height = 0.1) +
  geom_point(aes(x = mean_rating, color = director), size = 5) +
  geom_rect(aes(xmin = 7.27, xmax = 7.33, ymin=0,ymax=3), fill = 'grey70', alpha=0.5) +
  geom_text(aes(x = mean_rating, label = labels), vjust = 0, nudge_y = 0.05, overlap=FALSE) +
  labs(
    title = 'Do Tim Burton and Steven Spielberg have the same mean IMDB rating?',
    subtitle = '95% confidence intervals overlap',
    x = 'Rating',
    y = NULL
  ) +
  theme_minimal()

Seeing that the two intervals overalp in the grey highlighted area, we need to also conduct a statistical test to see if the means are statistically significantly different. For this, we run a t-test and use the infer package to bootstrap the data as well.

Our null hypothesis is that the means of the ratings of their films is the same. The alternative hypothesis is that there is a difference between the means of the ratings of Tim Burtons and Steven Spielbergs films. We will be using the t-statistic for the difference between the means. We are looking for a p-value smaller than 0.05 or a t-statistic bigger than 1.75 or smaller than -1.75 (based on df = 15 and p = 0.05). We have chosen the degrees of freedom as the smallest count - 1, which was 16 for Tim Burton.

df_directors <- movies  %>%
  filter(director %in% chosen_directors)

t.test(rating ~ director, data = df_directors)

## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93

directors_infer <- df_directors %>%
  # specify variables
  specify(rating ~ director) %>%
  
  # assume independence, i.e, there is no difference
  hypothesize(null = "independence") %>%
  
  # generate 1000 reps, of type "permute"
  generate(reps = 1000, type = "permute") %>%
  
  # calculate statistic of difference, namely "diff in means"
  calculate(stat = "diff in means", order = chosen_directors)

directors_infer %>%
  get_p_value(obs_stat = obs_diff, direction = "two-sided")

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

According to both the t.test and the bootstrapped result, there is a signficant difference in the ratings of the movies of Tim Burton and Steven Spielberg. We see that Tim has a lower mean rating than Steven. The p-value for the t.test was 0.01 and the p-value for infer was 0.

Omega Group plc- Pay Discrimination

At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives.

We were asked to carry out the analysis to find out whether there is a significant difference between the salaries of men and women, and if there are any discrimination factors.

Loading the data

Omega Group Plc. had shared data of 50 employees with us. The data was inspected before conducting our analysis and we have noted that there were no missing/incomplete data points.

omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame

## Rows: 50
## Columns: 3
## $ salary     <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender     <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…

Relationship Salary - Gender ?

The data frame omega contains the salaries for the sample of 50 executives in the company.

We have constructed a visualization that shows the salary trend between male and female employees:

#Extra Visual
ggplot(omega) + 
  geom_boxplot() +
  aes(x = salary, y = gender, colour = gender) + 
  labs (x = "Salary", y = "Gender", caption = "Source: Omega Plc salary data", title = "Male employees at Omega Plc have a higher salary than females", subtitle = "Boxplot showing the salary trend between males and females at Omega Plc")

The above graph shows a visual difference between the average distributions of salary between female and male salary’s for Omega Plc. The group decided to extrapolate the data and construct tables depict these initial findings.

# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)

##   gender   min    Q1 median    Q3   max  mean   sd  n missing
## 1 female 47033 60338  64618 70033 78800 64543 7567 26       0
## 2   male 54768 68331  74675 78568 84576 73239 7463 24       0

#Summary Stats 

summary_stats <- omega %>% 
  group_by(gender) %>% 
  summarise(mean_salary = mean(salary),
            sd_salary = sd(salary),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_salary = sd(salary)/sqrt(count),
            margin_of_error = t_critical * se_salary,
            salary_low = mean_salary - margin_of_error,
            salary_high = mean_salary + margin_of_error)

summary_stats

## # A tibble: 2 × 9
##   gender mean_salary sd_salary count t_critical se_sal…¹ margi…² salar…³ salar…⁴
##   <chr>        <dbl>     <dbl> <int>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
## 1 female      64543.     7567.    26       2.06    1484.   3056.  61486.  67599.
## 2 male        73239.     7463.    24       2.07    1523.   3151.  70088.  76390.
## # … with abbreviated variable names ¹se_salary, ²margin_of_error, ³salary_low,
## #   ⁴salary_high

obs_diff <- omega %>%
  specify(salary ~ gender) %>%
  calculate(stat = "diff in means", order = c("male", "female"))
# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size, 
# the t-critical value, the standard error, the margin of error, 
# and the low/high endpoints of a 95% condifence interval

p_val <- summary_stats %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")

p_val

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

From the initial analysis in the table above, it is shown that males have a higher mean salary of 8696 than females. We decided that more analysis is required to determine if this difference is significant and to decide if there are any further analysis required to uncover influencing factors.

Advanced Analysis

The team decided to conduct 2 tests to decide if the difference in mean salary’s between males and females were significant. The following tests were decided upon:

Hypothesis Tesing useing the T Test package
T Test using simulation method and infer packages

Hypothesis Tesing using the T Test package

The T Test package was used to assess if there is a true difference in the salaries, and findings form this built-in test shows that there is indeed a significant difference as indicated by the absolute T-Value being greater than 1.96, and the P-Value being smaller than 5%.

# hypothesis testing using t.test() 
t.test(salary ~ gender, data = omega)

## 
##  Welch Two Sample t-test
## 
## data:  salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -12973  -4420
## sample estimates:
## mean in group female   mean in group male 
##                64543                73239

We also ran a hypothesis test, assuming that the mean difference in salaries is zero as the null hypothesis using the simulation method from the infer package.

# hypothesis testing using infer package
library(infer)
set.seed(1234)
infer_stats <- omega %>% 
  specify(salary ~ gender) %>% 
  hypothesize(null = "independence",) %>% 
  generate(reps = 1000, type ="permute") %>% 
  calculate(stat = "diff in means", order = c("male", "female"))

percentile_ci <- infer_stats %>% get_confidence_interval(level = 0.95, type = "percentile")

visualize(infer_stats) + shade_p_value(obs_stat = obs_diff, direction = "two-sided")

percentile_ci

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1   -5025.    4829.

From both tests conducted, we can conclude that the observed difference in the means of salaries between males and females at Omega Plc is indeed a significant difference.

As depicted in the simulation density visualisation above, we can see that the observed difference is passed the Upper 95% percentile of 4829. This is confirmed by the t.test() performed that states the absolute T-value as 4, which is much bigger than the standard acceptable value of 1.96. Additionally, the p-value was smaller than 1%.

Relationship Experience - Gender?

At the board meeting, someone raised the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).

# Summary Statistics of experience by gender
favstats (experience ~ gender, data=omega)

##   gender min    Q1 median   Q3 max  mean    sd  n missing
## 1 female   0  0.25    3.0 14.0  29  7.38  8.51 26       0
## 2   male   1 15.75   19.5 31.2  44 21.12 10.92 24       0

#Summary Stats
summary_stats_exp <- omega %>% 
  group_by(gender) %>% 
  summarise(mean_experience = mean(experience),
            sd_experience = sd(experience),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_experience = sd(experience)/sqrt(count),
            margin_of_error = t_critical * se_experience,
            experience_low = mean_experience - margin_of_error,
            experience_high = mean_experience + margin_of_error)

summary_stats_exp

## # A tibble: 2 × 9
##   gender mean_experience sd_expe…¹ count t_cri…² se_ex…³ margi…⁴ exper…⁵ exper…⁶
##   <chr>            <dbl>     <dbl> <int>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 female            7.38      8.51    26    2.06    1.67    3.44    3.95    10.8
## 2 male             21.1      10.9     24    2.07    2.23    4.61   16.5     25.7
## # … with abbreviated variable names ¹sd_experience, ²t_critical,
## #   ³se_experience, ⁴margin_of_error, ⁵experience_low, ⁶experience_high

The above data shows a difference in experience mean of 13.74 years between females and males. To establish if the observation is a significant difference, further analysis is required.

t.test(experience ~ gender, data = omega)

## 
##  Welch Two Sample t-test
## 
## data:  experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -19.35  -8.13
## sample estimates:
## mean in group female   mean in group male 
##                 7.38                21.12

Further analysis was conducted in the form of a T Test to assess the if there is any significant differences in experience between genders. The following test show the statistical variables for Omega Plc.The findings show that there is a significant difference in the experience between males and females from Omega Plc, shown by the observed t-value is greater than 1.96 and a small p-value. This finding assists in validating the pervious observation - as the average experience of males suggests reason as to why the salaries of genders are different.

Based on this evidence, can you conclude that there is a significant difference between the experience of the male and female executives? Perform similar analyses as in the previous section. Does your conclusion validate or endanger your conclusion about the difference in male and female salaries?

Relationship Salary - Experience ?

To further substantiate the findings from the previous tests conducted, a final visual check is performed to determine if there have been any discrimination against females at Omega Plc.

ggplot(omega)+
  aes(x = experience, y = salary, color = gender)+
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Experience", y = "Salary", title = "Female's yearly increases are greater than that of men", subtitle = "Scatterplot showing the distribution of salary versus experience for males and females.")

The plot shows that there are a greater proportion of female employees at Omega Plc with less than 10 years of experience as compared to males, while there is a greater proportion of males with more than 10 years of experience than females at the company

Correlation checks between gender, experience, and salary

The following graph shows a visual matrix of how gender and experience affect salary.

omega %>% 
  select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
  ggpairs(aes(colour=gender, alpha = 0.3))+
  theme_bw()

The visual assessing salary and experience for various genders shows the same finding as described in the visual before.

Challenge 1: Brexit plot

Recreating the Brexit Plot.

knitr::include_graphics(here::here("images", "brexit.png"), error = FALSE)

# read data directly off github repo
brexit_results <- read_csv("https://raw.githubusercontent.com/kostis-christodoulou/am01/master/data/brexit_results.csv")

filtered_brexit_data <- brexit_results %>%  
  select(leave_share, con_2015, lab_2015, ld_2015, ukip_2015) %>% 
  pivot_longer(cols = 2:5, names_to = "party", values_to = "percentage") # pivoting the data before preparing the plot

#filtered_brexit_data

ggplot(filtered_brexit_data) + 
  aes(x = percentage, y = leave_share, colour = party) + 
  geom_point(alpha = 0.35) + 
  geom_smooth(method = "lm") +
  scale_color_manual(breaks = c("con_2015", "lab_2015", "ld_2015", "ukip_2015"), values=c("#0087DC", "#E4003B", "#FAA61A", "#FFD700"), labels=c('Conservative', 'Labour', 'Lib Dems', 'UKIP')) +
  labs(x = "Party % in the UK 2015 general election", y = "Party % in the 2016 Brexit referendum", title = "How political affiliation translated to Brexit Voting") +
 theme_bw() +
  theme(legend.position = "bottom", legend.title = element_blank())

#+ theme(aspect.ratio = 1)

Challenge 2:GDP components over time and among countries

We have data from the UN on the GDP among many different countries between 1970 and 2022. We want to test whether the economic definition of GDP is correct in reality. In economics, GDP is the sum of consumption, investment, government spending and net exports. We will be analysing whether that is true for India, the US and Germany.

Below we download and glimpse the data

UN_GDP_data  <-  read_excel(here::here("data", "Download-GDPconstant-USD-countries.xls"), # Excel filename
                sheet="Download-GDPconstant-USD-countr", # Sheet name
                skip=2) # Number of rows to skip

glimpse(UN_GDP_data)

## Rows: 3,685
## Columns: 51
## $ CountryID     <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8,…
## $ Country       <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
## $ IndicatorName <chr> "Final consumption expenditure", "Household consumption …
## $ `1970`        <dbl> 5.56e+09, 5.07e+09, 3.72e+08, 9.85e+08, 9.85e+08, 1.12e+…
## $ `1971`        <dbl> 5.33e+09, 4.84e+09, 3.82e+08, 1.05e+09, 1.05e+09, 1.45e+…
## $ `1972`        <dbl> 5.20e+09, 4.70e+09, 4.02e+08, 9.19e+08, 9.19e+08, 1.73e+…
## $ `1973`        <dbl> 5.75e+09, 5.21e+09, 4.21e+08, 9.19e+08, 9.19e+08, 2.18e+…
## $ `1974`        <dbl> 6.15e+09, 5.59e+09, 4.31e+08, 1.18e+09, 1.18e+09, 3.00e+…
## $ `1975`        <dbl> 6.32e+09, 5.65e+09, 5.98e+08, 1.37e+09, 1.37e+09, 3.16e+…
## $ `1976`        <dbl> 6.37e+09, 5.68e+09, 6.27e+08, 2.03e+09, 2.03e+09, 4.17e+…
## $ `1977`        <dbl> 6.90e+09, 6.15e+09, 6.76e+08, 1.92e+09, 1.92e+09, 4.31e+…
## $ `1978`        <dbl> 7.09e+09, 6.30e+09, 7.25e+08, 2.22e+09, 2.22e+09, 4.57e+…
## $ `1979`        <dbl> 6.92e+09, 6.15e+09, 7.08e+08, 2.07e+09, 2.07e+09, 4.89e+…
## $ `1980`        <dbl> 6.69e+09, 5.95e+09, 6.85e+08, 1.98e+09, 1.98e+09, 4.53e+…
## $ `1981`        <dbl> 6.81e+09, 6.06e+09, 6.97e+08, 2.06e+09, 2.06e+09, 4.60e+…
## $ `1982`        <dbl> 6.96e+09, 6.19e+09, 7.12e+08, 2.08e+09, 2.08e+09, 4.77e+…
## $ `1983`        <dbl> 7.30e+09, 6.49e+09, 7.47e+08, 2.19e+09, 2.19e+09, 4.96e+…
## $ `1984`        <dbl> 7.43e+09, 6.61e+09, 7.60e+08, 2.23e+09, 2.23e+09, 5.06e+…
## $ `1985`        <dbl> 7.45e+09, 6.63e+09, 7.63e+08, 2.23e+09, 2.23e+09, 5.08e+…
## $ `1986`        <dbl> 7.68e+09, 6.83e+09, 7.85e+08, 2.30e+09, 2.30e+09, 5.23e+…
## $ `1987`        <dbl> 6.89e+09, 6.12e+09, 7.05e+08, 2.07e+09, 2.07e+09, 4.69e+…
## $ `1988`        <dbl> 6.32e+09, 5.62e+09, 6.46e+08, 1.90e+09, 1.90e+09, 4.31e+…
## $ `1989`        <dbl> 5.87e+09, 5.22e+09, 6.01e+08, 1.76e+09, 1.76e+09, 4.00e+…
## $ `1990`        <dbl> 5.69e+09, 5.06e+09, 5.82e+08, 1.71e+09, 1.71e+09, 3.88e+…
## $ `1991`        <dbl> 5.28e+09, 4.70e+09, 5.40e+08, 1.51e+09, 1.51e+09, 4.15e+…
## $ `1992`        <dbl> 5.59e+09, 4.97e+09, 5.72e+08, 1.52e+09, 1.52e+09, 4.92e+…
## $ `1993`        <dbl> 4.36e+09, 3.87e+09, 4.46e+08, 1.13e+09, 1.13e+09, 4.22e+…
## $ `1994`        <dbl> 3.52e+09, 3.13e+09, 3.59e+08, 8.70e+08, 8.70e+08, 3.69e+…
## $ `1995`        <dbl> 5.46e+09, 4.86e+09, 5.60e+08, 1.29e+09, 1.29e+09, 6.16e+…
## $ `1996`        <dbl> 5.36e+09, 4.76e+09, 5.48e+08, 1.21e+09, 1.21e+09, 6.42e+…
## $ `1997`        <dbl> 5.25e+09, 4.67e+09, 5.36e+08, 1.14e+09, 1.14e+09, 6.64e+…
## $ `1998`        <dbl> 5.18e+09, 4.60e+09, 5.33e+08, 1.08e+09, 1.08e+09, 6.87e+…
## $ `1999`        <dbl> 5.09e+09, 4.52e+09, 5.17e+08, 1.02e+09, 1.02e+09, 7.04e+…
## $ `2000`        <dbl> 4.95e+09, 4.41e+09, 5.04e+08, 9.53e+08, 9.53e+08, 7.13e+…
## $ `2001`        <dbl> 4.70e+09, 4.17e+09, 4.95e+08, 1.00e+09, 1.00e+09, 6.54e+…
## $ `2002`        <dbl> 7.18e+09, 6.40e+09, 7.02e+08, 1.37e+09, 1.37e+09, 9.49e+…
## $ `2003`        <dbl> 8.87e+09, 7.89e+09, 9.06e+08, 1.54e+09, 1.54e+09, 1.41e+…
## $ `2004`        <dbl> 8.73e+09, 7.66e+09, 1.05e+09, 1.90e+09, 1.90e+09, 1.11e+…
## $ `2005`        <dbl> 1.01e+10, 9.00e+09, 1.06e+09, 2.06e+09, 2.06e+09, 1.14e+…
## $ `2006`        <dbl> 1.07e+10, 9.34e+09, 1.40e+09, 2.06e+09, 2.06e+09, 1.09e+…
## $ `2007`        <dbl> 1.20e+10, 1.04e+10, 1.71e+09, 2.17e+09, 2.17e+09, 1.03e+…
## $ `2008`        <dbl> 1.23e+10, 1.06e+10, 1.73e+09, 2.14e+09, 2.14e+09, 1.24e+…
## $ `2009`        <dbl> 1.29e+10, 1.10e+10, 2.15e+09, 3.12e+09, 3.12e+09, 1.53e+…
## $ `2010`        <dbl> 1.79e+10, 1.57e+10, 2.25e+09, 2.81e+09, 2.81e+09, 1.58e+…
## $ `2011`        <dbl> 1.97e+10, 1.70e+10, 2.69e+09, 2.61e+09, 2.61e+09, 1.72e+…
## $ `2012`        <dbl> 2.91e+10, 2.59e+10, 2.81e+09, 2.85e+09, 2.85e+09, 1.31e+…
## $ `2013`        <dbl> 3.48e+10, 3.14e+10, 2.81e+09, 2.75e+09, 2.75e+09, 8.34e+…
## $ `2014`        <dbl> 3.35e+10, 3.02e+10, 2.76e+09, 2.13e+09, 2.13e+09, 1.20e+…
## $ `2015`        <dbl> 3.53e+10, 3.19e+10, 2.81e+09, 2.29e+09, 2.29e+09, 9.10e+…
## $ `2016`        <dbl> 3.50e+10, 3.16e+10, 2.84e+09, 2.34e+09, 2.34e+09, 7.54e+…
## $ `2017`        <dbl> 3.51e+10, 3.16e+10, 2.85e+09, 2.24e+09, 2.24e+09, 7.60e+…

We tidy the data to make sure it’s in a long format. We also change the indicator names to be more concise, which helps legibility of the graph legend later. We then plot the data to show the GDP and it’s components over time.

tidy_GDP_data  <- UN_GDP_data %>% 
  
  pivot_longer(cols=4:51, 
                names_to="year", 
                values_to = "GDP") %>% 
   mutate(GDP_new = GDP/1e9,
          tidy_IndicatorName = case_when(
     IndicatorName == 'Imports of goods and services' ~ "Imports",
     IndicatorName == 'Household consumption expenditure (including Non-profit institutions serving households)' ~ "Household expenditure",
     IndicatorName == "General government final consumption expenditure" ~ "Government expenditure",
     IndicatorName == "Final consumption expenditure"~ "Final consumption expenditure",
     IndicatorName == "Gross capital formation" ~ "Gross capital formation",
     IndicatorName == "Exports of goods and services" ~ "Exports",
     IndicatorName == "Gross Domestic Product (GDP)" ~ "Gross Domestic Product",
    )) %>%
  drop_na() 

glimpse(tidy_GDP_data)

## Rows: 67,441
## Columns: 7
## $ CountryID          <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ Country            <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
## $ IndicatorName      <chr> "Final consumption expenditure", "Final consumption…
## $ year               <chr> "1970", "1971", "1972", "1973", "1974", "1975", "19…
## $ GDP                <dbl> 5.56e+09, 5.33e+09, 5.20e+09, 5.75e+09, 6.15e+09, 6…
## $ GDP_new            <dbl> 5.56, 5.33, 5.20, 5.75, 6.15, 6.32, 6.37, 6.90, 7.0…
## $ tidy_IndicatorName <chr> "Final consumption expenditure", "Final consumption…

# Let us compare GDP components for these 3 countries
country_list <- c("United States","India", "Germany")
Indicator_list <- c('Gross capital formation', "Exports of goods and services", "General government final consumption expenditure", "Household consumption expenditure (including Non-profit institutions serving households)", "Imports of goods and services"
)
tidy_GDP_data_countrywise <- tidy_GDP_data %>%
  
#  
  filter(Country %in% country_list, IndicatorName %in% Indicator_list) %>% 
  
  group_by(Country) %>% 
   mutate(year = as.integer(year))
  
tidy_GDP_data_countrywise

## # A tibble: 720 × 7
## # Groups:   Country [3]
##    CountryID Country IndicatorName                  year     GDP GDP_new tidy_…¹
##        <dbl> <chr>   <chr>                         <int>   <dbl>   <dbl> <chr>  
##  1       276 Germany Household consumption expend…  1970 8.72e11    872. Househ…
##  2       276 Germany Household consumption expend…  1971 9.19e11    919. Househ…
##  3       276 Germany Household consumption expend…  1972 9.69e11    969. Househ…
##  4       276 Germany Household consumption expend…  1973 9.97e11    997. Househ…
##  5       276 Germany Household consumption expend…  1974 9.95e11    995. Househ…
##  6       276 Germany Household consumption expend…  1975 1.03e12   1032. Househ…
##  7       276 Germany Household consumption expend…  1976 1.08e12   1076. Househ…
##  8       276 Germany Household consumption expend…  1977 1.12e12   1122. Househ…
##  9       276 Germany Household consumption expend…  1978 1.16e12   1162. Househ…
## 10       276 Germany Household consumption expend…  1979 1.20e12   1200. Househ…
## # … with 710 more rows, and abbreviated variable name ¹tidy_IndicatorName

ggplot(tidy_GDP_data_countrywise)+
  aes(x = year, y = GDP_new, color = tidy_IndicatorName)+
  geom_line()+
  facet_wrap(~Country)+
  labs(
    title = 'GDP components over time',
    subtitle = 'In constant 2010 USD',
    
  )+
  ylab('Billion US$')+
  xlab('Year')+
  theme_bw()

We now start testing whether the GDP from the sum of the components is the same as the GDP in reality. Below we calculate the percentage difference between reality and theory.

##         Country total_gdp_by_components total_gdp percentage_difference
## 1       Germany                   67598     43904                  35.1
## 2         India                   18523     13643                  26.3
## 3 United States                  203255    162771                  19.9

There is a percentage difference of 35.1% for Germany, 26.3% for India and 19.9% for United States

In Germany we can see a downward trend in household expenditure and gross capital formation. The net exports have increased starting 2020 and government expenditure has almost remained the same with slight changes in the middle. In India household expenditure has decreased drasitically, by more than 10%. On the oppostie, gross capital formation has seen rapid increase specially after the year 2000. Net exports and Government expenditure have stayed the same over the years. In the United states, Government expenditure has seen the most change, it has decreased by more than 5%. Net exports have reduced by about 1-2%, while both gross capital formation household expenditure have increased by approximately 2.5%. India has the highest levels of gross capital formation and the lowest government expenditure.