Session 2: Homework 1

Rents in San Francisco 2000-2018

# download directly off tidytuesdaygithub repo

rent <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv')

What are the variable types? Do they all correspond to what they really are? Which variables have most missing values?

The variable types are: character(chr), integer(int), double(dbl). The data type for date is ‘double’ but it should be ‘date’. The column description seems to have the most missing values.

skimr::skim(rent)

(#tab:skim_data)Data summary
Name	rent
Number of rows	200796
Number of columns	17
_______________________
Column type frequency:
character	8
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
post_id	0	1.00	9	14	200796
nhood	0	1.00	4	43	167
city	0	1.00	5	19	104
county	1394	0.99	4	13	10
address	196888	0.02	1	38	2869
title	2517	0.99	2	298	184961
descr	197542	0.02	13	16975	3025
details	192780	0.04	4	595	7667

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
date	0	1.00	2.01e+07	44694.07	2.00e+07	2.01e+07	2.01e+07	2.01e+07	2.02e+07	▁▇▁▆▃
year	0	1.00	2.01e+03	4.48	2.00e+03	2.00e+03	2.01e+03	2.01e+03	2.02e+03	▁▇▁▆▃
price	0	1.00	2.14e+03	1427.75	2.20e+02	1.30e+03	1.80e+03	2.50e+03	4.00e+04	▇▁▁▁▁
beds	6608	0.97	1.89e+00	1.08	0.00e+00	1.00e+00	2.00e+00	3.00e+00	1.20e+01	▇▂▁▁▁
baths	158121	0.21	1.68e+00	0.69	1.00e+00	1.00e+00	2.00e+00	2.00e+00	8.00e+00	▇▁▁▁▁
sqft	136117	0.32	1.20e+03	5000.22	8.00e+01	7.50e+02	1.00e+03	1.36e+03	9.00e+05	▇▁▁▁▁
room_in_apt	0	1.00	0.00e+00	0.04	0.00e+00	0.00e+00	0.00e+00	0.00e+00	1.00e+00	▇▁▁▁▁
lat	193145	0.04	3.77e+01	0.35	3.36e+01	3.74e+01	3.78e+01	3.78e+01	4.04e+01	▁▁▅▇▁
lon	196484	0.02	-1.22e+02	0.78	-1.23e+02	-1.22e+02	-1.22e+02	-1.22e+02	-7.42e+01	▇▁▁▁▁

rent %>% 
  count(city, sort=TRUE) %>% 
  mutate(prop = n/sum(n)) %>% 
  slice_max(prop, n=20) %>% 
  mutate(city = fct_reorder(city, prop)) %>% 
  ggplot(aes(x=city, y = prop)) +
  coord_flip() +
  geom_col() +
  theme_minimal(base_size=14) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title="San Francisco accounts for more than a quarter of all rental classifieds",
    subtitle="% of Craigslist listongs, 2000-2018",
    caption = "Source: Pennington, Kate (2018). Bay Area Craigslist Rental Housing Posts, 2000-2018",
    x=NULL,
    y=NULL
    )

by_year_sf <- rent %>% 
  filter(city == "san francisco") %>% 
  group_by(year, beds) %>% 
  summarize(median_price = median(price))

by_year_sf %>% 
  filter(beds <=3) %>% 
  ggplot(aes(x=year, y=median_price, colour = factor(beds))) +
  geom_line() +
  labs(title='San Francisco rents have been steadily increasing', subtitle='0 to 3-bed listings, 2000-2008', caption='Source:Pennington, Kate (2018). Bay Area Craigslist Rental Housing Posts, 2000-2018', x='Years', y='Median Prices') +
  theme(legend.position = "none") +
  facet_wrap(~beds, nrow=1)

list_of_top_12_cities=rent %>% 
  count(city, sort=TRUE) %>% 
  mutate(prop = n/sum(n)) %>% 
  slice_max(prop, n=12) %>% 
  mutate(city = fct_reorder(city, prop))

list_of_rent_for_top_12_cities=
  left_join(list_of_top_12_cities,rent,by="city")


list_of_rent_for_top_12_cities %>% 
  filter(beds==1) %>% 
  group_by(city,year) %>% 
  summarise(median=median(price)) %>% 
  ggplot(aes(x=year,y=median, color=city))+geom_line()+
   labs(title='Rental prices for 1-bedroom flats in the bay area', caption='Source:Pennington, Kate (2018). Bay Area Craigslist Rental Housing Posts, 2000-2018', x='Years', y='Median Prices') +
  theme(legend.position = "none") +
  facet_wrap(~city)

There is an overall increase in the median rental price among all these 12 cities in the Bay Area. San Francisco had experienced the highest median rental prices and most steep trend in the 18 years. This could be due to the boom in the tech industry in the bay area in the past few years, resulting in an overall increase in rental prices as more people moved into the area.

Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)

## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

gross : The gross earnings in the US box office, not adjusted for inflation
budget: The movie’s budget
cast_facebook_likes: the number of facebook likes cast memebrs received
votes: the number of people who voted for (or rated) the movie in IMDB
reviews: the number of reviews for that movie
rating: IMDB average rating

Use your data import, inspection, and cleaning skills to answer the following:

Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?

There are no missing values, we can check this using the following code:

skim(movies)

Table 1: Data summary
Name	movies
Number of rows	2961
Number of columns	11
_______________________
Column type frequency:
character	3
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
title	1	1	83	2907
genre	1	5	11	17
director	1	3	32	1366

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2.00e+03	9.95e+00	1920.0	2.00e+03	2.00e+03	2.01e+03	2.02e+03	▁▁▁▂▇
duration	1	1.10e+02	2.22e+01	37.0	9.50e+01	1.06e+02	1.19e+02	3.30e+02	▃▇▁▁▁
gross	1	5.81e+07	7.25e+07	703.0	1.23e+07	3.47e+07	7.56e+07	7.61e+08	▇▁▁▁▁
budget	1	4.06e+07	4.37e+07	218.0	1.10e+07	2.60e+07	5.50e+07	3.00e+08	▇▂▁▁▁
cast_facebook_likes	1	1.24e+04	2.05e+04	0.0	2.24e+03	4.60e+03	1.69e+04	6.57e+05	▇▁▁▁▁
votes	1	1.09e+05	1.58e+05	5.0	1.99e+04	5.57e+04	1.33e+05	1.69e+06	▇▁▁▁▁
reviews	1	5.03e+02	4.94e+02	2.0	1.99e+02	3.64e+02	6.31e+02	5.31e+03	▇▁▁▁▁
rating	1	6.39e+00	1.05e+00	1.6	5.80e+00	6.50e+00	7.10e+00	9.30e+00	▁▁▆▇▁

There are 2961 observations, all unique:

count(movies) == n_distinct(movies)

##         n
## [1,] TRUE

Produce a table with the count of movies by genre, ranked in descending order

movies %>% 
  group_by(genre) %>% 
  summarise(count_genre = count(genre)) %>% 
  arrange(desc(count_genre))

## # A tibble: 17 × 2
##    genre       count_genre
##    <chr>             <int>
##  1 Comedy              848
##  2 Action              738
##  3 Drama               498
##  4 Adventure           288
##  5 Crime               202
##  6 Biography           135
##  7 Horror              131
##  8 Animation            35
##  9 Fantasy              28
## 10 Documentary          25
## 11 Mystery              16
## 12 Sci-Fi                7
## 13 Family                3
## 14 Musical               2
## 15 Romance               2
## 16 Western               2
## 17 Thriller              1

Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order

movies %>% 
  group_by(genre) %>% 
  summarise(mean_gross = mean(gross), mean_budget = mean(budget)) %>% 
  mutate(return_on_budget = mean_gross/mean_budget) %>% 
  arrange(desc(return_on_budget))

## # A tibble: 17 × 4
##    genre       mean_gross mean_budget return_on_budget
##    <chr>            <dbl>       <dbl>            <dbl>
##  1 Musical      92084000     3189500          28.9    
##  2 Family      149160478.   14833333.         10.1    
##  3 Western      20821884     3465000           6.01   
##  4 Documentary  17353973.    5887852.          2.95   
##  5 Horror       37713738.   13504916.          2.79   
##  6 Fantasy      42408841.   17582143.          2.41   
##  7 Comedy       42630552.   24446319.          1.74   
##  8 Mystery      67533021.   39218750           1.72   
##  9 Animation    98433792.   61701429.          1.60   
## 10 Biography    45201805.   28543696.          1.58   
## 11 Adventure    95794257.   66290069.          1.45   
## 12 Drama        37465371.   26242933.          1.43   
## 13 Crime        37502397.   26596169.          1.41   
## 14 Romance      31264848.   25107500           1.25   
## 15 Action       86583860.   71354888.          1.21   
## 16 Sci-Fi       29788371.   27607143.          1.08   
## 17 Thriller         2468      300000           0.00823

Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.

movies %>% 
  group_by(director) %>% 
  summarise(total_gross = sum(gross), 
            mean_gross = mean(gross), 
            med_gross = median(gross),
            sd_gross = sd(gross)) %>% 
  slice_max(total_gross, n=15)

## # A tibble: 15 × 5
##    director          total_gross mean_gross  med_gross   sd_gross
##    <chr>                   <dbl>      <dbl>      <dbl>      <dbl>
##  1 Steven Spielberg   4014061704 174524422. 164435221  101421051.
##  2 Michael Bay        2231242537 171634041. 138396624  127161579.
##  3 Tim Burton         2071275480 129454718.  76519172  108726924.
##  4 Sam Raimi          2014600898 201460090. 234903076  162126632.
##  5 James Cameron      1909725910 318287652. 175562880. 309171337.
##  6 Christopher Nolan  1813227576 226653447  196667606. 187224133.
##  7 George Lucas       1741418480 348283696  380262555  146193880.
##  8 Robert Zemeckis    1619309108 124562239. 100853835   91300279.
##  9 Clint Eastwood     1378321100  72543216.  46700000   75487408.
## 10 Francis Lawrence   1358501971 271700394. 281666058  135437020.
## 11 Ron Howard         1335988092 111332341  101587923   81933761.
## 12 Gore Verbinski     1329600995 189942999. 123207194  154473822.
## 13 Andrew Adamson     1137446920 284361730  279680930. 120895765.
## 14 Shawn Levy         1129750988 102704635.  85463309   65484773.
## 15 Ridley Scott       1128857598  80632686.  47775715   68812285.

Finally, ratings. Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.

# TABLE WITH SUMMARY STATISTICS
movies %>% 
  group_by(genre) %>% 
  summarise(mean_rating  = mean(rating), 
            min_rating   = min(rating),
            max_rating   = max(rating),
            med_rating   = median(rating),
            sd_rating    = sd(rating))

## # A tibble: 17 × 6
##    genre       mean_rating min_rating max_rating med_rating sd_rating
##    <chr>             <dbl>      <dbl>      <dbl>      <dbl>     <dbl>
##  1 Action             6.23        2.1        9         6.3      1.03 
##  2 Adventure          6.51        2.3        8.6       6.6      1.09 
##  3 Animation          6.65        4.5        8         6.9      0.968
##  4 Biography          7.11        4.5        8.9       7.2      0.760
##  5 Comedy             6.11        1.9        8.8       6.2      1.02 
##  6 Crime              6.92        4.8        9.3       6.9      0.849
##  7 Documentary        6.66        1.6        8.5       7.4      1.77 
##  8 Drama              6.73        2.1        8.8       6.8      0.917
##  9 Family             6.5         5.7        7.9       5.9      1.22 
## 10 Fantasy            6.15        4.3        7.9       6.45     0.959
## 11 Horror             5.83        3.6        8.5       5.9      1.01 
## 12 Musical            6.75        6.3        7.2       6.75     0.636
## 13 Mystery            6.86        4.6        8.5       6.9      0.882
## 14 Romance            6.65        6.2        7.1       6.65     0.636
## 15 Sci-Fi             6.66        5          8.2       6.4      1.09 
## 16 Thriller           4.8         4.8        4.8       4.8     NA    
## 17 Western            5.7         4.1        7.3       5.7      2.26

movies %>% 
  ggplot(aes(x = rating, color = genre, fill = genre)) +
  geom_density(size=1, alpha = .1) +
  theme_minimal() +
  facet_wrap(~genre) +
  labs(
    title="Ratings density by Genre",
    subtitle="IMDB Movies Dataset",
    caption = "Source: Kaggle IMDB 5000 movie dataset",
    x="Rating",
    y="% Density"
    ) +
  theme(legend.position = "none")

Use `ggplot` to answer the following

Examine the relationship between gross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?

movies %>% 
  ggplot(aes(x=cast_facebook_likes, y=gross)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_continuous(limits = c(0, 150000)) + 
  theme_minimal() +
  labs(
    title="Relationship between Facebook Likes and Gross Profit",
    subtitle="IMDB Movies Dataset",
    caption = "Source: Kaggle IMDB 5000 movie dataset",
    x="Cast movie likes",
    y="Gross profit ($M)"
    ) + NULL

We chose to plot Cast Facebook likes on the X axes, Gross on the Y axes. The number of case facebook likes is not a good predictor as there does not seem to be a discernible correlation between the two, our argument is supported by a linear fit that shows a small positive correlation, but not enough to be significant. It is important to note that we have cut three outliers by limiting the x axis, as it did not help in visualizing the general tendency of the data.

Examine the relationship between gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.

ggplot(movies, aes(x= budget, y = gross))+
  geom_point()+geom_smooth(method=lm)+ 
  theme_minimal() +
  labs(
    title="Relationship between Budget and Gross Profit",
    subtitle="IMDB Movies Dataset",
    caption = "Source: Kaggle IMDB 5000 movie dataset",
    x="Movie budget ($M)",
    y="Gross profit ($M)"
    ) + NULL

Budget is a good predictor of how much money a movie will make, as is shown by the positive correlation between the two.

Examine the relationship between gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?

ggplot(movies, aes(x= rating, y = gross))+
  facet_wrap(vars(genre))+
  geom_point()+geom_smooth(method=lm)+ 
  theme_minimal() +
  labs(
    title="Relationship between Rating and Gross Profit",
    subtitle="IMDB Movies Dataset",
    caption = "Source: Kaggle IMDB 5000 movie dataset",
    x="Movie Rating",
    y="Gross profit ($M)"
    ) + NULL

It is not a good predictor, as there is no significant relation between the movie rating and gross profit within most of the categories. The data set does not include enough data for genres such as sci-fi or family to determine a correlation between the gross and rating.

Returns of financial stocks

nyse <- read_csv(here::here("data","nyse.csv"))

Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order

# TABLE OF COMPANIES PER SECTOR:
nyse %>% 
  group_by(sector) %>% 
  summarise(count = count(sector)) %>% 
  arrange(desc(count)) -> companies_per_sector

# PRINT THE TABLE
companies_per_sector

## # A tibble: 12 × 2
##    sector                count
##    <chr>                 <int>
##  1 Finance                  97
##  2 Consumer Services        79
##  3 Public Utilities         60
##  4 Capital Goods            45
##  5 Health Care              45
##  6 Energy                   42
##  7 Technology               40
##  8 Basic Industries         39
##  9 Consumer Non-Durables    31
## 10 Miscellaneous            12
## 11 Transportation           10
## 12 Consumer Durables         8

# PLOT:
companies_per_sector %>% 
  mutate(sector = fct_reorder(sector, count)) %>% 
  ggplot(aes(x=sector, y=count)) +
  geom_col(fill="#001461") + 
  theme_minimal() +
  coord_flip() +
  labs(
    title="Number of companies per sector",
    subtitle="NYSE Stock Data",
    caption = "Source: NYSE & SP500",
    x=NULL,
    y=NULL
  )

Next, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).

# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming, 
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

# We chose to delete DPZ
myStocks <- c("AAPL","JPM","DIS","ANF","TSLA","XOM","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2022-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame

## Rows: 20,545
## Columns: 8
## Groups: symbol [7]
## $ symbol   <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL…
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open     <dbl> 11.6, 11.9, 11.8, 12.0, 11.9, 12.1, 12.3, 12.3, 12.3, 12.4, 1…
## $ high     <dbl> 11.8, 11.9, 11.9, 12.0, 12.0, 12.3, 12.3, 12.3, 12.4, 12.4, 1…
## $ low      <dbl> 11.6, 11.7, 11.8, 11.9, 11.9, 12.0, 12.1, 12.2, 12.3, 12.3, 1…
## $ close    <dbl> 11.8, 11.8, 11.9, 11.9, 12.0, 12.2, 12.2, 12.3, 12.3, 12.4, 1…
## $ volume   <dbl> 4.45e+08, 3.09e+08, 2.56e+08, 3.00e+08, 3.12e+08, 4.49e+08, 4…
## $ adjusted <dbl> 10.05, 10.10, 10.18, 10.18, 10.25, 10.44, 10.42, 10.50, 10.54…

Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.

myStocks_returns_monthly %>% 
  group_by(symbol) %>% 
  summarise(min_return = min(monthly_returns),
            max_return = max(monthly_returns),
            med_return = median(monthly_returns),
            avg_return = mean(monthly_returns),
            sd_return  = sd(monthly_returns))

## # A tibble: 7 × 6
##   symbol min_return max_return med_return avg_return sd_return
##   <chr>       <dbl>      <dbl>      <dbl>      <dbl>     <dbl>
## 1 AAPL       -0.181      0.217    0.0230     0.0230     0.0791
## 2 ANF        -0.421      0.507    0.00105    0.00337    0.146 
## 3 DIS        -0.186      0.234    0.00725    0.0113     0.0721
## 4 JPM        -0.229      0.202    0.0199     0.0119     0.0727
## 5 SPY        -0.125      0.127    0.0146     0.0106     0.0404
## 6 TSLA       -0.224      0.811    0.0117     0.0501     0.177 
## 7 XOM        -0.262      0.241    0.00373    0.00756    0.0697

Plot a density plot, using geom_density(), for each of the stocks

myStocks_returns_monthly %>% 
  ggplot(aes(x = monthly_returns, color = symbol, fill = symbol)) +
  geom_density(size=2, alpha = .1) +
  theme_minimal() +
  labs(
    title="Returns by Symbol - Density Plot",
    subtitle="NYSE Stock Data",
    caption = "Source: NYSE & SP500",
    x="Monthly Return",
    y="% Density"
    )

What can you infer from this plot? Which stock is the riskiest? The least risky?

TSLA stock seems to have its median in negative returns, it also is the most variable of the stocks, making it the riskiest of all. A close second is ANF, which has a similar density to TSLA, but with a slightly positive median.

The S&P500 index has clearly the least risky performance, so an investor could choose to invest in ETFs that mimic the S&P500. Another safe stock is JPM, which has a positive median return, and most of its density is positive as well. It also has very low variance.

Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock

myStocks_returns_monthly %>%
  group_by(symbol) %>% 
  summarise(mean_return = mean(monthly_returns), sd_return = sd(monthly_returns)) %>% 
  ggplot(aes(x = mean_return, y = sd_return,  color = symbol, fill = symbol)) +
         geom_point(size=10, alpha=.5) + 
         theme_minimal() +
  labs(
    title="Mean VS Standard Deviation Scatterplot",
    subtitle="NYSE Stock Data",
    caption = "Source: NYSE & SP500",
    x="Mean of Return",
    y="Standard Deviation of Return"
  )

What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?

This plot confirms our previous findings from the density plot. TSLA is the most risky option with possibly highest returns. ANF is also a very risky stock, but with low returns, so investors should avoid this stock since others offer better mean returns with less risk. SPY however is the least risky investment with low expected return.

On your own: Spotify

Spotify have an API, an Application Programming Interface. APIs are ways for computer programs to talk to each other. So while we use Spotify app to look up songs and artists, computers use the Spotify API to talk to the spotify server. There is an R package that allows R to talk to this API: spotifyr. One of your team members, need to sign up and get a Spotify developer account and then you can download data about one’s Spotify usage. A detailed article on how to go about it can be found here Explore your activity on Spotify with R and spotifyr

If you do not want to use the API, you can download a sample of over 32K songs by having a look at https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

The data dictionary can be found below

variable	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

In this dataset, there are only 6 types of playlist_genre , but we can still try to perform EDA on this dataset.

Produce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:

What is the distribution of songs’ popularity (track_popularity). Does it look like a Normal distribution?

It looks not like a standard normal distribution with very high frequency when track_popularity is close to 0.

count_track_popularity <- spotify_songs %>%
  select(track_name, track_popularity) %>%
   group_by(track_popularity) %>%
   summarize(count=n()) 
# %>%
count_track_popularity

## # A tibble: 101 × 2
##    track_popularity count
##               <dbl> <int>
##  1                0  2703
##  2                1   575
##  3                2   387
##  4                3   321
##  5                4   240
##  6                5   240
##  7                6   192
##  8                7   189
##  9                8   201
## 10                9   195
## # … with 91 more rows

  ggplot(data=count_track_popularity, aes(x=track_popularity, y=count)) +
   geom_col() +
  labs(
    title="Track Popularity",
    x="Track Popularity",
    y=NULL
  )

The distribution of song popularity does not resemble a normal distribution. The graph appears to present two different maxima. It appears that a vast majority of songs have no popularity, but once songs gain some traction they tend to group around a “normal distribution” with mean ~55.

There are 12 audio features for each track, including confidence measures like acousticness, liveness, speechinesand instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode. How are they distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?

According to our visualization, the audio features are not normally distributed with features of liveness and speechines.

library(GGally)

spotify_songs %>% 
  select(12:23) %>% 
  GGally::ggpairs() +
  theme_minimal(base_size = 8)

The distribution is visualised as above.Variables “danceability”, “energy”, “valence”,“duration_ms” are closer to normal distribution.

Is there any relationship between valence and track_popularity? danceability and track_popularity ?

# valence and track_popularity

ggplot(data=spotify_songs, aes(x=valence, y=track_popularity)) +
  geom_point(size=1, alpha=.4,color="#001461") +
  geom_smooth(method = "lm", color="red") +
  labs(
    title="Relation between Valence and Track Popularity",
    x = "Valence",
    y = "Track Popularity",
  ) +
  theme_minimal()

ggplot(data=spotify_songs, aes(x=danceability, y=track_popularity)) +
  geom_point(size=1, alpha=.4, color="#001461") +
  geom_smooth(method = "lm", color="red") +
  labs(
    title="Relation between Danceability and Track Popularity",
    x = "Danceability",
    y = "Track Popularity",
  ) +
  theme_minimal()

spotify_songs

## # A tibble: 32,833 × 23
##    track_id      track…¹ track…² track…³ track…⁴ track…⁵ track…⁶ playl…⁷ playl…⁸
##    <chr>         <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 6f807x0ima9a… I Don'… Ed She…      66 2oCs0D… I Don'… 2019-0… Pop Re… 37i9dQ…
##  2 0r7CVbZTWZgb… Memori… Maroon…      67 63rPSO… Memori… 2019-1… Pop Re… 37i9dQ…
##  3 1z1Hg7Vb0AhH… All th… Zara L…      70 1HoSmj… All th… 2019-0… Pop Re… 37i9dQ…
##  4 75FpbthrwQmz… Call Y… The Ch…      60 1nqYsO… Call Y… 2019-0… Pop Re… 37i9dQ…
##  5 1e8PAfcKUYoK… Someon… Lewis …      69 7m7vv9… Someon… 2019-0… Pop Re… 37i9dQ…
##  6 7fvUMiyapMsR… Beauti… Ed She…      67 2yiy9c… Beauti… 2019-0… Pop Re… 37i9dQ…
##  7 2OAylPUDDfwR… Never … Katy P…      62 7INHYS… Never … 2019-0… Pop Re… 37i9dQ…
##  8 6b1RNvAcJjQH… Post M… Sam Fe…      69 6703SR… Post M… 2019-0… Pop Re… 37i9dQ…
##  9 7bF6tCO3gFb8… Tough … Avicii       68 7CvAfG… Tough … 2019-0… Pop Re… 37i9dQ…
## 10 1IXGILkPm0tO… If I C… Shawn …      67 4Qxzbf… If I C… 2019-0… Pop Re… 37i9dQ…
## # … with 32,823 more rows, 14 more variables: playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>, and abbreviated variable names ¹track_name,
## #   ²track_artist, ³track_popularity, ⁴track_album_id, ⁵track_album_name,
## #   ⁶track_album_release_date, ⁷playlist_name, ⁸playlist_id

There seems to be no apparent relationship between valence and track_popularity, and no relationship between danceability and track_popularity based on our scatter plots. The line of best fit supports our findings.

mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. Do songs written on a major scale have higher danceability compared to those in minor scale? What about track_popularity?

# danceability

spotify_songs %>%
  group_by(mode) %>%
  summarize(median_danceability = median(danceability)) %>%
   ggplot(aes(x=mode, y=median_danceability, fill=mode)) +
   geom_col() +
  theme(legend.position = "none") +
  labs(
    title="Measure of Danceability with respect to Modality",
    x = "Modality",
    y = "Median Danceability"
  )

# track_popularity 

spotify_songs %>%
group_by(mode) %>%
summarize(median_track_popularity = median(track_popularity)) %>%
  ggplot(aes(x=mode, y=median_track_popularity, fill=mode)) +
  geom_col() +
  theme(legend.position = "none") +
  labs(
    title="Measure of Track Popularity with respect to Modality",
    x = "Modality",
    y = "Median Track Popularity"
  )

Songs written on a major scale have lower danceability compared to it in minor scale. However, songs written on a major scale have higher track_popularity compared to it in minor scale.

Challenge 1: Replicating a chart

The purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. It builds on exercise 1, the San Francisco rentals data.

You have to create a graph that calculates the cumulative % change for 0-, 1-1, and 2-bed flats between 2000 and 2018 for the top twelve cities in Bay Area, by number of ads that appeared in Craigslist.

list_of_top_12_cities=rent %>% 
  count(city, sort=TRUE) %>% 
  mutate(prop = n/sum(n)) %>% 
  slice_max(prop, n=12) %>% 
  mutate(city = fct_reorder(city, prop))

list_of_rent_for_top_12_cities=
  left_join(list_of_top_12_cities,rent,by="city")


by_city_year <- list_of_rent_for_top_12_cities %>% 
  filter(beds<=2) %>% 
  group_by(beds,city,year) %>% 
  summarize(median_price=median(price))
  
pct_change <- by_city_year %>% 
  # SUGESTED BY KOSTIS:
  # mutate(
  #   delta = median_price/ lag(median_price),
  #   delta = ifelse(is.na(delta),1,delta),
  #   cumulative_delta = cumprod(delta)
  # ) %>% 
  
  # OUR ATTEMPT:
  mutate(lag.value = dplyr::lag(median_price, n = 1, default = NA)) %>% 
  mutate(pct_change = case_when(is.na(lag.value) ~ 1, lag.value>0 ~ median_price/lag.value)) %>% 
  mutate(cum_prod = cumprod(pct_change)) %>% 
  mutate(cum_prod = cum_prod)

pct_change %>% 
  ggplot(aes(x=year,y=cum_prod, color=city))+
  geom_line()+
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
   labs(title='Cumulative % change in 0, 1 and 2-bed rentals in Bay Area', subtitle='2000-2018', x="", y="") +
  theme(legend.position = "none", axis.text.x = element_text(angle = 90)) +
  facet_grid(vars(beds),vars(city))

Challenge 2: 2016 California Contributors plots

As discussed in class, I would like you to reproduce the plot that shows the top ten cities in highest amounts raised in political contributions in California during the 2016 US Presidential election.

To get this plot, you must join two dataframes; the one you have with all contributions, and data that can translate zipcodes to cities. You can find a file with all US zipcodes, e.g., here http://www.uszipcodelist.com/download.html.

The easiest way would be to create two plots and then place one next to each other. For this, you will need the patchwork package. https://cran.r-project.org/web/packages/patchwork/index.html

While this is ok, what if one asked you to create the same plot for the top 10 candidates and not just the top two? The most challenging part is how to reorder within categories, and for this you will find Julia Silge’s post on REORDERING AND FACETTING FOR GGPLOT2 useful.

# Make sure you use vroom() as it is significantly faster than read.csv()
CA_contributors_2016 <- vroom::vroom(here::here("data","CA_contributors_2016.csv"))

zip_codes_us <- vroom::vroom(here::here("data","zip_code_database.csv"))

CA_contributors_t_and_h <- # TRUMP AND HILLARY DATA
  CA_contributors_2016 %>% 
  filter(cand_nm %in% c('Clinton, Hillary Rodham', 'Trump, Donald J.')) %>% 
  mutate(zip = as.character(zip)) %>% # MAKING SURE THEY HAVE THE SAME TYPE
  left_join(zip_codes_us, by="zip")

library(tidytext)
CA_contributors_t_and_h %>% 
  group_by(cand_nm,primary_city) %>% 
  summarise(total_contrib = sum(contb_receipt_amt)) %>% 
  ungroup %>% 
  group_by(cand_nm) %>% 
  top_n(10) %>% 
  ungroup %>% 
  mutate(candidate = as.factor(cand_nm), 
         city = reorder_within(primary_city, total_contrib, candidate)) %>% 
  ggplot(aes(city, total_contrib, fill=candidate)) + 
    geom_col(show.legend = FALSE) +
    facet_wrap(~candidate, scales = "free") +
    coord_flip() +
    scale_x_reordered() + 
    scale_y_continuous(expand = c(0,0), labels = scales::dollar) +
    scale_fill_manual(values=c("#00AEF3", "#E81B23")) +
    theme_minimal() +
    labs(
      title="Where did candidates raise more money?",
      subtitle = "2016 Elections Contributions",
      caption = "Source: 2016 US Presidential elections database",
      y = "Amount raised ($)",
      x = NULL
    )

If we were to plot 10 top candidates, this is how we would approach the code:

# TOP 10 CANDIDATES SEARCH:
top_10_candidates <-
  CA_contributors_2016 %>% 
    group_by(cand_nm) %>% 
    summarise(total_contrib = sum(contb_receipt_amt)) %>% 
    slice_max(total_contrib, n=10)

# FILTER BY TOP TEN CANDIDATES WITH JOIN:
CA_contributors_top10 <- 
  left_join(top_10_candidates, CA_contributors_2016, by="cand_nm") %>% 
  mutate(zip = as.character(zip)) %>% # MAKING SURE THEY HAVE THE SAME TYPE
  left_join(zip_codes_us, by="zip")

CA_contributors_top10 %>% 
  group_by(cand_nm,primary_city) %>% 
  summarise(total_contrib = sum(contb_receipt_amt)) %>% 
  ungroup %>% 
  group_by(cand_nm) %>% 
  top_n(10) %>% 
  ungroup %>% 
  mutate(candidate = as.factor(cand_nm), 
         city = reorder_within(primary_city, total_contrib, candidate)) %>% 
  ggplot(aes(city, total_contrib, fill=candidate)) + 
    geom_col(show.legend = FALSE) +
    facet_wrap(~candidate, scales = "free") +
    coord_flip() +
    scale_x_reordered() + 
    scale_y_continuous(expand = c(0,0), labels = scales::dollar) +
    theme_minimal() +
    labs(
      title="Where did candidates raise more money?",
      subtitle = "2016 Elections Contributions",
      caption = "Source: 2016 US Presidential elections database",
      y = "Amount raised ($)",
      x = NULL
    )

Deliverables

There is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

Details

Who did you collaborate with: Ishaan Khetan, Kathlyn Lee, Emilia Moskala, Juan Sanchez-Blanco, Sylvie Zheng, Jingyi Fang
Approximately how much time did you spend on this problem set: 10h
What, if anything, gave you the most trouble: Both challenges gave us the most trouble

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

Yes

Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.