Take Home Exercise 1

Author

Liu Jiaqi

The Task

In this take-home exercise, you are required to apply the concepts and methods you had learned in Lesson 1-4 to reveal the demographic and financial characteristics of the city of Engagement byusing appropriate static and interactive statistical graphics methods. This exercise requires a user-friendly and interactive solution that helps city managers and planners to explore the complex data in an engaging way and reveal hidden patterns.

The Data

For the purpose of this study, two data sets are provided. They are:

Participants.csv

Contains information about the residents of City of Engagement that have agreed to participate in this study.

participantId (integer): unique ID assigned to each participant.
householdSize (integer): the number of people in the participant’s household
haveKids (boolean): whether there are children living in the participant’s household.
age (integer): participant’s age in years at the start of the study.
educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}
interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.
joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.

FinancialJournal.csv

Contains information about financial transactions.

participantId (integer): unique ID corresponding to the participant affected
timestamp (datetime): the time when the check-in was logged
category (string factor): a string describing the expense category, one of {“Education”, “Food”, “Recreation”, “RentAdjustment”, “Shelter”, “Wage”}
amount (double): the amount of the transaction

Data Preparation

Installing and loading the required libraries

pacman::p_load(tidyverse, DT, patchwork,ggiraph, plotly, ggrepel, ggstatsplot, ggdist, ggplot2, reshape2, forcats, ggridges)

Importing Data

participants_data <- read_csv("data/participants.csv")
financial_data <- read_csv("data/FinancialJournal.csv")

Preparing Data

Take a look at participants data

glimpse(participants_data)

Rows: 1,011
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 27, 34, 18,…
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege", "HighScho…
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", "D", "D", …
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.138063446, 0.8…

Turn education level into ordinal variables

participants_data$educationLevel <- factor(participants_data$educationLevel, levels = c("Low", "HighSchoolOrCollege", "Bachelors", "Graduate"))

Final participants_data df

glimpse(participants_data)

Rows: 1,011
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 27, 34, 18,…
$ educationLevel <fct> HighSchoolOrCollege, HighSchoolOrCollege, HighSchoolOrC…
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", "D", "D", …
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.138063446, 0.8…

Take a look at financial_data

glimpse(financial_data)

Rows: 1,513,636
Columns: 4
$ participantId <dbl> 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6,…
$ timestamp     <dttm> 2022-03-01, 2022-03-01, 2022-03-01, 2022-03-01, 2022-03…
$ category      <chr> "Wage", "Shelter", "Education", "Wage", "Shelter", "Educ…
$ amount        <dbl> 2472.50756, -554.98862, -38.00538, 2046.56221, -554.9886…

Format timestamp into year-month-date Add additional column year-month

financial_data$timestamp <- format(as.Date(financial_data$timestamp), "%Y-%m-%d")
financial_data$yearmonth <- format(as.Date(financial_data$timestamp), "%Y-%m")

Final financial_data df

glimpse(financial_data)

Rows: 1,513,636
Columns: 5
$ participantId <dbl> 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6,…
$ timestamp     <chr> "2022-03-01", "2022-03-01", "2022-03-01", "2022-03-01", …
$ category      <chr> "Wage", "Shelter", "Education", "Wage", "Shelter", "Educ…
$ amount        <dbl> 2472.50756, -554.98862, -38.00538, 2046.56221, -554.9886…
$ yearmonth     <chr> "2022-03", "2022-03", "2022-03", "2022-03", "2022-03", "…

Data Visualization

Participants Data

Understanding the variables

Distribution of joviality

Below is the density plot of joviality with mean and median. Mean and median close to each other and they are both slightly lower than 0.5 which means that people are ‘averagely happy’ in this town. There are fewer people who are extremly unhappy or happy but the distribution does not follow normal distribution.

ggplot(data=participants_data, aes(joviality)) +
  geom_density(fill = "lightblue", color = "black") +
  labs(title = "Density plot of joviality", x = "joviality", y ="density") +
  geom_vline(aes(xintercept = mean(joviality)), color = "red", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median(joviality)), color = "blue", linetype = "dashed", size = 1) +
  annotate("text", x = mean(participants_data$joviality), y = 1, label = paste("Mean =", round(mean(participants_data$joviality), 2)), color = "red", vjust = -2) +
  annotate("text", x = median(participants_data$joviality), y = 1, label = paste("Median =", round(median(participants_data$joviality), 2)), color = "blue", vjust = -0.5) +
  theme_minimal()

set.seed(123)

gghistostats(
  data = participants_data,
  x = joviality,
  type = "bayes",
  test.value = 0.5,
  xlab = "joviality"
)

Distribution of Age

From distribution of age for the participants we can see that there’s a large number of young adults at around 30 years old and a relatively smaller proportion of people who are below 20 years old. Other than that, the distribution is not extremely skewed.

ggplot(data=participants_data, 
       aes(x= age)) +
  geom_histogram(bins=20,            
                 color="black",      
                 fill="light blue") +
  ggtitle("Distribution of age")

Distribution of Household Size

Interestingly the max household size is 3. Not sure if this is the case or a limitation of the survey. The distribution of household size is relatively uniform.

ggplot(data=participants_data, 
             aes(x = householdSize)) +
  geom_bar(fill="light blue") + 
  ggtitle("Distribution of householdSize")

Distribution of havingKids

This variable correlates strongly with householdSize. Number of people with householdSize = 3 equals that of havingKids = True.

ggplot(data=participants_data, 
             aes(x = haveKids)) +
  geom_bar(fill="light blue") + 
  ggtitle("Distribution of haveKids")

Distribution of educationLevel

More than half of the participants have educationLevel of High School or College. Low educationLevel has the lowest percentage.

ggplot(data=participants_data, 
             aes(x = educationLevel)) +
  geom_bar(fill="light blue") + 
  ggtitle("Distribution of educationLevel")

Distribution of interestGroup

Each interest group seem to have relatively similar number of participants. None with significantly higher or lower number of people.

ggplot(data=participants_data, 
             aes(x = interestGroup)) +
  geom_bar(fill="light blue") + 
  ggtitle("Distribution of interestGroup")

Understanding Correlation between joviality and other variables

joviality vs haveKids

Having kids or not does not seem to influence people’s joviality. Median of people with kids are slightly higher than those without kids and the mean is very close for the two groups.

ggplot(data=participants_data, 
       aes(y = joviality, 
           x= haveKids)) +
  geom_boxplot(notch=TRUE) +
  geom_point(position="jitter", 
             size = 0.5) +
  stat_summary(geom = "point",       
               fun.y="mean",         
               colour ="red",        
               size=4) +
  ggtitle("Boxplot of joviality vs interestGroup")

ggbetweenstats(
  data = participants_data,
  x = haveKids, 
  y = joviality,
  type = "np",
  messages = FALSE
)

joviality vs educationLevel

educationLevel also does not seem to correlate to joviality. There is a slight increasing trend of mean of joviality with higher education level but the increase is not significant (0.49 to 0.52). But those with low educationLevel has the highest variance in terms of joviality.

tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean joviality:", mean, "+/-", sem)
}

gg_point <- ggplot(data=participants_data, 
                   aes(x = educationLevel),
) +
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  ) + 
  ggtitle("Bar chart of joviality vs educationLevel")

girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

ggbetweenstats(
  data = participants_data,
  x = educationLevel, 
  y = joviality,
  type = "p",
  mean.ci = TRUE, 
  pairwise.comparisons = TRUE, 
  pairwise.display = "s",
  p.adjust.method = "fdr",
  messages = FALSE
)

joviality vs age

There seem to be no significant correlation between joviality and age.

ggplot(data=participants_data, 
       aes(x= age, 
           y=joviality)) +
  geom_point() +
  geom_smooth(method=lm, 
              size=0.5) +  
  geom_label_repel(aes(label = participantId), 
                   fontface = "bold") +
  ggtitle("Scatter plot for joviality vs age")

joviality vs interestGroup

There doesn’t seem to be significant difference in interest groups and their participants joviality.

ggplot(participants_data, 
       aes(x = joviality, 
           y = interestGroup)) +
  geom_density_ridges(
    scale = 3,
    rel_min_height = 0.01,
    bandwidth = 3.4,
    fill = "#7097BB",
    color = "white"
  ) +
  scale_x_continuous(
    name = "interestGroup",
    expand = c(0, 0)
    ) +
  scale_y_discrete(name = NULL) +
  ggtitle("Ridge plot of joviality vs interestGroup") +
  theme_ridges()

Financial Data

Aggregate data base on participantId.

aggregated_data <- financial_data %>%
  group_by(participantId, category) %>%
  summarise(total_amount = sum(amount))

pivoted_data <- aggregated_data %>%
  pivot_wider(names_from = category, values_from = total_amount, values_fill = 0)

glimpse(pivoted_data)

Rows: 1,011
Columns: 7
Groups: participantId [1,011]
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ Education      <dbl> -494.0699, -494.0699, -166.5618, -494.0699, -166.5618, …
$ Food           <dbl> -3141.976, -3167.336, -3467.314, -3395.537, -3262.336, …
$ Recreation     <dbl> -4384.067, -6637.511, -4172.006, -4704.288, -6315.910, …
$ Shelter        <dbl> -7214.852, -7214.852, -7235.188, -7214.852, -18410.462,…
$ Wage           <dbl> 109816.59, 96374.93, 85107.52, 82269.33, 106055.17, 232…
$ RentAdjustment <dbl> 0.000, 0.000, 0.000, 0.000, 4809.283, 0.000, 0.000, 0.0…

Merge participants_data and financial data.

merged_data <- left_join(participants_data, pivoted_data, by = "participantId")

Create new variables.

merged_data <- merged_data %>%
  mutate(totalSpending = Education + Food + Recreation + Shelter - RentAdjustment)

merged_data <- merged_data %>%
  mutate(netValue = Wage + totalSpending)

glimpse(merged_data)

Rows: 1,011
Columns: 15
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 27, 34, 18,…
$ educationLevel <fct> HighSchoolOrCollege, HighSchoolOrCollege, HighSchoolOrC…
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", "D", "D", …
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.138063446, 0.8…
$ Education      <dbl> -494.0699, -494.0699, -166.5618, -494.0699, -166.5618, …
$ Food           <dbl> -3141.976, -3167.336, -3467.314, -3395.537, -3262.336, …
$ Recreation     <dbl> -4384.067, -6637.511, -4172.006, -4704.288, -6315.910, …
$ Shelter        <dbl> -7214.852, -7214.852, -7235.188, -7214.852, -18410.462,…
$ Wage           <dbl> 109816.59, 96374.93, 85107.52, 82269.33, 106055.17, 232…
$ RentAdjustment <dbl> 0.000, 0.000, 0.000, 0.000, 4809.283, 0.000, 0.000, 0.0…
$ totalSpending  <dbl> -15234.97, -17513.77, -15041.07, -15808.75, -32964.55, …
$ netValue       <dbl> 94581.623, 78861.160, 70066.451, 66460.587, 73090.617, …

Understanding joviality vs financial data

joviality vs netValue

Unexpectedly, there seem to be a negative correlation between netValue and joviality. People with more money left over at the end of each month seem to be less happy.

ggplot(data=merged_data, 
       aes(x= netValue, 
           y=joviality)) +
  geom_point() +
  geom_smooth(method=lm, 
              size=0.5) +  
  geom_label_repel(aes(label = participantId), 
                   fontface = "bold") +
  ggtitle("Scatter plot of joviality vs netValue")

joviality vs each spending category

The heatmap shows that Education has negative correlation to joviality. Those who spend more money on education are less happy than those who spend less. Shelter has almost 0 correlation with joviality. People with higher spending on shelter are not happier compared to the rest. Food and recreation have positive correlation to joviality and recreation has a stronger correlation. This makes sense as more money spent on food and recreation brings relaxation and happiness.

new_merged_data_neg <- abs(select(merged_data, joviality, Education, Food, Recreation, Shelter, totalSpending))
cor_matrix <- cor(as.matrix(new_merged_data_neg[,1]),as.matrix(new_merged_data_neg[,-1]))

ggplot(data = melt(cor_matrix), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "lightblue", mid = "white", high = "orange", midpoint = 0) +
  labs(title = "Correlation Matrix Heatmap - Spending Category", x = "Variable", y = "Variables") +
  geom_text(aes(label = round(value,2))) +
  theme_minimal()

joviality vs each incoming category

RentAdjustment is positively correlated to joviality. This could perhaps be explained as people would feel that they are supported by the government with rent adjustment which makes them happier. Interestingly, wage is negatively correlated to joviality. Those with higher wage are less happy.

new_merged_data_pos <- select(merged_data, joviality, Wage, RentAdjustment)
cor_matrix <- cor(as.matrix(new_merged_data_pos[,1]),as.matrix(new_merged_data_pos[,-1]))
# Plot correlation heatmap
ggplot(data = melt(cor_matrix), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "lightblue", mid = "white", high = "orange", midpoint = 0) +
  labs(title = "Correlation Matrix Heatmap - Incoming category", x = "Variable", y = "Variable") +
  geom_text(aes(label = round(value,2))) +
  theme_minimal()