Take-home Exercise 2

Author

Liu Jiaqi

Published

28/5/2023

Kickstarter

Getting Started

The code chunk below will be used to install and load the necessary R packages to meet the data preparation, data wrangling, data analysis and visualisation needs.

pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, lubridate, tidyverse, ggiraph, patchwork, GGally, igraph, igraphdata, visNetwork)

Data Import

In the code chunk below, fromJSON() of jsonlite package is used to import mc2_challenge_graph.json into R environment.

mc2_data <- fromJSON("data/mc2_challenge_graph.json")

Examine the list object created by using RStudio, especially nodes and links data tables.

Data Wrangling

Extracting the nodes

The code chunk is used to extract nodes data table from mc2_data list object and save the output in a tibble data frame object called mc2_nodes.

mc2_nodes <- as_tibble(mc2_data$nodes) %>%
  select(id, shpcountry, rcvcountry)
Thing to learn
  • select() is used not only to select the field needed but also to re-organise the sequent of the fields.

Extracting the edges

The code chunk is used to extract edgess data table from mc2_data list object and save the output in a tibble data frame object called mc2_edges.

mc2_edges <- as_tibble(mc2_data$links) %>%
  mutate(ArrivalDate = ymd(arrivaldate)) %>%
  mutate(Year = year(ArrivalDate)) %>%
  select(source, target, ArrivalDate, Year, hscode, valueofgoods_omu, 
         volumeteu, weightkg, valueofgoodsusd) %>% 
  distinct()
Things to learn
  • mutate() is used two times to create two derive fields.
    • ymd() of lubridate package is used to covert arrivaldate field from character data type into date data type.
    • year() of lubridate package is used to convert the values in ArrivalDate field into year values.
  • select() is used not only to select the field needed but also to re-organise the sequent of the fields.

Preparing edges data table

Things to learn from the code chunk below
  • filter() is used to select records whereby hscode is equal 306170 and Year is equal to 2028.
  • group_by() is used to aggregate values by source, target, hscode, Year.
  • summarise() and n() are used to count the aggregated records.
  • filter() is then used to perform two selections
    • to select all records whereby source are not equal to target, and
    • to select all records whereby the values of their weights field are greater than 20
mc2_edges_aggregated <- mc2_edges %>%
  filter(hscode == "306170" & Year == "2028") %>%
  group_by(source, target, hscode, Year) %>%
    summarise(weights = n()) %>%
  filter(source!=target) %>%
  filter(weights > 20) %>%
  ungroup()

Preparing nodes data

Instead of using the nodes data table extracted from mc2_data, we will prepare a new nodes data table by using the source and target fields of mc2_edges_aggregated data table. This is necessary to ensure that the nodes in nodes data tables include all the source and target values.

id1 <- mc2_edges_aggregated %>%
  select(source) %>%
  rename(id = source)
id2 <- mc2_edges_aggregated %>%
  select(target) %>%
  rename(id = target)
mc2_nodes_extracted <- rbind(id1, id2) %>%
  distinct()

Building the tidy graph data model

The code chunk below is then used to build the tidy graph data model.

mc2_graph <- tbl_graph(nodes = mc2_nodes_extracted,
                       edges = mc2_edges_aggregated,
                       directed = TRUE)

Visualising the network graph with ggraph

In order to check if the tidygraph model has been prepared correctly, we can use selected functions of ggraph package to plot a simple network graph as shown below.

ggraph(mc2_graph,
       layout = "fr") +
  geom_edge_link(aes()) +
  geom_node_point(aes()) +
  theme_graph()

Exporting data objects

Code chunk below will be used to export the data objects prepared in previous section into rds format for subsequent use.

write_rds(mc2_nodes_extracted, "data/mc2_nodes_extracted.rds")
write_rds(mc2_edges_aggregated, "data/mc2_edges_aggregated.rds")
write_rds(mc2_graph, "data/mc2_graph.rds")

Preparing Network Data for visNetwork

Instead of plotting static network graph, we can plot interactive network graph by using visNetwork package. Before we can plot a interactive network graph by using visNetwork package, we are required to prepare two tibble data frames, one for the nodes and the other one for the edges.

Preparing edges tibble data frame

In this example, we assume that you already have created a tidygraph model look similar to the print below.

mc2_graph
# A tbl_graph: 123 nodes and 146 edges
#
# A directed acyclic simple graph with 12 components
#
# A tibble: 123 × 1
  id                                             
  <chr>                                          
1 1 Ltd. Liability Co Cargo                      
2 Adriatic Catch Ltd. Liability Co Transportation
3 Adriatic Tuna AS Solutions                     
4 Amerigo Dockyard S.p.A. Distribution           
5 Aqua Adventures Carriers Seabed                
6 Aqua Azul LC International                     
# ℹ 117 more rows
#
# A tibble: 146 × 5
   from    to hscode  Year weights
  <int> <int> <chr>  <dbl>   <int>
1     1    74 306170  2028      68
2     2    75 306170  2028      30
3     3    76 306170  2028      22
# ℹ 143 more rows

Note that tidygraph model is in R list format. The code chunk below will be used to extract and convert the edges into a tibble data frame.

edges_df <- mc2_graph %>%
  activate(edges) %>%
  as.tibble()
Things to learn from the code chunk above
  • activate() is used to make the edges of mc2_graph1 active. This is necessary in order to extract the correct compontent from the list object.
  • as.tibble() is used to convert the edges list into tibble data frame.
Important

You might be curious to ask why don’t we used mc2_edges, the tibble data frame extracted in the earlier section. If you compare the data structure of both data frames, you will notice the first two field names in edges_df are called from and to instead of source and target. This is conformed to the nodes data structure of igraph object. Also note that the data type of from and to are in numeric data type and not in character data type.

Preparing nodes tibble data frame

In this section, we will prepare a nodes tibble data frame by using the code chunk below.

nodes_df <- mc2_graph %>%
  activate(nodes) %>%
  as.tibble() %>%
  rename(label = id) %>%
  mutate(id=row_number()) %>%
  select(id, label)
Things to learn from the code chunk above
  • activate() is used to make the edges of mc2_graph1 active. This is necessary in order to extract the correct compontent from the list object.
  • as.tibble() is used to convert the edges list into tibble data frame.
  • rename() is used to rename the field name id to label.
  • mutate() is used to create a new field called id and row_number() is used to assign the row number into id values.
  • select() is used to re-organised the field name. This is because visNerwork is expecting the first field is called id and the second field is called label.
Important
  • visNetowrk is expecting a field called id in the tibble data frame. The field must be in numeric data type and it must unique to the values in the from and to field of edges_df.

Plotting a simple interactive network graph

To ensure that the tibble data frames are confirmed to the requirements of visNetwork, we will plot a simple interactive graph by using the code chunk below.

visNetwork(nodes_df,
           edges_df) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visEdges(arrows = "to", 
           smooth = list(enabled = TRUE, 
                         type = "curvedCW"))

Take Home Exercise

Preparing edges data table

Some research indicates that fish products should have hs code starting with ‘03’. However, there is no hs code starting with ‘03’ in the data provided. I see that there are data with hs code starting with ‘3’ and end with ‘0’, moving the 0 in front gives seemingly correct hs codes as I randomly checked a few. For this exercise, I will only extract rows with hs code starting with ‘3’ and end with ‘0’. Due to limited computing power, only those with weights > 20 are selected.

Year is also added as an attribute without additional filtering to help with following temporal analysis.

mc2_edges_aggregated <- mc2_edges %>%
  filter(substr(hscode, 1, 1) == "3" & substr(hscode, 6, 6) == "0") %>%
  group_by(source, target, hscode, Year) %>%
    summarise(weights = n()) %>%
  filter(source!=target) %>%
  filter(weights > 100) %>%
  ungroup()

Preparing nodes data table

Add additional attribute on Year that first transaction occurred and Year that last transaction occurred.

id1 <- mc2_edges_aggregated %>%
  group_by(source) %>%
    summarise(min_year = min(Year), max_year = max(Year)) %>%
  select(source, min_year, max_year) %>%
  rename(id = source)
id2 <- mc2_edges_aggregated %>%
  group_by(target) %>%
    summarise(min_year = min(Year), max_year = max(Year)) %>%
  select(target, min_year, max_year) %>%
  rename(id = target)
mc2_nodes_extracted <- rbind(id1, id2) %>%
  distinct()

Building the tidy graph data model

mc2_graph <- tbl_graph(nodes = mc2_nodes_extracted,
                       edges = mc2_edges_aggregated,
                       directed = TRUE)

Visualising the network graph with ggraph

set.seed (1234)
ggraph(mc2_graph,
       layout = "fr") +
  geom_edge_link(aes(width=weights, colour = Year)) +
  geom_node_point(aes()) +
  theme_graph()

Can’t tell much from the static network. Try interactive.

Visualising the network graph with visNet

merged_nodes_df <- merge(nodes_df, mc2_nodes_extracted, by.x = "label", by.y = "id", all.x = TRUE)%>%
  select(id, label, min_year, max_year)
distinct_nodes_df <- distinct(merged_nodes_df, id, .keep_all = TRUE)
distinct_nodes_df <- distinct_nodes_df %>%
  rename(group = min_year) 
edges_df <- edges_df %>%
  rename(value = weights) 
mc2_graph_3_vis_plot <- visNetwork(nodes = distinct_nodes_df,
           edges = edges_df,
           main = "Interactive visNet (selection by Year of first shipment)") %>%
  visIgraphLayout(layout = "layout_with_fr",
                  smooth = TRUE,
                  physics = TRUE 
                ) %>%
  visNodes(size = 10) %>%
  visEdges(color = list(highlight = "lightgray"), arrows = "to", 
           smooth = list(enabled = TRUE)) %>%
  visOptions(selectedBy = "group",
             highlightNearest = list(enabled = TRUE,
                                     degree = 1,
                                     hover = TRUE,
                                     labelOnly = TRUE),
             nodesIdSelection = list(enabled = TRUE,
                                     values = nodes_df$id)) %>%
  visLegend(width = 0.1) %>%
  visLayout(randomSeed = 1234)

mc2_graph_3_vis_plot

To plot network metrics to explore further.

Visualising network metrics

Betweenness centrality with each

set.seed (123)

g1 <- mc2_graph %>%
  ggraph(layout = "fr") + 
  geom_edge_link(aes(width=weights), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 5)) +
  geom_node_point(aes(colour = min_year, 
                      size = centrality_betweenness()))

g1 + theme_graph()

The two nodes with the highest betweenness centrality seem odd they are not at the center of the network, linked with as many nodes nor have many shipments with other nodes. Use the following code chunk to identify the two nodes.

distinct_Source<-mc2_edges_aggregated%>%distinct(source)
distinct_Target<-mc2_edges_aggregated%>%distinct(target)
nodes_updated_source<-semi_join(mc2_nodes_extracted,distinct_Source,
                          by=c("id"="source"))
nodes_updated_target<-semi_join(mc2_nodes_extracted,
                                          distinct_Target,
                          by=c("id"="target"))
nodes_updated=bind_rows(nodes_updated_source, nodes_updated_target)%>%
  distinct(id,.keep_all = TRUE)
mc2_graph_2<-igraph::graph_from_data_frame(mc2_edges_aggregated, 
                                     vertices = nodes_updated) %>% 
                                      as_tbl_graph()

mc2_graph_2<-mc2_graph_2%>%
  mutate(betweenness=centrality_betweenness())
mc2_graph_df <- as.data.frame(mc2_graph_2)
# Select the top 2 nodes with the highest values in betweenness centrality
top_2 <- mc2_graph_df %>%
  top_n(2, betweenness)

# Output the top 2 rows
top_2
                      name min_year max_year betweenness
1         Isla del Este SE     2028     2032          15
2 Selkie Ltd. Liability Co     2032     2034           3

The two nodes are identified to be Isla del Este SE and Selkie Ltd. Liability Co. It’s rather interesting that the last shipment of Isla del Este SE was in 2032 and Selkie’s first shipment was in 2032 as well. This fits the modes operanti that FishEye mentioned whereas a company with illegal activities may close down and go active again under a new name.

Plotting Ego Network of the identified node

Isla del Este SE

mc2_edges_ego1 <- mc2_edges_aggregated %>%
  filter(source == "Isla del Este SE" | target == "Isla del Este SE")
id1_ego1 <- mc2_edges_ego1 %>%
  select(source) %>%
  rename(id = source)
id2_ego1 <- mc2_edges_ego1 %>%
  select(target) %>%
  rename(id = target)
mc2_nodes_ego1 <- rbind(id1_ego1, id2_ego1) %>%
  distinct()
mc2_graph_ego1 <- tbl_graph(nodes = mc2_nodes_ego1,
                       edges = mc2_edges_ego1,
                       directed = TRUE)
ggraph(mc2_graph_ego1,
       layout = "auto") +
  geom_edge_link(aes(width=weights, colour = Year), arrow = arrow(length = unit(4, 'mm'))) +
  geom_node_point(aes()) +
  geom_node_text(aes(label = mc2_nodes_ego1$id), size = 2, repel = TRUE, vjust = 1, hjust = 1) +
  theme_graph()

Looking at the ego network, looks like that Isla del Este is under the mode of operation where it imports a large amount from a few companies and distribute the imported goods to other companies. Could be suspicous for transshipment.