# An Introduction to Network Analysis in R

with tags igraph ggraph networks network-analysis -With the increasing availability of granular data on the relationships between individual entities - such as persons (social media), countries (internatinal trade) and financial institutions (supervisory reporting) - network analysis offers many possibilities to extract useful information from such data. This post provides an introduction to network analysis in R using the powerful `igraph`

package for the calculation of metrics and `ggraph`

for visualisation. It marks the beginning of a more comprehensive treatment of network analysis on r-econometrics.

## The `igraph`

package

There are multiple packages for the analysis of networks in R. In this post I concentrate on the `igraph`

package, which allows for a broad range of applications. But before we get into it in more detail, it is useful to know that there are two possible ways to represent the *edges*, i.e. the connections, of a network:

**Adjacency matrix**: This is a square matrix, where each row and column corresponds to an entity. If two entities are conencted, the respective field in the matrix takes the value one and zero otherwise.**A list of connections**: In its most simple form this is a list, where each edge of a network is represented by a row with one entity in the first column and the other in the second. For the purpose of this post this is our preferred representation.

## Create an artificial graph

We can illustrate these two representations by looking at an artificial network. Such a network could be generated with certain functions of the `igraph`

package. However, the following code only uses base functionalities of R. It results in a data frame with the names of the connected entities in the first and second row. The third row contains a random indicator for the strength of the connection. It is based on the square of a value from a standard normal distribution. To reduce the number of resulting edges, only values above a certain threshold are kept. Also, the code excludes the connections a node has with itself.

```
# Set seed for reproducibility
set.seed(12345)
# Generate data
raw <- data.frame(byer = rep(letters, 1, each = 26),
sllr = rep(letters, 26),
con = rnorm(26 * 26)^2) # Calculate arbitrary weights
# Drop connections with oneself
raw <- raw[raw$byer != raw$sllr,]
# Only keep strong connections
raw <- raw[abs(raw$con) > 1,]
# Reformat row numbers
rownames(raw) <- NULL
# Look at result
head(raw)
```

```
## byer sllr con
## 1 a f 3.304964
## 2 a l 3.302623
## 3 a s 1.255997
## 4 a v 2.119310
## 5 a x 2.412236
## 6 a y 2.552676
```

Basically, this is a list repesentation of the artifical network. In a next step we convert the data frame to an `igraph`

object.

`igraph`

objects

```
# Load package
library(igraph)
```

The transformation is straightforward and can be done with the `graph_from_data_frame`

function, where for simplicity we set the argument `directed = FALSE`

to indicated that the network is not directed. *Note that it is important that the first and second column of the data frame contain the names of the nodes that have a connection.*

`graph_df <- graph_from_data_frame(raw, directed = FALSE)`

The function `simplify`

can be used to get rid of loops and multiple edges:

`graph_df <- simplify(graph_df)`

Now take a look at the object:

`graph_df`

```
## IGRAPH bdf5bb9 UN-- 26 173 --
## + attr: name (v/c)
## + edges from bdf5bb9 (vertex names):
## [1] a--f a--g a--l a--m a--s a--v a--x a--y a--z b--d b--f b--g b--h b--k
## [15] b--l b--m b--o b--p b--q b--r b--t b--u b--v b--w b--x b--z c--g c--h
## [29] c--j c--l c--m c--n c--p c--q c--r c--s c--u c--v c--w c--x c--z d--e
## [43] d--f d--h d--o d--q d--t d--x d--z e--f e--g e--h e--i e--j e--k e--o
## [57] e--q e--r e--s e--t e--v e--w e--x e--y e--z f--g f--i f--k f--l f--m
## [71] f--n f--q f--r f--u f--z g--l g--m g--p g--s g--t g--u g--w g--x g--y
## [85] g--z h--m h--o h--r h--s h--t h--v h--x h--y h--z i--j i--k i--n i--q
## [99] i--u i--v i--y i--z j--m j--p j--r j--t j--u j--v j--w j--y j--z k--l
## + ... omitted several edges
```

It does not look like the data frame list. However, it can be seen that each element of the list of edges refers to a conenction between to entities.

Let’s compare this list to the adjacency matrix representation. It can be obtained directly from the `igraph`

object by using the `as_adj`

function.

`graph_adj <- as_adj(graph_df)`

Now look at the first six rows and columns of the resulting object.

`graph_adj[1:6, 1:6]`

```
## 6 x 6 sparse Matrix of class "dgCMatrix"
## a b c d e f
## a . . . . . 1
## b . . . 1 . 1
## c . . . . . .
## d . 1 . . 1 1
## e . . . 1 . 1
## f 1 1 . 1 1 .
```

An existing connection between two nodes is indicated by the value one. If there is no connection between them, the field contains a dot. For example, the (undirected) connection between *a* and *f* is described by the 1 in the sixth value of the first column and the sixth value in the first row. This corresponds to the first entry in the list representation of the above `igraph`

object. If the network were directed, only one field would contain a 1 - unless the connection was mutual.

Note that the resulting object is not an object of class `igraph`

. `as_adj`

produces a `dgCMatrix`

object, which comes with the `Matrix`

package and was specifically designed to implement so-called *sparse matrices*. Sparse matrices are simply matrices which contain a lot of zeros. By only considereing non-zero values, sparse matrices allow for a lot of additional computational efficiency, which is beneficial for the analysis of very large networks.

## Summary statistics

This secton covers standard metrics to describe a network. Most of these metrics focus on the *centrality* of a graph’s nodes. The following desciption is based on Jackson (2008).

### Degree

The degree of a node is the number of its connections. The `degree`

function calculates this number for each node of a graph. The node with the highest number is the node with the highest number of connections.

```
g_degree <- degree(graph_df)
g_degree
```

```
## a b c d e f g h i j k l m n o p q r s t u v w x y
## 9 17 15 9 17 14 15 13 10 12 14 10 14 10 14 11 15 16 12 13 14 13 12 16 14
## z
## 17
```

The degree is of a node is an insteresting metric to identify the most connected nodes of a network. However, if the structure of a network is of interest, the *degree distribution* can be an informative metric. It describes the relative frequencey of nodes that have different degrees. In that way it is a histogram of degrees.

`hist(g_degree)`

### Betweenness

*Betweenness* denotes the number of shortest paths passing through a node.

```
g_betw <- betweenness(graph_df)
round(g_betw, 2)
```

```
## a b c d e f g h i j k l
## 2.17 9.50 7.79 1.11 10.10 6.85 7.86 4.25 2.19 4.42 6.48 2.12
## m n o p q r s t u v w x
## 5.67 2.93 6.01 3.99 7.56 8.62 3.84 4.53 5.61 6.19 3.15 10.15
## y z
## 7.21 11.68
```

### Eigenvector centrality

The *eigenvector centrality* (Bonabeau, 1972) belongs to the class of spectral centralities. It is based on the idea that the importance of a node is recusively related to the importance of the nodes pointing to it. For example, your popularity depends on the popularity of your friends, whose popularity depends on their friends etc.

```
g_eigen <- eigen_centrality(graph_df)
round(g_eigen$vector, 2)
```

```
## a b c d e f g h i j k l m n o
## 0.54 1.00 0.87 0.58 0.99 0.81 0.88 0.81 0.60 0.72 0.82 0.61 0.85 0.61 0.84
## p q r s t u v w x y z
## 0.64 0.89 0.94 0.72 0.79 0.85 0.76 0.73 0.92 0.80 0.97
```

Further measures will be added in the future.

## Visualisation

Beside the calculation of summarising network metrics, the visualisation of a graph can also be a very informative step in network analysis. Since visualisations in R usually involve the `ggplot2`

package, I focus on the `ggraph`

package, which is based on the `ggplot2`

architecture. However, the `plot`

function can be used to get a quick impression of a network:

`plot(graph_df)`

When using `ggplot2`

the main challenge in the visualisation of networks is to find suitable x- and y-coordinates for the nodes of a graph. Fortunately, there are some packages out there, which were developed for this purpose. For example, `ggraph`

and `ggnetwork`

take an `igraph`

object and produce tables with suitable coordianates. These coordinates are usually obtained by using special algorithms such as the ones proposed by Fruchterman and Reingold (1991) or Kamada and Kawai (1989). The latter is applied in the code below.

`ggraph`

diagrams follow the `ggplot2`

logic, where the function `ggplot`

is replaced by `ggraph`

to set up the graph. The function is then followed by the layers the user wants to add. In this example we add a layer of edges with `geom_edge_link`

and on top of this a layer of nodes, which are represented by points, using `geom_node_point`

.

```
# Load packages
library(ggraph)
library(ggplot2)
ggraph(graph_df, layout = "kk") + # Generate graph
geom_edge_link(colour = "darkgrey") + # Add a layer of edges
geom_node_point(size = 3) + # Add a layer of nodes
theme_graph() # Get rid of gray background colour, grid and axes
```

Further introductions to `ggraph`

can be found on https://www.data-imaginist.com/tags/ggraph and in the package’s vignettes, of course.

## References

Bonabeau, E. (1972). Factoring and weighting approaches to status scores and clique identification. *Journal of Mathematical Sociology 2*, 113-120.

Csardi G., & Nepusz, T. (2006). The igraph software package for complex network research, *InterJournal* Complex Systems, 1695.

Fruchterman, T. M. J., & Reingold, E. M. (1991). Graph Drawing by Force-directed Placement. *Software - Practice and Experience, 21*(11), 1129-1164.

Jackson, M. O. (2008). *Social and economic networks*. Princeton: University Press.

Kamada, T., & Kawai, S. (1989). An algorithm for drawing general undirected graphs. *Information Processing Letters 31*(1), 7-15.