require(gen3sis)
require(terra)
require(ape)
require(phytools)
require(picante)
require(here)
setwd(here())
4 🖥️ Silicodiversity
Exploring outputs
In this practical we will explore the outputs from gen3sis using the island simulations we ran yesterday. We will learn how to use this data with common R packages for phylogenetic comparative methods, community phylogenetics, biogeography and much more.
The goal for today will be to produce:
A map of Species Richness from the simulation summary object
A plot of Lineages Through Time from the phylogeny
Plot of species trait values on a phylogenetic tree, by linking the species objects to the phylogeny
Maps of Phylogenetic Diversity by linking species objects to the landscape and phylogeny
First, let’s make sure we have the necessary packages loaded
Simulation summary object (sgen3sis.rds)
The first object we will look at today is the sgenesis.rds file. This file contains a summary of the simulation. Which we can plot (as we did yesterday) with the plot_summary function. This is the same object, that you would have in memory by running a simulation with run_simulation.
<- "output/islands"
output_dir
<- readRDS(file.path(output_dir, "config_islands_simple_Day1Prac3_M3/sgen3sis.rds"))
sim
# look at what the simulation summary contains
names(sim)
[1] "summary" "flag" "system" "parameters"
The first element is the sim summary. This contains a record of the history of speciation, extinction, and species richness through time (phylo_summary), a history of the number of total grid cells occupied during the simulation through time (occupancy) and the species richness of each grid cell at the final time step.
str(sim$summary)
List of 3
$ phylo_summary : num [1:52, 1:4] 4 4 4 4 4 4 4 4 4 4 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:52] "initial" "50" "49" "48" ...
.. ..$ : chr [1:4] "total" "alive" "speciations" "extinctions"
$ occupancy : Named num [1:52] 1 0.523 0.527 0.525 0.51 ...
..- attr(*, "names")= chr [1:52] "initial" "50" "49" "48" ...
$ richness-final: num [1:3600, 1:3] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3600] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "x" "y" "0"
# phylo summary
head(sim$summary$phylo_summary)
total alive speciations extinctions
initial 4 4 4 0
50 4 4 0 0
49 4 4 0 0
48 4 4 0 0
47 4 4 0 0
46 4 4 0 0
# occupancy
head(sim$summary$occupancy)
initial 50 49 48 47 46
1.0000000 0.5225677 0.5270793 0.5249294 0.5099819 0.5013193
# occupancy
head(sim$summary$`richness-final`)
x y 0
1 0.5 59.5 0
2 1.5 59.5 0
3 2.5 59.5 0
4 3.5 59.5 0
5 4.5 59.5 0
6 5.5 59.5 0
These data can be visualized with the plot_summary function
# Visualize the outputs
plot_summary(sim)
We can also use these data to map out patterns of species richness
# make sure the landscape is loaded
<- readRDS(file.path("data", "landscapes", "islands","landscapes.rds"))
lc
# can remove cells with elevation below sea level at timestep 0 (present-day) to see the outlines of the islands
<- is.na(lc$elevation[,"0"])
na_mask <- sim$summary$`richness-final`
rich 3] <- NA
rich[na_mask,
# turn richness summary into a raster
<- rast(rich, type="xyz")
richness
# plot, the sea is darblue, given by hexadecimal color code
plot(richness, col=c("grey", gen3sis::color_richness(12)), colNA="#000033")
The next part of the simulation summary object is the flag which will tell us if the simulation ran successfully. It should give “OK”
$flag sim
[1] "OK"
Next is the system summary. This is information about the R version, R packages, and operating system used in the simulation. This ensures complete repeatability. It also tells us the runtime of the simulation.
names(sim$system)
[1] "runtime-hours" "gen3sis-version" "R-version"
[4] "OS" "session-information"
$system[1:2] sim
$`runtime-hours`
[1] 0.01416649
$`gen3sis-version`
[1] '1.5.11'
Finally the summary object contains the config information and model parameters used.
names(sim$parameters)
[1] "gen3sis" "user" "directories"
names(sim$parameters$gen3sis)
[1] "general" "initialization" "dispersal" "speciation"
[5] "mutation" "ecology"
Phylogeny object (phy.nex)
The phylogeny object is pretty straight forward. It is a nexus file containing the relationships between the species.
# read phy
<- read.nexus(file.path(output_dir, "config_islands_simple_Day1Prac3_M3/phy.nex"))
phy # plot phy
plot(phy, cex=0.4)
axisPhylo()
From this object we can look at lineages through time plots and estimate trends in diversification. The gamma statistic is one way to detect diversification slowdowns or speedups, with positive values indicating nodes are more closely pushed up towards the tips (speed up) and negative values indicating nodes are closer to the root (slowdown) (Pybus and Harvey 2000). There are lots of other kinds of measures of phylogenetic tree shape, such as metrics of tree imbalance (Colless’s Index, Beta-Splitting Paramater, etc.)
# plot an ltt
<-ltt(phy) ltt_M1
# look at the gamma statistic
print(paste0("Gamma = ", round(ltt_M1$gamma, 2)))
[1] "Gamma = 0.27"
# is there a significant deviation from constant rates?
print(paste0("P = ", round(ltt_M1$p, 2)))
[1] "P = 0.79"
Species objects (phy.nex)
Now its time to get into the meat of the gen3sis outputs. Most of the information from the simulation is stored in the species objects. These are .rds files that contain a list which includes information on every species, extinct or extant, that existed during the simulation. These are saved per default at every time step, but can be fine tuned according to your needs. The naming convention is “species_t_0.rds” for time step 0 (present-day), and “species_t_50” for timestep 50, etc.
# load object
<- readRDS(file.path(output_dir, "config_islands_simple_Day1Prac3_M3", "species", "species_t_0.rds"))
species_t_0
# look at object class and length
class(species_t_0)
[1] "list"
length(species_t_0)
[1] 37
# compare to number of tips in the phylogeny
Ntip(phy)
[1] 37
There are 14 elements in the list, representing our 14 species and this number matches the number of species in our phylogeny. No species went extinct in this particular simulation, but if they did, they would match extinct tips in the phylogeny.
Lets look at a single species
names(species_t_0[[1]])
[1] "id" "abundance" "traits" "divergence"
The species has an ID which allows us to match it to the phylogeny.
1]]$id species_t_0[[
[1] "1"
The species’ abundances are also linked to grid cells in the landscape, which can be matched with their corresponding names.
1]]$abundance species_t_0[[
249 250 251 252 289 290 291 306 307 308 316 349 352 353 354 364 365 366 377 378
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
406 407 408 414 415 423 439 440 465 466 467 476 482 500 524 525 560 561 584 597
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
619 620 645 656 657 662 678 679 705 706 716 717 723 737 738 766 776 783 784 785
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
796 797 827 835 845 846 847 848 853 854 888 889 890 893 894 909 910 951 953
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In this case the species has abundance values of 1 in all populations from the cells that it occupies. This is because we set abundances to be binary: 1=present, 0=absent. We can see the species occupies cells 247, 248, 249, etc.
The species also has values of it’s traits for each of its populations.
head(species_t_0[[1]]$traits)
dispersal temp_niche_centre temp_niche_width start_island
249 10 20.81994 0.5 1
250 10 20.91671 0.5 1
251 10 20.84718 0.5 1
252 10 20.83687 0.5 1
289 10 20.83059 0.5 1
290 10 20.83772 0.5 1
We can see the population that each row is linked by the rownames. Here you can see 247, 248, 249, etc. Each of these populations have a dispersal trait of 5 (because we didn’t vary this) slightly different temp_mean traits (because these evolved stochastically under a Brownian motion model) and a temp width of 1 (again, we didn;t vary this in model 1).
So, if we want to map out the distribution of species 1 at timestep 0, we just need to link those cell names (267/268/etc) to the landscape object and make a raster. Why don;t we try and see what islands species 1 is found on.
par(mfrow=c(1,3))
# first plot the islands out from the landscape object
<- rast(lc$patch[,c("x", "y", "0")], type="xyz")
patch_xyz plot(patch_xyz, main="Island Patches", col=palette()[c(2,3,4,6)], colNA="#000033")
# pull out values of the landscape where the species is found
<- lc$patch[which(rownames(lc$patch) %in% rownames(species_t_0[[1]]$traits)), c("x", "y", "0")]
species1_xyz
# turn it into a raster
<- rast(species1_xyz, type="xyz")
species1_xyz <- extend(species1_xyz, patch_xyz)
species1_xyz plot(species1_xyz, main= "Species 1 Distribution", colNA="#000033")
# alternatively we can use the plot_species function on gen3sis..
# for that we load the landscape object of the respective time-step
<- readRDS(file.path(output_dir, "config_islands_simple_Day1Prac3_M1", "landscapes", "landscape_t_0.rds"))
landscape_t_0 ::plot_species_presence(species_t_0[[1]], landscape_t_0) gen3sis
Linking the species object and phylogeny
Lets try and link the species trait data to the phylogeny to start learning something about what exactly took place during our simulation! Let’s get the mean trait value of the temperature niche and also the islands that each species belongs too.
# lets create a data frame
<- data.frame("id"= paste0("species",sapply(species_t_0, function(x)x$id)),
daf "mean_temp"=NA,
"island1"=0,
"island2"=0,
"island3"=0,
"island4"=0,
"island_start"=NA)
# take a look
head(daf)
id mean_temp island1 island2 island3 island4 island_start
1 species1 NA 0 0 0 0 NA
2 species2 NA 0 0 0 0 NA
3 species3 NA 0 0 0 0 NA
4 species4 NA 0 0 0 0 NA
5 species5 NA 0 0 0 0 NA
6 species6 NA 0 0 0 0 NA
# Use sapply on the species object to get their mean trait values
$mean_temp <- sapply(species_t_0, function(x){
daf# x <- species_t_0[[1]]
mean(x$traits[, "temp_niche_centre"], na.rm=T)
})
# Get the island patch id values in a for loop
for(i in 1:length(species_t_0)){
#I'm here because it is a good idea to test code on a single species before running it on all species
# i <- 1
# as before we get the lanscape values of each species
<- lc$patch[which(rownames(lc$patch) %in% rownames(species_t_0[[i]]$traits)), c("x", "y", "0")]
speciesi_xyz
# then pull out the unique values (note that species might occur on more than one island)
<- unique(speciesi_xyz[, 3])
islands
# then if species are on the island, give that column a value of 1
$island1[i] <- ifelse(1 %in% islands, 1, 0)
daf$island2[i] <- ifelse(2 %in% islands, 1, 0)
daf$island3[i] <- ifelse(3 %in% islands, 1, 0)
daf$island4[i] <- ifelse(4 %in% islands, 1, 0)
daf
}
# get the starting island from the traits object since we recorded this in the initialization step
$island_start <- sapply(species_t_0, function(xasa) unique(xasa$traits[, "start_island"])) daf
Now look again at the data frame
head(daf)
id mean_temp island1 island2 island3 island4 island_start
1 species1 20.85480 1 1 0 0 1
2 species2 20.97846 1 0 0 0 2
3 species3 21.05220 0 0 1 0 3
4 species4 21.98701 0 0 0 1 4
5 species5 20.92045 1 1 0 0 2
6 species6 20.82620 1 1 0 0 1
Plot out the continuously evolving temperature niche trait
# create a named vector for the log of the temperature niche trait
<- daf$mean_temp
temp_niche names(temp_niche) <- daf$id
# plot it out with dots = trait
dotTree(phy,temp_niche,ftype="i", length=8, fsize=1.2, standardize=T)
Not much variation in that trait, due to the combined effects of trait homogenization and a low rate of change. Lets plot the tip states of the islands on the phylogeny using the phytools package
par(mfrow=c(1,2))
# format each island
<- function(island, phy=phy, daf=daf){
formatIsland <- as.factor(daf[, island])
islandf names(islandf) <- daf$id
<-to.matrix(islandf,levels(islandf))
islandmat<-islandmat[phy$tip.label,]
islandmatreturn(list(islandf, islandmat))
}
plotTree(phy,ftype="i",offset=1,fsize=0.9, xlim=c(0, 75))
tiplabels(pie=formatIsland(island="island1", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,2)],cex=0.6, adj=12+1)
tiplabels(pie=formatIsland(island="island2", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,3)],cex=0.6, adj=12+3)
tiplabels(pie=formatIsland(island="island3", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,4)],cex=0.6, adj=12+5)
tiplabels(pie=formatIsland(island="island4", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,6)],cex=0.6, adj=12+7)
plot(patch_xyz, main="Island Patches", col=palette()[c(2,3,4,6)])
Interesting. What do you notice about the distribution of species on islands? Could you predict which island each lineage began on?
We actually know which islands each lineage started on because we recorded this as a trait (we could also look at the past species objects to figure this out but we have used a shortcut though the traits).
par(mfrow=c(1,2))
plotTree(phy,ftype="i",offset=1,fsize=0.4, xlim=c(0, 75))
=0.9
my_cex# add the starting island as the first colum
tiplabels(pie=formatIsland(island="island_start", phy=phy, daf=daf)[[2]],piecol=palette()[c(2,3,4,6)],cex=my_cex*3, adj=12)
tiplabels(pie=formatIsland(island="island1", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,2)],cex=my_cex, adj=12+1)
tiplabels(pie=formatIsland(island="island2", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,3)],cex=my_cex, adj=12+3)
tiplabels(pie=formatIsland(island="island3", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,4)],cex=my_cex, adj=12+5)
tiplabels(pie=formatIsland(island="island4", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,6)],cex=my_cex, adj=12+7)
# add island plot
plot(patch_xyz, main="Island Patches", col=palette()[c(2,3,4,6)])
So whats really apparent here is that the clade that originated on the green island has speciated allopatrically into the red island multiple times in the recent past. The same is true for the red clade, however the deeper divergence between species9 and species3 have had enough time to recolonize both islands.
4.0.0.1 Linking the species object, landscape, and phylogeny
Common spatial biodiversity analyses link information measured at the species level to maps of their distribution in space using presence-absence matrices or PAMs. PAMs typically are data frame with each row representing a site, could be an island or could be a grid cell, and each column representing a species. Values of 1 are given if the species is present in the site, if not a value of 0 is given.
# grid cell level PAM
# create an empty data frame with the dimensions of species x sites
<- data.frame(matrix(0, nrow=nrow(lc$elevation), ncol=length(species_t_0)))
PAM
# given names to rows and columns
rownames(PAM) <- rownames(lc$elevation)
colnames(PAM) <- paste0("species", sapply(species_t_0, function(x)x$id))
# loop over species and add value of 1 to all sites the species is present
for(i in 1:length(species_t_0)){
which(rownames(PAM) %in% names(species_t_0[[i]]$abundance)), i] <- 1
PAM[
}
# how does it look?
print(PAM[1:10, 1:10])
species1 species2 species3 species4 species5 species6 species7 species8
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0
species9 species10
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 0 0
# we can estimate Phylogenetic diversity
<- pd(PAM, phy)
pd_islands
# we can also measure mean phylogenetic distance and mean nearest neighbour distance
<- mpd(PAM, cophenetic(phy))
mpd_islands <- mntd(PAM,cophenetic(phy))
mntd_islands
# link these back with the landscape by joining to the landscape lat/long
<- cbind(lc$elevation[, c("x", "y")], pd_islands, mpd_islands, mntd_islands)
community_phylo
par(mfrow=c(2,2))
<- rast(community_phylo[, c("x", "y", "SR")], type="xyz")
sr_ras <- rast(community_phylo[, c("x", "y", "PD")], type="xyz")
pd_ras <- rast(community_phylo[, c("x", "y", "mpd_islands")], type="xyz")
mpd_ras <- rast(community_phylo[, c("x", "y", "mntd_islands")], type="xyz")
mntd_ras
plot(sr_ras, main="Species Richness", na.color ="red")
plot(pd_ras, main="Phylogenetic Diversity")
plot(mpd_ras, main="Mean Phylogenetic Pairwise Distance")
plot(mntd_ras, main= "Mean Nearest Neighbour Phylogenetic Distance")
Sensitivity analysis
In this part we are going to load in a data set from Skeels et al. (2022) SystBiol in which we simulated data with Gen3sis under four alternative models to test the evolutionary speed hypothesis (ESH). The ESH hypothesizes that faster rates of evolution occurs in lineages from warm regions like the tropics because they are have higher mutagenesis from faster life histories associated with warm temperatures and smaller body sizes. The four models used there were
M0 - the null. population divergence is independent of temperature and body size
M1 - Temperature Trailblazer. environmental temperature drives rate of population divergence
M2 - Size Shaper. body size drives the rate of population divergence
M3 - Synergistic Drivers. environmental temperature and body size drives the rate of population divergence
Not only did we change the overall model of evolution, we also varied key parameters for rates of niche evolution (simga_squared_t), rates of body size evolution (sigma_squared_bs), dispersal, and the temperature niche breadth (omega), the exponent of the divergence factor with temp/body size (lambda), and the divergence threshold. Load in the data and take a look, the first 6 columns are the model parameters we varied.
<- read.csv("data/simulated_summary_statistics.csv")
sim_data
# look at the first few columns
head(sim_data)
divergence_threshold lambda omega sigma_squared_bs sigma_squared_t dispersal
1 6 3.497 0.022 0.010 0.008 2.748
2 8 2.743 0.029 0.006 0.012 2.119
3 4 4.251 0.016 0.015 0.004 3.376
4 5 3.120 0.026 0.003 0.013 3.691
5 9 4.629 0.013 0.013 0.006 2.433
6 7 2.365 0.019 0.008 0.003 3.062
m n_extant_diversity richness_lat_cor richness_temp0mya_cor temp_skewness
1 m0 3763 -0.5123028 0.4603989 -0.3377488
2 m0 2514 -0.4425833 0.3879755 -0.4301348
3 m0 2881 -0.6090030 0.5980889 -0.5485366
4 m0 3259 -0.2692291 0.2709599 -0.4056149
5 m0 1720 -0.6320276 0.5569583 -0.2034315
6 m0 2040 -0.6901237 0.7143851 -0.5818263
temp_kurtosis rs_skewness rs_kurtosis bodysize_temp_cor
1 2.404199 7.054563 70.57922 0.0001267958
2 2.781235 7.065556 73.60686 -0.0573601538
3 2.686734 6.568041 66.80411 0.0627634742
4 2.578726 6.372164 56.40946 -0.0099951252
5 2.039092 5.943670 62.21730 -0.0491801922
6 2.692505 5.557220 43.19415 -0.1425113733
bodysize_rangesize_cor rangesize_temp_cor bodysize_MRDa_cor bodysize_MRDn_cor
1 -0.021888162 0.3166909 -0.22987475 -0.22987475
2 -0.004099039 0.3750050 0.11431674 0.11431674
3 -0.007583409 0.1745926 -0.17385167 -0.17385167
4 0.019477523 0.3315223 0.15882106 0.15882106
5 -0.020735305 0.3106268 -0.01367189 -0.01367189
6 -0.068035906 0.2386339 -0.25345879 -0.25345879
bodysize_MRDs_cor bodysize_ED_cor bodysize_ES_cor bodysize_DR_cor
1 -0.22987475 -0.004220075 -0.001379360 0.004220206
2 0.11431674 -0.048918445 -0.047336234 0.048918445
3 -0.17385167 0.156214848 0.124189954 -0.156214848
4 0.15882106 0.025569939 0.020867045 -0.025569939
5 -0.01367189 -0.091018579 -0.071841554 0.091018579
6 -0.25345879 -0.011839185 -0.009998883 0.011839185
temp_MRDa_cor temp_MRDn_cor temp_MRDs_cor temp_ED_cor temp_ES_cor temp_DR_cor
1 -0.26576712 -0.26576712 -0.26576712 0.26183743 0.2202009 -0.26183732
2 -0.17175772 -0.17175772 -0.17175772 0.24312164 0.2193826 -0.24312164
3 -0.26126033 -0.26126033 -0.26126033 0.20306519 0.2015201 -0.20306519
4 -0.18047283 -0.18047283 -0.18047283 0.09627562 0.0921575 -0.09627562
5 -0.29855804 -0.29855804 -0.29855804 0.51772365 0.4230468 -0.51772365
6 0.05847105 0.05847105 0.05847105 0.11024859 0.1080941 -0.11024859
rangesize_MRDa_cor rangesize_MRDn_cor rangesize_MRDs_cor rangesize_ED_cor
1 -0.06934918 -0.06934918 -0.06934918 0.08511133
2 -0.10091990 -0.10091990 -0.10091990 0.12507118
3 0.04022504 0.04022504 0.04022504 -0.09452590
4 -0.04660310 -0.04660310 -0.04660310 0.07933631
5 -0.13259037 -0.13259037 -0.13259037 0.18191779
6 0.06267093 0.06267093 0.06267093 -0.01025940
rangesize_ES_cor rangesize_DR_cor collessI sackinI gamma lat_pd_cor
1 0.077434173 -0.08511127 1.2941150 1.2446979 -2.842964 0.31465895
2 0.122658910 -0.12507118 0.2469526 0.3055442 -3.432895 0.25315279
3 -0.069705157 0.09452590 1.7248814 1.6445494 3.552396 0.18284230
4 0.087443271 -0.07933631 0.1092457 -0.1254847 -2.856134 0.11539323
5 0.147346956 -0.18191779 1.9250034 1.8373325 1.221135 -0.06489522
6 0.006120436 0.01025940 1.4952264 1.3605311 -5.120426 0.53662583
lat_mpd_cor lat_mntd_cor temp_pd_cor temp_mpd_cor temp_mntd_cor
1 0.03036122 0.33145967 -0.24781657 0.01816540 -0.268472337
2 0.10829373 0.24772423 -0.19630776 -0.09235074 -0.184665665
3 0.18551711 0.24143456 -0.18427695 -0.19302604 -0.243213118
4 -0.03477720 0.15145822 -0.09837927 0.03810852 -0.129764794
5 -0.20454388 0.09798374 0.14050779 0.24629073 -0.006957113
6 0.46224942 0.53924111 -0.54984407 -0.47923803 -0.551691573
richness_pd_cor richness_mpd_cor richness_mntd_cor temp_bsm_cor temp_bssd_cor
1 -0.7653721 -0.41190740 -0.7499123 0.13029352 0.10296607
2 -0.5019377 -0.13404370 -0.4901978 0.02405246 0.32276835
3 -0.4206957 -0.37006059 -0.5361401 0.29116267 0.28959998
4 -0.6620540 -0.41176821 -0.6769946 -0.05320780 0.04856645
5 -0.1328847 0.06086838 -0.2905060 -0.08286640 0.08193951
6 -0.8328187 -0.71895993 -0.8346579 0.31747185 0.34586235
richness_bsm_cor richness_bssd_cor pd_bsm_cor mpd_bsm_cor mntd_bsm_cor
1 0.11558582 0.2699839 0.003325102 0.2260060 -0.05039301
2 -0.04537885 0.3499518 -0.263036186 -0.2484294 -0.25064129
3 0.23668036 0.3628398 0.515713659 0.4795434 0.49672112
4 0.04042515 0.1778522 -0.052266148 -0.1187109 -0.03605613
5 0.31046985 0.3722813 0.061612978 0.1029425 0.06221910
6 0.45747017 0.3837205 -0.038428520 0.3474728 -0.12562991
# look at the models
unique(sim_data$m)
[1] "m0" "m1" "m2" "m3"
This data set has 27 metrics used in our paper to measure patterns in the distribution of species, such as range size metrics, or correlations between temperature and diversity, as well as phylogenetic tree shape metrics, such as gamma, and measures of functional diversity, like body size variance. We predicted that these different models of evolution (M0-M4) should leave discernible signatures in these metrics. We can plot a few associations between biodiversity metrics and these model parameters to test this hypothesis.
require(ggplot2)
# how is diversity related to the dispersal ability of a clade?
ggplot(sim_data, aes(x=m, y=log(n_extant_diversity), fill=m))+
geom_point(alpha=0.7, position = "jitter")+
geom_boxplot(alpha=0.7)+
stat_smooth()+
theme_classic()
# how about the temperature~diversity gradient?
ggplot(sim_data, aes(x=m, y=richness_temp0mya_cor, fill=m))+
geom_point(alpha=0.7, position = "jitter")+
geom_boxplot(alpha=0.7)+
stat_smooth()+
theme_classic()
# how about the relationship between body size and diversification rate?
ggplot(sim_data, aes(x=m, y=bodysize_DR_cor, fill=m))+
geom_point(alpha=0.7, position = "jitter")+
geom_boxplot(alpha=0.7)+
stat_smooth()+
theme_classic()
# how about the relationship between temperature and diversification rate?
ggplot(sim_data, aes(x=m, y=temp_DR_cor, fill=m))+
geom_point(alpha=0.7, position = "jitter")+
geom_boxplot(alpha=0.7)+
stat_smooth()+
theme_classic()
Do these patterns fit our predictions? We can also look at how some of these metrics varied with the continuous model parameters such as dispersal ability.
# how is diversity related to the dispersal ability of a clade?
ggplot(sim_data, aes(x=dispersal, y=log(n_extant_diversity)))+
geom_point(alpha=0.7)+
stat_smooth()+
theme_classic()
# how about the latitude diversity gradient?
ggplot(sim_data, aes(x=dispersal, y=temp_DR_cor))+
geom_point(alpha=0.7)+
stat_smooth()+
theme_classic()
To perform a simple kind of sensitivity test we might ask how each of the model parameters predicts linear changes in the distribution of a biodiversity patterns using a multiple regression model. One example where we have a good idea of what the relationship should be is the how variance in the distribution of temperature niches across species (e.g., skewness of the distribution) relates to model paramaters. We expect that this should scale with the rate of rate of temperature niche evolution - faster rates of change = more variation in the trait = more kutosis.
# first scale parameters to be in the same units
<- sim_data
sim_data_scaled 1:6] <- scale(sim_data_scaled[,1:6] )
sim_data_scaled[,
# fit the multiple regression
<- lm(temp_kurtosis ~ divergence_threshold+lambda+omega+sigma_squared_bs+sigma_squared_t+dispersal, data=sim_data_scaled)
lm1
# look at model coefficients
summary(lm1)
Call:
lm(formula = temp_kurtosis ~ divergence_threshold + lambda +
omega + sigma_squared_bs + sigma_squared_t + dispersal, data = sim_data_scaled)
Residuals:
Min 1Q Median 3Q Max
-2.7214 -0.7281 -0.3103 0.2481 22.9740
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.26301 0.04670 69.872 < 2e-16 ***
divergence_threshold 0.48515 0.04978 9.745 < 2e-16 ***
lambda 0.22304 0.04717 4.729 2.49e-06 ***
omega 0.07447 0.04769 1.562 0.119
sigma_squared_bs 0.02415 0.04677 0.516 0.606
sigma_squared_t 0.29197 0.04690 6.225 6.37e-10 ***
dispersal 0.26180 0.04915 5.326 1.17e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.737 on 1377 degrees of freedom
Multiple R-squared: 0.1076, Adjusted R-squared: 0.1037
F-statistic: 27.68 on 6 and 1377 DF, p-value: < 2.2e-16
What do you see? Are our expectations met? Any surprises? Let’s try a few other biodiversity patterns where the predictions are less clear.
# fit the multiple regressions
# Gamma statsitic for phylogenetic tree shape
<- lm(gamma ~ divergence_threshold+lambda+omega+sigma_squared_bs+sigma_squared_t+dispersal, data=sim_data_scaled)
lm2
# skewness of the range size distribution of species
<- lm(rs_skewness ~ divergence_threshold+lambda+omega+sigma_squared_bs+sigma_squared_t+dispersal, data=sim_data_scaled)
lm3
# correlation between species range sizes and temperature
<- lm(rangesize_temp_cor ~ divergence_threshold+lambda+omega+sigma_squared_bs+sigma_squared_t+dispersal, data=sim_data_scaled)
lm4
# look at model coefficients
summary(lm2)
Call:
lm(formula = gamma ~ divergence_threshold + lambda + omega +
sigma_squared_bs + sigma_squared_t + dispersal, data = sim_data_scaled)
Residuals:
Min 1Q Median 3Q Max
-14.6901 -3.3262 -0.3125 3.0625 19.1039
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.58975 0.13209 12.035 < 2e-16 ***
divergence_threshold -1.67245 0.14081 -11.877 < 2e-16 ***
lambda -0.36850 0.13341 -2.762 0.00582 **
omega 1.13745 0.13489 8.432 < 2e-16 ***
sigma_squared_bs 0.04682 0.13230 0.354 0.72347
sigma_squared_t -0.17454 0.13266 -1.316 0.18848
dispersal 0.99940 0.13903 7.188 1.07e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.914 on 1377 degrees of freedom
Multiple R-squared: 0.2175, Adjusted R-squared: 0.2141
F-statistic: 63.78 on 6 and 1377 DF, p-value: < 2.2e-16
summary(lm3)
Call:
lm(formula = rs_skewness ~ divergence_threshold + lambda + omega +
sigma_squared_bs + sigma_squared_t + dispersal, data = sim_data_scaled)
Residuals:
Min 1Q Median 3Q Max
-5.5219 -1.7186 -0.2771 1.3923 21.2181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.14631 0.06701 91.717 < 2e-16 ***
divergence_threshold -2.11448 0.07144 -29.598 < 2e-16 ***
lambda -0.60219 0.06768 -8.897 < 2e-16 ***
omega -0.31157 0.06844 -4.553 5.76e-06 ***
sigma_squared_bs 0.22387 0.06712 3.335 0.000875 ***
sigma_squared_t -0.57043 0.06730 -8.476 < 2e-16 ***
dispersal -1.33498 0.07053 -18.927 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.493 on 1377 degrees of freedom
Multiple R-squared: 0.4552, Adjusted R-squared: 0.4528
F-statistic: 191.8 on 6 and 1377 DF, p-value: < 2.2e-16
summary(lm4)
Call:
lm(formula = rangesize_temp_cor ~ divergence_threshold + lambda +
omega + sigma_squared_bs + sigma_squared_t + dispersal, data = sim_data_scaled)
Residuals:
Min 1Q Median 3Q Max
-0.72559 -0.08895 0.01488 0.09329 0.68488
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.197748 0.004238 46.658 < 2e-16 ***
divergence_threshold -0.029361 0.004518 -6.498 1.13e-10 ***
lambda -0.022882 0.004281 -5.345 1.05e-07 ***
omega -0.042760 0.004328 -9.879 < 2e-16 ***
sigma_squared_bs -0.001199 0.004245 -0.283 0.778
sigma_squared_t 0.041124 0.004256 9.662 < 2e-16 ***
dispersal -0.019362 0.004461 -4.340 1.53e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1577 on 1377 degrees of freedom
Multiple R-squared: 0.1388, Adjusted R-squared: 0.135
F-statistic: 36.99 on 6 and 1377 DF, p-value: < 2.2e-16
In all these cases there are interesting associations between model parameters and the biodiversity metrics. However, look at the R-squared values and what do you find? They are highly variable and in some cases quite low. This means that lots of variance in these biodiversity metrics are not explained by a linear combination of our model parameters. This is important when interpreting the effect. There are other ways of inferring more complex relationships between model parameters and biodiversity metrics, such as by including quadratic effects, exploring interaction terms, fitting non-linear models such as generalised additive models (GAMs), or even using machine learning methods such as neural networks which allow for highly-dimension non-linear effects. We won’t cover these today but they are all useful options to explore during sensitivity analysis.
4.0.1 Model Selection
Once we have established that some biodiversity metrics showed predictable relationships with model parameters or the generative model (e.g., M0-M4) we can use these metrics to perform model selection on empirical data. Here we are using the match between observed biodiversity patterns and simulated biodiversity patterns, to ask which model might be most likely to have generated the real patterns.
To do this we are going to perform a linear discriminant analysis which is a classification tool that can fit fit fairly quickly compared to some of the other models. We want to validate how well the model performs so we will perform a 10-fold cross validation repeated 10 times. Here we train the model on a subset of the data, repeating the process and optimising the predictive capacity. Then we predict how good a job our classifier is on a witheld portion of the data (test data). If the model performs well we should be able to accurately predict what models generated what biodiversity metrics.
require(caret)
# get just the biodiversity metrics
<- sim_data[,7:ncol(sim_data)]
sim_data_ms
# remove MRDs variables from sim data
<- sim_data_ms[, -which(grepl("MRDs",colnames(sim_data_ms)))]
sim_data_ms
# partition into a training and testing dataset for cross validation
<- createDataPartition(sim_data_ms$m, p = .66, list = FALSE, times = 1)
train_index <- sim_data_ms[ train_index ,]
train_data <- sim_data_ms[-train_index ,]
test_data
# preprocess values - we will scale and center values
<- preProcess(train_data, method = c("center", "scale"))
preprocessed_values <- predict(preprocessed_values, train_data)
train_transformed <- predict(preprocessed_values, test_data )
test_transformed
# configure the cross-validation paramaters
<- trainControl( method = "repeatedcv", number = 10, repeats = 10, classProbs = TRUE, savePredictions = TRUE)
train_control
# FIT MODELS
<- formula(paste("m ~ ", paste(names(sim_data_ms)[2:c(length(names(sim_data_ms)))], collapse=" + ")))
f1
## LINEAR DISCRIMINANT ANALYSIS
# note we'll run an LDA because they're quick - many other kinds of models to choose from
<- train(f1, data=train_transformed, method = "lda", trControl = train_control, verbose = T)
lda_train
#predict on test data
<- predict(lda_train, test_transformed)
lda_test
# classification accuracy
<- confusionMatrix(data = lda_test, reference = as.factor(test_transformed$m), mode = "prec_recall")
lda_cm
# how well did the model perform
lda_cm
Confusion Matrix and Statistics
Reference
Prediction m0 m1 m2 m3
m0 102 18 1 6
m1 6 91 3 10
m2 0 1 81 24
m3 9 6 32 77
Overall Statistics
Accuracy : 0.7516
95% CI : (0.7098, 0.7902)
No Information Rate : 0.2505
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6688
Mcnemar's Test P-Value : 0.09666
Statistics by Class:
Class: m0 Class: m1 Class: m2 Class: m3
Precision 0.8031 0.8273 0.7642 0.6210
Recall 0.8718 0.7845 0.6923 0.6581
F1 0.8361 0.8053 0.7265 0.6390
Prevalence 0.2505 0.2484 0.2505 0.2505
Detection Rate 0.2184 0.1949 0.1734 0.1649
Detection Prevalence 0.2719 0.2355 0.2270 0.2655
Balanced Accuracy 0.9002 0.8652 0.8104 0.7619
Whats the overall accuracy? Are some models predicted better than others?
Now we’ll use this model to predict the possible model of diversification in orders of terrestrial vertebrates.
# first subset and clean the colnames of the empirical data
<- read.csv("data/order_empirical_summary_statistics.csv")
empirical_data
# just want to look at diverse clades
<- na.omit(empirical_data[which(empirical_data$n_species >= 20),])
empirical_data
# make labels lower case
$taxon <- tolower(empirical_data$taxon)
empirical_data
# clean up a few names because I'm a grub
colnames(empirical_data)[which(colnames(empirical_data) == "taxon")] <- "m"
colnames(empirical_data)[which(colnames(empirical_data)=="rs_kutosis")] <- "rs_kurtosis"
colnames(empirical_data)[which(colnames(empirical_data)=="n_species")] <- "n_extant_diversity"
colnames(empirical_data) <- gsub("_p_cor", "_cor",colnames(empirical_data)) # change _p_cor for posterior samplescould also change _m_cor to use MCC samples
colnames(empirical_data) <- gsub("DivRate", "DR",colnames(empirical_data))
colnames(empirical_data)[which(colnames(empirical_data) == "taxon")] <- "m"
colnames(empirical_data)[which(colnames(empirical_data) == "collessI_posterior")] <- "collessI"
colnames(empirical_data)[which(colnames(empirical_data) == "sackinI_mcc")] <- "sackinI"
colnames(empirical_data)[which(colnames(empirical_data) == "gamma_mcc")] <- "gamma"
# ok these should be able to be matched now
<- empirical_data[, which(colnames(empirical_data) %in% colnames(sim_data_ms))]
empirical_data <- empirical_data[, match(colnames(sim_data_ms), colnames(empirical_data))]
empirical_subset
# check names match
colnames(sim_data_ms)[which(!colnames(sim_data_ms) %in% colnames(empirical_subset))]
character(0)
colnames(empirical_subset)[which(!colnames(empirical_subset) %in% colnames(sim_data_ms))]
character(0)
colnames(empirical_subset) == colnames(sim_data_ms)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Process empirical data in the same way as for the simulated data
<- predict(preprocessed_values, sim_data_ms )
simulated_transformed <- predict(preprocessed_values, empirical_subset )
empirical_transformed
<- list(lda_train)
model_set
# predict on empirical
<- predict(model_set, newdata = empirical_transformed, type = "raw", na.action = na.omit)
class_predictions <- predict(model_set, newdata = empirical_transformed, type = "prob", na.action = na.omit)
class_probabilities
# sum the classes
colSums(do.call(rbind, class_predictions)) # TODO define class prediction table
[1] 3 4 1 2 1 1 4 1 3 3 1 4 2 3 1 4 4 4 3 2 1 1 3 2 1 1 4 4 4 3 4 1 4 4 3 4 1 4
[39] 1 4 1 3 4 3 4 2 4 1
# now we can see which model is the most supported for each order
cbind(empirical_transformed$m, as.character(class_predictions[[1]]))
[,1] [,2]
[1,] "serpentes" "m2"
[2,] "anguimorpha" "m3"
[3,] "dibamia" "m0"
[4,] "gekkota" "m1"
[5,] "iguania" "m0"
[6,] "lacertoidea" "m0"
[7,] "scincoidea" "m3"
[8,] "afrosoricida" "m0"
[9,] "carnivora" "m2"
[10,] "cetartiodactyla" "m2"
[11,] "chiroptera" "m0"
[12,] "cingulata" "m3"
[13,] "dasyuromorphia" "m1"
[14,] "didelphimorphia" "m2"
[15,] "diprotodontia" "m0"
[16,] "eulipotyphla" "m3"
[17,] "lagomorpha" "m3"
[18,] "primates" "m3"
[19,] "rodentia" "m2"
[20,] "scandentia" "m1"
[21,] "accipitriformes" "m0"
[22,] "anseriformes" "m0"
[23,] "apodiformes" "m2"
[24,] "bucerotiformes" "m1"
[25,] "caprimulgiformes" "m0"
[26,] "charadriiformes" "m0"
[27,] "columbiformes" "m3"
[28,] "coraciiformes" "m3"
[29,] "cuculiformes" "m3"
[30,] "falconiformes" "m2"
[31,] "galliformes" "m3"
[32,] "gruiformes" "m0"
[33,] "musophagiformes" "m3"
[34,] "otidiformes" "m3"
[35,] "passeriformes" "m2"
[36,] "pelecaniformes" "m3"
[37,] "piciformes" "m0"
[38,] "procellariiformes" "m3"
[39,] "psittaciformes" "m0"
[40,] "strigiformes" "m3"
[41,] "suliformes" "m0"
[42,] "tinamiformes" "m2"
[43,] "trogoniformes" "m3"
[44,] "anura" "m2"
[45,] "caudata" "m3"
[46,] "gymnophiona" "m1"
[47,] "crocodylia" "m3"
[48,] "testudines" "m0"
table(class_predictions[[1]])
m0 m1 m2 m3
15 5 10 18
# can also look at the variation in this
cbind(empirical_transformed$m, round(class_probabilities[[1]], 3))
empirical_transformed$m m0 m1 m2 m3
1 serpentes 0.081 0.000 0.919 0.000
2 anguimorpha 0.000 0.000 0.000 1.000
3 dibamia 1.000 0.000 0.000 0.000
4 gekkota 0.000 0.655 0.000 0.345
5 iguania 1.000 0.000 0.000 0.000
6 lacertoidea 1.000 0.000 0.000 0.000
7 scincoidea 0.000 0.000 0.000 1.000
8 afrosoricida 1.000 0.000 0.000 0.000
9 carnivora 0.000 0.000 1.000 0.000
10 cetartiodactyla 0.176 0.000 0.824 0.000
11 chiroptera 0.880 0.000 0.120 0.000
12 cingulata 0.000 0.000 0.000 1.000
13 dasyuromorphia 0.000 1.000 0.000 0.000
15 didelphimorphia 0.000 0.000 1.000 0.000
16 diprotodontia 1.000 0.000 0.000 0.000
17 eulipotyphla 0.000 0.000 0.000 1.000
19 lagomorpha 0.000 0.000 0.000 1.000
28 primates 0.000 0.000 0.000 1.000
30 rodentia 0.000 0.000 1.000 0.000
31 scandentia 0.002 0.998 0.000 0.000
33 accipitriformes 1.000 0.000 0.000 0.000
34 anseriformes 1.000 0.000 0.000 0.000
35 apodiformes 0.000 0.000 1.000 0.000
37 bucerotiformes 0.000 1.000 0.000 0.000
38 caprimulgiformes 1.000 0.000 0.000 0.000
41 charadriiformes 0.972 0.028 0.000 0.000
44 columbiformes 0.000 0.001 0.000 0.999
45 coraciiformes 0.000 0.000 0.000 1.000
46 cuculiformes 0.000 0.000 0.000 1.000
48 falconiformes 0.016 0.000 0.984 0.000
49 galliformes 0.000 0.000 0.000 1.000
51 gruiformes 1.000 0.000 0.000 0.000
53 musophagiformes 0.000 0.000 0.000 1.000
54 otidiformes 0.000 0.000 0.000 1.000
55 passeriformes 0.000 0.000 1.000 0.000
56 pelecaniformes 0.000 0.000 0.032 0.968
59 piciformes 0.999 0.000 0.001 0.000
61 procellariiformes 0.000 0.000 0.000 1.000
62 psittaciformes 0.672 0.328 0.000 0.000
66 strigiformes 0.000 0.000 0.000 1.000
67 suliformes 1.000 0.000 0.000 0.000
68 tinamiformes 0.000 0.000 0.775 0.225
69 trogoniformes 0.000 0.000 0.000 1.000
70 anura 0.000 0.000 0.941 0.059
71 caudata 0.000 0.000 0.000 1.000
72 gymnophiona 0.031 0.968 0.000 0.001
73 crocodylia 0.000 0.000 0.000 1.000
74 testudines 1.000 0.000 0.000 0.000
What patterns do you see? Which models are the most supported?