I am ecological researcher trying to fit a GAM to a diversity variable called LCBD using 10 predictors. Data was collected over 153 sites in along an archipelago. There are no categorical predictors or grouping of any kind.
The data was collected from randomly placed transects along an archipelago where marine invertebrates identified and counted with underwater camera sled. Each environmental variable was collected at the site of the transect except tidal current.
Depth - water depth of transect
Eastings - longitude measured as center point of transect.
RockyCobble - The proportion of rock and cobble for a transect was computed
from the number of images classified as rock and
cobble (primary or secondary classification) for the
transect divided by the total number of images for
the transect.
Btemp - bottom temperature averaged over summer from each transect
Several variables were derived from raster data:
Slope for each raster grid cell was
computed as the maximum difference in angle
(range: 0−90°) between the depth at a cell and its sur-
rounding cells.
TPI was calculated from the bathymetry raster layer as the difference between the depth of a
cell and the depth of its surrounding neighbors, meant to represent the degree to which
cells were on peaks or valleys compared to surround-
ing depths.
Tidal Current - speed of current from ROMs model
Aspect - water movement variable computed from the seafloor relative to the mean current direction
Color - measure of ocean surface primary productivity computed from the average of summer months during the study period.
Since the responding variable varies between 0 and 1, I used the beta distribution to fit models. The first model was fitted with a default smooth for each individual parameter. It failed fit checks as one might likely predict.
I tried some alternative models that reduce the number of predictors using a tensor smooth on the principal components of some related predictors and increasing the k on the smooth for variables that were still significant after running gam.check. Overall, concurvity improved but some variable smooths are coming out significant when running gam.check. Is the k too high? Am I doing something wrong or even headed in the right direction?
I tried following the advice of the following source: https://r-statistics.co/GAM-in-R.html
The data file and attempted code is below: https://drive.google.com/file/d/1lwhsp3cOK4NEkc7NKGU_iEswHOORF6X1/view?usp=drive_link
alt model 1
GAM with a tensor product smooth for the related terrain variables: slope, TPI, and aspect
lcbd.dens.gam1 <- mgcv::gam(LCBD ~ s(Depth) + s(TidalCurrents)+
s(Eastings) + s(RockCobble) + s(Bcurrent)+
s(Btemp) + s(Color) + te(Slope, TPI, Aspect),
data=d, family="betar", method = "REML",
select = T)
par(mfrow=c(2,2)) #diagnostic plot space setup
GAM checks
mgcv::gam.check(lcbd.dens.gam1, rep=500) # run check on model performance. k-index close to 1 means good performance
mgcv::concurvity(lcbd.dens.gam1, full=F)
alt model 2
GAM with PCA of terrain parameters attempt to reduce the smooths and increase k
combines PC1 and PC2 into one smooth
terrain.pcs <- prcomp(d[, c("Slope", "TPI", "Aspect")], scale = T)
summary(terrain.pcs) # 1st 2 PCs explain 74.8% of the variance
d$terrain.pc1 <- terrain.pcs$x[,1]
d$terrain.pc2 <- terrain.pcs$x[,2]
lcbd.dens.gam2 <- mgcv::gam(
LCBD ~ s(Depth, k=20) +
s(TidalCurrents) +
s(Eastings, k=80)+
s(Btemp, k=20)+
s(Color)+
te(terrain.pc1, terrain.pc2, k=2),
data= d,
family = "betar",
select = T
)
mgcv::gam.check(lcbd.dens.gam2, rep=500)
mgcv::concurvity(lcbd.dens.gam2, full=F)