summary_analysis.Rmd

---
title: "phytoplankton class prediction"
author: "Flavien"
date: "14/06/2021"
output: html_document
---

The 3X1M sensor is a fluorometer with 3 different exciation channels (440, 470 and 532nm). The fluorescence signal correspond to different accesories pigment, that can be used as biomarker of different taxa. Experimentations have been conducted in the SBR to characterize the response of different mediterranean taxa of phytoplankton. From those data, it is possible to develope a model that classify the signal response directly in situ. 

# Experimentation results ###

As we calculated the chlorophyll from spectrophotmetri in first place, we compared it with hplc measurement afterward.

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE,  fig.align="center", fig.width=7, fig.height=5, fig.cap="Figure: Chla HPLC compared to Chla estimated with spectro"}
library(tidyverse)
library(scales)
results <- read_csv("resultats/results_reshape.csv") %>% rename("f440_mean" = "440nm_mean", "f470_mean" = "470nm_mean", "f532_mean" = "532nm_mean")

results_with_hplc <- filter(results, chla_qa != 'NA')

results_chla <- results_with_hplc %>% dplyr::select(strain ,dilution, replicate, chla_conc_spectro, total_chlorophyll_a) %>% 
  pivot_longer(4:5, names_to = 'chla', values_to = 'concentration') %>% filter(dilution %in% c('2', 'Cmere'))


ggplot(results_with_hplc)+
  geom_point(aes(x = total_chlorophyll_a, y = chla_conc_spectro, colour = strain))+
  geom_smooth(aes(x = total_chlorophyll_a, y = chla_conc_spectro), method = 'lm', se = FALSE)+
  ylab("Chla spectro")+
  xlab("Chla HPLC")+
  ggtitle("Rsq = 0.94")+
  theme_bw()+
  scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x))) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x))) +
  scale_color_brewer(name = "Souches", palette = "Paired")
```


```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE,  fig.align="center", fig.width=7, fig.height=5, fig.cap="Figure: Distribution of phytoplankton groups"}
result <- results_with_hplc %>% mutate(chl_content = total_chlorophyll_a/number_cells,
                                       f440_content = f440_mean/number_cells,
                                       f470_content = f470_mean/number_cells,
                                       f532_content = f532_mean/number_cells,
                                       f440_f470 = f440_mean/f470_mean,
                                       f440_f532 = f440_mean/f532_mean,
                                       f532_f440 = f532_mean/f440_mean,
                                       f470_f440 = f470_mean/f440_mean,
                                       f532_f470 = f532_mean/f470_mean,
                                       f440_ratio = f440_mean/total_chlorophyll_a,
                                       f470_ratio = f470_mean/total_chlorophyll_a,
                                       f532_ratio = f532_mean/total_chlorophyll_a,
                                       d470_d440 = f470_ratio/f440_ratio,
                                       d532_d440 = f532_ratio/f440_ratio) 


result <- result %>% mutate(common_name = case_when(
  strain %in% c("RCC1717", "RCC76", "RCC4213") ~ "Diatom",
  strain %in% c("RCC162", "RCC156", "PCC9511") ~ "Prochlorococcus",
  strain %in% c("RCC2379", "RCC2374", "RCC2319") ~ "Synechococcus",
  strain %in% c("RCC100") ~ "Pelagophyte",
  strain %in% c("RCC3006") ~ "Dino"
))

ggplot(result)+
  geom_point(aes(x = d532_d440, y = d470_d440, colour = common_name))+
  xlab("fluo 532 / fluo 440")+
  ylab("fluo 470 / fluo 440")+
  theme_bw()
```

This figure indicated that it may be possible to dsicriminate the different phytoplankton groups with a linear discrimination analysis (LDA).

# LDA ####

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE,  fig.align="center", fig.width=7, fig.height=5, fig.cap="Figure: LDA plot of the dataset"}
library(MASS)

DTrain <- result %>% filter(dilution != "Cmere") %>% dplyr::select(common_name, f470_f440, f532_f440, f532_f470)

mlda <- lda(common_name ~ ., data = DTrain)

Dplot <- cbind(scale(as.matrix(DTrain[,-1]),scale=FALSE) %*% mlda$scaling,DTrain[,1,drop=FALSE])

ggplot(data=Dplot)+
  geom_point(aes(x = LD1, y = LD2, colour = common_name))+
  theme_bw()
```

A bootstrap technique was used (2/3 train 1/3 test) to create a confusion matrix to understand which class can be well predict or not.

```{r echo=FALSE, message=FALSE, warning=FALSE}
ytest <- c()
prediction <- c()
for(i in c(1:1000)){
  training.samples <- caret::createDataPartition(DTrain$common_name, p = 0.8, list = FALSE)
  training.samples <- sample(nrow(DTrain), 31)
  
  train.data  <- DTrain[training.samples, ]
  test.data <- DTrain[-training.samples, ] %>% dplyr::select(-common_name)
  ytest <- c(ytest, DTrain[-training.samples,]$common_name)
  
  mlda <- lda(common_name ~ ., data = train.data)
  
  predTest <- predict(mlda, newdata=test.data)
  
  predicted <- as.character(predTest$class)

  prediction <- c(prediction, predicted)
    
}

mc <- table(ytest, prediction)
print(mc)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE,  fig.align="center", fig.width=7, fig.height=5, fig.cap="Figure: Percentage of good prediction from the LDA"}
result_pred <- data.frame("real" = ytest, "predict" = prediction) %>% 
  mutate("classification" = case_when(
    real == predict ~ "ok",
    real != predict ~ "nok"
  )) %>% 
  group_by(real, classification) %>% 
  summarise(n = n()) %>% 
  mutate(percent = n/sum(n) * 100)

ggplot(result_pred)+
  geom_bar(aes(x = real, y = percent, fill = classification), stat = "identity")+
  theme_bw()+
  xlab("Phytoplankton group")+
  ylab("%age")+
  scale_fill_brewer(palette = "Set1")
```

The model is good to predict Diatom, Pelagophyte and Synechococcus, but is not powerfull enough to detect dinoflagellate and Prochlorococcus.

# Application ####

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE,  fig.align="center", fig.width=8, fig.height=5, fig.cap="Figure: Prediction of phytoplankton group on a Boussole profile"}

b227_desc2 <- read_csv("resultats/bouss_prof/b227_desc2.csv")


b227 <- dplyr::select(b227_desc2, pres, fluo_440, fluo_470, fluo_532) %>% na.omit() %>% mutate(depth = round(pres)) %>% 
  group_by(depth) %>% 
  mutate(f470_f440 = fluo_470/fluo_440,
             f532_f440 = fluo_532/fluo_440,
             f532_f470 = fluo_532/fluo_470) %>% 
  summarise_all(mean) %>% 
  ungroup() %>% 
  mutate(prof = "b227")


test2 <- b227 %>% 
  dplyr::select(f470_f440, f532_f440, f532_f470) 

mlda <- lda(common_name~., data = DTrain)

pred227 <- predict(mlda, newdata = test2)


b227$taxa <- pred227$class
b227_prob <- data.frame(pred227$posterior) %>% mutate(depth = b227$depth) %>% 
  pivot_longer(1:5, names_to = "taxa", values_to = "probability") %>% 
  group_by(depth) %>% 
  summarise(m = max(probability)) %>% 
  pull(m)
b227$prob <- b227_prob


ggplot(filter(b227, taxa != "Dino"))+
  geom_point(aes(x = fluo_470*0.007, y = - depth, colour = taxa, size = prob))+
  xlab("Chla concentration (ug.L-1)")+
  ylab("Depth")+
  ylim(-250,0)+
  xlim(0,1.1)+
  scale_colour_brewer(palette = "Set1")+
  ggtitle("Boussole 227 (February)")+
  theme_bw()

b227_fluo <- dplyr::select(b227, depth, fluo_440, fluo_470, fluo_532) %>% pivot_longer(2:4, names_to = "Wavelength", values_to = "Counts")

ggplot(b227_fluo)+
  geom_point(aes(x = Counts, y = - depth, colour = Wavelength))+
  scale_color_brewer(palette = "Paired")+
  theme_bw()

```

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE,  fig.align="center", fig.width=8, fig.height=5, fig.cap="Figure: Prediction of phytoplankton group on a Boussole profile"}
b229_desc2 <- read_csv("resultats/bouss_prof/b229_desc2.csv")


b229 <- dplyr::select(b229_desc2, pres, fluo_440, fluo_470, fluo_532) %>% na.omit() %>% mutate(depth = round(pres)) %>% 
  mutate(f470_f440 = fluo_470/fluo_440,
         f532_f440 = fluo_532/fluo_440,
         f532_f470 = fluo_532/fluo_470) %>% 
  group_by(depth) %>% 
  summarise_all(mean) %>% 
  ungroup()


b229_test <- b229 %>% 
  dplyr::select(f470_f440, f532_f440, f532_f470)
pred229 <- predict(mlda, newdata= b229_test)

b229$taxa <- pred229$class

b229_prob <- data.frame(pred229$posterior) %>% mutate(depth = b229$depth) %>% 
  pivot_longer(1:5, names_to = "taxa", values_to = "probability") %>% 
  group_by(depth) %>% 
  summarise(m = max(probability)) %>% 
  pull(m)
b229$prob <- b229_prob

ggplot(b229)+
  geom_point(aes(x = fluo_470*0.007, y = - depth, colour = taxa, size = prob))+
  xlab("Chla concentration (ug.L-1)")+
  ylab("Depth")+
  ylim(-250,0)+
  xlim(0,1.1)+
  scale_colour_brewer(palette = "Set1")+
  ggtitle("Boussole 229 (April)")+
  theme_bw()

b229_fluo <- dplyr::select(b229, depth, fluo_440, fluo_470, fluo_532) %>% pivot_longer(2:4, names_to = "Wavelength", values_to = "Counts")

ggplot(b229_fluo)+
  geom_point(aes(x = Counts, y = - depth, colour = Wavelength))+
  scale_color_brewer(palette = "Paired")+
  theme_bw()
```

The plots show that the LDA discriminate different communities along the profile. Especially on the Chl maximum structure. However the community structure should be examined to determine the accuracy of this output. It seems that Synechoccus are over represented in the output, which can be due to better discrimination of this taxa with the 532 channel.