I’m using random forest models to make predictions on the “humanized” germfree mouse experiments.

Problem definition

Want to predict:

  • Day 1 cdifficile levels based on day 0 community
  • Level of toxin based on community
  • Community to predict death or not–try using just last day of each or include each day with its outcome of death or not
    • Categorical random forest to predict low colonization, high/sustained colonization, or death

Plan

Maybe a function to generally take the previous day and subsequent day’s cdiff predictions

  • Inputs:
    • day of interest
    • Predicted variable (cdiff, toxin levels, death)

Results

I ran a full random forest model over all 10 days predicting level of cdiff colonization. This model was built on OTUs not including cdiff OTU 8, OTUs were greater than 1% for at least 1 day by cage. This ultimately was based on 141 OTUs. The model explained 62.4% of the variation: Full Random Forest

OTU 110 Coriobacteriaceae is by far the most important OTU. Interestingly this OTU is only present at above 1% in the “NP1” group over time and a couple “NP2” days. This “NP1” group is one of the few groups with little to no cdiff colonization, which only happens late in the time course.

Blast results of top 5:

OTU_Name Blast Name Max Score Total Score Query Coverage E value Identity Accession
OTU110_Coriobacteriaceae Eggerthella lenta strain DSM 2243 383 383 100% 2e-106 94% NR_074377.1
OTU42_Lachnospiraceae Blautia obeum strain ATCC 29174 429 429 100% 3e-120 97% NR_118692.1
OTU44_Clostridium Clostridium subterminale strain JCM 1417, Clostridium thiosulfatireducens strain DSM 13105, Clostridium subterminale strain DSM 6970, Clostridium sulfidigenes strain SGB2, Clostridium thiosulfatireducens strain LUP 21, Clostridium subterminale strain DSM 6970 468 468 100% 6e-132 100% NR_113027.1, NR_112656.1, NR_112653.1, NR_044161.1, NR_042718.1, NR_041795.1
OTU17_Clostridium_XlVa [Clostridium] clostridioforme strain ATCC 25537, [Clostridium] bolteae strain JCM 12243, [Clostridium] bolteae strain 16351 468 468 100% 6e-132 100% NR_118128.1, NR_113410.1, NR_025567.1
OTU5_Akkermansia Akkermansia muciniphila strain ATCC BAA-835, Akkermansia muciniphila strain Muc 468 468 100% 6e-132 100% NR_074436.1, NR_042817.1


When I break down the fit of this model by cage I get the following: Full Random Forest Fit


The partial plot including the top 12 OTUs only from the previous full model explained 59.2% of the variation. The relative importance among the top 12 also shifts a bit, placing greater importance on Enterococcus OTU2 levels: Partial Random Forest


I also looked at a model from just Day 0. This model explained only 39.6% of the variation in the data. Interestingly, there are several Clostridium XIVa OTUs that were important in this model. Clostridium XIVa includes many butyrate producing bacteria which in vitro have close association with the mucus layer. Day 0 Random Forest


When I break down the fit of this model by cage I get the following: Day 0 Random Forest Fit


The binary model (yes/no colonization) for Day 0 to Day 1 is not that great. The OOB is 14.52%, which is pretty good. However, while it accurately classifies all true yes samples, the non-colonized (less than 10^2 CFU/g feces) samples were miss classifed 90% of the time. There were only 10 “no” samples, so the n is pretty small. Day 0 Random Forest Classification

What’s next

  • Remove cdiff OTU 8
  • incorporate other variables into the model, such as toxin activity for cdiff levels
  • only look at cdiff 431!!
  • Make function to pick days included
  • Blast the top 4 OTUs in full RF model