I introduced the hidden species problem in the form of four related
questions. We have answered the first two by computing the posterior
distribution for n
and the prevalence
of each species.
The other two questions are:
If we are planning to collect additional reads, can we predict how many new species we are likely to discover?
How many additional reads are needed to increase the fraction of observed species to a given threshold?
To answer predictive questions like this we can use the posterior distributions to simulate possible future events and compute predictive distributions for the number of species, and fraction of the total, we are likely to see.
The kernel of these simulations looks like this:
Choose n
from its posterior
distribution.
Choose a prevalence for each species, including possible unseen species, using the Dirichlet distribution.
Generate a random sequence of future observations.
Compute the number of new species, num_new
, as a function of the number of
additional reads, k
.
Repeat the previous steps and accumulate the joint distribution
of num_new
and
k
.
And here’s the code. RunSimulation
runs a single simulation:
# class Subject def RunSimulation(self, num_reads): m, seen = self.GetSeenSpecies() n, observations = self.GenerateObservations(num_reads) curve = [] for k, obs in enumerate(observations): seen.add(obs) num_new = len(seen) - m curve.append((k+1, num_new)) return curve
num_reads
is the number of additional reads to simulate. ...
No credit card required