I introduced the hidden species problem in the form of four related
questions. We have answered the first two by computing the posterior
n and the prevalence
of each species.
The other two questions are:
If we are planning to collect additional reads, can we predict how many new species we are likely to discover?
How many additional reads are needed to increase the fraction of observed species to a given threshold?
To answer predictive questions like this we can use the posterior distributions to simulate possible future events and compute predictive distributions for the number of species, and fraction of the total, we are likely to see.
The kernel of these simulations looks like this:
n from its posterior
Choose a prevalence for each species, including possible unseen species, using the Dirichlet distribution.
Generate a random sequence of future observations.
Compute the number of new species,
num_new, as a function of the number of
Repeat the previous steps and accumulate the joint distribution
And here’s the code.
RunSimulation runs a single simulation:
# class Subject def RunSimulation(self, num_reads): m, seen = self.GetSeenSpecies() n, observations = self.GenerateObservations(num_reads) curve =  for k, obs in enumerate(observations): seen.add(obs) num_new = len(seen) - m curve.append((k+1, num_new)) return curve
num_reads is the number of additional reads to simulate. ...