| Type: | Package | 
| Title: | Predicting Species Accumulation Curves | 
| Version: | 4.0.0 | 
| Date: | 2018-06-27 | 
| Author: | Chao Deng, Timothy Daley and Andrew D. Smith | 
| Maintainer: | Chao Deng <chaodeng@usc.edu> | 
| Description: | Originally as an R version of Preseq <doi:10.1038/nmeth.2375>, the package has extended its functionality to predict the r-species accumulation curve (r-SAC), which is the number of species represented at least r times as a function of the sampling effort. When r = 1, the curve is known as the species accumulation curve, or the library complexity curve in high-throughput genomic sequencing. The package includes both parametric and nonparametric methods, as described by Deng C, et al. (2018) <doi:10.48550/arXiv.1607.02804>. | 
| License: | GPL-3 | 
| Imports: | polynom, graphics, stats | 
| NeedsCompilation: | no | 
| Packaged: | 2018-06-27 10:02:27 UTC; JAKE | 
| Repository: | CRAN | 
| Date/Publication: | 2018-06-27 10:52:56 UTC | 
Predicting r-species accumulation curves
Description
The functionality of this package is to predict r-species
accumulaiton curves. The method is based on a nonparametric empirical Bayes approach
with rational function approximation. The estimator is
excellent in accuracy for both large values of r and long-range
extrapolations, which are essential to large-scale applications. Some
examples are predicting the molecular complexity of sequencing
libraries, estimating the minimum sufficient sequencing depths for
whole-exome sequencing experiments and optimizing depths for single-cell
whole-genome sequencing experiments. 
Details
main functions:
preseqR.rSAC
preseqR.rSAC.bootstrap
preseqR.optimal.sequencing
preseqR.rSAC.sequencing.rmdup
preseqR.sample.cov
preseqR.sample.cov.bootstrap
Author(s)
Chao Deng, Timothy Daley, and Andrew D. Smith
Maintainer: Chao Deng <chaodeng@usc.edu>
References
Baker, G. A., & Graves-Morris, P. (1996). Pade approximants (Encyclopedia of Mathematics and its Applications vol 59).
Boneh, S., Boneh, A., & Caron, R. J. (1998). Estimating the prediction function and the number of unseen species in sampling with replacement. Journal of the American Statistical Association, 93(441), 372-379.
Chao, A., & Shen, T. J. (2004). Nonparametric prediction in species sampling. Journal of agricultural, biological, and environmental statistics, 9(3), 253-269.
Cohen Jr, A. C. (1960). Estimating the parameters of a modified Poisson distribution. Journal of the American Statistical Association, 55(289), 139-143.
Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature methods, 10(4), 325-327.
Deng C, Daley T & Smith AD (2015). Applications of species accumulation curves in large-scale biological data analysis. Quantitative Biology, 3(3), 135-144. URL http://dx.doi.org/10.1007/s40484-015-0049-7.
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Efron, B., & Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know?. Biometrika, 63(3), 435-447.
Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, 1-26.
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.
Fisher, R. A., Corbet, A. S., and Williams, C. B. ,1943, The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population, Journal of Animal Ecology, 12, 42-58.
Good, I. J., & Toulmin, G. H. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43(1-2), 45-63.
Heck Jr, K. L., van Belle, G., & Simberloff, D. (1975). Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology, 1459-1461.
Kalinin V (1965). Functionals related to the poisson distribution and statistical structure of a text. Articles on Mathematical Statistics and the Theory of Probability pp. 202-220.
Dickens' vocabulary
Description
Words frequencies of a collection of Charles Dickens from Project Gutenberg
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of unique words appeared exactly j
times in a collection of Charles Dickens.
References
http://zipfr.r-forge.r-project.org/
Examples
##load library
library(preseqR)
##load data
data(Dickens)
Fisher's butterfly data
Description
Frequencies data of butterflies collected in the Malay peninsula was from Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943).
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is n_j, the number of butterflies captured j
times in the sample.
References
Fisher, R. A., Corbet, A. S., and Williams, C. B. ,1943, The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population, Journal of Animal Ecology, 12, 42-58, Table 1,2.
Examples
##load library
library(preseqR)
##load data
data(FisherButterfly)
k-mer counts of a metagenomic data
Description
The k-mer counts are based on a metagenome sequencing data from
Human Microbiome Project with the accession number
SRR061157. Only forward reads are used to generate the k-mer counts.
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of 31-mers observed exactly j times.
References
Human Microbiome Project (https://hmpdacc.org/).
Examples
##load library
library(preseqR)
##load data
data(SRR061157_k31)
Coverage histogram of a WES data
Description
The coverage histogram is based on a whole-exome sequencing (WES) data from Simons Foundation Autism Research Initiative with the accession number SRR1301329. One million reads are randomly sampled from the raw data to generate this coverage histogram.
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of nucleotides in the genome covered exactly
j times.
References
Simons Foundation Autism Research Initiative (https://www.sfari.org/).
Examples
##load library
library(preseqR)
##load data
data(SRR1301329_1M_base)
Read counts of a WES data
Description
The read counts are based on a whole-exome sequencing (WES) data from Simons Foundation Autism Research Initiative with the accession number SRR1301329. One million reads are randomly sampled from the raw data to generate the read counts.
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of reads observed exactly j times in the
data.
References
Simons Foundation Autism Research Initiative (https://www.sfari.org/).
Examples
##load library
library(preseqR)
##load data
data(SRR1301329_1M_read)
Coverage histogram of a WES data
Description
The coverage histogram is based on a whole-exome sequencing (WES) data from Simons Foundation Autism Research Initiative with the accession number SRR1301329. Only forward reads are used to generate the coverage histogram.
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of nucleotides in the genome covered exactly
j times.
References
Simons Foundation Autism Research Initiative (https://www.sfari.org/).
Examples
##load library
library(preseqR)
##load data
data(SRR1301329_base)
Read counts of a WES data
Description
The read counts are based on a whole-exome sequencing data from Simons Foundation Autism Research Initiative with the accession number SRR1301329. Only forward reads are used to generate the read counts.
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of reads observed exactly j times in the
data.
References
Simons Foundation Autism Research Initiative (https://www.sfari.org/).
Examples
##load library
library(preseqR)
##load data
data(SRR1301329_read)
Coverage histogram of a scWGS data
Description
The coverage histogram is based on a single-cell whole-genome sequencing data (scWGS) through MALBAK protocol. The accession number of the raw data is SRR1301329. Only forward reads are used to generate the coverage histogram.
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of nucleotides in the genome covered exactly
j times.
References
Zong, C., Lu, S., Chapman, A. R., & Xie, X. S. (2012). Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science, 338(6114), 1622-1626.
Examples
##load library
library(preseqR)
##load data
data(SRR611492)
Coverage histogram of a scWGS data
Description
The coverage histogram is based on a single-cell whole-genome sequencing (scWGS) data through MALBAK protocol. The accession number of the raw data is SRR1301329. Five million reads are randomly sampled from the raw data to generate this coverage histogram.
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of nucleotides in the genome covered exactly
j times.
References
Zong, C., Lu, S., Chapman, A. R., & Xie, X. S. (2012). Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science, 338(6114), 1622-1626.
Examples
##load library
library(preseqR)
##load data
data(SRR1301329_5M)
Shakespeare's word type frequencies
Description
The Shakespeare's word type frequencies data was from Efron, B., & Thisted, R. (1976).
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is n_j, the number of unique words appeared j
times in Shakespeare's work.
References
Efron, B., & Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know?. Biometrika, 63(3), 435-447.
Examples
##load library
library(preseqR)
##load data
data(Shakespeare)
Social network
Description
Following relationships of Twitter's social network
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is n_j, the number of users with exactly j followers.
References
Zafarani R, Liu H (2009) Social computing data repository at ASU.
Examples
##load library
library(preseqR)
##load data
data(Twitter)
Fisher's butterfly data
Description
Frequencies data of butterflies collected in the Malay peninsula was from Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943).
Details
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is n_j, the number of butterflies captured exactly j
times in the sample.
References
Fisher, R. A., Corbet, A. S., and Williams, C. B. ,1943, The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population, Journal of Animal Ecology, 12, 42-58, Table 3.
Examples
##load library
library(preseqR)
##load data
data(WillButterfly)
BBC estimator
Description
bbc.rSAC predicts the expected number of species represented at least 
r times in a random sample, based on the initial sample. 
The estimator was originally proposed by Boneh et al. (1998) for estimating
the SAC. We generalize this estimator for predicting the r-SAC.
Usage
bbc.rSAC(n, r=1)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
Value
The estimator for the r-SAC. The input of the estimator is a vector of
sampling efforts t, i.e., the relative sample sizes comparing with the initial
sample. For example, t = 2 means a random sample that is twice the size of
the initial sample.
Author(s)
Chao Deng
References
Boneh, S., Boneh, A., & Caron, R. J. (1998). Estimating the prediction function and the number of unseen species in sampling with replacement. Journal of the American Statistical Association, 93(441), 372-379.
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
library(preseqR)
## import data
data(FisherButterfly)
## construct the estimator for SAC
bbc1 <- bbc.rSAC(FisherButterfly, r=1)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
bbc1(c(10, 20))
## construct the estimator for r-SAC
bbc2 <- bbc.rSAC(FisherButterfly, r=2)
## The number of species represented at least twice in a sample, 
## when the sample size is 50 or 100 times of the initial sample
bbc2(c(50, 100))
CS estimator
Description
cs.rSAC predicts the expected number of species represented at least 
r times in a random sample, based on the initial sample. 
The estimator was originally proposed by Chao and Shen (2004) for estimating
the SAC. We generalize this estimator for predicting the r-SAC.
Usage
cs.rSAC(n, r=1, k=10)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
| k | A cutoff for common species. Default is 10. | 
Value
The estimator for the r-SAC. The input of the estimator is a vector of
sampling efforts t, i.e., the relative sample sizes comparing with the initial
sample. For example, t = 2 means a random sample that is twice the size of
the initial sample.
Author(s)
Chao Deng
References
Chao, A., & Shen, T. J. (2004). Nonparametric prediction in species sampling. Journal of agricultural, biological, and environmental statistics, 9(3), 253-269.
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
library(preseqR)
## import data
data(FisherButterfly)
## construct the estimator for SAC
chao1 <- cs.rSAC(FisherButterfly, r=1)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
chao1(c(10, 20))
## construct the estimator for r-SAC
chao2 <- cs.rSAC(FisherButterfly, r=2)
## The number of species represented at least twice in a sample, 
## when the sample size is 50 or 100 times of the initial sample
chao2(c(50, 100))
RFA estimator
Description
ds.rSAC predicts the expected number of species represented at least 
r times in a random sample, based on the initial sample. 
Usage
ds.rSAC(n, r=1, mt=20)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| mt | An positive integer constraining possible rational function approximations. Default is 20. | 
| r | A positive integer. Default is 1. | 
Details
The estimator is based on an empirical Bayes approach using rational function approximation (RFA), as described in the paper in the references section.
ds.rSAC is the fast version of ds.rSAC.bootstrap.
The function does not provide the confidence interval. To obtain the
confidence interval along with the estimates, one should use the function
ds.rSAC.bootstrap.
Value
The estimator for the r-SAC. The input of the estimator is a vector of
sampling efforts t, i.e., the relative sample sizes comparing with the initial
sample. For example, t = 2 means a random sample that is twice the size of
the initial sample.
Author(s)
Chao Deng
References
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
library(preseqR)
## import data
data(FisherButterfly)
## construct the estimator for SAC
ds1 <- ds.rSAC(FisherButterfly, r=1)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
ds1(c(10, 20))
## construct the estimator for r-SAC
ds2 <- ds.rSAC(FisherButterfly, r=2)
## The number of species represented at least twice in a sample, 
## when the sample size is 50 or 100 times of the initial sample
ds2(c(50, 100))
RFA estimator with bootstrap
Description
ds.rSAC.bootstrap predicts the expected number of species 
represented at least r times in a random sample, 
based on the initial sample. 
Usage
ds.rSAC.bootstrap(n, r=1, mt=20, times=30, conf=0.95)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
| mt | An positive integer constraining possible rational function approximations. Default is 20. | 
| times | The number of bootstrap samples. Default is 30. | 
| conf | The confidence level. Default is 0.95 | 
Details
This is the bootstrap version of ds.rSAC. The bootstrap
sample is generated by randomly sampling the initial sample with replacement.
For each bootstrap sample, we construct an estimator. The median of
estimates is used as the prediction for the number of species
represented at least r times in a random sample. 
The confidence interval is constructed based on a lognormal distribution.
Value
| f | The estimator for the number of species represented at least  | 
| se | The standard error for the estimator. The input is a vector of sampling 
efforts  | 
| lb | The lower bound of the confidence interval.The input is a vector of sampling 
efforts  | 
| ub | The upper bound of the confidence interval.The input is a vector of sampling 
efforts  | 
Author(s)
Chao Deng
References
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
# library(preseqR)
## import data
# data(FisherButterfly)
## construct the estimator for SAC
# ds1 <- ds.rSAC.bootstrap(FisherButterfly, r=1)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
# ds1$f(c(10, 20))
## The standard error of the estiamtes
# ds1$se(c(10, 20))
## The confidence interval of the estimates
# lb <- ds1$lb(c(10, 20))
# ub <- ds1$ub(c(10, 20))
# matrix(c(lb, ub), byrow=FALSE, ncol=2)
## construct the estimator for SAC
# ds2 <- ds.rSAC.bootstrap(FisherButterfly, r=2)
## The number of species represented at least twice in a sample, 
## when the sample size is 50 or 100 times of the initial sample
# ds2$f(c(50, 100))
## The standard error of the estiamtes
# ds2$se(c(50, 100))
## The confidence interval of the estimates
# lb <- ds2$lb(c(50, 100))
# ub <- ds2$ub(c(50, 100))
# matrix(c(lb, ub), byrow=FALSE, ncol=2)
Parameter alpha in the logseries estimator
Description
fisher.alpha estimates the parameter alpha in the logseries estimator by
Fisher, R. A., et al. (1943) based on an initial sample.
Usage
fisher.alpha(n)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
Value
A double, the estimated value of the parameter alpha
Author(s)
Chao Deng
References
Fisher, R., Corbet, A., & Williams, C. (1943). The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population. Journal of Animal Ecology, 12(1), 42-58. doi:10.2307/1411
Examples
## load library
library(preseqR)
## import data
data(WillButterfly)
## estimating alpha
fisher.alpha <- fisher.alpha(WillButterfly)
Logseries estimator
Description
fisher.rSAC estimates the expected number of species represented at least 
r times in a random sample, based on the initial sample.
The estimator was originally proposed by Fisher et al. (1943) for estimating
the SAC. We generalize this estimator for predicting the r-SAC.
Usage
fisher.rSAC(n, r=1)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
Value
The estimator for the r-SAC. The input of the estimator is a vector of
sampling efforts t, i.e., the relative sample sizes comparing with the initial
sample. For example, t = 2 means a random sample that is twice the size of
the initial sample.
Author(s)
Chao Deng
References
Fisher, R., Corbet, A., & Williams, C. (1943). The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population. Journal of Animal Ecology, 12(1), 42-58. doi:10.2307/1411
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
library(preseqR)
## import data
data(WillButterfly)
## construct the estimator for SAC
fisher1 <- fisher.rSAC(WillButterfly, r=1)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
fisher1(c(10, 20))
## construct the estimator for r-SAC
fisher2 <- fisher.rSAC(WillButterfly, r=2)
## The number of species represented at least twice in a sample, 
## when the sample size is 50 or 100 times of the initial sample
fisher2(c(50, 100))
Fraction of k-mers observed at least r times 
Description
kmer.frac.curve predicts the expected fraction of k-mers observed at
least r times in a high-throughput sequencing experiment given the 
amount of sequencing
Usage
  kmer.frac.curve(n, k, read.len, seq, r=2, mt=20)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| k | The number of nucleotides in a  | 
| read.len | The average length of a read. | 
| seq | The amount of nucleotides sequenced.. | 
| r | A positive integer. Default is 1. | 
| mt | An positive integer constraining possible rational function approximations. Default is 20. | 
Details
kmer.frac.curve is mainly designed for metagenomics to evaluate how 
saturated a metagenomic data is.
kmer.frac.curve is the fast version of kmer.frac.curve.bootstrap.
The function does not provide the confidence interval. To obtain the
confidence interval along with the estimates, one should use the function
kmer.frac.curve.bootstrap.
Value
A two-column matrix. The first column is the amount of sequencing in an 
experiment. The second column is the estimate of the fraction of k-mers observed at least
r times in the experiment.
Author(s)
Chao Deng
References
Deng, C and Smith, AD (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804
Examples
## load library
library(preseqR)
## import data
data(SRR061157_k31)
## the fraction of 31-mers represented at least 10 times in an experiment when
## sequencing 1M, 10M, 100M, 1G, 10G, 100G, 1T nucleotides
kmer.frac.curve(n=SRR061157_k31, k=31, read.len=100, seq=10^(6:12), r=10, mt=20) 
Fraction of k-mers observed at least r times with bootstrap
Description
kmer.frac.curve predicts the expected fraction of k-mers observed at
least r times in a high-throughput sequencing experiment given the 
amount of sequencing
Usage
kmer.frac.curve.bootstrap(n, k, read.len, seq, r=2, mt=20, times=30, conf=0.95)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| k | The number of nucleotides in a  | 
| read.len | The average length of a read. | 
| seq | The amount of nucleotides sequenced. | 
| r | A positive integer. Default is 1. | 
| mt | An positive integer constraining possible rational function approximations. Default is 20. | 
| times | The number of bootstrap samples. | 
| conf | The confidence level. Default is 0.95 | 
Details
This is the bootstrap version of kmer.frac.curve. The bootstrap
sample is generated by randomly sampling the initial sample with replacement.
For each bootstrap sample, we construct an estimator. The median of
estimates is used as the prediction for the number of species
represented at least r times in a random sample. 
The confidence interval is constructed based on a lognormal distribution.
Value
A four-column matrix. The first column is the amount of sequencing in an 
experiment.
The second column is the estimate of the fraction of k-mers observed at least
r times in the experiment. The third and fourth columns are the lower
bounds and the upper bounds of the confidence intervals.
Author(s)
Chao Deng
References
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
# library(preseqR)
## import data
# data(SRR061157_k31)
## the fraction of 31-mers represented at least 10 times in an experiment when
## sequencing 1M, 10M, 100M, 1G, 10G, 100G, 1T nucleotides
# kmer.frac.curve.bootstrap(n=SRR061157_k31, k=31, read.len=100, 
#                          seq=10^(6:12), r=10, mt=20) 
Interpolation
Description
Interpolating the number of species represented at least r times
in a subsample given an initial sample
Usage
preseqR.interpolate.rSAC(n, ss, r=1)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| ss | A positive double equal to the step size between subsamples. | 
| r | A positive integer. Default is 1 | 
Details
The expected number of species represented at least r
times in the subsample is estimated based on an expended version of the
formula by Heck Jr, KL. et al. (1975).
Value
A two-column matrix for the number of species represented at least r
times in a random sample. The first column is the size of the random sample; 
the second column is the expected number of species represented at least r
times in the sample.
NULL if failed.
Author(s)
Chao Deng
References
Heck Jr, K. L., van Belle, G., & Simberloff, D. (1975). Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology, 1459-1461.
Examples
## load library
library(preseqR)
## import data
data(Shakespeare)
## The expected number of distinct words represented twice or more in the
## subsample 
preseqR.interpolate.rSAC(n=Shakespeare, ss=1e5, r=2)
Sampling
Description
Generating a histogram by subsampling without replacement.
Usage
  preseqR.nonreplace.sampling(n, size)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| size | An positive integer representing the size of the subsample. | 
Details
preseqR.nonreplace.sampling generates a subsample by sampling the
initial sample without replacement.
sample in R is used to implement the function. We wrap up
this function in such a way that both the input and the output are histograms. 
Value
A two-column matrix as a subsample.
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of species represented j times in the subsample. 
Author(s)
Chao Deng
References
https://stat.ethz.ch/R-manual/R-patched/library/base/html/sample.html
Examples
## load library
library(preseqR)
## import data
data(FisherButterfly)
## generate a subsample of size 1000.
preseqR.nonreplace.sampling(n=FisherButterfly, size=1000)
Optimal amount of sequencing for scWGS
Description
preseqR.optimal.sequencing predicts the optimal amount of sequencing in
a single-cell whole-genome sequencing (scWGS) experiment based on a shallow sequencing experiment.
Usage
preseqR.optimal.sequencing(n, efficiency=0.05, bin=1e8, r=1, mt=20, 
                           times=30, conf=0.95)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| efficiency | The minimum benefit-cost ratio | 
| bin | One unit of sequencing effort. Default is 1e8. | 
| r | A positive integer. Default is 1. | 
| mt | An positive integer constraining possible rational function approximations. Default is 20. | 
| times | The number of bootstrap samples. | 
| conf | The confidence level. Default is 0.95 | 
Details
preseqR.optimal.sequencing predicts the optimal amount of sequencing
in a scWGS experiment. The term optimal is interpreted as the maximum 
amount of sequencing with its benefit-cost ratio greater than a given threshold.
The benefit-cost ratio is defined as the probability of a new nucleotide in the
genome represented at least r times when one more base is sequenced. 
In order to improve the numeric stability, we use the mean of new nucleotdies
with coverage at least r in one unit of sequencing effort to approximate the
ratio. The amount of sequences in one unit of sequencing effort is defined by
the variable bin.
Note that the benefit-cost ratio is not monotonic. The ratio first increases and then decrease as the amount of sequencing increase. To predicte the optimal amount of sequencing, we consider only the areas after the peak, where the ratio starts to decrease.
Value
A vector of three dimensions. The first coordinate is the optimal amount of sequencing. The second and the third coordinates are the lower and upper bound of the confidence interval.
Author(s)
Chao Deng
References
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
#library(preseqR)
## import data
# data(SRR611492_5M)
## the optimal amount of sequencing with the benefit-cost ratio greater than 
## 0.05 for r = 4
# preseqR.optimal.sequencing(n=SRR611492_5M, efficiency=0.05, bin=1e8, r=4)
## the optimal amount of sequencing with the benefit-cost ratio greater than 
## 0.05 for r = 10
# preseqR.optimal.sequencing(n=SRR611492_5M, efficiency=0.05, bin=1e8, r=10)
Best practice for r-SAC – a fast version
Description
preseqR.rSAC predicts the expected number of species represented at least 
r times in a random sample based on the initial sample.
Usage
preseqR.rSAC(n, r=1, mt=20, size=SIZE.INIT, mu=MU.INIT)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| mt | A positive integer constraining possible rational function approximations. Default is 20. | 
| r | A positive integer. Default is 1. | 
| size | A positive double, the initial value of the parameter  | 
| mu | A positive double, the initial value of the parameter  | 
Details
preseqR.rSAC combines the nonparametric approach using the rational
function approximation and the parametric approach using the
zero-truncated negative binomial (ZTNB). For a given initial sample, if the sample
is from a heterogeneous population, the function calls
ds.rSAC; otherwise it calls ztnb.rSAC. The degree
of heterogeneity is measured by the coefficient of variation, which is
estimated by the ZTNB approach.
preseqR.rSAC is the fast version of preseqR.rSAC.bootstrap.
The function does not provide the confidence interval. To obtain the
confidence interval along with the estimates, one should use the function
preseqR.rSAC.bootstrap.
Value
The estimator for the r-SAC. The input of the estimator is a vector of
sampling efforts t, i.e., the relative sample sizes comparing with the initial
sample. For example, t = 2 means a random sample that is twice the size of
the initial sample.
Author(s)
Chao Deng
References
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
library(preseqR)
## import data
data(FisherButterfly)
## construct the estimator for SAC
estimator1 <- preseqR.rSAC(FisherButterfly, r=1)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
estimator1(c(10, 20))
## construct the estimator for r-SAC
estimator2 <- preseqR.rSAC(FisherButterfly, r=2)
## The number of species represented at least twice in a sample, 
## when the sample size is 50 or 100 times of the initial sample
estimator2(c(50, 100))
Best practice for r-SAC
Description
preseqR.rSAC.bootstrap predicts the expected number of species 
represented at least r times in a random sample based on the initial sample.
Usage
preseqR.rSAC.bootstrap(n, r=1, mt=20, size=SIZE.INIT, mu=MU.INIT, times=30,
                       conf=0.95)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
| mt | An positive integer constraining possible rational function approximations. Default is 20. | 
| times | The number of bootstrap samples. | 
| size | A positive double, the initial value of the parameter  | 
| mu | A positive double, the initial value of the parameter  | 
| conf | The confidence level. Default is 0.95 | 
Details
This is the bootstrap version of preseqR.rSAC. The bootstrap
sample is generated by randomly sampling the initial sample with replacement.
For each bootstrap sample, we construct an estimator. The median of
estimates is used as the prediction for the number of species
represented at least r times in a random sample. 
The confidence interval is constructed based on a lognormal distribution.
Value
| f | The estimator for the  | 
| se | The standard error for the estimator. The input is a vector of sampling efforts t. | 
| lb | The lower bound of the confidence interval.The input is a vector of sampling efforts t. | 
| ub | The upper bound of the confidence interval.The input is a vector of sampling efforts t. | 
Author(s)
Chao Deng
References
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
# library(preseqR)
## import data
# data(FisherButterfly)
## construct estimator for SAC
# estimator1 <- preseqR.rSAC.bootstrap(FisherButterfly, r=1)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
# estimator1$f(c(10, 20))
## The standard error of the estiamtes
# estimator1$se(c(10, 20))
## The confidence interval of the estimates
# lb <- estimator1$lb(c(10, 20))
# ub <- estimator1$ub(c(10, 20))
# matrix(c(lb, ub), byrow=FALSE, ncol=2)
## construct estimator for r-SAC
# estimator2 <- preseqR.rSAC.bootstrap(FisherButterfly, r=2)
## The number of species represented at least twice in a sample, 
## when the sample size is 50 or 100 times of the initial sample
# estimator2$f(c(50, 100))
## The standard error of the estiamtes
# estimator2$se(c(50, 100))
## The confidence interval of the estimates
# lb <- estimator2$lb(c(50, 100))
# ub <- estimator2$ub(c(50, 100))
# matrix(c(lb, ub), byrow=FALSE, ncol=2)
Predicting r-SAC in WES/WGS
Description
preseqR.rSAC.sequencing.rmdup predicts the expected number of 
nucleotides in the genome sequenced at least r times in a sequencing
experiment, based on a shallow sequencing experiment.
Usage
preseqR.rSAC.sequencing.rmdup(n_base, n_read, r=1, mt=20, times=30, conf=0.95)
Arguments
| n_base | A two-column matrix.  
The first column is the frequency  | 
| n_read | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
| mt | An positive integer constraining possible rational function approximations. Default is 20. | 
| times | The number of bootstrap samples. Default is 30. | 
| conf | The confidence level. Default is 0.95 | 
Details
preseqR.rSAC.sequencing.rmdup is designed for sequencing experiments, 
where duplicate reads are removed. The procedure is commonly used in 
whole-exome sequencing experiments and sometimes appeared in WGS as well.
To use the function, one must have two histograms. The first histogram
is the coverage histogram, which is based on distinct reads.
The second histogram is the counts of reads with exactly j duplicates.
Value
| f | The estimator for the expected number of nucleotides in the genome
sequenced at least  | 
| se | The standard error for the estimator. The input is a vector of sequencing efforts t. | 
| lb | The lower bound of the confidence interval.The input is a vector of sequencing efforts t. | 
| ub | The upper bound of the confidence interval.The input is a vector of sequencing efforts t. | 
Author(s)
Chao Deng
References
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
#library(preseqR)
## import data
# data(SRR1301329_1M_base)
# data(SRR1301329_1M_read)
# construct the estimator
# estimator1 <- preseqR.rSAC.sequencing.rmdup(
#                 n_base=SRR1301329_1M_base, n_read=SRR5365359_5M_read, 
#                 r=4, mt=20, times=100, conf=0.95)
## The number of nucleotides in the genome covered at least 4 times, when the
## amount of sequencing is 10 or 20 times of the intial experiment
## 10 or 20 times of the initial sample
# estimator1$f(c(10, 20))
## The standard error of the estiamtes
# estimator1$se(c(10, 20))
## The confidence interval of the estimates
# lb <- estimator1$lb(c(10, 20))
# ub <- estimator1$ub(c(10, 20))
# matrix(c(lb, ub), byrow=FALSE, ncol=2)
# construct the estimator
# estimator2 <- preseqR.rSAC.sequencing.rmdup(
#                 n_base=SRR1301329_1M_base, n_read=SRR5365359_5M_read, 
#                 r=10, mt=20, times=100, conf=0.95)
## The number of nucleotides in the genome covered at least 10 times, when the
## amount of sequencing is 10 or 20 times of the intial experiment
## 10 or 20 times of the initial sample
# estimator2$f(c(10, 20))
## The standard error of the estiamtes
# estimator2$se(c(10, 20))
## The confidence interval of the estimates
# lb <- estimator2$lb(c(10, 20))
# ub <- estimator2$ub(c(10, 20))
# matrix(c(lb, ub), byrow=FALSE, ncol=2)
Predicting generalized sample coverage
Description
preseqR.sample.cov predicts the probability of observing a species
represented at least r times in a random sample.
Usage
  preseqR.sample.cov(n, r=1, mt=20)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
| mt | A positive integer constraining possible rational function approximations. Default is 20. | 
Details
Suppose a sample is given and one more individual is randomly drawn from the
population. preseqR.sample.cov estimates the probability of the
species, which represents the individual, has been observed at least
r times in the
sample. When r = 1, the probability is called the sample coverage.
Let N_j be the number of species represented exactly j times in 
a sample. The probability of observing a species represented at
least r times in the sample is estimated as 
\sum_{j=r+1}^\infty jN_j / \sum_{j=1}^\infty jN_j. The theory is
described by Mao and Lindsay (2002). For a random sample
where N_j is unknown, a modified rational function approximation is
first used to predict the value of N_j. Then the estimates are
substituted to obtain an estimator for the probability of observing a species
represented at least r times in the sample.
This function is the fast version of preseqR.sample.cov.bootstrap.
The function does not provide the confidence interval. To obtain the
confidence interval along with the estimates, one should use the function
preseqR.sample.cov.bootstrap.
Value
The estimator for the probability of observing a species represented at least
r times in a random sample.
The input of the estimator is a vector of sampling efforts t, i.e.,
the relative sample sizes comparing with the initial sample. 
For example, t = 2 means a random sample that is twice the size of
the initial sample.
Author(s)
Chao Deng
References
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4), 237-264.
Mao, C. X. and Lindsay, B. G. (2002). A Poisson model for the coverage problem with a genomic application. Biometrika, 89(3), 669-682.
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
library(preseqR)
## import data
data(FisherButterfly)
## construct the estimator for the sample coverage
estimator1 <- preseqR.sample.cov(FisherButterfly, r=1) 
## Given a sample that is 10 times or 20 times the size of an initial samples,
## suppose one randomly draws one more individual from the population. The
## value of the function is the probability that the representing species 
## has been observed in the sample
estimator1(c(10, 20))
## construct the estimator
estimator2 <- preseqR.sample.cov(FisherButterfly, r=2)
## the probability a species represented at least twice when the sample size
## is 50 times or 100 times of the initial sample
estimator2(c(50, 100))
Predicting generalized sample coverage with bootstrap
Description
preseqR.sample.cov.bootstrap predicts the probability of observing a species
represented at least r times in a random sample.
Usage
  preseqR.sample.cov.bootstrap(n, r=1, mt=20, times=30, conf=0.95)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
| mt | A positive integer constraining possible rational function approximations. Default is 20. | 
| times | The number of bootstrap samples. Default is 30. | 
| conf | The confidence level. Default is 0.95 | 
Details
This is the bootstrap version of preseqR.sample.cov. The bootstrap
sample is generated by randomly sampling the initial sample with replacement.
For each bootstrap sample, we construct an estimator. The median of
estimates is used as the prediction for the number of species
represented at least r times in a random sample. 
The confidence interval is constructed based on a lognormal distribution.
Value
| f | The estimator for the probability of observing a species represented at least
 | 
| se | The standard error for the estimator. The input is a vector of sampling efforts t. | 
| lb | The lower bound of the confidence interval.The input is a vector of sampling efforts t. | 
| ub | The upper bound of the confidence interval.The input is a vector of sampling efforts t. | 
Author(s)
Chao Deng
References
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.
Examples
## load library
#library(preseqR)
## import data
#data(FisherButterfly)
## construct the estimator for the sample coverage
# estimator1 <- preseqR.sample.cov.bootstrap(FisherButterfly, r=1)
## Given a sample that is 10 times or 20 times the size of an initial samples,
## suppose one randomly draws one more individual from the population. The
## value of the function is the probability that the representing species 
## has been observed in the sample
# estimator1$f(c(10, 20))
## The standard error of the estiamtes
# estimator1$se(c(10, 20))
## The confidence interval of the estimates
# lb <- estimator1$lb(c(10, 20))
# ub <- estimator1$ub(c(10, 20))
# matrix(c(lb, ub), byrow=FALSE, ncol=2)
## construct the estimator
# estimator2 <- preseqR.rSAC.bootstrap(FisherButterfly, r=2)
## the probability when the sample size is 50 times or 100 times of the initial
## sample
# estimator2$f(c(50, 100))
## The standard error of the estiamtes
# estimator2$se(c(50, 100))
## The confidence interval of the estimates
# lb <- estimator2$lb(c(50, 100))
# ub <- estimator2$ub(c(50, 100))
# matrix(c(lb, ub), byrow=FALSE, ncol=2)
Simulation
Description
Generating a histogram based on a Poisson mixture model.
Usage
  preseqR.simu.hist(L=1e8, N, FUN)
Arguments
| L | A positive integer, the number of species in a population. | 
| N | A positive interger, the simulated sample size. | 
| FUN | An RNG generating non negative real number. | 
Details
preseqR.simu.hist uses a mixture of Poisson distributions to generate
a sample, which size is defined by the variable N. 
The statistical assumption is that for each species the number of individuals
captured in a sample follows a Poisson process. 
The Poisson rates among species are generated by a given
function FUN per unit of sampling effort.
FUN must take an argument indicating the number of random
numbers generated and return a vector of generated numbers. 
Value
A two-column matrix.  
The first column is the frequency j = 1,2,\dots; and the second column
is N_j, the number of species with each species represented exactly j
times in the initial sample. The first column must be sorted in an
ascending order.
Author(s)
Chao Deng
Examples
## load library
library(preseqR)
## construct a RNG
f <- function(n) {
  rgamma(n, shape=0.5, scale=1)
}
## sample 10,000 individuals
preseqR.simu.hist(L=1e5, N=10000, f)
Fitting a zero-truncated negative binomial distribution
Description
preseqR.ztnb.em fits a zero-truncated negative binomial (ZTNB)
distribution to the initial sample.
Since the species with zero observations are missed in the sample, an
EM algorithm is used to estimate the parameters assuming the number of
individuals for each species follows a Negative Binomial distribution
with the zero counts as a missing latent data.
Usage
preseqR.ztnb.em(n, size = SIZE.INIT, mu = MU.INIT)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| size | A positive double setting the initial value of the parameter  | 
| mu | A positive double setting the initial value of the parameter  | 
Details
See the supplement of Daley and Smith (2013).
Value
| size | The estimate of the parameter  | 
| mu | The estimate of the parameter  | 
| loglik | Log-likelihood under estimated ZTNB. | 
Author(s)
Chao Deng
Examples
## load library
library(preseqR)
## import data
data(FisherButterfly)
## print the parameters of a fitting negative binomial distribution
preseqR.ztnb.em(FisherButterfly)
ZTNB estimator
Description
ztnb.rSAC predicts the expected number of species represented at least 
r times in a random sample, based on the initial sample. 
Usage
ztnb.rSAC(n, r=1, size=SIZE.INIT, mu=MU.INIT)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
| size | A positive double, the initial value of the parameter  | 
| mu | A positive double, the initial value of the parameter  | 
Details
The statistical assumption is that for each species the number of individuals
in a sample follows a Poisson distribution. The Poisson rate lambda
are numbers generated from a gamma distribution. So the random variable
X, which is the number of species represented x (x > 0) times in the
sample, follows a zero-truncated negative binomial distribution. The
unknown parameters are estimated by the function preseqR.ztnb.em based
on the initial sample. Using the estimated distribution, we calculate the 
expected number of species represented at least r times in a random sample.
Details of the estimation procedure can be found in the supplement of 
Daley T. and Smith AD. (2013).
Value
The estimator for the r-SAC. The input of the estimator is a vector of
sampling efforts t, i.e., the relative sample sizes comparing with the initial
sample. For example, t = 2 means a random sample that is twice the size of
the initial sample.
Author(s)
Chao Deng
References
Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature methods, 10(4), 325-327.
Deng C, Daley T & Smith AD (2015). Applications of species accumulation curves in large-scale biological data analysis. Quantitative Biology, 3(3), 135-144.
See Also
Examples
## load library
library(preseqR)
## import data
data(FisherButterfly)
## construct the estimator for SAC
ztnb1 <- ztnb.rSAC(FisherButterfly, r=1)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
ztnb1(c(10, 20))
## construct the estimator for r-SAC
ztnb2 <- ztnb.rSAC(FisherButterfly, r=2)
## The number of species represented at least twice in a sample, 
## when the sample size is 50 or 100 times of the initial sample
ztnb2(c(50, 100))
ZTP estimator
Description
ztp.rSAC predicts the expected number of species represented at least 
r times in a random sample, based on the initial sample. 
Usage
ztp.rSAC(n, r=1)
Arguments
| n | A two-column matrix.  
The first column is the frequency  | 
| r | A positive integer. Default is 1. | 
Details
The statistical assumption is that for each species the number of individuals
in a sample follows a Poisson distribution. The Poisson rate lambda
is the same among all species. So the random variable X, which is 
the number of species represented x (x > 0) times, follows a zero-truncated 
Poisson distribution. The unknown parameters are estimated by 
Cohen (1960). Based on the estimated distribution,
we calculate the expected number of species in a random sample. 
Value
The estimator for the r-SAC. The input of the estimator is a vector of
sampling efforts t, i.e., the relative sample sizes comparing with the initial
sample. For example, t = 2 means a random sample that is twice the size of
the initial sample.
Author(s)
Chao Deng
References
Cohen, A. Clifford. "Estimating the parameter in a conditional Poisson distribution." Biometrics 16, no. 2 (1960): 203-211.
Examples
## load library
library(preseqR)
## import data
data(FisherButterfly)
## construct the estimator for SAC
ztp1 <- ztp.rSAC(FisherButterfly, r=1)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
ztp1(c(10, 20))
## construct the estimator for r-SAC
ztp2 <- ztp.rSAC(FisherButterfly, r=2)
## The number of species represented at least once in a sample, 
## when the sample size is 10 or 20 times of the initial sample
ztp2(c(50, 100))