pydistinct APIs¶
Usage¶
# import all the estimators first
from pydistinct.stats_estimators import *
bootstrap_estimator([1,2,3,3])
>>> 3.9923...
APIs¶
-
pydistinct.stats_estimators.bootstrap_estimator(sequence)[source]¶ Implementation of a bootstrap estimator to estimate D (Smith and Van Bell 1984; Haas et al, 1995)
DBoot = d + sigma (1-nj/n)^n
Parameters: sequence (array of ints) – sample sequence of integers Returns: estimated distinct count Return type: float
-
pydistinct.stats_estimators.chao_estimator(sequence)[source]¶ Implementation of Chao’s estimator from Chao 1984, using counts of values that appear exactly once and twice
d_chao = d + (f_1)^2/(2*(f_2)) returns birthday problem solution if there are no sequences observed of frequency 2 (ie each distinct value observed is never seen again)
also makes insane bets (10x) when every point observed is almost unique. could be good or bad
Parameters: sequence (array of ints) – sample sequence of integers Returns: estimated distinct count Return type: float
-
pydistinct.stats_estimators.chao_lee_estimator(sequence)[source]¶ Implementation of Chao and Lee’s estimator (Chao and Lee, 1984) using a natural estimator of coverage
gamma hat is an estimator for the squared coefficient of variation of the frequencies
Parameters: sequence (array of ints) – sample sequence of integers Returns: estimated distinct count Return type: float
-
pydistinct.stats_estimators.goodmans_estimator(sequence)[source]¶ Implementation of goodmans estimator from Goodman 1949 : throws an error if N is too high due to numerical complexity
Parameters: sequence (array of ints) – sample sequence of integers Returns: estimated distinct count Return type: float
-
pydistinct.stats_estimators.horvitz_thompson_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]¶ Implementation of the Horvitz-Thompson Estimator to estimate D (Sarndal, Swensson, and Wretman 1992; Haas et al, 1995)
n_j = attribute count of value j
Parameters: - sequence (array of ints) – sample sequence of integers
- pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
- n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns: estimated distinct count
Return type: float
-
pydistinct.stats_estimators.hybrid_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]¶ hybrid_estimator : Hybrid Estimator that uses Shlosser’s estimator when data is skewed and Smooth jackknife estimator when data is not. Skew is computed by using an approximate chi square test for uniformity
Parameters: - sequence (array of ints) – sample sequence of integers
- pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
- n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns: estimated distinct count
Return type: float
-
pydistinct.stats_estimators.jackknife_estimator(sequence)[source]¶ Jackknife scheme for estimating D (Ozsoyoglu et al., 1991) good at regimes where sample size is close to actual number of points
D^hat_c_j = d_n - (n - l)(d_(n-1)- d_n).
Parameters: sequence (array of ints) – sample sequence of integers Returns: estimated distinct count Return type: float
-
pydistinct.stats_estimators.method_of_moments_estimator(sequence)[source]¶ Simple Method-of-Moments Estimator to estimate D (Haas et al, 1995) can be optimised (training rate, stopping value)
d = d_moments(l - e^(-n))/d_moments)
solve for d_moments in d = d_moments(l - e^(-n))/d_moments)
Parameters: sequence (array of ints) – sample sequence of integers Returns: estimated distinct count Return type: float
-
pydistinct.stats_estimators.method_of_moments_v2_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]¶ - Method-of-Moments Estimator with equal frequency assumption while still sampling
- from a finite relation (Haas et al, 1995)
Parameters: - sequence (array of ints) – sample sequence of integers
- pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
- n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns: estimated distinct count
Return type: float
-
pydistinct.stats_estimators.method_of_moments_v3_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]¶ Method-of-Moments Estimator without equal frequency assumption (Haas et al, 1995)
Parameters: - sequence (array of ints) – sample sequence of integers
- pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
- n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns: estimated distinct count
Return type: float
-
pydistinct.stats_estimators.shlossers_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]¶ Implementation of Shlosser’s Estimator (Shlosser 1981) using a Bernoulli Sampling scheme
Note : Hard to determine q (probability of being included)
Parameters: - sequence (array of ints) – sample sequence of integers
- pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
- n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns: estimated distinct count
Return type: float
-
pydistinct.stats_estimators.sichel_estimator(sequence)[source]¶ Implementation of Sichel’s Parametric Estimator (Sichel 1986a, 1986b and 1992) which uses a zero-truncated generalized inverse Gaussian-Poisson to estimate D
implementation uses broyden 2 to solve and search linear space for good solution
Parameters: sequence (array of ints) – sample sequence of integers Returns: estimated distinct count Return type: float
-
pydistinct.stats_estimators.smoothed_jackknife_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]¶ Jackknife scheme for estimating D that accounts for true bias structures (Haas et al, 1995)
Parameters: - sequence (array of ints) – sample sequence of integers
- pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
- n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns: estimated distinct count
Return type: float