pydistinct APIs

Usage

# import all the estimators first
from pydistinct.stats_estimators import *
bootstrap_estimator([1,2,3,3])
>>> 3.9923...

APIs

pydistinct.stats_estimators.bootstrap_estimator(sequence)[source]

Implementation of a bootstrap estimator to estimate D (Smith and Van Bell 1984; Haas et al, 1995)

DBoot = d + sigma (1-nj/n)^n

Parameters:sequence (array of ints) – sample sequence of integers
Returns:estimated distinct count
Return type:float
pydistinct.stats_estimators.chao_estimator(sequence)[source]

Implementation of Chao’s estimator from Chao 1984, using counts of values that appear exactly once and twice

d_chao = d + (f_1)^2/(2*(f_2)) returns birthday problem solution if there are no sequences observed of frequency 2 (ie each distinct value observed is never seen again)

also makes insane bets (10x) when every point observed is almost unique. could be good or bad

Parameters:sequence (array of ints) – sample sequence of integers
Returns:estimated distinct count
Return type:float
pydistinct.stats_estimators.chao_lee_estimator(sequence)[source]

Implementation of Chao and Lee’s estimator (Chao and Lee, 1984) using a natural estimator of coverage

gamma hat is an estimator for the squared coefficient of variation of the frequencies

Parameters:sequence (array of ints) – sample sequence of integers
Returns:estimated distinct count
Return type:float
pydistinct.stats_estimators.goodmans_estimator(sequence)[source]

Implementation of goodmans estimator from Goodman 1949 : throws an error if N is too high due to numerical complexity

Parameters:sequence (array of ints) – sample sequence of integers
Returns:estimated distinct count
Return type:float
pydistinct.stats_estimators.horvitz_thompson_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]

Implementation of the Horvitz-Thompson Estimator to estimate D (Sarndal, Swensson, and Wretman 1992; Haas et al, 1995)

n_j = attribute count of value j

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.hybrid_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]

hybrid_estimator : Hybrid Estimator that uses Shlosser’s estimator when data is skewed and Smooth jackknife estimator when data is not. Skew is computed by using an approximate chi square test for uniformity

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.jackknife_estimator(sequence)[source]

Jackknife scheme for estimating D (Ozsoyoglu et al., 1991) good at regimes where sample size is close to actual number of points

D^hat_c_j = d_n - (n - l)(d_(n-1)- d_n).

Parameters:sequence (array of ints) – sample sequence of integers
Returns:estimated distinct count
Return type:float
pydistinct.stats_estimators.method_of_moments_estimator(sequence)[source]

Simple Method-of-Moments Estimator to estimate D (Haas et al, 1995) can be optimised (training rate, stopping value)

d = d_moments(l - e^(-n))/d_moments)

solve for d_moments in d = d_moments(l - e^(-n))/d_moments)

Parameters:sequence (array of ints) – sample sequence of integers
Returns:estimated distinct count
Return type:float
pydistinct.stats_estimators.method_of_moments_v2_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]
Method-of-Moments Estimator with equal frequency assumption while still sampling
from a finite relation (Haas et al, 1995)
Parameters:
  • sequence (array of ints) – sample sequence of integers
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.method_of_moments_v3_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]

Method-of-Moments Estimator without equal frequency assumption (Haas et al, 1995)

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.shlossers_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]

Implementation of Shlosser’s Estimator (Shlosser 1981) using a Bernoulli Sampling scheme

Note : Hard to determine q (probability of being included)

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.sichel_estimator(sequence)[source]

Implementation of Sichel’s Parametric Estimator (Sichel 1986a, 1986b and 1992) which uses a zero-truncated generalized inverse Gaussian-Poisson to estimate D

implementation uses broyden 2 to solve and search linear space for good solution

Parameters:sequence (array of ints) – sample sequence of integers
Returns:estimated distinct count
Return type:float
pydistinct.stats_estimators.smoothed_jackknife_estimator(sequence, pop_estimator=<function <lambda>>, n_pop=None)[source]

Jackknife scheme for estimating D that accounts for true bias structures (Haas et al, 1995)

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float