pydistinct APIs

Usage

# import all the estimators first
from pydistinct.stats_estimators import *
bootstrap_estimator([1,2,3,3])
>>> 3.9923...

APIs

Ensemble Estimator

pydistinct.ensemble_estimators.full_median_estimator(sequence=None, attributes=None)[source]

Takes median result from all statistical estimators. 10 times slower than normal median estimator

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
Returns:

median value of all estimator

Return type:

float

pydistinct.ensemble_estimators.median_estimator(sequence=None, attributes=None)[source]

Takes median result from faster and generally more reliable statistical estimators

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
Returns:

median value of all estimator

Return type:

float

Statistical Estimators

pydistinct.stats_estimators.bootstrap_estimator(sequence=None, attributes=None, cache=None)[source]

Implementation of a bootstrap estimator to estimate D (Smith and Van Bell 1984; Haas et al, 1995)

DBoot = d + sigma (1-nj/n)^n

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.chao_estimator(sequence=None, attributes=None, cache=None)[source]

Implementation of Chao’s estimator from Chao 1984, using counts of values that appear exactly once and twice

d_chao = d + (f_1)^2/(2*(f_2)) returns birthday problem solution if there are no sequences observed of frequency 2 (ie each distinct value observed is never seen again)

also makes insane bets (10x) when every point observed is almost unique. could be good or bad

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.chao_lee_estimator(sequence=None, attributes=None, cache=None)[source]

Implementation of Chao and Lee’s estimator (Chao and Lee, 1984) using a natural estimator of coverage

gamma hat is an estimator for the squared coefficient of variation of the frequencies

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.goodmans_estimator(sequence=None, attributes=None, cache=None)[source]

Implementation of goodmans estimator from Goodman 1949 : throws an error if N is too high due to numerical complexity

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.horvitz_thompson_estimator(sequence=None, attributes=None, pop_estimator=<function <lambda>>, n_pop=None, cache=None)[source]

Implementation of the Horvitz-Thompson Estimator to estimate D (Sarndal, Swensson, and Wretman 1992; Haas et al, 1995)

n_j = attribute count of value j

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.hybrid_estimator(sequence=None, attributes=None, pop_estimator=<function <lambda>>, n_pop=None, cache=None)[source]

hybrid_estimator : Hybrid Estimator that uses Shlosser’s estimator when data is skewed and Smooth jackknife estimator when data is not. Skew is computed by using an approximate chi square test for uniformity

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.jackknife_estimator(sequence=None, attributes=None, cache=None)[source]

Jackknife scheme for estimating D (Ozsoyoglu et al., 1991) good at regimes where sample size is close to actual number of points

D^hat_c_j = d_n - (n - l)(d_(n-1)- d_n).

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.method_of_moments_estimator(sequence=None, attributes=None, cache=None)[source]

Simple Method-of-Moments Estimator to estimate D (Haas et al, 1995) can be optimised (training rate, stopping value)

d = d_moments(l - e^(-n))/d_moments)

solve for d_moments in d = d_moments(l - e^(-n))/d_moments)

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.method_of_moments_v2_estimator(sequence=None, attributes=None, pop_estimator=<function <lambda>>, n_pop=None, cache=None)[source]
Method-of-Moments Estimator with equal frequency assumption while still sampling
from a finite relation (Haas et al, 1995)
Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.method_of_moments_v3_estimator(sequence=None, attributes=None, pop_estimator=<function <lambda>>, n_pop=None, cache=None)[source]

Method-of-Moments Estimator without equal frequency assumption (Haas et al, 1995)

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.shlossers_estimator(sequence=None, attributes=None, pop_estimator=<function <lambda>>, n_pop=None, cache=None)[source]

Implementation of Shlosser’s Estimator (Shlosser 1981) using a Bernoulli Sampling scheme

Note : Hard to determine q (probability of being included)

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.sichel_estimator(sequence=None, attributes=None, cache=None)[source]

Implementation of Sichel’s Parametric Estimator (Sichel 1986a, 1986b and 1992) which uses a zero-truncated generalized inverse Gaussian-Poisson to estimate D

implementation uses broyden 2 to solve and search linear space for good solution

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
Returns:

estimated distinct count

Return type:

float

pydistinct.stats_estimators.smoothed_jackknife_estimator(sequence=None, attributes=None, pop_estimator=<function <lambda>>, n_pop=None, cache=None)[source]

Jackknife scheme for estimating D that accounts for true bias structures (Haas et al, 1995)

Parameters:
  • sequence (array of ints) – sample sequence of integers
  • attributes (dictionary where keys can be any type, values must be integers) – attribute count -> dictionary with keys as the unique elements and values as counts of those elements
  • cache (dictionary with 4 elements {"n":no_elements,"d":no_unique_elements,"attr": attribute_counts, "freq":frequency_dictionary}) – argument used by median methods to avoid recomputation of variables
  • pop_estimator (function that takes in the length of sequence (int) and outputs the estimated population size (int)) – function to estimate population size if possible
  • n_pop (int) – estimate of population size if available, will be used over pop_estimator function
Returns:

estimated distinct count

Return type:

float

Bootstrap

Functions that allow one to create bootstrapped confidence intervals

pydistinct.bootstrap.bootstrap(stat_func=<function smoothed_jackknife_estimator>, sequence=None, attributes=None, alpha=0.05, num_iterations=1000, iteration_batch_size=10, is_pivotal=True, num_threads=1, return_distribution=False)[source]

Returns bootstrap estimate. Args:

stat_func: statistical estimator to use - good choices are median_estimator and smoothed_jackknife_estimator sequence: sample sequence of integers attributes: dictionary with keys as the unique elements and values as counts of those elements alpha: alpha value representing the confidence interval. Defaults to 0.05, i.e., 95th-CI. num_iterations: number of bootstrap iterations to run. The higher this

number the more sure you can be about the stability your bootstrap. By this - we mean the returned interval should be consistent across runs for the same input. This also consumes more memory and makes analysis slower. Defaults to 10000.
iteration_batch_size: The bootstrap sample can generate very large
matrices. This argument limits the memory footprint by batching bootstrap rounds. If unspecified the underlying code will produce a matrix of len(values) x num_iterations. If specified the code will produce sets of len(values) x iteration_batch_size (one at a time) until num_iterations have been simulated. Defaults to 10. Passing None will calculate the full simulation in one step.
is_pivotal: if true, use the pivotal method for bootstrapping confidence
intervals. If false, use the percentile method.
num_threads: The number of therads to use. This speeds up calculation of
the bootstrap. Defaults to 1. If -1 is specified then multiprocessing.cpu_count() is used instead.
Returns:
BootstrapResults representing CI and estimated value.