Reservoir Sampling

1.Introduction.

Reservoir Sampling is an online algorithm designed for picking up small amount of sample with uniform probability. To be specific, given a sample of size K with N items processed so far, the chance for any item to be selected is K/N. When the next item comes in, current sample has a chance to survive K/N*N/(N+1)=K/(N+1) while the new item has chance K/(N+1) to be selected.

The mathematical prove is as follows:

Figure 1

Reservoir sample algorithm works as follows:

    1. Create an array samples[0..k-1] and copy first k items of stream[] to it.
    1. When a new data comes in (with index greater than k - 1), generate a random number from 0 to i where i is index of current item in stream[].
    1. Generated another random number j with in range 0 to k-1, replace samples[j] with the new data.

The code for this aglroithm is also straightforward.

import random  
from typing import List,Generator  
class ReservoirSampling:  
    def __init__(self, capacity: int) -> None:  
        self._capacity:int = capacity  
        self._samples:List[int] = []  
        self._pointer:int = 0  

    def reservoir_sampling(self) -> Generator[List[int], int, None]:  
        while True:  
            next_val:int = yield self._samples  

            if len(self._samples) < self._capacity:  
                self._samples.append(next_val)  
            else:  
                rand_pos: int = random.randint(0,self._pointer)  # [0, M - 1]  
                # probability = K / M  
                if rand_pos < self._capacity: # belonging to [0, K - 1]  
                    self._samples[rand_pos] = next_val  

            self._pointer += 1   


if __name__ == '__main__':
    rs:ReservoirSampling = ReservoirSampling(5)  
    rs_generator = rs.reservoir_sampling()  
    print("init list:{}".format(next(rs_generator)))  
    for x in range(10):  
        print(rs_generator.send(x)) 

Result:

init list:[]
[0]
[0, 1]
[0, 1, 2]
[0, 1, 2, 3]
[0, 1, 2, 3, 4]
[0, 1, 5, 3, 4]
[6, 1, 5, 3, 4]
[6, 7, 5, 3, 4]
[6, 7, 5, 3, 8]
[6, 7, 9, 3, 8]

Author: Liang Tan
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Liang Tan !
  TOC