1.Introduction.
Reservoir Sampling is an online algorithm designed for picking up small amount of sample with uniform probability. To be specific, given a sample of size K with N items processed so far, the chance for any item to be selected is K/N. When the next item comes in, current sample has a chance to survive K/N*N/(N+1)=K/(N+1) while the new item has chance K/(N+1) to be selected.
The mathematical prove is as follows:
Reservoir sample algorithm works as follows:
- Create an array samples[0..k-1] and copy first k items of stream[] to it.
- When a new data comes in (with index greater than k - 1), generate a random number from 0 to i where i is index of current item in stream[].
- Generated another random number j with in range 0 to k-1, replace samples[j] with the new data.
The code for this aglroithm is also straightforward.
import random
from typing import List,Generator
class ReservoirSampling:
def __init__(self, capacity: int) -> None:
self._capacity:int = capacity
self._samples:List[int] = []
self._pointer:int = 0
def reservoir_sampling(self) -> Generator[List[int], int, None]:
while True:
next_val:int = yield self._samples
if len(self._samples) < self._capacity:
self._samples.append(next_val)
else:
rand_pos: int = random.randint(0,self._pointer) # [0, M - 1]
# probability = K / M
if rand_pos < self._capacity: # belonging to [0, K - 1]
self._samples[rand_pos] = next_val
self._pointer += 1
if __name__ == '__main__':
rs:ReservoirSampling = ReservoirSampling(5)
rs_generator = rs.reservoir_sampling()
print("init list:{}".format(next(rs_generator)))
for x in range(10):
print(rs_generator.send(x))
Result:
init list:[]
[0]
[0, 1]
[0, 1, 2]
[0, 1, 2, 3]
[0, 1, 2, 3, 4]
[0, 1, 5, 3, 4]
[6, 1, 5, 3, 4]
[6, 7, 5, 3, 4]
[6, 7, 5, 3, 8]
[6, 7, 9, 3, 8]