-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements reservoir sampler randomly sampling stream of features #33
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a beautiful implementation! ❤️
i = random.randint(0, size - 1) | ||
self.reservoir[i] = v | ||
|
||
self.pushed += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First we designate a counter, which will be incremented for every data point seen.
'''Randomly samples k items from a stream of unknown n items. | ||
''' | ||
|
||
def __init__(self, capacity): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reservoir is generally a list or array of predefined size.
self.reservoir = [] | ||
self.pushed = 0 | ||
|
||
def push(self, v): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we can begin adding data.
size = len(self.reservoir) | ||
|
||
if size < self.capacity: | ||
self.reservoir.append(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we encounter size elements, elements are added directly to reservoir
assert size == self.capacity | ||
assert size <= self.pushed | ||
|
||
p = self.capacity / self.pushed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once reservoir is full, incoming data points have a size / counter chance to replace an existing sample point
For #7. Work in progress.
This changeset implements a a randomized online algorithm "reservoir sampling" for randomly sampling k items from a stream of unknown n items. We can use this to randomly sample e.g. k building features in the osmium handlers without having to store all features first or doing two passes.
Tasks:
Refs: