-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undersampling project proposal #34
Conversation
Create directory for project proposals
Creating Sklearn proposal
Sklearn Example
* copies the sklearn example from TFX source and updates it to be run from a cloned tfx-addons repo instead
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
2 similar comments
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall! I would discourage the idea of using BigQuery for this, since that would limit it to use on the Google Cloud only.
Hello! Just a quick update: I've completed an initial version of the undersampling component, which can be found here. Constructive criticism is much appreciated! |
|
||
## Project Description | ||
|
||
This project will be a custom function-based component that inputs an artifact in `tfRecord` format of `tf.Example`s and randomly undersamples it, reducing the data to the lowest-frequency class. It will primarily use an underlying Apache Beam pipeline that will be wrapped inside the TensorFlow component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(non blocking)
How much do you imagine this implementation will be tied to either the format or the tf.Example payload? If it becomes a different container, or a different payload, would sample still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this implementation is tied to the tf.Example payload, this is the standard pipeline data flow for pipelines that are downstream from ExampleGen. We are planning for data format logic to be handled by ExampleGen; as long as your ExampleGen component or a custom component can ingest a requested data format, the Undersampler can use it.
|
||
## Project Implementation | ||
|
||
At a high level, the plan is to use Apache Beam to injest a `tfRecord` of `tf.Examples`, shuffle them, convert them into a key-value `PCollection` with keys as class values and values as data points, and then use `Sample.FixedSizePerKey()` in order to perform the actual undersampling. The algorithm will be written as an Apache Beam pipeline, which will be wrapped into a TensorFlow custom function component to use with TFX pipelines. The component would be written as inputting a `tfRecord` artifact and exporting a similar `tf.Record` artifact, making its placement in a pipeline nearly ubiquitous. If necessary, the component may also be changed to a fully-custom component, albeit one where only the executor is edited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ingest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case of TensorFlow custom function, vs just another Python function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually outdated, but the use of a TFX custom component function would be to allow its use in TFX pipelines as a component. The current implementation uses a fully-custom component with a custom executor and spec, which fulfills a similar role while allowing more complexity.
Just a quick update: a finished initial version of the component can be found here. |
Closed accidentally |
Proposal for the undersampling project discussed in #33.