Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undersampling project proposal #34

Closed
wants to merge 27 commits into from
Closed

Conversation

kindalime
Copy link
Contributor

Proposal for the undersampling project discussed in #33.

@google-cla
Copy link

google-cla bot commented Jun 14, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@kindalime kindalime changed the title Add undersampling project proposal Undersampling project proposal Jun 14, 2021
@google-cla
Copy link

google-cla bot commented Jun 14, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

2 similar comments
@google-cla
Copy link

google-cla bot commented Jun 14, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Jun 14, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@kindalime
Copy link
Contributor Author

@googlebot I signed it!

Copy link
Collaborator

@rcrowe-google rcrowe-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall! I would discourage the idea of using BigQuery for this, since that would limit it to use on the Google Cloud only.

@rcrowe-google
Copy link
Collaborator

@kindalime
Copy link
Contributor Author

Hello! Just a quick update: I've completed an initial version of the undersampling component, which can be found here. Constructive criticism is much appreciated!


## Project Description

This project will be a custom function-based component that inputs an artifact in `tfRecord` format of `tf.Example`s and randomly undersamples it, reducing the data to the lowest-frequency class. It will primarily use an underlying Apache Beam pipeline that will be wrapped inside the TensorFlow component.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non blocking)

How much do you imagine this implementation will be tied to either the format or the tf.Example payload? If it becomes a different container, or a different payload, would sample still work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this implementation is tied to the tf.Example payload, this is the standard pipeline data flow for pipelines that are downstream from ExampleGen. We are planning for data format logic to be handled by ExampleGen; as long as your ExampleGen component or a custom component can ingest a requested data format, the Undersampler can use it.


## Project Implementation

At a high level, the plan is to use Apache Beam to injest a `tfRecord` of `tf.Examples`, shuffle them, convert them into a key-value `PCollection` with keys as class values and values as data points, and then use `Sample.FixedSizePerKey()` in order to perform the actual undersampling. The algorithm will be written as an Apache Beam pipeline, which will be wrapped into a TensorFlow custom function component to use with TFX pipelines. The component would be written as inputting a `tfRecord` artifact and exporting a similar `tf.Record` artifact, making its placement in a pipeline nearly ubiquitous. If necessary, the component may also be changed to a fully-custom component, albeit one where only the executor is edited.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingest

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case of TensorFlow custom function, vs just another Python function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually outdated, but the use of a TFX custom component function would be to allow its use in TFX pipelines as a component. The current implementation uses a fully-custom component with a custom executor and spec, which fulfills a similar role while allowing more complexity.

@kindalime
Copy link
Contributor Author

Just a quick update: a finished initial version of the component can be found here.

pull_request_template.md Outdated Show resolved Hide resolved
@kindalime kindalime marked this pull request as ready for review July 7, 2021 18:13
@rcrowe-google
Copy link
Collaborator

Closed accidentally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants