Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: modify test dataset to contain edge cases for testing purposes #873

Open
3 tasks
seve opened this issue Apr 2, 2024 · 6 comments
Open
3 tasks

chore: modify test dataset to contain edge cases for testing purposes #873

seve opened this issue Apr 2, 2024 · 6 comments
Labels
P2 Priority 2 - Improvement with narrower impact, fix within a month testing

Comments

@seve
Copy link
Contributor

seve commented Apr 2, 2024

  • differing number of cells for spatial vs non-spatial embedding
  • Singleton categorical column
  • Singleton continuous column

Related Tickets: #872 #870

@seve seve added the testing label Apr 2, 2024
@seve seve self-assigned this Apr 2, 2024
@seve
Copy link
Contributor Author

seve commented Apr 2, 2024

Related: chanzuckerberg/cellxgene#1578

@seve
Copy link
Contributor Author

seve commented Apr 15, 2024

Aligned with timmy on this ticket, we have a great opportunity right now to clobber all three datasets (pbmc3k, super-cool-spatial, and truncation-test) into a single one!

@kaloster
Copy link
Contributor

kaloster commented May 8, 2024

Following a good discussion with @seve here - summarizing my take on the subject:

On a single consolidated dataset:

  • Having one consolidated dataset has the benefit of being lean and eliminate the need to have multiple datasets committed to the repo, with the intention of having a single dataset covering all edge cases.
  • The downside would be that:
    • We would have a manually modified dataset, diverged from it's original form, that is detached from schema updates and migration and it would need to be documented on the modifications
    • Which means that it would probably get stale quickly with future schema changes and migrations (realistically)
    • Risk of losing context of what was changed and how (even with a documented process via readme - which would be another thing to maintain)
    • Manually modifying a dataset to cover edge cases can be an issue when done outside of the validation/migration workflow, is generally not good practice and can introduce other issues
    • Generally, it would be another thing that we would have to maintain and document in order to maintain

The option of using actual prod datasets:

  • Find a couple of datasets that encapsulate all edge cases that we need to dev/test for and keep those in the original state unmodified, with the original dataset id as CXGs under example-dataset
  • Whenever there's a schema update, download the refreshed h5ad from the assets bucket and regenerate those to update the test CXGs (or even just download the regenerated CXG from the hosted bucket and commit to the repo).
  • No need to maintain and document a modified dataset - just refresh existing assets~~
  • Dev and test would be done against up-to-date prod data
  • Tapping into an existing, robust workflow vs creating a new one is generally preferred

EDIT: I'm fully aligned with the approach Seve suggested below, leaving this here for posterity

@tihuan tihuan added the P2 Priority 2 - Improvement with narrower impact, fix within a month label May 8, 2024
@seve
Copy link
Contributor Author

seve commented May 10, 2024

What if we do the following:

  1. we select a spatial dataset from prod
  2. download the local.h5ad
  3. create a notebook that adds any necessary testing data to the dataset
  4. store that notebook somewhere in the explorer repo
  5. run the notebook against the .h5ad
  6. run the conversion script against the .h5ad
  7. commit the generated .cxg to the repo
  8. repeat as needed (not every schema migration)
  9. nice to have: create a GHA cron job that pulls the dataset from prod weekly, runs the notebook, and commits the cxg to /example-dataset

@seve
Copy link
Contributor Author

seve commented May 10, 2024

I believe it's worth noting that our current main test dataset(pbmc3k.cxg) is rarely updated, and I'm not sure it's ever gone through a schema migration.

https://github.com/chanzuckerberg/single-cell-explorer/commits/main/example-dataset/pbmc3k.cxg

@seve seve removed their assignment May 10, 2024
@kaloster
Copy link
Contributor

I like it and I'm fully aligned with this approach.

Thank you for coming up with the workflow @seve . The only thing I have to contribute is that maybe as a second phase would be cool to implement this as a cli command as well

What if we do the following:

  1. we select a spatial dataset from prod
  2. download the local.h5ad
  3. create a notebook that adds any necessary testing data to the dataset
  4. store that notebook somewhere in the explorer repo
  5. run the notebook against the .h5ad
  6. run the conversion script against the .h5ad
  7. commit the generated .cxg to the repo
  8. repeat as needed (not every schema migration)
  9. nice to have: create a GHA cron job that pulls the dataset from prod weekly, runs the notebook, and commits the cxg to /example-dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority 2 - Improvement with narrower impact, fix within a month testing
Projects
None yet
Development

No branches or pull requests

3 participants