Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

176 review srcdata paths and remove everything besides valid and invalid if possible #180

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

turbomam
Copy link
Member

@turbomam turbomam commented Jan 25, 2024

  • removes src/data/examples
  • decreases the number of example data files in data/unexpected_pass
    • notes on each remaining file are provided below for review by others

@turbomam turbomam marked this pull request as draft January 25, 2024 19:28
@turbomam
Copy link
Member Author

turbomam commented Jan 25, 2024

.../SampleData-jgi_mg_data-colon-dna_sample_name.yaml

in src/data/unexpected_pass/

jgi_mg_data:
  - analysis_type:
      - metagenomics
      - metatranscriptomics
    dna_concentration: 1.23
    dna_cont_type: plate
    dna_cont_well: C3
    dna_container_id: xxx
    dna_dnase: "no"
    dna_isolate_meth: xxx
    dna_project_contact: xxx
    dna_samp_id: xxx
    dna_sample_format: DNAStable
    dna_sample_name: DNA:0546789 # contains colon and should not be allowed 'a-z, A-Z, 0-9, - and _ only'
    dna_seq_project: xxx
    dna_seq_project_name: xxx
    dna_seq_project_pi: xxx
    dna_volume: 0
    proposal_dna: xxx
    samp_name: xxx

The slot definition in src/nmdc_submission_schema/schema/nmdc_submission_schema.yaml mentions the legal characters for dna_sample_name, but doesn't actually define a pattern

      dna_sample_name:
        name: dna_sample_name
        description: Give the DNA sample a name that is meaningful to you. Sample names must be unique across all JGI projects and contain a-z, A-Z, 0-9, - and _ only.
        title: DNA sample name
        examples:
          - value: JGI_pond_041618
        from_schema: https://w3id.org/nmdc/nmdc
        rank: 4
        string_serialization: '{text}'
        owner: Biosample
        domain_of:
          - Biosample
        slot_group: jgi_metagenomics_section
        range: string
        required: true
        recommended: true
        multivalued: false

@turbomam
Copy link
Member Author

turbomam commented Jan 25, 2024

.../SampleData-jgi_mt_data-illegal-rna_seq_project.yaml

same logic for

  • SampleData-jgi_mt_data-illegal-rna_seq_project_name.yaml
  • SampleData-jgi_mt_data-illegal-rna_seq_project_pi.yaml
jgi_mt_data:
  - analysis_type:
      - metatranscriptomics
    dnase_rna: "no"
    proposal_rna: '504000'
    rna_concentration: 100.3
    rna_cont_type: plate
    rna_cont_well: H7
    rna_container_id: cat_ht_1999
    rna_isolate_meth: phenol/cloroform extraction
    rna_project_contact: Leslie Ann Levine
    rna_samp_id: '123456'
    rna_sample_format: MDA reaction buffer
    rna_sample_name: JGI_lagoon_14343
    rna_seq_project: An RNA Sequencing Project # Values will be prefilled by NMDC. Appears to accept any string.
    rna_seq_project_name: JGI Lagoon metatranscritpomics
    rna_seq_project_pi: Patty Smith
    rna_volume: 25.1
    samp_name: sample name
    source_mat_id: MPI:012345

But the schema doesn't actually prevent users from entering data in any of those slots. I believe there may be some separate issue about "locking" some slots/columns, but the schema doesn't know about that.

      rna_seq_project:
        name: rna_seq_project
        title: RNA seq project ID
        comments:
          - Do not edit these values. A template will be provided by NMDC in which these values have been pre-filled.
        examples:
          - value: '1191234'
        from_schema: https://w3id.org/nmdc/nmdc
        rank: 1
        string_serialization: '{text}'
        owner: Biosample
        domain_of:
          - Biosample
        slot_group: jgi_metatranscriptomics_section
        range: string
        required: true
        recommended: true
        multivalued: false

@turbomam
Copy link
Member Author

@bmeluch this build on something we were discussion recently

.../SampleData-jgi_mt_data-too-small-value-rna_volume.yaml

jgi_mt_data:
  analysis_type:
    - metatranscriptomics
  dnase_rna: "no"
  proposal_rna: '504000'
  rna_concentration: 100.3
  rna_cont_type: plate
  rna_cont_well: H7
  rna_container_id: cat_ht_1999
  rna_isolate_meth: phenol/cloroform extraction
  rna_project_contact: Leslie Ann Levine
  rna_samp_id: '123456'
  rna_sample_format: MDA reaction buffer
  rna_sample_name: JGI_lagoon_14343
  rna_seq_project: '123456789'
  rna_seq_project_name: JGI Lagoon metatranscritpomics
  rna_seq_project_pi: Patty Smith
  rna_volume: 24 # According to documention: Values <25 only permitted by special permission.
  samp_name: sample name
  source_mat_id: MPI:012345

the schema says:

      rna_volume:
        name: rna_volume
        title: RNA volume in ul
        comments:
          - Units must be in uL. Enter the numerical part only. Value must be 0-1000. This form accepts values < 25, but JGI may refuse to process them unless permission has been granted by a project manager
        examples:
          - value: '25'
        from_schema: https://w3id.org/nmdc/nmdc
        rank: 6
        string_serialization: '{float}'
        owner: Biosample
        domain_of:
          - Biosample
        slot_group: jgi_metatranscriptomics_section
        range: float
        required: true
        recommended: true
        minimum_value: 0
        maximum_value: 1000

24 isn't below the minimum_value so is valid. This example data file could be moved to the valid directory or deleted.

@turbomam
Copy link
Member Author

turbomam commented Jan 25, 2024

.../SampleData-soil-data-illegal-env_package.yaml

soil_data:
  - elev: 123
    analysis_type:
      - metagenomics
      - metatranscriptomics
    collection_date: "2022-01-15"
    depth: "0 - 1"
    env_broad_scale: "______urban biome [ENVO:01000249]"
    env_local_scale: "________tunnel [ENVO:00000068]"
    env_medium: "bare soil [ENVO:01001616]"
    geo_loc_name: "USA: Nowhere, Oklahoma"
    growth_facil: greenhouse
    lat_lon: 35.211445 -98.464612
    samp_name: b
    samp_store_temp: "-80 Celsius"
    store_cond: frozen

env_package is defined as a slot in nmdc-schema, but not in the submission schema, and this file doesn't assert one.

It looks like most Biosample records in MongoDB don't have an env_package value. Those that do use the has_raw_value of a TextValue to capture an EnvO CURIe.

I recommend that we delete this file.

@turbomam
Copy link
Member Author

turbomam commented Jan 25, 2024

.../SampleData-soil-data-long-decimal-lat_lon.yaml

sediment_data:
  - analysis_type:
      - metagenomics
      - metatranscriptomics
    collection_date: "2022-01-15"
    depth: "0-.1"
    elev: 0
    env_broad_scale: "______urban biome [ENVO:01000249]"
    env_local_scale: "________tunnel [ENVO:00000068]"
    env_medium: "bare soil [ENVO:01001616]"
    geo_loc_name: "USA: Nowhere, Oklahoma"
    lat_lon: 35.2114459 -98.4646126787809678678
    samp_name: a
    samp_store_temp: "-80 Celsius"

I treid to assert a pattern that would limit the number of digits after the decimal point but maybe I got something wrong.

      lat_lon:
        name: lat_lon
        description: The geographical origin of the sample as defined by latitude and longitude. The values should be reported in decimal degrees and in WGS84 system
        title: geographic location (latitude and longitude)
        notes:
          - This is currently a required field but it's not clear if this should be required for human hosts
        from_schema: https://w3id.org/nmdc/nmdc
        rank: 5
        is_a: environment field
        string_serialization: '{lat lon}'
        slot_uri: MIXS:0000009
        multivalued: false
        owner: Biosample
        domain_of:
          - FieldResearchSite
          - Biosample
        slot_group: mixs_modified_section
        range: string
        required: true
        pattern: ^[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?)\s[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$

regexr agrees that the lat_lon value matches the pattern. Back to the drawing board.

@turbomam turbomam marked this pull request as ready for review January 25, 2024 20:16
@turbomam turbomam marked this pull request as draft January 25, 2024 20:21
@mslarae13
Copy link
Contributor

decreasing the number of examples that pass even though they should be invalid

Have less directories in https://github.com/microbiomedata/submission-schema/tree/main/src/data

  • remove examples, move that file into valid or invalid
  • Resolve 'unexpected pass'

@turbomam will check out this branch and determine if it can be merged

@mslarae13
Copy link
Contributor

@turbomam did you get a chance to evaluate this before the freeze?

@bmeluch was going to look at example files. If not, maybe she can help?

@ssarrafan
Copy link

@bmeluch @turbomam can you please respond to @mslarae13 questions from 2 weeks ago on this issue? Thanks!
I'll move this to the current sprint.

@ssarrafan
Copy link

Removing this one from the sprint. No updates in the last month.

@ssarrafan ssarrafan added the backlog backlogged to be reprioritized in the future label Jun 28, 2024
@turbomam
Copy link
Member Author

turbomam commented Jul 3, 2024

I'm doing the cleanup now

@mslarae13
Copy link
Contributor

Hey @turbomam is this one ready for review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog backlogged to be reprioritized in the future
Projects
None yet
Development

Successfully merging this pull request may close these issues.

review src/data paths and remove everything besides valid and invalid if possible
3 participants