Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Feature]: DISP-S1 Support for Validator Tool #944

Open
riverma opened this issue Aug 14, 2024 · 10 comments
Open

[New Feature]: DISP-S1 Support for Validator Tool #944

riverma opened this issue Aug 14, 2024 · 10 comments
Assignees
Labels
enhancement New feature or request needs triage Issue that requires triage

Comments

@riverma
Copy link
Collaborator

riverma commented Aug 14, 2024

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Related problems

The DSWx-S1 validator tool currently only supports DSWx-S1. We'd like to ensure DISP-S1 is also supported.

Describe the feature request

Sample logic:

  1. Query for a set of CSLC products available between START_TIME and END_TIME
  2. For each CSLC product that is available, for example with the ID t087_012345_iw2, use the burst_to_frame.json file to locate the frame IDs that correspond to this CSLC product.
    • Result: The frame ID is 45.
  3. Next, use the frame_to_burst.json file to identify all the CSLC burst IDs expected for this frame.
    • There are 27 bursts associated with this frame, for example: [t087_012340_iw1, …].
  4. Verify whether all the 27 CSLC burst products have been generated or are available.
    • If the 27 CSLC bursts have been generated, we expect a corresponding DISP-S1 product with frame: 45 and those bursts to be referenced
    • If there is a missing burst, then we expect to have skipped generating the corresponding DISP-S1 product with frame 45.

Some key resources needed:

  • Access to CMR for CSLC and DISP-S1 queries
  • The frame_to_burst.json or burst_to_frame.json
  • Metadata in DISP-S1 product that lists input CSLCs used
@riverma riverma added enhancement New feature or request needs triage Issue that requires triage labels Aug 14, 2024
@riverma
Copy link
Collaborator Author

riverma commented Aug 14, 2024

@philipjyoon - did I capture the logic correctly? The above would apply for FWD or HIST regardless, assuming enough time has passed.

@philipjyoon
Copy link
Contributor

@riverma There are a few more dimensions to this:

  1. Not all frames are produced using 27 bursts. Instead of burst_to_frame.json and frame_to_burst.json we should use opera-disp-s1-consistent-burst-ids-with-datetimes.json which contains the real burst pattern information. This is the file OPERA PCM uses - OPERA PCM does not use the former two files mentioned.
  2. In addition to Frame ID we need to also group and reason over by acquisition time index. Within a 12+ day window you can end up with more than one set of CSLC bursts that belong to the same Frame ID but belong to a different DISP-S1 product.
  3. We need to account for Compressed CSLC availability before deciding whether or not a CSLC should have been part of existing DISP-S1 products. We can perhaps get around this by making the end date window of the validation to be, say, at least 24 days from the current date. This way some of the lag in the system would have been resolved at the time of validation.

We discussed one more dimension which we hadn't decided whether it was worth the complexity: verifying the K- and M- files used as input files in producing the DISP-S1 products. There are two ways to look at those K- and M- input files:

  1. These are just another types of ancillary files like DEM and Ionosphere files. In this case we can take the precedence set in previous CMR audits and not validate these ancillary files.
  2. K- and M- are uniquely critical in DISP-S1 product quality beyond the ancillary files. In this view, we should reason over and validate the input files listed in the DISP-S1 product metadata.

(to be continued... I'll write out what I think should be the overall logic tomorrow morning)

@philipjyoon
Copy link
Contributor

Sample logic:

  1. Query for a set of CSLC products available between START_TIME and END_TIME
  2. For each CSLC product that is available, we want to group them by Frame ID and then by Acquisition Day Index. I would use a Python dictionary of a dictionary of a list to do this. This function can be used to determine those two: https://github.com/nasa/opera-sds-pcm/blob/develop/data_subscriber/cslc_utils.py#L312
    • So you would have something like: { 45: {600: [CLSC1, CSLC2, ...], 612: [CSLC1, ...]}, 48: {...}}
    • We could also put every logic in this data structure and make it a dict of a dict of a dict: the last dict would map Burst ID to CSLC Native ID which would simplify Step 3 below a bit.
  3. Next, iterate that data structure per Frame ID per Acquisition Day Index. You will end up with w list of CSLC IDs that may be complete for DISP-S1 triggering. So we evaluate each of these lists:
    • For every item in each list, determine the Burst ID, using the function above, and then create a unique hashset of them
      • If we wish to also validate the input files of DISP-S1 products, we would use a dict mapping Burst ID to CSLC Native ID instead of using a hashset here. We would also have to evaluate the production time at time of insertion to make sure that we are tracking the latest CSLC file in case of Burst ID collision.
    • Looking up that Frame ID using opera-disp-s1-consistent-burst-ids-with-datetimes.json determine whether the number of bursts find matches the number is required. If so, a corresponding DISP-S1 product should have been created.
    • DISP-S1 Native ID from CMR looks something like this OPERA_L3_DISP-S1_IW_F03050_VV_20240709T000000Z_20240814T000000Z_v0.3_20240815T133432Z This is documented here: https://github.com/nasa/opera-sds-pcm/blob/develop/conf/pge_outputs.yaml#L152 The two important fields are the Frame ID and the "sec_time" which I believe is the Sensing Time or the Acquisition Time.
    • We can then concoct a native id pattern to find the corresponding DISP-S1 product from CMR. I will be something like OPERA_L3_DISP-S1_IW_F03050_VV_20240709T000000Z* Use that pattern to query CMR to find that product.
      • A tricky part: Note that the acquisition time used here only has a day precision - the time has been stripped away. Each CSLC burst is acquired within tens of seconds of each other so it's possible that some may cross the day boundary. Therefore, if we don't find a DISP-S1 product using the exact day, we should also search +- one day. This is rare but possible.

Comparison Options:

  1. As mentioned previously if we also want to take M dependency into account, we would have to expand this logic accordingly.
  2. Above logic only checks whether the right DISP-S1 has been produced but does not check whether it was produced using all the correct CSLC input files. To perform the latter, we need to obtain the full metadata. This is not available from CMR to my knowledge. We can obtain in two ways:
    1. Download the actual DISP-S1 product from ASF DAAC, open it up, and extract out the full metadata. This is costly since these are large files and also we would need to write code to open these files up. The PCM does not have such code base right now.
    2. If we are running this validation on a cluster that contains these products, it would be much better to query the GRQ ES for each product metadata. This would be orders of magnitude cheaper, faster, and easier than above option.

Some key resources needed:

  • Access to CMR for CSLC and DISP-S1 queries
  • The frame_to_burst.json or burst_to_frame.json
  • (Possibly) Metadata in DISP-S1 product that lists input CSLCs used

@riverma
Copy link
Collaborator Author

riverma commented Aug 16, 2024

@philipjyoon - thank you so much for writing out these excellent and clear points! Extremely helpful.

I have a few follow-up questions:

We can then concoct a native id pattern to find the corresponding DISP-S1 product from CMR. I will be something like OPERA_L3_DISP-S1_IW_F03050_VV_20240709T000000Z* Use that pattern to query CMR to find that product.
A tricky part: Note that the acquisition time used here only has a day precision - the time has been stripped away. Each CSLC burst is acquired within tens of seconds of each other so it's possible that some may cross the day boundary. Therefore, if we don't find a DISP-S1 product using the exact day, we should also search +- one day. This is rare but possible.

Hmm, can't we just use the same strategy we did for DSWx-S1? Namely:

  1. Get a listing of all CSLC products between START and END, call this "LIST A"
  2. Go through the logic above to get a list of DISP-S1 frames (grouped by acquisition time) that have complete CSLC coverage. Call the full list of CSLCs that cover full frames list "LIST B"
  3. Query CMR for all DISP-S1 products between START and END with the same acquisition (sensor) time as the earliest and latest CSLC's from LIST A. Aggregate the list of CSLCs mentioned within the metadata field "InputGranules" from all available DISP-S1 products in this window, call this list of CSLCs "LIST C"
  4. Compare LIST B with LIST C and note any discrepancies:
    • If LIST B has more CSLCs than LIST C, then we have incomplete DISP-S1 products
    • If LIST C has more CSLCs than LIST B, then we used too many and the wrong CSLCs for processing

Above logic only checks whether the right DISP-S1 has been produced but does not check whether it was produced using all the correct CSLC input files. To perform the latter, we need to obtain the full metadata. This is not available from CMR to my knowledge. We can obtain in two ways:

The logic I mentioned in the above quote would tell us exactly which CSLCs we should have used. Am I missing something? How would we not know this?

This function can be used to determine those two: https://github.com/nasa/opera-sds-pcm/blob/develop/data_subscriber/cslc_utils.py#L312

Do you have a recommendation on how to import your code? I'm assuming we don't have published packages. Currently the auditing tools are within /report

@philipjyoon
Copy link
Contributor

@riverma I did not realize that CMR query also returns InputGranules If that's the case, yes, what you've outlined would work.

You can use the code here as the general guideline to use cslc_util.py https://github.com/nasa/opera-sds-pcm/blob/develop/tests/data_subscriber/test_cslc_util.py
You can import it by from data_subscriber import cslc_utils on a deployed system that would already have data_subscriber package installed. If you wish to install this package independently of deploying a cluster, we'd have to do a little bit of research (I think it's possible)

@riverma riverma self-assigned this Aug 16, 2024
@riverma
Copy link
Collaborator Author

riverma commented Aug 27, 2024

Next steps based on discussions:

  • - Utilize opera-disp-s1-consistent-burst-ids-with-datetimes.json rather than opera-s1-disp-0.5.0-frame-to-burst.json since the former is a subset of the latter (just the CSLC products we need to match)
  • - Beyond a single orbit worth of validation is currently not supported. Is this a use we want to support? If so - we'll need to update the map_cslc_bursts_to_frames to include acquisition date
  • - The function def validate_disp_s1(smallest_date, greatest_date, endpoint, df): needs to be updated to reflect testing once ASF.DAAC has successfully ingested DISP-S1 products to UAT. That way we can test the tool in real-world conditions.
  • Refactor the codebase to utilize @philipjyoon's PCM DISP-S1 triggering logic utils

@philipjyoon
Copy link
Contributor

This tool must also take into account blackout dates

@philipjyoon
Copy link
Contributor

We've been told by ASF that the input CSLC granule list CANNOT be stored in CMR because it would break. So we will need to implement this logic without that information - as @philipjyoon described in a comment on Aug 15, 2024

philipjyoon added a commit that referenced this issue Nov 22, 2024
philipjyoon added a commit that referenced this issue Dec 11, 2024
philipjyoon added a commit that referenced this issue Dec 16, 2024
…rmining triggering logic for DISP-S1 processing for validation purposes
philipjyoon added a commit that referenced this issue Dec 18, 2024
…ate DataFrame. Create DataFrame from CSLC analysis result
philipjyoon added a commit that referenced this issue Dec 18, 2024
philipjyoon added a commit that referenced this issue Dec 19, 2024
@philipjyoon
Copy link
Contributor

While DISP-S1 forward processing still needs some work, we can still use it to test high-level functionality of the validator. To do so, we run
python ~/mozart/ops/opera-pcm/data_subscriber/daac_data_subscriber.py query -c OPERA_L2_CSLC-S1_V1 --start-date=2024-12-15T08:00:00Z --end-date=2024-12-15T09:00:00Z --chunk-size=2 --k=2 --m=1 --job-queue=opera-job_worker-cslc_data_download --processing-mode=forward
which submits 33 download jobs. When all the products are produced, we can run validation on that using the --disp_s1_validate_with_grq functionality without having to deliver products to the DAAC.

@philipjyoon
Copy link
Contributor

I tested DISP-S1 historical mode validation mode in two ways:

  1. INT POP1 cluster had processed first two runs of frame 11116 in historical mode and delivered to the ASF UAT DAAC. I ran the validator against it and it matched up all CSLC files to all products.
  2. On my dev cluster I created a historical run batch_proc that runs over 2 years using --k=2, --m=1 (for faster processing) and then validated without using the delivery daac using the --disp_s1_validate_with_grq option. All CSLC granules matched up with all the products.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage Issue that requires triage
Projects
None yet
Development

No branches or pull requests

2 participants