[New Feature]: DISP-S1 Support for Validator Tool #944

riverma · 2024-08-14T01:58:02Z

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Describe the feature request

Sample logic:

Query for a set of CSLC products available between START_TIME and END_TIME
For each CSLC product that is available, for example with the ID t087_012345_iw2, use the burst_to_frame.json file to locate the frame IDs that correspond to this CSLC product.
- Result: The frame ID is 45.
Next, use the frame_to_burst.json file to identify all the CSLC burst IDs expected for this frame.
- There are 27 bursts associated with this frame, for example: [t087_012340_iw1, …].
Verify whether all the 27 CSLC burst products have been generated or are available.
- If the 27 CSLC bursts have been generated, we expect a corresponding DISP-S1 product with frame: 45 and those bursts to be referenced
- If there is a missing burst, then we expect to have skipped generating the corresponding DISP-S1 product with frame 45.

Some key resources needed:

Access to CMR for CSLC and DISP-S1 queries
The frame_to_burst.json or burst_to_frame.json
Metadata in DISP-S1 product that lists input CSLCs used

The text was updated successfully, but these errors were encountered:

riverma · 2024-08-14T01:58:34Z

@philipjyoon - did I capture the logic correctly? The above would apply for FWD or HIST regardless, assuming enough time has passed.

philipjyoon · 2024-08-15T00:32:25Z

@riverma There are a few more dimensions to this:

Not all frames are produced using 27 bursts. Instead of burst_to_frame.json and frame_to_burst.json we should use opera-disp-s1-consistent-burst-ids-with-datetimes.json which contains the real burst pattern information. This is the file OPERA PCM uses - OPERA PCM does not use the former two files mentioned.
In addition to Frame ID we need to also group and reason over by acquisition time index. Within a 12+ day window you can end up with more than one set of CSLC bursts that belong to the same Frame ID but belong to a different DISP-S1 product.
We need to account for Compressed CSLC availability before deciding whether or not a CSLC should have been part of existing DISP-S1 products. We can perhaps get around this by making the end date window of the validation to be, say, at least 24 days from the current date. This way some of the lag in the system would have been resolved at the time of validation.

We discussed one more dimension which we hadn't decided whether it was worth the complexity: verifying the K- and M- files used as input files in producing the DISP-S1 products. There are two ways to look at those K- and M- input files:

These are just another types of ancillary files like DEM and Ionosphere files. In this case we can take the precedence set in previous CMR audits and not validate these ancillary files.
K- and M- are uniquely critical in DISP-S1 product quality beyond the ancillary files. In this view, we should reason over and validate the input files listed in the DISP-S1 product metadata.

(to be continued... I'll write out what I think should be the overall logic tomorrow morning)

philipjyoon · 2024-08-15T16:59:27Z

Sample logic:

Query for a set of CSLC products available between START_TIME and END_TIME
For each CSLC product that is available, we want to group them by Frame ID and then by Acquisition Day Index. I would use a Python dictionary of a dictionary of a list to do this. This function can be used to determine those two: https://github.com/nasa/opera-sds-pcm/blob/develop/data_subscriber/cslc_utils.py#L312
- So you would have something like: { 45: {600: [CLSC1, CSLC2, ...], 612: [CSLC1, ...]}, 48: {...}}
- We could also put every logic in this data structure and make it a dict of a dict of a dict: the last dict would map Burst ID to CSLC Native ID which would simplify Step 3 below a bit.
Next, iterate that data structure per Frame ID per Acquisition Day Index. You will end up with w list of CSLC IDs that may be complete for DISP-S1 triggering. So we evaluate each of these lists:
- For every item in each list, determine the Burst ID, using the function above, and then create a unique hashset of them
  - If we wish to also validate the input files of DISP-S1 products, we would use a dict mapping Burst ID to CSLC Native ID instead of using a hashset here. We would also have to evaluate the production time at time of insertion to make sure that we are tracking the latest CSLC file in case of Burst ID collision.
- Looking up that Frame ID using opera-disp-s1-consistent-burst-ids-with-datetimes.json determine whether the number of bursts find matches the number is required. If so, a corresponding DISP-S1 product should have been created.
- DISP-S1 Native ID from CMR looks something like this OPERA_L3_DISP-S1_IW_F03050_VV_20240709T000000Z_20240814T000000Z_v0.3_20240815T133432Z This is documented here: https://github.com/nasa/opera-sds-pcm/blob/develop/conf/pge_outputs.yaml#L152 The two important fields are the Frame ID and the "sec_time" which I believe is the Sensing Time or the Acquisition Time.
- We can then concoct a native id pattern to find the corresponding DISP-S1 product from CMR. I will be something like OPERA_L3_DISP-S1_IW_F03050_VV_20240709T000000Z* Use that pattern to query CMR to find that product.
  - A tricky part: Note that the acquisition time used here only has a day precision - the time has been stripped away. Each CSLC burst is acquired within tens of seconds of each other so it's possible that some may cross the day boundary. Therefore, if we don't find a DISP-S1 product using the exact day, we should also search +- one day. This is rare but possible.

Comparison Options:

As mentioned previously if we also want to take M dependency into account, we would have to expand this logic accordingly.
Above logic only checks whether the right DISP-S1 has been produced but does not check whether it was produced using all the correct CSLC input files. To perform the latter, we need to obtain the full metadata. This is not available from CMR to my knowledge. We can obtain in two ways:
1. Download the actual DISP-S1 product from ASF DAAC, open it up, and extract out the full metadata. This is costly since these are large files and also we would need to write code to open these files up. The PCM does not have such code base right now.
2. If we are running this validation on a cluster that contains these products, it would be much better to query the GRQ ES for each product metadata. This would be orders of magnitude cheaper, faster, and easier than above option.

Some key resources needed:

Access to CMR for CSLC and DISP-S1 queries
The frame_to_burst.json or burst_to_frame.json
(Possibly) Metadata in DISP-S1 product that lists input CSLCs used

riverma · 2024-08-16T01:46:12Z

@philipjyoon - thank you so much for writing out these excellent and clear points! Extremely helpful.

I have a few follow-up questions:

We can then concoct a native id pattern to find the corresponding DISP-S1 product from CMR. I will be something like OPERA_L3_DISP-S1_IW_F03050_VV_20240709T000000Z* Use that pattern to query CMR to find that product.
A tricky part: Note that the acquisition time used here only has a day precision - the time has been stripped away. Each CSLC burst is acquired within tens of seconds of each other so it's possible that some may cross the day boundary. Therefore, if we don't find a DISP-S1 product using the exact day, we should also search +- one day. This is rare but possible.

Hmm, can't we just use the same strategy we did for DSWx-S1? Namely:

Get a listing of all CSLC products between START and END, call this "LIST A"
Go through the logic above to get a list of DISP-S1 frames (grouped by acquisition time) that have complete CSLC coverage. Call the full list of CSLCs that cover full frames list "LIST B"
Query CMR for all DISP-S1 products between START and END with the same acquisition (sensor) time as the earliest and latest CSLC's from LIST A. Aggregate the list of CSLCs mentioned within the metadata field "InputGranules" from all available DISP-S1 products in this window, call this list of CSLCs "LIST C"
Compare LIST B with LIST C and note any discrepancies:
- If LIST B has more CSLCs than LIST C, then we have incomplete DISP-S1 products
- If LIST C has more CSLCs than LIST B, then we used too many and the wrong CSLCs for processing

Above logic only checks whether the right DISP-S1 has been produced but does not check whether it was produced using all the correct CSLC input files. To perform the latter, we need to obtain the full metadata. This is not available from CMR to my knowledge. We can obtain in two ways:

The logic I mentioned in the above quote would tell us exactly which CSLCs we should have used. Am I missing something? How would we not know this?

This function can be used to determine those two: https://github.com/nasa/opera-sds-pcm/blob/develop/data_subscriber/cslc_utils.py#L312

Do you have a recommendation on how to import your code? I'm assuming we don't have published packages. Currently the auditing tools are within /report

philipjyoon · 2024-08-16T15:54:30Z

@riverma I did not realize that CMR query also returns InputGranules If that's the case, yes, what you've outlined would work.

You can use the code here as the general guideline to use cslc_util.py https://github.com/nasa/opera-sds-pcm/blob/develop/tests/data_subscriber/test_cslc_util.py
You can import it by from data_subscriber import cslc_utils on a deployed system that would already have data_subscriber package installed. If you wish to install this package independently of deploying a cluster, we'd have to do a little bit of research (I think it's possible)

riverma · 2024-08-27T22:54:39Z

Next steps based on discussions:

- Utilize opera-disp-s1-consistent-burst-ids-with-datetimes.json rather than opera-s1-disp-0.5.0-frame-to-burst.json since the former is a subset of the latter (just the CSLC products we need to match)
- Beyond a single orbit worth of validation is currently not supported. Is this a use we want to support? If so - we'll need to update the map_cslc_bursts_to_frames to include acquisition date
- The function def validate_disp_s1(smallest_date, greatest_date, endpoint, df): needs to be updated to reflect testing once ASF.DAAC has successfully ingested DISP-S1 products to UAT. That way we can test the tool in real-world conditions.
Refactor the codebase to utilize @philipjyoon's PCM DISP-S1 triggering logic utils

philipjyoon · 2024-10-03T21:21:24Z

This tool must also take into account blackout dates

philipjyoon · 2024-11-12T16:50:50Z

We've been told by ASF that the input CSLC granule list CANNOT be stored in CMR because it would break. So we will need to implement this logic without that information - as @philipjyoon described in a comment on Aug 15, 2024

…ided by internal utils

…rmining triggering logic for DISP-S1 processing for validation purposes

…ate DataFrame. Create DataFrame from CSLC analysis result

…eated unit test which is failing currently

philipjyoon · 2024-12-20T01:11:15Z

While DISP-S1 forward processing still needs some work, we can still use it to test high-level functionality of the validator. To do so, we run
python ~/mozart/ops/opera-pcm/data_subscriber/daac_data_subscriber.py query -c OPERA_L2_CSLC-S1_V1 --start-date=2024-12-15T08:00:00Z --end-date=2024-12-15T09:00:00Z --chunk-size=2 --k=2 --m=1 --job-queue=opera-job_worker-cslc_data_download --processing-mode=forward
which submits 33 download jobs. When all the products are produced, we can run validation on that using the --disp_s1_validate_with_grq functionality without having to deliver products to the DAAC.

philipjyoon · 2024-12-20T01:14:42Z

I tested DISP-S1 historical mode validation mode in two ways:

INT POP1 cluster had processed first two runs of frame 11116 in historical mode and delivered to the ASF UAT DAAC. I ran the validator against it and it matched up all CSLC files to all products.
On my dev cluster I created a historical run batch_proc that runs over 2 years using --k=2, --m=1 (for faster processing) and then validated without using the delivery daac using the --disp_s1_validate_with_grq option. All CSLC granules matched up with all the products.

…1 products

riverma added enhancement New feature or request needs triage Issue that requires triage labels Aug 14, 2024

riverma self-assigned this Aug 16, 2024

riverma mentioned this issue Sep 5, 2024

DISP-S1 Support for Validator Tool #981

Merged

philipjyoon added a commit that referenced this issue Sep 18, 2024

#944: Small fix

e711469

philipjyoon assigned philipjyoon and unassigned riverma Oct 3, 2024

philipjyoon added a commit that referenced this issue Nov 22, 2024

#944: Retooled the validator code to use DISP-S1 data structures prov…

f6c6c55

…ided by internal utils

philipjyoon added a commit that referenced this issue Dec 11, 2024

#944: Some refactoring

7af5280

philipjyoon added a commit that referenced this issue Dec 16, 2024

#944: More refactor and core code for grouping CSLC granules and dete…

8932db9

…rmining triggering logic for DISP-S1 processing for validation purposes

philipjyoon added a commit that referenced this issue Dec 17, 2024

#944: Basic CSLC and DISP-S1 retrieval is working for validation

ae6b7fe

philipjyoon added a commit that referenced this issue Dec 18, 2024

#944: Secondary time filtering by temporal time for DISP-S1 products

0ef81c8

philipjyoon added a commit that referenced this issue Dec 18, 2024

#944: Look up DISP-S1 products from GRQ ES, extract metadata, and cre…

3a57cba

…ate DataFrame. Create DataFrame from CSLC analysis result

philipjyoon added a commit that referenced this issue Dec 18, 2024

#944: DISP-S1 Validator fundamental functionality is now complete. Cr…

69ba0a0

…eated unit test which is failing currently

philipjyoon added a commit that referenced this issue Dec 18, 2024

#944: Unit testing for DISP-S1 validation now passes

887f725

philipjyoon added a commit that referenced this issue Dec 19, 2024

#944: Added --disp_s1_validate_with_grq feature and also sorting output

c4a3090

philipjyoon added a commit that referenced this issue Dec 19, 2024

#944: Fixed folder path

214b496

philipjyoon added a commit that referenced this issue Dec 19, 2024

#944: Installing pcm module in metrics machine

de640c6

philipjyoon added a commit that referenced this issue Dec 21, 2024

#944: DISP-S1 validation works for forward mode correctly

162eb03

philipjyoon added a commit that referenced this issue Dec 28, 2024

#944: Additional instructions in README for using validator on DISP-S…

fd958d4

…1 products

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Feature]: DISP-S1 Support for Validator Tool #944

[New Feature]: DISP-S1 Support for Validator Tool #944

riverma commented Aug 14, 2024 •

edited

Loading

riverma commented Aug 14, 2024

philipjyoon commented Aug 15, 2024

philipjyoon commented Aug 15, 2024

riverma commented Aug 16, 2024

philipjyoon commented Aug 16, 2024

riverma commented Aug 27, 2024 •

edited

Loading

philipjyoon commented Oct 3, 2024

philipjyoon commented Nov 12, 2024

philipjyoon commented Dec 20, 2024

philipjyoon commented Dec 20, 2024

[New Feature]: DISP-S1 Support for Validator Tool #944

[New Feature]: DISP-S1 Support for Validator Tool #944

Comments

riverma commented Aug 14, 2024 • edited Loading

Checked for duplicates

Alternatives considered

Related problems

Describe the feature request

riverma commented Aug 14, 2024

philipjyoon commented Aug 15, 2024

philipjyoon commented Aug 15, 2024

riverma commented Aug 16, 2024

philipjyoon commented Aug 16, 2024

riverma commented Aug 27, 2024 • edited Loading

philipjyoon commented Oct 3, 2024

philipjyoon commented Nov 12, 2024

philipjyoon commented Dec 20, 2024

philipjyoon commented Dec 20, 2024

riverma commented Aug 14, 2024 •

edited

Loading

riverma commented Aug 27, 2024 •

edited

Loading