Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default filename for cf writer to be compatible with satpy_cf_nc reader #1637

Merged
merged 11 commits into from
Jun 24, 2024

Conversation

BENR0
Copy link
Collaborator

@BENR0 BENR0 commented Apr 13, 2021

The default filename pattern of the cf writer is changed to the format the satpy_cf_nc reader expects. Also the "instrument" attribute which the satpy_cf_nc reader expects is now added by the reader.

I have been trying out the satpy_cf_nc reader and stumbled over some things.

  1. The reader expects a certain filename format which is documented. While this is not a big point never the less I think to be consistent the cf writer should by default use the same filename format because as a user I would expect to be able to read a file written with the default settings.
  2. The reader also expects a global attribute instrument in the get_sensor function of the reader. This is currently not documented and is not added by the cf writer but needs to be added by the user by supplying the header_attrs parameter. With the same argument I think the reader should add this.

Apart from this I am wondering why the attribute is named instrument in the satpy_cf_nc reader while everywhere else it seems to be named sensor? Maybe this should be renamed?
Additionally I was wondering why this is retrieved from the global attributes since the reader relies on dynamic datasets anyway and retrieves some attributes anyway in _assign_ds_info which could also be used to get the sensor. I guess the sensor information might not be available in all readers?
However this is not to say that global attributes should not be used. Actually maybe the cf writer should write attributes common to all datasets to the global attributes similar to the to_xarray_dataset scene method?

Note: a lot of the cf writer tests are failing due to the instrument attribute addition. I wanted to hear your thoughts on this before fixing these.

@djhoese
Copy link
Member

djhoese commented May 15, 2021

I think the main purpose of this PR has been handled now in some other PR. You can see the current file types here:

- '{platform_name}-{sensor}-{resolution_type}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc'
- '{platform_name}-{sensor}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc'

I think we can close this? If you don't think so, let me know and we can reopen it.

@djhoese djhoese closed this May 15, 2021
@BENR0
Copy link
Collaborator Author

BENR0 commented May 17, 2021

Well I would have to see the PR. The file type patterns above still don't match the patterns the cf writer uses by default. Additionally the cf reader needs the instrument attribute in the metadata of the netcdf which also does not get written by default. So I think it would be good to keep this open.

@djhoese
Copy link
Member

djhoese commented May 17, 2021

Here's what your (this) PR has:

'{platform_name}-{sensor}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc'

And here's what is there currently:

'{platform_name}-{sensor}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc' 

I think those are exactly the same, right?

@BENR0
Copy link
Collaborator Author

BENR0 commented May 17, 2021

:-D yes exactly the filename pattern from the reader the and writer should be the same. With my changes they are. Currently the filename pattern of the writer is:

filename: '{name}_{start_time:%Y%m%d_%H%M%S}.nc'

@djhoese
Copy link
Member

djhoese commented May 17, 2021

Oh! 🤦 Your changes are to the writer. @TAlonglong @martin @sfinkens what do you think? Should the CF writer's default output filename be the more complex name in this PR (see my comment above) or should the CF reader be updated to also match the current simpler writer default filename (see above).

@mraspaud
Copy link
Member

Yes, the more complete name is better imo

@sfinkens
Copy link
Member

Thanks for looking into this @BENR0! I agree that the more complete name is better.

Regarding the attribute name: sensor is certainly more consistent in the Satpy world. You could argue that the netCDF file is for people from the CF community. And they are probably more familiar with instrument. At least that is what's recommended by ACDD. I have a tendency to keep the instrument name.


Thinking a bit further: There's actually a standardized vocabulary for platforms, instruments etc (GCMD). For example you would specify

platform = "Earth Observation Satellites > METEOSAT > METEOSAT-9"
instrument = "Earth Remote Sensing Instruments > Passive Remote Sensing > Spectrometers/Radiometers > Imaging Spectrometers/Radiometers > SEVIRI > Spinning Enhanced Visible and Infrared Imager"
instrument_vocabulary = "GCMD Instruments, Version 8.1"
platform_vocabulary = "GCMD Platforms, Version 8.1"

I think it would be nice if all the readers followed that convention, if possible. But that's clearly beyond the scope of this PR.

@BENR0
Copy link
Collaborator Author

BENR0 commented May 18, 2021

@sfinkens thanks for the info.

Some more thoughts from my side about the instrument attribute. In this PR I just added writing of this attribute to the cf writer because the cf reader complains otherwise. That said from what I grasp the cf reader does not really depend on that information itself but just for the sensor_names property. So maybe the cf reader should be changed to not need the instrument attribute?
If the writer should add that attribute it would be necessary to ensure that every dataset has it (I am not sure if currently it is possible that a dataset does not have it?).
Furthermore in this PR the sensor attribute of the first dataset is used. It might be necessary to refactor that to account for cases where not all datasets in the scene are from the same sensor.

@sfinkens
Copy link
Member

In this PR I just added writing of this attribute to the cf writer because the cf reader complains otherwise

Yes that's good, reader and writer should be consistent!

So maybe the cf reader should be changed to not need the instrument attribute?

The sensor_names property is prescribed by the base class so we better keep it :)

Furthermore in this PR the sensor attribute of the first dataset is used. It might be necessary to refactor that to account for cases where not all datasets in the scene are from the same sensor.

I think that's fine for now.

@BENR0
Copy link
Collaborator Author

BENR0 commented May 25, 2022

@djhoese @mraspaud @sfinkens working on #1695 I stumbled about this problem again and I think this is still relevant. Especially the instrument attribute is not very intuitive to users. For the default filenames in the reader I think it would also be good to even add {filename}.nc to the list like for the generic_image reader since a user might just want to save to a custom name which he can not read back in since the cf reader does not find any files.

@BENR0
Copy link
Collaborator Author

BENR0 commented Aug 16, 2022

@djhoese @mraspaud @sfinkens this still is an issue. I have merged the current main again and pushed to this branch but since this PR got closed it does not show up. Should I open another PR?

Just saw #2176 which is related to this PR regarding the missing instrument.

@djhoese djhoese reopened this Aug 19, 2022
@djhoese djhoese added enhancement code enhancements, features, improvements component:readers labels Aug 19, 2022
Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on updating this again. I've re-opened the PR. Looks like I closed it and forgot I had done that even though the conversation clearly led to it needing to be reopened.

I've made a few comments and things I'm questioning. There are few things in the writer that I don't think are needed anymore. Sorry for taking so long to get to this PR that part of it were handled by others.

satpy/writers/cf_writer.py Outdated Show resolved Hide resolved
satpy/writers/cf_writer.py Outdated Show resolved Hide resolved
satpy/etc/readers/satpy_cf_nc.yaml Outdated Show resolved Hide resolved
@djhoese
Copy link
Member

djhoese commented Aug 22, 2022

Not sure how but it seems the tests are failing because pygrib is missing?

@djhoese
Copy link
Member

djhoese commented Aug 22, 2022

Oh the pygrib warnings can be ignored.

The problem is that group_files in the failing test is matching the CF reader instead of the abi_l1b reader. I think the reader could be specified on the group files call. Do you want to try adding reader='abi_l1b' to that test.

 ________________________ TestGroupFiles.test_no_reader _________________________

self = <satpy.tests.test_readers.TestGroupFiles testMethod=test_no_reader>

    def test_no_reader(self):
        """Test that reader does not need to be provided."""
        from satpy.readers import group_files
    
        # without files it's going to be an empty result
        assert group_files([]) == []
        groups = group_files(self.g16_files)
>       self.assertEqual(6, len(groups))
E       AssertionError: 6 != 1

satpy/tests/test_readers.py:723: AssertionError

@codecov
Copy link

codecov bot commented Aug 22, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.94%. Comparing base (417c768) to head (82b858e).
Report is 390 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1637      +/-   ##
==========================================
- Coverage   95.94%   95.94%   -0.01%     
==========================================
  Files         366      366              
  Lines       53613    53504     -109     
==========================================
- Hits        51441    51332     -109     
  Misses       2172     2172              
Flag Coverage Δ
behaviourtests 4.04% <ø> (+<0.01%) ⬆️
unittests 96.03% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coveralls
Copy link

coveralls commented Aug 22, 2022

Coverage Status

Coverage remained the same at 94.747% when pulling e82d4fd on BENR0:feat_default_filename_cf_writer into 5cc09e0 on pytroll:main.

@BENR0
Copy link
Collaborator Author

BENR0 commented Oct 13, 2022

@djhoese @mraspaud just looking through some PR's to clean up some open stuff. Is there something else than merging main again which needs to be done here or can this be merged?

@gerritholl
Copy link
Member

@djhoese @mraspaud just looking through some PR's to clean up some open stuff. Is there something else than merging main again which needs to be done here or can this be merged?

It's marked as draft. Draft pull requests cannot be merged.

@BENR0 BENR0 marked this pull request as ready for review October 13, 2022 13:48
@BENR0 BENR0 requested a review from mraspaud as a code owner October 13, 2022 13:48
@@ -17,3 +17,4 @@ file_types:
file_patterns:
- '{platform_name}-{sensor}-{resolution_type}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc'
- '{platform_name}-{sensor}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc'
- '{filename}.nc'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the GRIB2 reader, could this be {stem}.nc? stem would also be more accurate, as .nc is part of the filename.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerritholl I would be ok with that. I just added filename based on the generic_image reader config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine by me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for stem also. The generic_image can be updated later.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, I just have a question regarding the group file test.

@@ -751,7 +751,7 @@ def test_no_reader(self):

# without files it's going to be an empty result
assert group_files([]) == []
groups = group_files(self.g16_files)
groups = group_files(self.g16_files, reader='abi_l1b')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this has to do with this PR. Moreover, the docstring of this test says that the we want to check that we can skip the reader argument, and with change does the opposite?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BENR0 can you comment on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mraspaud what you are saying makes sense to me. I honestly didn't check what that test does I just made this change based on @djhoese suggestion so maybe he can shed more light on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa I did not realize what the test was actually trying to accomplish. This is one downside of the generic filename in the reader. It matches everything and has the change to "steal" matches from other readers. Either:

  1. This test gets updated to expect 1 group of files (not great).
  2. The reader gets updated with group keys to sort by start time...which might work and fix this test.
  3. Remove the default file pattern from the reader

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well obviously I am against option 3. :-D. I think removing a "feature" because a test fails is not the way to go. I am not sure which of the remaining options is the best way to move forward.

Copy link
Member

@gerritholl gerritholl Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Satpy needs a strategy to choose a reader if multiple readers match a file pattern. I think the current strategy is that the first reader with a match wins, but the order of readers is not explicitly defined?

Copy link
Member

@djhoese djhoese Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reader order may come down to:

reader_configs = glob_config(os.path.join('readers', '*.yaml'), search_dirs=paths)

in most cases. There may be others. It may get even more complicated when we look at that glob_configs function since it is searching not only Satpy's configs but any user configured paths too:

satpy/satpy/_config.py

Lines 156 to 184 in c0b69be

def config_search_paths(filename, search_dirs=None, **kwargs):
"""Get series of configuration base paths where Satpy configs are located."""
if search_dirs is None:
search_dirs = get_config_path_safe()[::-1]
paths = [filename, os.path.basename(filename)]
paths += [os.path.join(search_dir, filename) for search_dir in search_dirs]
paths += [os.path.join(PACKAGE_CONFIG_PATH, filename)]
paths = [os.path.abspath(path) for path in paths]
if kwargs.get("check_exists", True):
paths = [x for x in paths if os.path.isfile(x)]
paths = list(OrderedDict.fromkeys(paths))
# flip the order of the list so builtins are loaded first
return paths[::-1]
def glob_config(pattern, search_dirs=None):
"""Return glob results for all possible configuration locations.
Note: This method does not check the configuration "base" directory if the pattern includes a subdirectory.
This is done for performance since this is usually used to find *all* configs for a certain component.
"""
patterns = config_search_paths(pattern, search_dirs=search_dirs,
check_exists=False)
for pattern_fn in patterns:
for path in glob.iglob(pattern_fn):
yield path

Either way, it clearly uses iglob which isn't going to be sorted.

As for my option 3 @BENR0, it isn't that the test fails, it is that the failing test represents a use case (as @gerritholl says): which reader gets chosen when a user doesn't specify a reader (they don't know what one they want). With this PR they seem to always get the CF reader even though there is a much better choice.

@gerritholl besides alphabetical order, I'm not sure we want to go down the performance-lossy road of checking all readers for a match against a filename. We could implement something that knows that the CF reader and the generic image reader should be a last resort:

# pseudo code
generic_choice = None
for reader, reader_info in ...:
    if reader.matches(filename):
        if not reader_info["generic"]:
            return reader
        generic_choice = reader
return generic_choice

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. This discussion reminded me of the conversation regarding overlapping file patterns in #913 which might be related.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, but they are separate bits of code. The overlapping file patterns will likely always be a little difficult. This reader choice one can have a solution, but it might slow some operations down.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I exchanged "filename" for "stem" which obviously was the easy part. @djhoese @mraspaud @sfinkens @gerritholl What is the proposed way, based on the discussion above, to move this forward?

@@ -17,3 +17,4 @@ file_types:
file_patterns:
- '{platform_name}-{sensor}-{resolution_type}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc'
- '{platform_name}-{sensor}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc'
- '{filename}.nc'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine by me.

@coveralls
Copy link

coveralls commented May 25, 2023

Pull Request Test Coverage Report for Build 9516196907

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 96.04%

Totals Coverage Status
Change from base Build 9513521871: 0.0%
Covered Lines: 51561
Relevant Lines: 53687

💛 - Coveralls

@@ -2,6 +2,6 @@ writer:
name: cf
description: Generic netCDF4/CF Writer
writer: !!python/name:satpy.writers.cf_writer.CFWriter
filename: '{name}_{start_time:%Y%m%d_%H%M%S}.nc'
filename: '{platform_name}-{sensor}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it was mentioned in the comments, but does anyone know/remember where this - separate filename came from? As you mentioned it seems to have existed in the reader first, but it isn't clear to me why/where from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my question here still stands, but not sure we're going to get anyone who has a real answer.

@djhoese
Copy link
Member

djhoese commented May 25, 2023

I don't use find_files_and_readers so one of the other maintainers or users will have to decide on the {stem}.nc file pattern added to the reader. Since this would match any netcdf file input I could see it breaking a lot of users workflows so I'm hesitant to keep it. The rest of this PR I think I'm fine with.

@BENR0
Copy link
Collaborator Author

BENR0 commented Jun 14, 2024

I removed the stem from the filename patterns for the reader again because at this point it is causing to much trouble and should be solved in a separate PR. I will add an issue. This is ready to go then.

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -2,6 +2,6 @@ writer:
name: cf
description: Generic netCDF4/CF Writer
writer: !!python/name:satpy.writers.cf_writer.CFWriter
filename: '{name}_{start_time:%Y%m%d_%H%M%S}.nc'
filename: '{platform_name}-{sensor}-{start_time:%Y%m%d%H%M%S}-{end_time:%Y%m%d%H%M%S}.nc'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my question here still stands, but not sure we're going to get anyone who has a real answer.

@mraspaud mraspaud merged commit cade277 into pytroll:main Jun 24, 2024
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:readers enhancement code enhancements, features, improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants