Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for cris-fsr memory corruption (issue #755) #767

Merged
merged 1 commit into from
Jul 9, 2024

Conversation

jswhit
Copy link
Contributor

@jswhit jswhit commented Jul 7, 2024

one line change from @jderber-NOAA to fix memory corruption issue in correlated_obsmod.F90 for cris-fsr data found in reanalysis scout runs. Detailed description found in issue #755.

Resolves #755

Checklist

  • [x ] My code follows the style guidelines of this project
  • [x ] I have performed a self-review of my own code
  • [ x] I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • [ x] Any dependent changes have been merged and published

@RussTreadon-NOAA
Copy link
Contributor

@ADCollard, @emilyhcliu , @xincjin-NOAA , or @azadeh-gh : If any of you have time, would you please review this simple one line change from @jderber-NOAA ? John fixed a bug in correlated_obsmod.F90 for cris-fsr data.

If none of you have time to review the change, any suggestions as to who we can ask?

Copy link
Contributor

@emilyhcliu emilyhcliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through the issues documented in Issue #755. I learned a lot by reading the issue about how to trace the stepsize problem in the minimization.

@emilyhcliu
Copy link
Contributor

It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file.

if (corr_obs) then
lu = luavail()
open(lu,file=trim(fname),convert='little_endian',form='unformatted')
if (GMAO_ObsErrorCov) then
read(lu,IOSTAT=ioflag) nch_active, iprec
else
read(lu,IOSTAT=ioflag) nch_active, nctot, iprec
endif
if(ioflag/=0) call die(myname_,' failed to read nch from '//trim(fname))
coun=0
couns=0
istart=0
nctotf=0
do ii=1,jpch_rad
if (nusis(ii)==ErrorCov%instrument) then
if (couns==0) then
istart=ii-1
couns=1
endif
if (iuse_rad(ii)>0) then
coun=coun+1
endif
nctotf=nctotf+1
endif
enddo
! if no data available, turn off Correlated Error
if (coun==0) then
if (iamroot_) write(6,*) 'WARNING: ',trim(ErrorCov%instrument), &
' is not initiallized. Turning off Correlated Error'
return
endif
ErrorCov%nch_active = coun
if (.not.GMAO_ObsErrorCov) ErrorCov%nctot = nctot
call create_(coun,ErrorCov)
allocate(indxRf(nch_active),indxR(nch_active),Rcov(nctot,nctot))

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 ctests
Build jswhit:cris-fsr-fix at 7c63e06 on Dogwood. Run ctests against develop at 529bb79. All tests pass

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr767/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  726.99 sec
2/6 Test #6: global_enkf ......................   Passed  850.40 sec
3/6 Test #2: rtma .............................   Passed  977.23 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1155.09 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1274.74 sec
6/6 Test #1: global_4denvar ...................   Passed  1682.85 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1682.86 sec

@xincjin-NOAA
Copy link
Contributor

It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file.

if (corr_obs) then
lu = luavail()
open(lu,file=trim(fname),convert='little_endian',form='unformatted')
if (GMAO_ObsErrorCov) then
read(lu,IOSTAT=ioflag) nch_active, iprec
else
read(lu,IOSTAT=ioflag) nch_active, nctot, iprec
endif
if(ioflag/=0) call die(myname_,' failed to read nch from '//trim(fname))
coun=0
couns=0
istart=0
nctotf=0
do ii=1,jpch_rad
if (nusis(ii)==ErrorCov%instrument) then
if (couns==0) then
istart=ii-1
couns=1
endif
if (iuse_rad(ii)>0) then
coun=coun+1
endif
nctotf=nctotf+1
endif
enddo
! if no data available, turn off Correlated Error
if (coun==0) then
if (iamroot_) write(6,*) 'WARNING: ',trim(ErrorCov%instrument), &
' is not initiallized. Turning off Correlated Error'
return
endif
ErrorCov%nch_active = coun
if (.not.GMAO_ObsErrorCov) ErrorCov%nctot = nctot
call create_(coun,ErrorCov)
allocate(indxRf(nch_active),indxR(nch_active),Rcov(nctot,nctot))

Does this mean that the nch_active should be equal to coun?

@jderber-NOAA
Copy link
Contributor

jderber-NOAA commented Jul 8, 2024 via email

@emilyhcliu
Copy link
Contributor

It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file.

if (corr_obs) then
lu = luavail()
open(lu,file=trim(fname),convert='little_endian',form='unformatted')
if (GMAO_ObsErrorCov) then
read(lu,IOSTAT=ioflag) nch_active, iprec
else
read(lu,IOSTAT=ioflag) nch_active, nctot, iprec
endif
if(ioflag/=0) call die(myname_,' failed to read nch from '//trim(fname))
coun=0
couns=0
istart=0
nctotf=0
do ii=1,jpch_rad
if (nusis(ii)==ErrorCov%instrument) then
if (couns==0) then
istart=ii-1
couns=1
endif
if (iuse_rad(ii)>0) then
coun=coun+1
endif
nctotf=nctotf+1
endif
enddo
! if no data available, turn off Correlated Error
if (coun==0) then
if (iamroot_) write(6,*) 'WARNING: ',trim(ErrorCov%instrument), &
' is not initiallized. Turning off Correlated Error'
return
endif
ErrorCov%nch_active = coun
if (.not.GMAO_ObsErrorCov) ErrorCov%nctot = nctot
call create_(coun,ErrorCov)
allocate(indxRf(nch_active),indxR(nch_active),Rcov(nctot,nctot))

Does this mean that the nch_active should be equal to coun?

I think so!

@jderber-NOAA
Copy link
Contributor

jderber-NOAA commented Jul 8, 2024

I think with this change coun and nch_active do not have to be equal. If coun < nch_active the code will delete those parts of the correlation matrix that use channels not in coun. If nch_active < coun then the code will add the additional channels found in coun into the covariance matrix using the variances (no correlated error) in the satinfo file (not positive where the variances come from).

The original code was set up to do this, just the one array was too small (when coun > nch_active) and created the overwriting of the memory.

Copy link
Contributor

@jderber-NOAA jderber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are consistent with what I gave Jeff. Approved.

jswhit pushed a commit to jswhit/GSI that referenced this pull request Jul 8, 2024
@RussTreadon-NOAA RussTreadon-NOAA self-requested a review July 9, 2024 12:40
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve

@RussTreadon-NOAA
Copy link
Contributor

@jderber-NOAA and @emilyhcliu : Are we good with the changes in this PR? If "yes", I will work with the GSI Review Committee to merge this PR into develop.

@jderber-NOAA
Copy link
Contributor

jderber-NOAA commented Jul 9, 2024 via email

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Jul 9, 2024
@emilyhcliu
Copy link
Contributor

@jderber-NOAA and @emilyhcliu : Are we good with the changes in this PR? If "yes", I will work with the GSI Review Committee to merge this PR into develop.

@RussTreadon-NOAA Yes!!

@RussTreadon-NOAA
Copy link
Contributor

Hera & Hercules ctests
Install jswhit:cris-fsr-fix at 7c63e06 on Hera and Hercules. Run ctests with following results

Hera

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  614.03 sec
2/6 Test #6: global_enkf ......................   Passed  814.13 sec
3/6 Test #2: rtma .............................   Passed  972.84 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1107.29 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed  1222.67 sec
6/6 Test #1: global_4denvar ...................***Failed  2046.18 sec

50% tests passed, 3 tests failed out of 6

Total Test time (real) = 2046.22 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)

The rrfs_3denvar_rdasens failure is due to

The runtime for rrfs_3denvar_rdasens_loproc_updat is 157.287149 seconds.  This has exceeded maximum allowable threshold ti\
me of 129.578203 seconds, resulting in Failure time-thresh of the regression test.

A check of the gsi.x wall times confirms that the updat wall times exceed the contrl

rrfs_3denvar_rdasens_hiproc_contrl/stdout:The total amount of wall time                        = 86.790942
rrfs_3denvar_rdasens_hiproc_updat/stdout:The total amount of wall time                        = 108.434967
rrfs_3denvar_rdasens_loproc_contrl/stdout:The total amount of wall time                        = 86.385469
rrfs_3denvar_rdasens_loproc_updat/stdout:The total amount of wall time                        = 157.287149

Correlate observation error for radiances is not active in the rrfs_3denvar_rdasens test. The correction of the array allocation bug is the only difference between updat and contrl. Thus, the above timing difference is not due to the bug fix since this code is not exercised.

The hafs_4denvar_glbens failure is due to

The runtime for hafs_4denvar_glbens_loproc_updat is 314.056492 seconds.  This has exceeded maximum allowable threshold time of 312.580009 seconds, resulting in Failure time-thresh of the regression test.

This is not a fatal fail.

The global_4denvar failed due to

The runtime for global_4denvar_hiproc_updat is 483.159129 seconds.  This has exceeded maximum allowable threshold time of 400.916883 seconds, resulting in Failure of timethresh2 the regression test.

This is not a fatal fail.

A rerun of the failed tests resulted in Passed for each test

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
    Start 1: global_4denvar
1/1 Test #1: global_4denvar ...................   Passed  1986.62 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1986.64 sec

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
    Start 3: rrfs_3denvar_rdasens
1/1 Test #3: rrfs_3denvar_rdasens .............   Passed  496.82 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 497.23 sec

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
    Start 4: hafs_4denvar_glbens
1/1 Test #4: hafs_4denvar_glbens ..............   Passed  1287.73 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1287.74 sec

Hercules

Test project /work/noaa/da/rtreadon/git/gsi/pr767/build/regression
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  486.98 sec
2/6 Test #6: global_enkf ......................***Failed  729.88 sec
3/6 Test #2: rtma .............................***Failed  965.90 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1153.19 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed  1218.99 sec
6/6 Test #1: global_4denvar ...................   Passed  1681.87 sec

33% tests passed, 4 tests failed out of 6

Total Test time (real) = 1681.88 sec

The following tests FAILED:
          2 - rtma (Failed)
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)
          6 - global_enkf (Failed)

The rrfs_3denvar_rdasens failure is due to

The runtime for rrfs_3denvar_rdasens_hiproc_updat is 105.944369 seconds.  This has exceeded maximum allowable threshold time of 104.912751 seconds, resulting in Failure of timethresh2 the regression test.

This is not a fatal fail.

The global_enkf failure is due to

The runtime for global_enkf_loproc_updat is 160.504755 seconds.  This has exceeded maximum allowable threshold time of 157.186910 seconds, resulting in Failure timethresh of the regression test.

This is not a fatal fail.

The rtma failure is due to

The runtime for rtma_loproc_updat is 215.266764 seconds.  This has exceeded maximum allowable threshold time of 211.326822 seconds, resulting in Failure time-thresh of the regression test.

This is not a fatal fail.

The hafs_4denvar_glbens failure is due to

The runtime for hafs_4denvar_glbens_loproc_updat is 352.022764 seconds.  This has exceeded maximum allowable threshold time of 292.013010 seconds, resulting in Failure time-thresh of the regression test.

This is not a fatal fail.

A rerun of the ctests on Hercules using /work2 as the run directory yields Passed for two of the four failed tests

Test project /work/noaa/da/rtreadon/git/gsi/pr767/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #6: global_enkf ......................   Passed  751.87 sec
2/6 Test #2: rtma .............................   Passed  966.78 sec
3/6 Test #5: hafs_3denvar_hybens ..............   Passed  1098.22 sec
4/6 Test #4: hafs_4denvar_glbens ..............***Failed  1102.22 sec
5/6 Test #1: global_4denvar ...................   Passed  2222.37 sec

hafs_4denvar_glbens fails due to a timing and reproducibility checks. This is interesting. The previous run of this test in /work only failed due to timing checks. The analysis results were identical. GSI issue #697 discusses non-reproducible results from regional ctests on Hercules. Today's test indicates that this problem remains.

rrfs_3denvar_rdasens did not finish due to the intermittent gsi.x hang reported in GSI issue #766.

@RussTreadon-NOAA RussTreadon-NOAA merged commit a5e2a43 into NOAA-EMC:develop Jul 9, 2024
4 checks passed
DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this pull request Sep 6, 2024
* origin/develop:
  Move to contrib spack-stack on Jet (NOAA-EMC#787)
  a quick workaround for increasing the mpi task numbers on orion for ctest :: rrfs_3denvar_rdasens  (NOAA-EMC#788)
  Recover the capability of handling model fields from operation gfs.v16.3 (NOAA-EMC#785)
  fix a bug in deter_sfc_gmi (NOAA-EMC#781)
  add safeguard to thompson_reff (NOAA-EMC#779)
  Fix incorrect usage of real(i_kind) in mg_input.f90  (NOAA-EMC#760)
  Transition to Thompson Microphysics for Microwave All-sky Assimilation (NOAA-EMC#743)
  Format changes for EUMETSAT metop-sg and CADS debug fix (NOAA-EMC#773)
  Update global_4denvar and global_enkf ctests to reflect GFS v17 (NOAA-EMC#774)
  fix for cris-fsr memory corruption (NOAA-EMC#767)
  Gnssrwnd1.0 (NOAA-EMC#747)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GSI convergence problems in scout runs in 2020
6 participants