Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Quda work clover force #549

Merged
merged 38 commits into from
Jan 27, 2024
Merged

WIP Quda work clover force #549

merged 38 commits into from
Jan 27, 2024

Conversation

kostrzewa
Copy link
Member

No description provided.

@kostrzewa kostrzewa changed the title Quda work clover force WIP Quda work clover force Oct 6, 2022
@Marcogarofalo
Copy link
Contributor

I think that the call to reorder_spinor_eo_toQuda is necessary. However the quda function reorder_gauge_toQuda return an error

@Marcogarofalo
Copy link
Contributor

I asked for help to quda developers lattice/quda#1330 (comment)

@kostrzewa
Copy link
Member Author

I think that the call to reorder_spinor_eo_toQuda is necessary. However the quda function reorder_gauge_toQuda return an error

reorder_gauge_toQuda is our own function: what kind of error can it return?

It seems to me that instead where computeCloverForceQuda's third argument should be an array of arrays, you are simply passing spinorIn. As a quick workaround, I guess passing &spinorIn might work when only a single field is required for x, right?

@Marcogarofalo
Copy link
Contributor

I think that the call to reorder_spinor_eo_toQuda is necessary. However the quda function reorder_gauge_toQuda return an error

reorder_gauge_toQuda is our own function: what kind of error can it return?

sorry I made a typos in the comment the function that get an error is computeCloverForceQuda , see lattice/quda#1330

It seems to me that instead where computeCloverForceQuda's third argument should be an array of arrays, you are simply passing spinorIn. As a quick workaround, I guess passing &spinorIn might work when only a single field is required for x, right?

yes I agree

@Marcogarofalo
Copy link
Contributor

@kostrzewa can I merge the other way around, i.e. quda_work into this quda_work_clover_force to get the automatic integration test working again?

@kostrzewa
Copy link
Member Author

@kostrzewa can I merge the other way around, i.e. quda_work into this quda_work_clover_force to get the automatic integration test working again?

yes, that would be the way to go anyway to synchronise before merging everything together

@kostrzewa
Copy link
Member Author

@Marcogarofalo using 020dfa3 (tmLQCD) and c310d9c53431e2cf75a86e2cf0956ebcc2b9e12e (QUDA), I get the following behaviour on Juwels Booster:

00001033 0.540737337562 3039327.268655220047 0.000000e+00 101 1861 1115 9364 0 3.845400e+02 3.080185e-01
00001034 0.540737337562 3033882.783575160429 0.000000e+00 100 1850 1098 9277 0 1.325076e+02 3.080185e-01
00001035 0.540737337562 3023880.860261276364 0.000000e+00 101 1868 1111 9355 0 1.325901e+02 3.080185e-01

In a test run which works fine with the current quda_work head commit:

00001033 0.539109865241 -0.139158694074 1.149306e+00 101 1864 1026 8308 1 4.022610e+02 3.051663e-01
00001034 0.537638696155 -0.114508185536 1.121322e+00 100 1860 1023 8298 1 2.138116e+02 3.026756e-01
00001035 0.536272387032 -0.084111206234 1.087750e+00 102 1870 1047 8383 1 2.137449e+02 3.004461e-01

using this input file: /p/scratch/isdlqcd/kostrzewa2/benchmarks/tmLQCD_QUDA_HMC/quda-tm_force_64c128/64c128_n8mpi4nt24.input

@kostrzewa
Copy link
Member Author

@Marcogarofalo using 020dfa3 (tmLQCD) and c310d9c53431e2cf75a86e2cf0956ebcc2b9e12e (QUDA) ...

@Marcogarofalo, which was the last QUDA commit that you tested with? It might be that some of the recent changes have messed with our interfacing with the code or, more likely, I did something wrong.

@kostrzewa
Copy link
Member Author

I've started some more test runs using the functionality that you added in aeffd7a / 5d3caec and #577 on 48c96 lattices at two quark masses. One of the two runs has started and is not showing any differences > 1e-9 so far. The problem is that it always took a while for the instability to be triggered so we need to wait and see.

move compare_derivative into its own compilation unit and make the strictness variable, specify a 'name' argument
@kostrzewa
Copy link
Member Author

@Marcogarofalo I think I might have at least figured out why one set of my test jobs was misbehaving. To be more specific, the ones that I referred to in:

I have two nf=2+1+1 test runs on 24c48 and 48c96 lattices in which, after a few integration steps, solvers begin to diverge (in one case the MG fails to converge, in a another run a simple CG with a large rho fails to converge and finally in yet another, a multi-shift solve diverges and produces a NaN residual).

I was using fc31c7d before you had added the 07fc55e fix.

…ke sure to not miss anything by doing a MPI_MAX reduction on rank 0.
@kostrzewa
Copy link
Member Author

FYI I have a 64c128 test job running on Booster in which there appear to be no problems.

/p/scratch/isdlqcd/kostrzewa2/benchmarks/tmLQCD_QUDA_HMC/quda-tm_force_64c128/jobscript/logs/log_64c128_tm_force_quda_full_n8_9214292.out

I will perform another longer run (this one has artifically short trajectories), but so far it seems that all is well.

@Marcogarofalo
Copy link
Contributor

Hi, thanks for the good news. I am also running a test of the 64x128 in
/qbigwork2/garofalo/builds/tmLQCD_fermionic_forces/phys_test/
so far the plaquette is ok while $\Delta H$ started to show differences

::::::::::::::
device/output.data
::::::::::::::
00001033 0.539124377721 -0.181064112112 1.198492e+00 99 1850 1008 8217 1 7.052586e+02 3.051752e-01
00001034 0.537658649417 -0.136323347688 1.146052e+00 102 1869 1036 8334 1 6.922392e+02 3.026848e-01
00001035 0.536295420353 -0.072135765105 1.074801e+00 101 1888 1037 8464 1 6.917731e+02 3.004625e-01
00001036 0.535011552829 -0.135401181877 1.144996e+00 101 1875 1025 8362 1 7.069782e+02 2.983943e-01
00001037 0.533843994664 0.000084528700 9.999155e-01 101 1875 1033 8372 1 6.941697e+02 2.965856e-01
00001038 0.532752973206 -0.088634211570 1.092681e+00 100 1872 1018 8313 1 7.003413e+02 2.948908e-01
00001039 0.531704500459 -0.006030147895 1.006048e+00 101 1866 1025 8286 1 6.922550e+02 2.933619e-01
00001040 0.530735472711 -0.022664899006 1.022924e+00 101 1883 1028 8354 1 6.910189e+02 2.919589e-01
::::::::::::::
host/output.data
::::::::::::::
00001033 0.539124377721 -0.181256163865 1.198722e+00 99 1850 1010 8218 1 1.087587e+03 3.051752e-01
00001034 0.537658649417 -0.136414341629 1.146157e+00 102 1866 1037 8334 1 1.079608e+03 3.026848e-01
00001035 0.536289797778 -0.074235649779 1.077061e+00 101 1885 1037 8444 1 1.114638e+03 3.004565e-01
00001036 0.535016437923 -0.062977313995 1.065003e+00 100 1877 1019 8385 1 1.103632e+03 2.984491e-01
00001037 0.533841333976 -0.071600986645 1.074227e+00 100 1848 1009 8207 1 1.101621e+03 2.966407e-01

@kostrzewa
Copy link
Member Author

Are you running the same number of trajectories per job? Otherwise the random numbers won't be identical and you are bound to see differences quite early on. After some number of trajectories there's of course nothing that you can do: the runs will diverge somewhere around 20-30 trajectories or even slightly earlier.

@kostrzewa
Copy link
Member Author

kostrzewa commented Jan 18, 2024

In my test using a full 2+1+1 run on a 64c128 lattice things are looking good so far (the top seven trajectories are with the force offloading enabled):

# n8 with quda force full | quda_version quda-develop-25d85b | slurm_job_id 9214636
00001033 0.540752271381 0.043441347778 9.574887e-01 100 14226 175 12811 34394 0 1010 61291 1214 43192 932 3870 109 6159 4976 77081 1 4.504216e+03 3.080557e-01
00001034 0.540752271381 0.565656803548 5.679870e-01 102 14220 179 12795 35007 0 1040 61200 1267 43109 932 3848 109 5938 5168 78220 0 4.421921e+03 3.080557e-01
00001035 0.540713299781 0.471504589543 6.240626e-01 99 14225 175 12849 34024 0 1005 61374 1210 43367 933 3822 107 5853 4949 77301 1 4.452976e+03 3.080064e-01
00001036 0.540689619806 -0.210317837074 1.234070e+00 199 28422 278 25633 48012 0 1920 122603 1901 86464 975 7664 169 11686 7556 153829 1 8.718310e+03 3.079745e-01
00001037 0.540717490165 -0.093398010358 1.097899e+00 99 14214 174 12805 34093 0 1010 61197 1213 43164 934 3835 106 5917 4964 77340 1 4.464691e+03 3.080230e-01
00001038 0.540701838357 -0.041311362758 1.042177e+00 100 14174 176 12763 34198 0 1012 61049 1224 42988 932 3843 109 5865 4968 76331 1 4.421069e+03 3.080025e-01
00001039 0.540722992628 -0.098816601560 1.103864e+00 98 14308 174 12939 33863 0 998 61671 1214 43740 935 3829 108 5950 4894 76960 1 4.427378e+03 3.080321e-01
# n8 with quda no_force full | quda_version quda-develop-25d85b | slurm_job_id 9217535
00001033 0.540752271382 0.043261365965 9.576611e-01 100 14224 175 12813 34388 0 1010 61298 1214 43175 932 3874 109 6158 4974 77093 1 5.152877e+03 3.080557e-01
00001034 0.540752271382 0.565140699968 5.682802e-01 101 14221 180 12794 35034 0 1038 61215 1250 43091 933 3846 109 5942 5161 78216 0 5.070436e+03 3.080557e-01
00001035 0.540713299781 0.471584610641 6.240127e-01 99 14226 175 12850 34019 0 1006 61371 1209 43347 933 3822 107 5851 4957 77339 1 5.078761e+03 3.080064e-01
00001036 0.540689619817 -0.210258966312 1.233998e+00 199 28417 278 25631 48004 0 1922 122607 1908 86476 975 7666 169 11701 7541 153884 1 9.954171e+03 3.079745e-01
00001037 0.540717490182 -0.093846568838 1.098391e+00 99 14217 174 12807 34072 0 1007 61186 1211 43166 934 3838 107 5913 4956 77344 1 5.083100e+03 3.080230e-01
00001038 0.540701838370 -0.041504077613 1.042377e+00 100 14175 177 12764 34232 0 1013 61043 1225 43011 931 3841 108 5864 4971 76314 1 5.042457e+03 3.080025e-01
00001039 0.540722992685 -0.097678773105 1.102609e+00 98 14306 175 12936 33853 0 998 61648 1214 43753 935 3830 108 5942 4923 76966 1 5.047067e+03 3.080321e-01

I will update the above when the second set of trajectories with force offloading disabled has completed.

@kostrzewa
Copy link
Member Author

I think whatever problems I seemed to have were related to the one oversight of employing fc31c7d before you had added the 07fc55e fix plus later on perhaps some random mismatch between tmLQCD and the QUDA version that I was using.

@Marcogarofalo
Copy link
Contributor

Are you running the same number of trajectories per job?

no, I'll redo the test then.

@kostrzewa
Copy link
Member Author

The test above #549 (comment) looks fine to me. Could you add something to the documentation about UseExternalLibrary = quda having to be set to use the offloaded fermionic force? I think we can then merge this and start using in production. Thanks a lot!

@Marcogarofalo
Copy link
Contributor

After doing the test with the same random number in the host and device run I got good agreement after 30 trajectories

==> device/output.data <==
00001033 0.539124377721 -0.181064112112 1.198492e+00 99 1850 1008 8217 1 6.779430e+02 3.051752e-01
00001034 0.537658649417 -0.136323347688 1.146052e+00 102 1869 1036 8334 1 6.657363e+02 3.026848e-01
...
00001063 0.516763973986 -0.057475058362 1.059159e+00 101 1924 867 7139 1 6.612186e+02 2.728876e-01

==> host/output.data <==
00001033 0.539124377721 -0.181048505008 1.198473e+00 99 1850 1008 8220 1 1.075596e+03 3.051752e-01
00001034 0.537658649417 -0.136431051418 1.146176e+00 102 1869 1034 8337 1 1.071386e+03 3.026848e-01
...
00001063 0.516763973986 -0.057487478480 1.059172e+00 101 1928 867 7138 1 1.080475e+03 2.728876e-01

@kostrzewa
Copy link
Member Author

A comment to 44449ad: the DET and DETRATIO monomials do support EnableExternalInverter (or at least they should).

@Marcogarofalo
Copy link
Contributor

we never tested:

  useexternalinverter = no
  UseExternalLibrary = quda

I don't see an application of this setup, at the current stage it is likely to crash.

@kostrzewa
Copy link
Member Author

we never tested:

true ... I guess the check for "UseExternalLibrary" should include a simultaneous check for "UseExternalInverter" and spit out an error if the two are not set together...

@kostrzewa
Copy link
Member Author

Note that I've merged with the current quda_work as there was a conflict in the documentation here.

@kostrzewa
Copy link
Member Author

I'm a bit stuck testing this on LUMI-G because of lattice/quda#1432

I can test using an unofficial stack based on ROCm 5.6.1 but I've already found that there are issues when both GDR and P2P are enabled with my current production code (relatively recent tmLQCD-quda_work and quda-develop-02391b12). I see solvers diverging so I first need to find a combination ROCm 5.6.1 (x) GDR=[0,1] (x) P2P=[0,3] which works correctly with that before I can then move on to testing the offloaded fermionic force.

read_input.l Outdated Show resolved Hide resolved
@kostrzewa
Copy link
Member Author

I was able to test this on LUMI-G now and it works perfectly fine. 10% speedup overall in a large run on 128 nodes which should translate to more than that at a smaller scale.

@kostrzewa kostrzewa merged commit 28295ac into quda_work Jan 27, 2024
3 checks passed
@kostrzewa
Copy link
Member Author

@Marcogarofalo this is really awesome, thanks!

@Marcogarofalo
Copy link
Contributor

Thank you for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants