Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to accelerate training with large dataset #395

Closed
hhlim12 opened this issue Mar 9, 2024 · 30 comments
Closed

How to accelerate training with large dataset #395

hhlim12 opened this issue Mar 9, 2024 · 30 comments

Comments

@hhlim12
Copy link

hhlim12 commented Mar 9, 2024

Dear FLARE developers,

Thank you so much for the great development (and the user-friendliness) of the package.
But I have problem when the training set becomes larger, the update of SGP becomes slower as expected from larger covariance matrix. So I'd appreciate if you have comment and answer for the following questions:

  1. Does the current version of FLARE implement MPI parallelization? If so, would you mind to tell me how to install it?
  2. I have tried the OMP parallelization by specifying the OMP_NUM_THREADS, but it seems there is no significant improvement to the speed. Is there a particular setting that I should enable for parallelization (just like in the old version of FLARE)?
  3. If I train the OTF learning independently, is it possible to combine the resulting SGP model at once ?

Thanks in advance !

Harry

@YuuuXie
Copy link
Collaborator

YuuuXie commented Mar 30, 2024

Hi @hhlim12

Does the current version of FLARE implement MPI parallelization? If so, would you mind to tell me how to install it?

Yes we have a very experimental version, and it's better to be used if you already have a big dataset to be directly fed in, instead of running active learning.

I have tried the OMP parallelization by specifying the OMP_NUM_THREADS, but it seems there is no significant improvement to the speed. Is there a particular setting that I should enable for parallelization (just like in the old version of FLARE)?

The OpenMP parallelization is automatically used within flare++, so the speed has been optimized in that sense. I don't think you can get faster.

If I train the OTF learning independently, is it possible to combine the resulting SGP model at once ?

Yes, usually what we do is to set up multiple trainings and collect data of different configurations. And then we combine those data to train one final model. This is achieved by the offline training using FakeMD which is introduced in the tutorial

@hhlim12
Copy link
Author

hhlim12 commented Mar 30, 2024

Hi @YuuuXie , Thank you for the kind reply.

I understand your answer. We have the dataset already and want to go for the final model. Is the MPI version can be accessed through this repository? (there are mpifix and mpi branches). Or should I contact you by email? I can perform some tests and send you some feedbacks about the performance.

About the offline training using FakeMD, I think it still iterates and updates the model frame by frame.
Thus this might be inefficient if let's say we only have few local envs. extracted from each frame.
Then the number of time the SGP model is updated is then similar to the number of frames, right?
My question is more on the combining all those sub-models in one go. It's quite similiar to batch-learning in NN.
I wonder whether it is possible in GP..

@YuuuXie
Copy link
Collaborator

YuuuXie commented Mar 30, 2024

  1. You can do offline training using FakeMD with the direct mode, i.e. without active learning. And then, it will save some time as it won't need to predict the uncertainty for the active learning. As long as the memory does not blow up, you can just let it run for a few days and see if the model finishes.

  2. For the mpi code, you can check the branch feature/yu/mpi, but I have to say it's in a beta version. I have used it but I can not guarantee that it's working stably without any issue. I might also not have time to maintain it, but I can give you some instructions here. The source code mainly used are src/parallel_sgp.cpp and flare/bffs/parallel_sgp.py.

This is what I used to compile it in my cluster, and you might need to install the corresponding modules to enable the compilation

module load cmake/3.17.3-fasrc01 python/3.6.3-fasrc01 gcc/9.3.0-fasrc01
module load intel-mkl/2017.2.174-fasrc01 openmpi/4.0.5-fasrc01

git clone https://github.com/mir-group/flare.git
cd flare

git checkout feature/yu/mpi
mkdir build
cd build
CC=mpicc CXX=mpic++ FC=mpif90 cmake .. -DSCALAPACK_LIB=NOTFOUND
make -j

Below attached a python script as an example usage.

fpp_train.txt

@hhlim12
Copy link
Author

hhlim12 commented Jul 18, 2024

@YuuuXie , Thank you very much for kind response! I finally manage to install it in my cluster, though I need to troubleshoot some MKL library problems.

Even though I can run the fpp_train.txt without problem, there is no sgp model dumped in the end. Do you know additional script to output the model (e.g., lmp.flare or the json file)? Do you have reference for that? Thank you in advance!

Edit:

I append the following in the fpp_train.txt and it dumps the json and lmp file.

sgp_calc = SGP_Calculator(sparse_gp)
sgp_calc.build_map("lmp.flare", "my_name")

I also found that training with two elements works fine, but increasing the number element > 2 gives me this error:

Hello from rank 0 = (0,0)
Traceback (most recent call last):
  File "/work/k0107/k010716/flareMPI/test/fpp_train.py", line 225, in <module>
    sparse_gp.update_db(
  File "/home/k0107/k010716/flareMPI/flare/flare/bffs/sgp/parallel_sgp.py", line 204, in update_db
    self.build(self.training_structures, self._training_sparse_indices, update=update)
  File "/home/k0107/k010716/flareMPI/flare/flare/bffs/sgp/parallel_sgp.py", line 128, in build
    self.sparse_gp.build(
RuntimeError: Number of clusters of kernel 0 and type 2 is 0

@YuuuXie
Copy link
Collaborator

YuuuXie commented Aug 8, 2024 via email

@hhlim12
Copy link
Author

hhlim12 commented Aug 8, 2024

Hi, Thank you for the kind response.

I'm not sure where I should specify the "write_model" parameter in the fpp_train.txt, but I have appended the following in the code and it does dump the model correctly:

sgp_calc = SGP_Calculator(sparse_gp)
sgp_calc.build_map("lmp.flare", "my_name")

However I still could not solve the training problem with more than two elements..(I put the error message in the previous message).

Thank you for your kind attention,

Harry

@YuuuXie
Copy link
Collaborator

YuuuXie commented Aug 8, 2024 via email

@rbjiawen
Copy link

Thanks to the developer for making such an excellent code. Could it be possible to consider providing an additional interface like GAP that uses CUR matrix decomposition for sparsity and then fits the model in one go, rather than adding it frame by frame (I found that both OTF without calculating uncertainty and offline methods in the tutorial were slow in fitting more than 4000+ crystal structures.). Or It is highly hoped that this offline training mpi version could be improved and verified.

@hhlim12
Copy link
Author

hhlim12 commented Sep 24, 2024

Hi Harry, you could probably try increasing the number of initial data such as the number of frames and number of sparse points. It could be that the model was initialized with insufficient number of datapoints so it does not distribute correctly, i.e. some processes get empty set.

Hi @YuuuXie, Thanks for the kind reply. I just wanted to confirm that the error was solved when I increased the initial data, as you suggested. Thanks again for kind response.

Best,

Harry

@jonpvandermause
Copy link
Collaborator

Hi @rbjiawen,

We're not actively working on new features, but note that you don't need to refit the model after each structure is added. Slightly modifying the static training example given in this tutorial, you can pull the call to update_matrices_QR, which fits the GP, out of the for loop so that it is only called once:

for m in range(training_size):
  train_struc = training_strucs[m]

  # Add training structure and sparse environments.
  gp_model.add_training_structure(train_struc)
  gp_model.add_all_environments(train_struc)

gp_model.update_matrices_QR()

You're also free to choose how many sparse points you want to include. The fewer you choose, the faster the fitting step will be, at the cost of model accuracy.

Hope that helps,
Jon

@rbjiawen
Copy link

Thanks for your reply, yes I did it this way, after adding some atomic cluster environments, I updated the covariance matrix and am randomly selecting sparse points to see the results.

@rbjiawen
Copy link

rbjiawen commented Oct 6, 2024

Hi,@jonpvandermause
Thanks for your previous advice.
“Segmentation fault” now blocks me from using the flare.
The following is my descriptor, kernels and sparse gp

cutoff = 5.8
n_species = 3   # Al Gd Cu
N = 8 
lmax = 6  
radial_basis = "chebyshev"  
cutoff_name = "quadratic"  
radial_hyps = [0, cutoff] 
cutoff_hyps = []
descriptor_settings = [n_species, N, lmax]

B1_descriptor = B1(radial_basis, cutoff_name, radial_hyps, cutoff_hyps,
                 [n_species, 10]) 
B2_descriptor = B2(radial_basis, cutoff_name, radial_hyps, cutoff_hyps,
                 [n_species, N, lmax])
descriptors = [B1_descriptor,B2_descriptor]

power = 2
dot_product_kernel_B1 = NormalizedDotProduct(3.7, power)
dot_product_kernel_B2 = NormalizedDotProduct(2.1, power)

kernels = [dot_product_kernel_B1,dot_product_kernel_B2]

sigma_e = 0.05
sigma_f = 0.1
sigma_s = 0.003

bounds = [(0,5),
          (0,5),
          (0.001,0.04), (0.05, 0.20), (0.0006, 0.03)]

max_iterations = 5  
opt_method = "L-BFGS-B" 
variance_type = "local" 

sparse_gp = SGP_Wrapper(
    kernels,
    descriptors,
    cutoff,
    sigma_e,
    sigma_f,
    sigma_s,
    species_map,
    variance_type=variance_type,
    energy_training=True,
    force_training=True,
    stress_training=True,
    single_atom_energies = single_atom_energies,
    opt_method=opt_method,
    max_iterations=max_iterations,
    bounds = bounds
)

sparse_gp.Kuu_jitter = 1e-8

index_end= 200
train_list=[]

for i, atoms in enumerate(train_structures[:index_end]):
    train_list.append(atoms)
    struc_pp = structure_with_B1_B2_descriptors(atoms)
    sparse_gp.sparse_gp.add_training_structure(struc_pp)  
    sparse_gp.atom_indices.append([-1])  
    sparse_gp.rel_efs_noise.append([1, 1, 1])
    if int(len(atoms)) > 8:
       sparse_size=int(0.2*len(atoms))+7
    else:
       sparse_size=len(atoms)
    sparse_gp.sparse_gp.add_uncertain_environments(struc_pp,[sparse_size,sparse_size])
    if (i+1)<10:
       sparse_gp.sparse_gp.update_matrices_QR()
    else:
       if (i+1) % 5 ==0 :
           print("Current atoms (Index+1): ",i+1,flush=True)
           sparse_gp.sparse_gp.update_matrices_QR()
       if (i+1) % 20 == 0 :
           print(f"Current atoms (index+1):{i+1}, Starting test train error",flush=True)
           test(sparse_gp, train_list, `g_add)

Every time when the loop reaches more than one hundred structures, sparse_gp.sparse_gp.update_matrices_QR() to update the covariance matrix or sparse_gp.train() to optimize hyperparameters Triggered the c++ code Segmentation fault error!
When I checked the error state, the hyperparameters and likelihood is unstable, and the likelihood gradient is large as the following, but the hundred or so structures previously added were fine, and the likelihoods and hyperparameters were all fine:

Current atoms (Index+1):  135
Current atoms (Index+1):  140
Current atoms (index+1):140, Starting test train error
Current atoms (Index+1):  145
Current atoms (Index+1):  150
Current atoms (index+1):155, Starting optimize hyperparameter
Precomputing KnK for hyps optimization
Done precomputing. Time: 11.746763229370117
Hyperparameters:
[2.30625121 1.54094759 0.02014179 0.07265018 0.01062098]
Likelihood gradient:
[   26432.67739506    -3060.81510271 -1284895.17996955   208258.85007577
  1136451.06377194]
Likelihood:
-9664.171852322932


Hyperparameters:
[3.30625121 1.5408893  0.02014107 0.0726531  0.01062172]
Likelihood gradient:
[ 1.13019243e+05  1.57413786e+03 -9.31567568e+06  1.87940397e+05
  1.02464055e+06]
Likelihood:
-7333.137032631232


Hyperparameters:
[7.3062512  1.54065611 0.02013817 0.07266481 0.01062465]
Likelihood gradient:
[-1.01099623e+06  1.93722198e+03 -9.34280279e+06  1.59977911e+05
  8.74061306e+05]
Likelihood:
-4335.725554929026


Hyperparameters:
[3.70847762 1.54086585 0.02014078 0.07265428 0.01062201]
Likelihood gradient:
[ 1.56000119e+05  2.44247261e+03 -2.03842259e+07  1.83068748e+05
  9.96269808e+05]
Likelihood:
-6760.620564703135


Hyperparameters:
[6.01300024 1.5407315  0.02013911 0.07266103 0.0106237 ]
Likelihood gradient:
[-3.31156299e+05  2.60441510e+03  1.60053049e+07  1.66153963e+05
  9.03155107e+05]
Likelihood:
-4814.6254093852185


Hyperparameters:
[6.93965683 1.54067748 0.02013844 0.07266374 0.01062438]
Likelihood gradient:
[-2.22363851e+04  1.77844270e+03 -2.49113819e+07  1.61346227e+05
  8.81384447e+05]
Likelihood:
-4371.572751716198


Hyperparameters:
[7.59079914 1.54063934 0.01443556 0.07273944 0.01102778]
Likelihood gradient:
[ 3.64473894e+04  5.81132146e+03 -2.77258055e+08  1.61504156e+05
  8.07922487e+05]
Likelihood:
-4033.861447481686


Hyperparameters:
[9.12484178e+00 1.54054950e+00 1.00000000e-03 7.29177984e-02
 1.19781671e-02]
Likelihood gradient:
[ 8.90959258e+08 -1.82912088e+06 -3.76943731e+12  1.58662691e+05
  6.66949658e+05]
Likelihood:
112774.35683702157

Current atoms (Index+1):  155
/var/spool/slurm/slurmd/job38236/slurm_script: line 17: 255644 Segmentation fault      (core dumped) python test.py

I tried the different interval to optimize hyperparameters or update the covariance matrix , but this error still exists, and my train_structures should have no error and more than 4,000 structures are included. I can get well results using other code ,e.g. GAP, NNP.
Could you give me some suggestions to optimize hperparameters or select the sparse atomic environment to complete my loop? By the way, this error disappears after replacing add_uncertain_environments with add all structures, but that's very slow.

@jonpvandermause
Copy link
Collaborator

Can you send me a complete script so I can try to reproduce? For example I'm not sure how structure_with_B1_B2_descriptors is defined.

Two things you might try:

  1. Bump sparse_gp.Kuu_jitter to see if that stabilizes the likelihood.
  2. Monitor your RAM usage while the program is running. The sparse GP object consumes a lot of memory and it's possible you're maxing out your machine's resources.

@rbjiawen
Copy link

rbjiawen commented Oct 6, 2024 via email

@jonpvandermause
Copy link
Collaborator

I don't see the files. You can paste the entire test script into the comment box, I'll try to reproduce with my own structures.

@rbjiawen
Copy link

rbjiawen commented Oct 6, 2024

import numpy as np
import matplotlib.pyplot as plt

# flare++ imports
from flare.bffs.sgp import SGP_Wrapper
from flare.bffs.sgp.calculator import SGP_Calculator
from flare.bffs.sgp._C_flare import  NormalizedDotProduct, Structure,B1, B2, B3

from ase.calculators.eam import EAM
from ase.visualize import view
from ase.io import read

from flare.bffs.sgp import SGP_Wrapper

train_structures = read("train.xyz", index=":")
num_atoms = len(train_structures)

def structure_with_B1_B2_descriptors(atoms, **kwargs):
    coded_species = [species_map[spec] for spec in atoms.numbers]
    # create c++ structure
    struc_pp = Structure(
        atoms.cell,
        coded_species,
        atoms.positions,
        cutoff,
        descriptors
    )

    single_atom_sum = 0
    if single_atom_energies is not None:
        for spec in struc_pp.species:
            single_atom_sum += single_atom_energies[spec]
    corrected_energy = atoms.get_potential_energy() - single_atom_sum
    struc_pp.energy = np.array([[corrected_energy]])
    struc_pp.stresses = - atoms.get_stress()[[0, 5, 4, 1, 3, 2]]
    #print(struc_pp.stresses,flush=True)
    struc_pp.forces = atoms.get_forces().reshape(-1)

    return struc_pp

def compute_mae(atoms, sgp):
    dft_e = atoms.get_potential_energy()
    dft_f = atoms.get_forces()
    dft_s = atoms.get_stress()

    struc_pp = structure_with_B1_B2_descriptors(atoms)
    gp_calc = SGP_Calculator(sgp)
    gp_calc.predict_on_structure(struc_pp)

    sgp_e = gp_calc.get_potential_energy(atoms)
    sgp_f = gp_calc.get_forces(atoms)
    sgp_s = gp_calc.get_stress(atoms)

    mae_e = np.mean(np.abs(dft_e - sgp_e))
    mae_e_peratom = np.mean(np.abs(dft_e - sgp_e))/len(atoms)
    mae_f = np.mean(np.abs(dft_f - sgp_f))
    mae_s = np.mean(np.abs(dft_s - sgp_s))


    return mae_e, mae_e_peratom, mae_f, mae_s

def test(sparse_gp, test_structures, g):
    #pre_test = time.time()

    force_errors = []
    energy_errors = []
    energy_errors_peratom=[]
    stress_errors = []
    likelihood = sparse_gp.likelihood
    for j, test_struc in enumerate(test_structures):
        mae_e,mae_e_peratom, mae_f, mae_s = compute_mae(test_struc, sparse_gp)
        energy_errors.append(mae_e)
        energy_errors_peratom.append(mae_e_peratom)
        force_errors.append(mae_f)
        stress_errors.append(mae_s)

    g.write(f"{i+1},   {likelihood},   {1000*np.mean(force_errors)},   "
        f"{1000*np.mean(energy_errors)},   {1000*np.mean(energy_errors_peratom)},   "
        f"{np.mean(stress_errors)}\n")
    g.flush()

species_map = {12: 0, 39: 1, 30:2}                    

# Define many-body descriptor.
cutoff = 5.8 
n_species = 3 
N = 8  
lmax = 4  
radial_basis = "chebyshev" 
cutoff_name = "quadratic"  
radial_hyps = [0, cutoff] 
cutoff_hyps = []
descriptor_settings = [n_species, N, lmax]

# tow body
B1_descriptor = B1(radial_basis, cutoff_name, radial_hyps, cutoff_hyps,
                 [n_species, 10])   # n_species  n_max  arg5 is  cutoffs array, different cutoff for different species


# Define a three-body descriptor object.
B2_descriptor = B2(radial_basis, cutoff_name, radial_hyps, cutoff_hyps,
                 [n_species, 8, 6])

# four body
B3_descriptor = B3(radial_basis,cutoff_name, radial_hyps, cutoff_hyps,
                 [n_species, 6, 3])   #  the l_max < 5

single_atom_energies = {0:0.01111762, 1:-2.08215113, 2:0.00049665} 
descriptors = [B1_descriptor,B2_descriptor]



power = 2
dot_product_kernel_B1 = NormalizedDotProduct(3.7, power)
dot_product_kernel_B2 = NormalizedDotProduct(2.1, power)    #  sigma variable 

K=2
kernels = [dot_product_kernel_B1,dot_product_kernel_B2]


# Define sparse GP.
ave_noa= sum(len(atoms) for atoms in train_structures) / len(train_structures)
print("The average length of atoms",ave_noa,flush=True)
sigma_e = 0.004 * ave_noa
sigma_f = 0.1 
sigma_s = 0.003  


bounds = [(0,5),
          (0,5),
          (0.001,0.04), (0.05, 0.20), (0.0006, 0.03)]

max_iterations = 5
opt_method = "L-BFGS-B"  
variance_type = "local" 

sparse_gp = SGP_Wrapper(
    kernels,
    descriptors,
    cutoff,
    sigma_e,
    sigma_f,
    sigma_s,
    species_map,
    variance_type=variance_type,
    energy_training=True,
    force_training=True,
    stress_training=True,
    single_atom_energies = single_atom_energies,
    opt_method=opt_method,
    max_iterations=max_iterations,
    bounds = bounds
)

sparse_gp.Kuu_jitter = 1e-6


g_add = open(f'train_learning_curve_K{K}_N{N}_L{lmax}_R{cutoff}_P2.txt','w')
g_add.write("frame,   likelihood,   force_mae(meV/Å),   energy_mae(meV),   energy_mae_peratom(meV/atom),   stress_mae(eV/Å3)\n")



# train
index_end=  num_atoms
train_list=[]

hypers_optimize_list=[]

for i, atoms in enumerate(train_structures[:index_end]):
    train_list.append(atoms)
    struc_pp = structure_with_B1_B2_descriptors(atoms)
    sparse_gp.sparse_gp.add_training_structure(struc_pp)  
    sparse_gp.atom_indices.append([-1]) 
    sparse_gp.rel_efs_noise.append([1, 1, 1])
    if int(len(atoms)) > 8:
       sparse_size=int(0.2*len(atoms))+7
    else:
       sparse_size=len(atoms)

    if (i+1) in hypers_optimize_list:
        print(f"Current atoms (index+1):{i+1}, Starting optimize hyperparameter",flush=True)
        sparse_gp.train()

    sparse_gp.sparse_gp.add_uncertain_environments(struc_pp,[sparse_size,sparse_size])
    if (i+1)<10:
       sparse_gp.sparse_gp.update_matrices_QR()
    else:
       if (i+1) % 5 ==0 :
           print("Current atoms (Index+1): ",i+1,flush=True)
           sparse_gp.sparse_gp.update_matrices_QR()
       if (i+1) % 20 == 0 :
           print(f"Current atoms (index+1):{i+1}, Starting test train error",flush=True)
           test(sparse_gp, train_list, g_add)

@rbjiawen
Copy link

rbjiawen commented Oct 6, 2024 via email

@rbjiawen
Copy link

rbjiawen commented Oct 6, 2024

The files are attached to google e-mail, thanks.

@rbjiawen
Copy link

rbjiawen commented Oct 8, 2024

Hi,@jonpvandermause
Another thing I'm not sure about is that when I use two descriptors and two kernels, or more, it corresponds to more than one mapping beta matrix file(lmp.flare). In lammps, does pair_style flare support multiple lmp.flare files being read?

@jonpvandermause
Copy link
Collaborator

jonpvandermause commented Oct 8, 2024

No, pair_style flare doesn't support this, but in principle you may be able to use lammps' hybrid/overlay functionality to combine potentials.

On the seg fault issue: can you give me a bit more information? Do you reliably hit the seg fault when a certain number of structures get added, or is it random? Do you also hit it when only one descriptor is being used? How much RAM does your machine have?

Thanks,
Jon

@rbjiawen
Copy link

rbjiawen commented Oct 8, 2024

  1. The RAM: “free -h”
    total used free shared buff/cache available
    Mem: 251G 10G 239G 838M 2.5G 239G
    Swap: 0B 0B 0B

  2. On the seg fault issue: this error is random, but when I lower the sparse_size, it seems that more structures can be added in. In the very beginning, I added all the atomic environments in, and it seemed to make the model more stable.
    But whenever I optimize the hyperparameters frequently, I basically get this error.
    I've tried a lot of possibilities and have encountered this error with the B2 descriptor alone or B1+B2+B3.

Thanks!

@rbjiawen
Copy link

rbjiawen commented Oct 8, 2024

When I added all the environments and updated the covariance matrix every 100 steps without optimizing the hyperparameters, 2100 structures were added without any problems and the output error was as follows:

frame,   likelihood,   force_mae(meV/Å),   energy_mae(meV),   energy_mae_peratom(meV/atom),   stress_mae(eV/Å3)
100,   2.529049689672762e-52,   7.1662654032310655,   2.080874178066461,   0.3083742393072436,   0.0006211965511722758
200,   2.529049689672762e-52,   63.95955906719454,   20.44542192038009,   5.815188775462623,   0.006406908778709346
300,   2.529049689672762e-52,   110.025878516276,   35.083780420510614,   8.885791084585652,   0.009334833223752883
400,   2.529049689672762e-52,   193.98093660738994,   89.02938683091598,   24.207908134382627,   0.016381608740697925
500,   2.529049689672762e-52,   185.94119840594382,   82.0375207534216,   21.722715427665587,   0.014552824516494093
600,   2.529049689672762e-52,   169.54159060827683,   75.57464928530239,   19.75699580731157,   0.01266391417244684
700,   2.529049689672762e-52,   163.59505840915546,   70.28925459551186,   18.403311348311,   0.011468366796893358
800,   2.529049689672762e-52,   151.79013635029676,   65.89270817489262,   17.384213170394407,   0.010282641622288669
900,   2.529049689672762e-52,   148.2278907852693,   64.79946141597546,   17.133297494694506,   0.009652270362709105
1000,   2.529049689672762e-52,   137.73818660336514,   62.938367789392466,   16.753821908394073,   0.00905860832254719
1100,   2.529049689672762e-52,   132.5126385426517,   60.58443727464342,   16.205118119077127,   0.008390705945029343
1200,   2.529049689672762e-52,   127.74529989271139,   61.41779107521394,   16.04558163334672,   0.008050892834853702
1300,   2.529049689672762e-52,   126.07090870530621,   60.90735320926744,   15.75305422267996,   0.007744445080534072
1400,   2.529049689672762e-52,   123.96954393904919,   62.05062944485583,   15.769959652864026,   0.007445319935496585
1500,   2.529049689672762e-52,   118.37216259662819,   61.89730423482478,   15.54465695017052,   0.0071046276095730505
1600,   2.529049689672762e-52,   115.74543309362268,   61.22275762325991,   15.338709704584183,   0.006871721545158356
1700,   2.529049689672762e-52,   114.54815670484098,   61.03937181659982,   15.22959850574314,   0.006644277223621648
1800,   2.529049689672762e-52,   111.84414782302042,   60.58685187669883,   15.049929261252306,   0.0063436168768693174
1900,   2.529049689672762e-52,   112.63956930652662,   61.80984032574967,   15.310919171670266,   0.0063335205656999535
2000,   2.529049689672762e-52,   110.23441156121223,   62.04392739511916,   15.10610614314543,   0.0061106588404012995
2100,   2.529049689672762e-52,   107.7868650174919,   61.408939988734595,   14.836713080351755,   0.005924881927304646

@rbjiawen
Copy link

rbjiawen commented Oct 8, 2024

Only when all environments are added, the error does not occur, but it seems to occur when adding structures by uncertainty, even without optimizing the hyperparameters.

@jonpvandermause
Copy link
Collaborator

Interesting, thanks. Just to confirm - you're running with the current master branch? If not, can you confirm that you still hit this on master?

@rbjiawen
Copy link

rbjiawen commented Oct 8, 2024

I need to confirm this, thanks for the heads up!

@jonpvandermause
Copy link
Collaborator

@rbjiawen, when you get a chance, can you please email me your .extxyz file? It didn't attach above, probably because it's not one of github's allowed file formats. It will be easier to reproduce the segfault with your exact system. My email address is at the top of the sparse GP tutorial.

@rbjiawen
Copy link

Thanks again, I have sent it to [email protected].

@jonpvandermause
Copy link
Collaborator

Thanks @rbjiawen. I've reproduced the segfault on my machine, and I believe I've root caused the issue.

To safely make calls to add_uncertain_environments, you need to make sure that the internal matrices of the sparse GP are up to date, since these internal matrices (specifically the L_inv matrix) are needed to evaluate uncertainties.

This is the exact line from sparse_gp.cpp that triggers the segfault:

    Eigen::MatrixXd L_inverse_block =
        L_inv.block(sparse_count, sparse_count, n_clusters, n_clusters);

When L_inv isn't up to date, you are attempting to access a part of the matrix that doesn't exist yet, which gives undefined behavior and occasionally results in a segfault.

If you call update_matrices_QR() after every call to add_uncertain_environments, you should avoid this.

I will update the code to require L_inv to be up to date when add_uncertain_environments is called.

Nice find!

@rbjiawen
Copy link

rbjiawen commented Oct 12, 2024 via email

@jonpvandermause
Copy link
Collaborator

Yes, that's right. And more generally, before making predictions, the covariance matrices must be current. I've placed assertions in the code in #419 that throw runtime errors if this isn't the case, which should make this easier to catch moving forward.

I'm going to mark this as resolved. Feel free to open another ticket if you encounter other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants