suncat-center / catlearn Goto Github PK

A machine learning environment for atomic-scale modeling in surface science and catalysis.

Home Page: http://catlearn.readthedocs.io/

License: GNU General Public License v3.0

Python 99.80% Shell 0.02% Dockerfile 0.18%

atomistic-machine-learning catalysis catalyst computational-chemistry machine-learning materials-informatics materials-science nanotechnology python

catlearn's People

Contributors

Stargazers

Watchers

Forkers

mhangaard pcjennings chuanxun jagarridotorres twang0686 aeho-htas mamunm jboes nanoresearch mhoffman ziyun-wang realac schlexer hitarth64 maimaiti jianglst haikun7274 eminsight shyamdeokr zqhuang2014 ericmusa shyshy903 vieri2006 arnoldkuo kush2803 ikowalec gwanyeong ericfontes malik-ust mansurcompai woshicc pk-organics fastflair computational-chemistry-research chemcoms sailfish009 zongwuyang badaweena skphy pavelstishenko avishart vladislavivanistsev kuo-tingkai fchennwpu wayneyann kimrojas gabrielbram harel-coffee farewelll1 sudo-raheel

catlearn's Issues

VASP internal relaxation for CatLearn NEB

Hi all,
I am just wondering if it possible to run CatLearn ML-NEB without using the ASE VASP calculator but using the VASP internal relaxation (I.e. optimizing the initial and final end-points in VASP)

Optimization of the hyperparameters

LBFGS scipy: change L-BFGS-B (from scipy minimize) to scipy.optimize.fmin_l_bfgs_b

Update Gradients Tutorials

The gradients tutorials need updating to jupyter notebook format with some additional discussion of what is going on/expected.

Implement ASE parallel in MLMin/MLNEB

Parallel GPAW runs with MLMin/MLNEB not compatible until new upgrade.

Catlearn MLNEB() initializes without assigning the correct calculator to the images of is_endpoint, fs_endpoint

It seems that right now, when you create a mlneb object, it doesnt assign the calculator of the images to be the calculator passed in the ase_calc variable.

On my machine, I hackily patched this by just putting
for image in is_endpoint:\n image.calc = ase_calc
but IDK if this is a safe way to do it, or if its actually the correct solution.

Broken installation setup.py

This first line does not seem to refer to an actual package. At least the URL does not lead anywhere specific. And it breaks the installation.

CatLearn/requirements.txt

Line 1 in 8643e4e

-i https://pypi.python.org/simple

Evaluate diagonal only on predict std

Predicting mean and covariance on a test set, X_test, scale N**2, because we are constructing K(X_test, X_test). In case X_test is large, we would prefer to not calculate the covariance matrix, but just the standard deviation based on the diagonal (See gpfunctions.uncertainty)

Avoid deepcopy in MLNEB and MLMin, issues with GPAW.

Deepcopy needs to be avoided in order to run MLMin/MLNEB with GPAW.

Tests generate 200000 PendingDeprecationWarnings, which causes failure.

This is due to use of np.mat or np.matrix in ASE. ASE has fixed this in the master branch, but the warnings will crash our tests until the next ASE release.

All warnings were supposed to be filtered to "once" or "ignore", but unfortunately either unittests or pytest overrides this.

conda-forge package

I started building a conda-forge package:
https://github.com/conda-forge/catlearn-feedstock

And I was wondering if catlearn/api/magpie can be compiled from code as well, or maybe distributed as a separate package.

Force prediction in the Gaussian process model

Hi all,

I have noticed that CatLearn includes both energy and forces in the training of a Gaussian process model, but only predicts energy from the GP model. The predicted forces are computed using finite differences according to Phys. Rev. Lett. 122, 156001 (2019) . However, predicting forces directly from the GP model is also starightforward once forces are included in the training just as what J. Chem. Phys. 147, 152720 (2017) did. Why not predicting forces directly from the GP model in CatLearn? Is there any benifit of using the finite difference approach?

Best,
Zeyuan

PLOTNEB problem with plot's text positional requirement

I am currently testing the tutorials, in particular the tutorials/11_NEB/04_CO_Cu111/nebCO.py
Everything ran properly except the PLOTNEB module.
I encountered the following output:

The ML-NEB algorithm required  19.636363636363637 times less number of function evaluations than the standard NEB algorithm.
Energy barrier: 0.05631016623911789 eV
Traceback (most recent call last):
  File "/home/krojas/student/rai/catlearn_test/test03/nebCO.py", line 137, in <module>
    plotneb(trajectory='ML-NEB.traj', view_path=False)
  File "/home/krojas/APPS/mambaforge/envs/catlearn/lib/python3.10/site-packages/catlearn/optimize/tools.py", line 50, in plotneb
    ax.annotate(s=str(np.round(e_barrier, 3))+' eV',
TypeError: Axes.annotate() missing 1 required positional argument: 'text'

Method to replicate:

Create conda environent conda create -n pycatlearn python=3.10 catlearn
Download and run tutorials/11_NEB/04_CO_Cu111/nebCO.py

May I ask how to fix this?
Should I specify the matplotlib version ?

the original resources of the GP features sensitivity analysis

Hi, I am interested in the GP sensitivity analysis implemented in CatLearn and want to know the original formula of this method. https://github.com/SUNCAT-Center/CatLearn/blob/master/catlearn/regression/gpfunctions/sensitivity.py
Does it similar with the method described in https://munin.uit.no/bitstream/handle/10037/6784/thesis.pdf?sequence=2.
Thank you very much.

Parallel Testing

There are some issues with parallelism in Python2.7 with the adsorbate fingerprinting and maybe others. This is specific to 2.7 and does not affect Python3+.

In general I think it would be best for tests to be run with nprocs=None to pick up these errors in any code with parallelism.

TravisCI has a server for parallel testing that could be used I think. But specific tests would need to be written for this. Otherwise I think we only have access to a single core by default, so even when nprocs=None everything is still being run in serial.

Error in docs on featurizing

Just a small error, but the docs say that one should use:

from catlearn.fingerprint.setup import FeatureGenerator

whereas the FeatureGenerator function now seems to be in catlearn.featurize.setup

Highly pedantic `requirements.txt`

Right now the requirements.txt looks as follows:

ase==3.16.0
click==6.7
cycler==0.10.0
decorator==4.3.0
flask==1.0.2
h5py==2.7.1
itsdangerous==0.24
jinja2==2.10
kiwisolver==1.0.1
markupsafe==1.0
matplotlib==2.2.2
networkx==2.1.0
numpy==1.14.3
pandas==0.23.0 
pyparsing==2.2.0
python-dateutil==2.7.3
pytz==2018.4
scikit-learn==0.19.1
scipy==1.1.0
six==1.11.0
tqdm==4.23.3
werkzeug==0.14.1

This means, in order to install CatLearn with pip install catkit my system needs to match all the packages down to the patch level or pip will refuse to install it. Point in case, if I upgrade my numpy today I would get version 1.14.4 but pip refuses to install CatLearn in this situation since it thinks that CatLearn requires exactly 1.14.3. This leaves me with two options: either I downgrade or all my other packages to matches exactly CatLearn (and potentially break other packages) or I have to escape into a virtualenv or docker to spin exactly those versions. Would it be possible to relax some of these version numbers using >= or ~=. >= simply fixes the version of greater or equal to the stated number but allows to skip trailing digits. So, numpy>=1.4 would allow everyting greater or equal than 1.4.0. ~= skips one more rank. To quote the essential part of the following website

Mopidy-Dirble ~= 1.1        # Compatible release. Same as >= 1.1, == 1.*

This website documents the different possible qualifiers https://pip.pypa.io/en/stable/reference/pip_install/#example-requirements-file . Better yet unless there is a specific reason I would never state the third version number because assuming the depedency sticks to semantic version this would only count the number of patches not break backwards compatibility.

MLMIN bug with ASE 3.19

There is a major bug in MLMIN when used with ASE 3.19. It behaves like previous NEB bug (if im not mistaken)
The bug is: the iteration of geo_opt using GPR in line 283 doesn't works well

I have compared them when used with ASE 3.17 and 3.19
the one that optimized quickly is done with ASE 3.17

Convergence issue of ML-NEB with newer ASE

There seem to be again issues with ML-NEB convergence and the newer ASE 3.20 version.

sklearn deprecated Imputer breaking `clean_data.py` module

The most recent version of sklearn (0.22) has removed the Imputer class from within from preprocessing.imputation location and as a result the following traceback is given:

~/TEMP/CatLearn/catlearn/preprocess/clean_data.py in <module>
      2 import numpy as np
      3 from collections import defaultdict
----> 4 from sklearn.preprocessing import Imputer
      5 from scipy.stats import skew
      6 

ImportError: cannot import name 'Imputer'

The following message is located in the old preprocessing.imputation file

@deprecated("Imputer was deprecated in version 0.20 and will be "                      
    "removed in 0.22. Import impute.SimpleImputer from "                       
    "sklearn instead.")

It looks like simply replacing Inputer with SimpleImputer would be sufficient, but we should make sure that these classes are in fact the same before fixing

MLNeb hangs with latest ASE from git

The tutorial described in 11_NEB/00_Tutorial/Tutorial_MLNEB.ipynb hangs when using the latest ASE git head due to changes in the Dynamics class located in ase/optimize/optimize.py. Specifically, this loop can repeat forever:

CatLearn/catlearn/optimize/mlneb.py

Lines 379 to 419 in 2cd306d

    
           while ml_converged is False: 
        
               # Save prev. positions: 
        
               prev_save_positions = [] 
        
               for i in self.images: 
        
                   prev_save_positions.append(i.get_positions()) 
        
               neb_opt.run(fmax=(fmax * 0.85), steps=1) 
        
               get_results_predicted_path(self) 
        
               unc_ml = np.max(self.uncertainty_path[1:-1]) 
        
               e_ml = np.max(self.e_path[1:-1]) 
        
               if e_ml >= self.max_target + 0.2: 
        
                   for i in range(0, self.n_images): 
        
                       self.images[i].positions = prev_save_positions[i] 
        
                   if self.fullout is True: 
        
                       print('Pred. energy above max. energy. ' 
        
                             'Early stop.') 
        
                   ml_converged = True 
        
               if unc_ml >= max_step: 
        
                   for i in range(0, self.n_images): 
        
                       self.images[i].positions = prev_save_positions[i] 
        
                   if self.fullout is True: 
        
                       print('Maximum uncertainty reach. Early stop.') 
        
                   ml_converged = True 
        
               if neb_opt.converged(): 
        
                   ml_converged = True 
        
               n_steps_performed = neb_opt.__dict__['nsteps'] 
        
               if np.isnan(ml_neb.emax): 
        
                   sp = str(-self.n_images) + ':' 
        
                   self.images = read('./all_predicted_paths.traj', sp) 
        
                   for i in self.images: 
        
                       i.get_potential_energy() 
        
                   n_steps_performed = 10000 
        
               if n_steps_performed > ml_steps-1: 
        
                   if self.fullout is True: 
        
                       print('Not converged yet...') 
        
                   ml_converged = True

This happens when neb_opt doesn't immediately converge, because neb_opt.run(..., steps=1) now returns before performing any steps if neb_opt.max_steps has been reached. Additionally, neb_opt.nsteps isn't incremented by neb_opt.run(...) beyond max_steps, so the bailout condition of n_steps_performed > ml_steps-1 never evaluates to True.

A simple workaround would be to manually set neb_opt.nsteps = 0 immediately before neb_opt.run(..). There's probably a more elegant way of telling ASE's NEB class to iterate once, but that would require more changes to the code, and the workaround I describe seems to work for me.

As an aside, I don't understand why CatLearn accesses nsteps through neb_opt's __dict__ attribute (L407). Is there a reason for this atypical access pattern?

Convergence issue with newer ASE

Workaround: ML-NEB is still stable and compatible with ASE 3.17.0.

The two latest ASE stable releases, however, breaks ML-NEB, causing each iteration to slow down dramatically and possibly prevents convergence.

Help is wanted in identifying the bug.

Requirements for PyPi.

The setup.py file currently has the requirements defined.

 25     install_requires=['ase==3.16.0',
 26                       'h5py==2.7.1',
 27                       'networkx==2.1.0',
 28                       'numpy==1.14.2',
 29                       'pandas==0.22.0',
 30                       'pytest-cov==2.5.1',
 31                       'scikit-learn==0.19.1',
 32                       'scipy==1.0.1',
 33                       'tqdm==4.20.0',
 34                       ],

This is a bad idea as they need to be kept updated along with the requirements.txt file. At some point it is highly likely these will diverge.

There needs to be a way to automatically parse the requirements.txt when setup.py is run that is compatible with the uploaded PyPi package.

CI Docs

It would be good if the CI could generate the sphinx docs on-the-fly so we didn't have to keep updating things. It would probably be as simple as calling:

sphinx-apidoc -o docs catlearn

within so additional post complete part of the CI.

Parallelize GA feature elimination

Update Docstrings

Add docstrings to all functions.
Make sure everything has Returns.
Add attributes to docstring.

Question: Correct parallelization usage

Hi I would like to ask about how to properly run the catlearn code with proper parallelization.

I tested catlearn and compared it with traditional neb.x code of quantum espresso.
With the same system and number of images, the catlearn (single node - 64 core) and the neb.x (5 nodes - 64 core each - image parallelized) have the same duration. This means that catlearn is more efficient in resources.

I would like to further expand this by utilizing more node for the DFT calculation, say 5 nodes for 1 DFT evaluation.
When I do the catlearn (5 node - 64 core each), the calculation become rather slow.
The 5 node method is applied to the DFT calculation via ASE_ESPRESSO_COMMAND.
I think the bottleneck maybe due to the parallelization of the catlearn for 5 nodes ? (at least the automatic treatment is not correct).

May I ask how to properly do this ?

	while ml_converged is False:
	# Save prev. positions:
	prev_save_positions = []
	for i in self.images:
	prev_save_positions.append(i.get_positions())

	neb_opt.run(fmax=(fmax * 0.85), steps=1)
	get_results_predicted_path(self)
	unc_ml = np.max(self.uncertainty_path[1:-1])
	e_ml = np.max(self.e_path[1:-1])

	if e_ml >= self.max_target + 0.2:
	for i in range(0, self.n_images):
	self.images[i].positions = prev_save_positions[i]
	if self.fullout is True:
	print('Pred. energy above max. energy. '
	'Early stop.')
	ml_converged = True

	if unc_ml >= max_step:
	for i in range(0, self.n_images):
	self.images[i].positions = prev_save_positions[i]
	if self.fullout is True:
	print('Maximum uncertainty reach. Early stop.')
	ml_converged = True
	if neb_opt.converged():
	ml_converged = True

	n_steps_performed = neb_opt.__dict__['nsteps']

	if np.isnan(ml_neb.emax):
	sp = str(-self.n_images) + ':'
	self.images = read('./all_predicted_paths.traj', sp)
	for i in self.images:
	i.get_potential_energy()
	n_steps_performed = 10000

	if n_steps_performed > ml_steps-1:
	if self.fullout is True:
	print('Not converged yet...')
	ml_converged = True