sparks-baird / mat_discover Goto Github PK

A materials discovery algorithm geared towards exploring high-performance candidates in new chemical spaces.

Home Page: https://mat-discover.readthedocs.io/

License: MIT License

Python 100.00%

materials-discovery machine-learning materials-science materials-screening materials-informatics pytorch numba predict-materials-properties crabnet wasserstein-distance

mat_discover's People

Contributors

Stargazers

Watchers

Forkers

sailfish009 sgbaird-mwe mukhtarbayerouniversity fermiq msb002 bamcvoelker hasan-sayeed antobi hotant-bu

mat_discover's Issues

Google Colab disc.plot error (towards end)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-c8f4453d23be> in <module>()
----> 1 disc.plot()

4 frames
/usr/local/lib/python3.7/dist-packages/mat_discover/mat_discover_.py in plot(self, return_pareto_ind)
   1256                 x: self.val_frac.ravel(),
   1257                 y: self.cluster_avg.ravel(),
-> 1258                 "cluster ID": self.unique_labels,
   1259             }
   1260         )

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    612         elif isinstance(data, dict):
    613             # GH#38939 de facto copy defaults to False only in non-dict cases
--> 614             mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    615         elif isinstance(data, ma.MaskedArray):
    616             import numpy.ma.mrecords as mrecords

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in dict_to_mgr(data, index, columns, dtype, typ, copy)
    463 
    464     return arrays_to_mgr(
--> 465         arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
    466     )
    467 

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity, typ, consolidate)
    117         # figure out the index, if necessary
    118         if index is None:
--> 119             index = _extract_index(arrays)
    120         else:
    121             index = ensure_index(index)

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in _extract_index(data)
    633             lengths = list(set(raw_lengths))
    634             if len(lengths) > 1:
--> 635                 raise ValueError("All arrays must be of the same length")
    636 
    637             if have_dicts:

ValueError: All arrays must be of the same length

Combine workflows in GitHub actions, account for PRs

Feature request

Consider using if statement related to tag pushes in a single workflow.
Run pytest on PRs to keep "main" as a stable branch.

LLVM error on CPU

Hello,

After running the and training on my data set I keep on getting this error on my machine:

 The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit(inline=inline)

Model architecture: out_dims, d_model, N, heads
3, 512, 3, 4
Running on compute device: cpu
Model size: 11987206 parameters

Generating EDM: 100%|██████████████████████| 598/598 [00:00<00:00, 197402.31formulae/s]
loading data with up to 7 elements in the formula
training with batchsize 128 (2**7.000)
stepping every 50 training passes, cycling lr every 10 epochs
checkin at 20 epochs to match lr scheduler
Epoch: 0/40 --- train mae: 0.0135 val mae: 0.0135
Epoch: 19/40 --- train mae: 0.00895 val mae: 0.00895
Epoch: 39/40 --- train mae: 0.00653 val mae: 0.00653
Saving network (test-property) to models/trained_models/test-property.pth
[train-CrabNet]
Elapsed: 104.44437

Generating EDM: 100%|██████████████████████| 598/598 [00:00<00:00, 161392.05formulae/s]
loading data with up to 7 elements in the formula
Generating EDM: 100%|██████████████████████| 205/205 [00:00<00:00, 123716.88formulae/s]
loading data with up to 7 elements in the formula
val RMSE:  0.017233173185924926
Fitting mod_petti kernel matrix
Constructing distances
LLVM ERROR: Symbol not found: __powidf2
Aborted

I don't have cuda. I am only using cpu version. I checked on Google but could not find a solution.

If you know how to resolve let me know. Thanks.

Compare with Bayesian optimization

Long-time to-do list item. Mentioned by Balaranjan Selvaratnam as a question for @sp8rks Materials Project seminar. Easy enough to implement in Ax with mat2vec features.

Allow specification of target unit other than GPa for plotting

Add attribute documentation to Discover class

Density scores are negative if property values are negative (?)

Replace `frame.append` with `pandas.concat` to avoid future incompatibility

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Error while using `nearest_neigh_prop`

I'm trying to predict using trained model and facing the following error:

InvalidParameterError                     Traceback (most recent call last)
Cell In[16], line 13
     11 for i, val_df in enumerate(val_dfs):
     12     jd = jd + 1
---> 13     disc.predict(val_df, umap_random_state=42)
     14     dens_score.append(disc.dens_score_df.sample(frac=1))
     15     peak_score.append(disc.peak_score_df.sample(frac=1))

File c:\Users\hasan\miniconda3\envs\li_ml\lib\site-packages\mat_discover\mat_discover_.py:665, in Discover.predict(self, val_df, plotting, umap_random_state, pred_weight, proxy_weight, dummy_run, count_repeats, return_peak)
    663 # compound-wise scores (i.e. individual compounds)
    664 with self.Timer("nearest-neighbor-properties"):
--> 665     self.rad_neigh_avg_targ, self.k_neigh_avg_targ = nearest_neigh_props(
    666         self.dm, pred, n_neighbors=self.n_peak_neighbors
    667     )
    668     self.val_rad_neigh_avg = self.rad_neigh_avg_targ[val_ids]
    669     self.val_k_neigh_avg = self.k_neigh_avg_targ[val_ids]

File c:\Users\hasan\miniconda3\envs\li_ml\lib\site-packages\mat_discover\utils\nearest_neigh.py:48, in nearest_neigh_props(X, target, r_strength, radius, n_neighbors, metric, **NN_kwargs)
      6 def nearest_neigh_props(
      7     X,
      8     target,
   (...)
     13     **NN_kwargs,
     14 ):
...
     98     f"The {param_name!r} parameter of {caller_name} must be"
     99     f" {constraints_str}. Got {param_val!r} instead."
    100 )

InvalidParameterError: The 'radius' parameter of NearestNeighbors must be a float in the range [0, inf] or None. Got -0.28528690338134766 instead.

FileNotFoundError while splitting val_df into chunks

val_dfs = np.array_split(val_df, 3)

print("##########chunking done##########")

# %% directories
table_dirs = [join("tables", val) for val in ["val1", "val2", "val3"]]
figure_dirs = [join("figures", val) for val in ["val1", "val2", "val3"]]
# %% fit, predict, plot
disc.fit(train_df)

print("##########fitting done##########")

for i, val_df in enumerate(val_dfs):
    # change output dirs
    disc.table_dir = table_dirs[i]
    disc.figure_dir = figure_dirs[i]

    # %% predict
    score = disc.predict(val_df, umap_random_state=42)

    print("##########predictions done##########")

    # %% plot and save
    disc.plot()

This is code that I got from @mliu7051 to help me split my validation data into chunks in DiSCoVeR. Fitting and predicting works, but I get the following error when running disc.plot()

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-32-7f8d2c805a2e> in <module>()
     16 
     17     # %% plot and save
---> 18     disc.plot()
     19     disc.save(dummy=dummy)
     20 

7 frames
/usr/local/lib/python3.7/dist-packages/mat_discover/mat_discover_.py in plot(self, return_pareto_ind)
   1318             Pareto front indices for the peak and density proxies, respectively.
   1319         """
-> 1320         fig, pk_pareto_ind = self.pf_peak_proxy()
   1321         fig, frac_pareto_ind = self.pf_train_contrib_proxy()
   1322 

/usr/local/lib/python3.7/dist-packages/mat_discover/mat_discover_.py in pf_peak_proxy(self)
   1544             x_unit=self.target_unit,
   1545             y_unit=self.target_unit,
-> 1546             use_plotly_offline=self.use_plotly_offline,
   1547         )
   1548 

/usr/local/lib/python3.7/dist-packages/mat_discover/utils/pareto.py in pareto_plot(df, x, y, color, x_unit, y_unit, color_unit, hover_data, fpath, reverse_x, parity_type, pareto_front, color_continuous_scale, color_discrete_map, xrange, use_plotly_offline)
    214 
    215     if fpath is not None:
--> 216         fig.write_html(fpath + ".html")
    217 
    218     fig, scale = matplotlibify(fig)

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in write_html(self, *args, **kwargs)
   3706         import plotly.io as pio
   3707 
-> 3708         return pio.write_html(self, *args, **kwargs)
   3709 
   3710     def to_image(self, *args, **kwargs):

/usr/local/lib/python3.7/dist-packages/plotly/io/_html.py in write_html(fig, file, config, auto_play, include_plotlyjs, include_mathjax, post_script, full_html, animation_opts, validate, default_width, default_height, auto_open, div_id)
    534     # Write HTML string
    535     if path is not None:
--> 536         path.write_text(html_str)
    537     else:
    538         file.write(html_str)

/usr/lib/python3.7/pathlib.py in write_text(self, data, encoding, errors)
   1238             raise TypeError('data must be str, not %s' %
   1239                             data.__class__.__name__)
-> 1240         with self.open(mode='w', encoding=encoding, errors=errors) as f:
   1241             return f.write(data)
   1242 

/usr/lib/python3.7/pathlib.py in open(self, mode, buffering, encoding, errors, newline)
   1206             self._raise_closed()
   1207         return io.open(self, mode, buffering, encoding, errors, newline,
-> 1208                        opener=self._opener)
   1209 
   1210     def read_bytes(self):

/usr/lib/python3.7/pathlib.py in _opener(self, name, flags, mode)
   1061     def _opener(self, name, flags, mode=0o666):
   1062         # A stub for the opener argument to built-in open()
-> 1063         return self._accessor.open(self, flags, mode)
   1064 
   1065     def _raw_open(self, flags, mode=0o777):

FileNotFoundError: [Errno 2] No such file or directory: 'figures/val1/pf-peak-proxy.html

I know it's something to do with the directory I created, I just need plot to recognize all 3 chunks

Aitchison compositional distance metric as a swap-out for element mover's distance

https://doi.org/10.1023/A:1007529726302

Suggested by Evan Still

Google Colab: Numba/CPU lowering warning, runs much slower on Colab

I.e. set dist_device="cpu" on e.g. https://colab.research.google.com/github/sparks-baird/mat_discover/blob/main/examples/elmd_densmap_cluster.ipynb

hdbscan temporarily not working

scikit-learn-contrib/hdbscan#562

revert to a previous version if this isn't fixed soon

megnet not installed, how to deal with crystal vs. composition dependencies?

Make all dependencies required?
Specify crystal ones as optional dependencies and convert imports to try/except with appropriate errors?

add facilities for crystal structures and polymers

Feature request

Requires valid distance metrics for crystal structures and polymers that encode chemo-structural novelty and polymeric novelty, respectively as well as structure-based regression models. After that, just some basic plumbing.

add `skip_write_image` kwarg

For compatibility reasons if someone's code is getting stuck on Plotly's write_image. Probably also good to open up an issue on Plotly's GitHub page

@ancarnevali

Remove `flit_core >=3.2,<4` in meta.yaml produced by run_grayskull.py

(Assuming this works for dist-matrix after installing and trying to run basic example)

why was jaxlib incompatible?

#149 #154

Additional performance/novelty metric

Great suggestion by @computron after @sp8rks 's Materials Project seminar

At the end you mentioned a few metrics to test if your "novel" compounds were actually novel. I would suggest also something like a min-max distance from top performers. For example, you predict compound X. Take the distance of compound X from the top 20 known good performers and see what is the closest known good performer. The most novel compounds are the ones with the maximum distance from any known top performer.

upload to conda-forge instead of sgbaird channel, much easier than I thought

Feature request

https://conda-forge.org/docs/maintainer/adding_pkgs.html

Conda installation: Unsatisfiable Error

Hi!
I'd like to report a problem with the conda installation. After creating a new conda environment and launching the installation with

conda update conda
conda create --name mat_discover python==3.9.*
conda activate mat_discover
conda install -c sgbaird mat_discover

I am prompted the following:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Any ideas abouth what may be going on?
Thank you!

peak_score.csv should probably have `k_neigh_avg` instead of `density`

Based on discussion with @hasan-sayeed

See e.g. the tutorial notebook:

disc.dens_score_df.head(10)

disc.peak_score_df.head(10)

Peak score is calculated from the kNN averages, so it's confusing that density is one of the columns in the output dataframe.

'Discover' object has no attribute 'dens_score_df'

Hi!
I'd like to open a fitted model in order to elaborate on its clusters, and found the disc.load() method.
I load the method with

>> disc=Discover()
>>> disc.load(fpath='C:/Users/anton/OneDrive/Desktop/matdisc_colab/formula_discharge vs capacity_grav/disc.pkl')
<mat_discover.mat_discover_.Discover object at 0x0000015006F56D90>

but, when trying to use some methods like disc.merge() or disc.cluster_avg() it appears that the Discover() object lacks attributes.
Some examples:

>>> disc.cluster_avg()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Discover' object has no attribute 'cluster_avg'

>>> disc.merge()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\anton\miniconda3\envs\matdisc\lib\site-packages\mat_discover\mat_discover_.py", line 848, in merge
    dens_score_df = self.dens_score_df.rename(columns={"score": "Density Score"})
AttributeError: 'Discover' object has no attribute 'dens_score_df'

>>> disc.compute_log_density()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\anton\miniconda3\envs\matdisc\lib\site-packages\mat_discover\mat_discover_.py", line 1261, in compute_log_density
    r_orig = self.std_r_orig
AttributeError: 'Discover' object has no attribute 'std_r_orig'

The folder where disc.pkl is stored contains the three folders "figures", "models", "tables", and disk.pkl itself. The same occurs for other folders structured similarly.
Am I doing something wrong here or is there a possible bug in the load() method?

Validation MAE is fictitious (just a repeat of training MAE) during training of CrabNet

Reporting a bug

For example, running the basic example: https://github.com/sparks-baird/mat_discover/blob/main/examples/mat_discover_example.py

See https://github.com/sgbaird/CrabNet/issues/3

clean up data (mostly `.pkl` files) from root dir, place in `data` folder and change savepaths in files

make crystal structure packages optional (i.e., do try-except imports)

Adaptive design with mat_discover

Feature request

mat_discover can be adapted into an adaptive design scheme by iteratively moving the top-ranked compound from the validation set to the training set and repeating the fit/predict. This is costly, as producing a top-10 list will now take 10x longer. While this could be mitigated by using algorithms that allow for "online" updating of the training data (probably out-of-scope for mat_discover), the extra computational cost is probably worth it for any kind of low-throughput simulation or experiment-based adaptive design scheme. It could also take the form of adding new simulation results to mat_discover and re-running.

The first idea would be an implementation within mat_discover with a docs example, while the latter would probably exist only as a docs example.

Support Python 3.10

#149

Add/improve docstrings for `Adapt` class

add unique elements and unique templates from extraordinary as measure of both performance + novelty

For the extraordinary adaptive design study/example, the performance and novelty metrics are separate. In other words, it only shows that high-performing materials are being found and that novel materials are being found. It could be finding high-performing, traditional materials and low-performing, novel materials while never finding high-performing, novel compounds.

To address that, consider plotting additional rows:

number of unique elements added by the extraordinary compounds that are discovered (novelty + proxy)
the number of unique templates added by the extraordinary compounds that are discovered (novelty + proxy)

Also:

number of unique elements that are added during the addition of a unique template (novelty)
same as (1) directly above but with the additional constraint that it was also an extraordinary compound (novelty + proxy)

With so many rows, it might be nice to have an interactive figure with a dropdown or a few dropdowns.

See https://github.com/sparks-baird/mat_discover/blob/main/examples/adaptive_design_compare.py

Aside: There's also the question of whether "top 2% of performers" is too tight of a constraint given the additional constraints above.

Issue with custom color for pareto_plot (can't use strings for color values)

Due to:

mat_discover/mat_discover/utils/pareto.py

Line 120 in 84ebc7b

mx = np.max(df[color])

Readthedocs not creating mat_discover_ docs

Blank submodule

Numpy inconsistency in mat_discover

Hi!
After pip-installing mat_discover in a new conda environment, I found the following potential problem:

>>> from mat_discover.mat_discover_ import Discover   
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\mat_discover\mat_discover_.py", line 36, in <module>
    import umap
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\umap\__init__.py", line 2, in <module>
    from .umap_ import UMAP
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\umap\umap_.py", line 28, in <module>
    import numba
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\numba\__init__.py", line 200, in <module>
    _ensure_critical_deps()
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\numba\__init__.py", line 140, in _ensure_critical_deps
    raise ImportError("Numba needs NumPy 1.21 or less")
ImportError: Numba needs NumPy 1.21 or less

After which I downgraded my numpy to 1.21.0 via pip install. I was prompted the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
matminer 0.7.6 requires numpy>=1.22.0, but you have numpy 1.21.0 which is incompatible.

I am currently on numpy 1.21.0, and everything seems to work, but perhaps this incompatibility is worth a thought?

GitHub actions: make sure that workflow recognizes the error of "already uploaded to PyPI" and stops there

Right now it just continues the action steps instead of stopping on the error.

add testing for Windows and Mac to GitHub actions

Feature request

Just need to change the os to a matrix scheme and use the correct command (if fi statement) for setting the environment variable

sparks-baird / mat_discover Goto Github PK

mat_discover's People

Contributors

Stargazers

Watchers

Forkers

mat_discover's Issues

Feature request

Feature request

Feature request

Reporting a bug

Feature request

Feature request

Recommend Projects

Recommend Topics

Recommend Org