Code Monkey home page Code Monkey logo

mat_discover's People

Contributors

dependabot[bot] avatar hasan-sayeed avatar sgbaird avatar sgbaird-alt avatar surgearrester avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

mat_discover's Issues

Google Colab disc.plot error (towards end)

Open In Colab (PyPI)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-c8f4453d23be> in <module>()
----> 1 disc.plot()

4 frames
/usr/local/lib/python3.7/dist-packages/mat_discover/mat_discover_.py in plot(self, return_pareto_ind)
   1256                 x: self.val_frac.ravel(),
   1257                 y: self.cluster_avg.ravel(),
-> 1258                 "cluster ID": self.unique_labels,
   1259             }
   1260         )

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    612         elif isinstance(data, dict):
    613             # GH#38939 de facto copy defaults to False only in non-dict cases
--> 614             mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    615         elif isinstance(data, ma.MaskedArray):
    616             import numpy.ma.mrecords as mrecords

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in dict_to_mgr(data, index, columns, dtype, typ, copy)
    463 
    464     return arrays_to_mgr(
--> 465         arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
    466     )
    467 

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity, typ, consolidate)
    117         # figure out the index, if necessary
    118         if index is None:
--> 119             index = _extract_index(arrays)
    120         else:
    121             index = ensure_index(index)

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in _extract_index(data)
    633             lengths = list(set(raw_lengths))
    634             if len(lengths) > 1:
--> 635                 raise ValueError("All arrays must be of the same length")
    636 
    637             if have_dicts:

ValueError: All arrays must be of the same length

LLVM error on CPU

Hello,

After running the and training on my data set I keep on getting this error on my machine:

 The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit(inline=inline)

Model architecture: out_dims, d_model, N, heads
3, 512, 3, 4
Running on compute device: cpu
Model size: 11987206 parameters

Generating EDM: 100%|██████████████████████| 598/598 [00:00<00:00, 197402.31formulae/s]
loading data with up to 7 elements in the formula
training with batchsize 128 (2**7.000)
stepping every 50 training passes, cycling lr every 10 epochs
checkin at 20 epochs to match lr scheduler
Epoch: 0/40 --- train mae: 0.0135 val mae: 0.0135
Epoch: 19/40 --- train mae: 0.00895 val mae: 0.00895
Epoch: 39/40 --- train mae: 0.00653 val mae: 0.00653
Saving network (test-property) to models/trained_models/test-property.pth
[train-CrabNet]
Elapsed: 104.44437

Generating EDM: 100%|██████████████████████| 598/598 [00:00<00:00, 161392.05formulae/s]
loading data with up to 7 elements in the formula
Generating EDM: 100%|██████████████████████| 205/205 [00:00<00:00, 123716.88formulae/s]
loading data with up to 7 elements in the formula
val RMSE:  0.017233173185924926
Fitting mod_petti kernel matrix
Constructing distances
LLVM ERROR: Symbol not found: __powidf2
Aborted

I don't have cuda. I am only using cpu version. I checked on Google but could not find a solution.

If you know how to resolve let me know. Thanks.

Error while using `nearest_neigh_prop`

I'm trying to predict using trained model and facing the following error:

InvalidParameterError                     Traceback (most recent call last)
Cell In[16], line 13
     11 for i, val_df in enumerate(val_dfs):
     12     jd = jd + 1
---> 13     disc.predict(val_df, umap_random_state=42)
     14     dens_score.append(disc.dens_score_df.sample(frac=1))
     15     peak_score.append(disc.peak_score_df.sample(frac=1))

File c:\Users\hasan\miniconda3\envs\li_ml\lib\site-packages\mat_discover\mat_discover_.py:665, in Discover.predict(self, val_df, plotting, umap_random_state, pred_weight, proxy_weight, dummy_run, count_repeats, return_peak)
    663 # compound-wise scores (i.e. individual compounds)
    664 with self.Timer("nearest-neighbor-properties"):
--> 665     self.rad_neigh_avg_targ, self.k_neigh_avg_targ = nearest_neigh_props(
    666         self.dm, pred, n_neighbors=self.n_peak_neighbors
    667     )
    668     self.val_rad_neigh_avg = self.rad_neigh_avg_targ[val_ids]
    669     self.val_k_neigh_avg = self.k_neigh_avg_targ[val_ids]

File c:\Users\hasan\miniconda3\envs\li_ml\lib\site-packages\mat_discover\utils\nearest_neigh.py:48, in nearest_neigh_props(X, target, r_strength, radius, n_neighbors, metric, **NN_kwargs)
      6 def nearest_neigh_props(
      7     X,
      8     target,
   (...)
     13     **NN_kwargs,
     14 ):
...
     98     f"The {param_name!r} parameter of {caller_name} must be"
     99     f" {constraints_str}. Got {param_val!r} instead."
    100 )

InvalidParameterError: The 'radius' parameter of NearestNeighbors must be a float in the range [0, inf] or None. Got -0.28528690338134766 instead.

FileNotFoundError while splitting val_df into chunks

val_dfs = np.array_split(val_df, 3)

print("##########chunking done##########")

# %% directories
table_dirs = [join("tables", val) for val in ["val1", "val2", "val3"]]
figure_dirs = [join("figures", val) for val in ["val1", "val2", "val3"]]
# %% fit, predict, plot
disc.fit(train_df)

print("##########fitting done##########")

for i, val_df in enumerate(val_dfs):
    # change output dirs
    disc.table_dir = table_dirs[i]
    disc.figure_dir = figure_dirs[i]

    # %% predict
    score = disc.predict(val_df, umap_random_state=42)

    print("##########predictions done##########")

    # %% plot and save
    disc.plot()

This is code that I got from @mliu7051 to help me split my validation data into chunks in DiSCoVeR. Fitting and predicting works, but I get the following error when running disc.plot()

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-32-7f8d2c805a2e> in <module>()
     16 
     17     # %% plot and save
---> 18     disc.plot()
     19     disc.save(dummy=dummy)
     20 

7 frames
/usr/local/lib/python3.7/dist-packages/mat_discover/mat_discover_.py in plot(self, return_pareto_ind)
   1318             Pareto front indices for the peak and density proxies, respectively.
   1319         """
-> 1320         fig, pk_pareto_ind = self.pf_peak_proxy()
   1321         fig, frac_pareto_ind = self.pf_train_contrib_proxy()
   1322 

/usr/local/lib/python3.7/dist-packages/mat_discover/mat_discover_.py in pf_peak_proxy(self)
   1544             x_unit=self.target_unit,
   1545             y_unit=self.target_unit,
-> 1546             use_plotly_offline=self.use_plotly_offline,
   1547         )
   1548 

/usr/local/lib/python3.7/dist-packages/mat_discover/utils/pareto.py in pareto_plot(df, x, y, color, x_unit, y_unit, color_unit, hover_data, fpath, reverse_x, parity_type, pareto_front, color_continuous_scale, color_discrete_map, xrange, use_plotly_offline)
    214 
    215     if fpath is not None:
--> 216         fig.write_html(fpath + ".html")
    217 
    218     fig, scale = matplotlibify(fig)

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in write_html(self, *args, **kwargs)
   3706         import plotly.io as pio
   3707 
-> 3708         return pio.write_html(self, *args, **kwargs)
   3709 
   3710     def to_image(self, *args, **kwargs):

/usr/local/lib/python3.7/dist-packages/plotly/io/_html.py in write_html(fig, file, config, auto_play, include_plotlyjs, include_mathjax, post_script, full_html, animation_opts, validate, default_width, default_height, auto_open, div_id)
    534     # Write HTML string
    535     if path is not None:
--> 536         path.write_text(html_str)
    537     else:
    538         file.write(html_str)

/usr/lib/python3.7/pathlib.py in write_text(self, data, encoding, errors)
   1238             raise TypeError('data must be str, not %s' %
   1239                             data.__class__.__name__)
-> 1240         with self.open(mode='w', encoding=encoding, errors=errors) as f:
   1241             return f.write(data)
   1242 

/usr/lib/python3.7/pathlib.py in open(self, mode, buffering, encoding, errors, newline)
   1206             self._raise_closed()
   1207         return io.open(self, mode, buffering, encoding, errors, newline,
-> 1208                        opener=self._opener)
   1209 
   1210     def read_bytes(self):

/usr/lib/python3.7/pathlib.py in _opener(self, name, flags, mode)
   1061     def _opener(self, name, flags, mode=0o666):
   1062         # A stub for the opener argument to built-in open()
-> 1063         return self._accessor.open(self, flags, mode)
   1064 
   1065     def _raw_open(self, flags, mode=0o777):

FileNotFoundError: [Errno 2] No such file or directory: 'figures/val1/pf-peak-proxy.html

I know it's something to do with the directory I created, I just need plot to recognize all 3 chunks

add facilities for crystal structures and polymers

Feature request

Requires valid distance metrics for crystal structures and polymers that encode chemo-structural novelty and polymeric novelty, respectively as well as structure-based regression models. After that, just some basic plumbing.

Additional performance/novelty metric

Great suggestion by @computron after @sp8rks 's Materials Project seminar

At the end you mentioned a few metrics to test if your "novel" compounds were actually novel. I would suggest also something like a min-max distance from top performers. For example, you predict compound X. Take the distance of compound X from the top 20 known good performers and see what is the closest known good performer. The most novel compounds are the ones with the maximum distance from any known top performer.

Conda installation: Unsatisfiable Error

Hi!
I'd like to report a problem with the conda installation. After creating a new conda environment and launching the installation with

conda update conda
conda create --name mat_discover python==3.9.*
conda activate mat_discover
conda install -c sgbaird mat_discover

I am prompted the following:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Any ideas abouth what may be going on?
Thank you!

'Discover' object has no attribute 'dens_score_df'

Hi!
I'd like to open a fitted model in order to elaborate on its clusters, and found the disc.load() method.
I load the method with

>> disc=Discover()
>>> disc.load(fpath='C:/Users/anton/OneDrive/Desktop/matdisc_colab/formula_discharge vs capacity_grav/disc.pkl')
<mat_discover.mat_discover_.Discover object at 0x0000015006F56D90>

but, when trying to use some methods like disc.merge() or disc.cluster_avg() it appears that the Discover() object lacks attributes.
Some examples:

>>> disc.cluster_avg()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Discover' object has no attribute 'cluster_avg'
>>> disc.merge()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\anton\miniconda3\envs\matdisc\lib\site-packages\mat_discover\mat_discover_.py", line 848, in merge
    dens_score_df = self.dens_score_df.rename(columns={"score": "Density Score"})
AttributeError: 'Discover' object has no attribute 'dens_score_df'
>>> disc.compute_log_density()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\anton\miniconda3\envs\matdisc\lib\site-packages\mat_discover\mat_discover_.py", line 1261, in compute_log_density
    r_orig = self.std_r_orig
AttributeError: 'Discover' object has no attribute 'std_r_orig'

The folder where disc.pkl is stored contains the three folders "figures", "models", "tables", and disk.pkl itself. The same occurs for other folders structured similarly.
Am I doing something wrong here or is there a possible bug in the load() method?

Adaptive design with mat_discover

Feature request

mat_discover can be adapted into an adaptive design scheme by iteratively moving the top-ranked compound from the validation set to the training set and repeating the fit/predict. This is costly, as producing a top-10 list will now take 10x longer. While this could be mitigated by using algorithms that allow for "online" updating of the training data (probably out-of-scope for mat_discover), the extra computational cost is probably worth it for any kind of low-throughput simulation or experiment-based adaptive design scheme. It could also take the form of adding new simulation results to mat_discover and re-running.

The first idea would be an implementation within mat_discover with a docs example, while the latter would probably exist only as a docs example.

add unique elements and unique templates from extraordinary as measure of both performance + novelty

For the extraordinary adaptive design study/example, the performance and novelty metrics are separate. In other words, it only shows that high-performing materials are being found and that novel materials are being found. It could be finding high-performing, traditional materials and low-performing, novel materials while never finding high-performing, novel compounds.

To address that, consider plotting additional rows:

  1. number of unique elements added by the extraordinary compounds that are discovered (novelty + proxy)
  2. the number of unique templates added by the extraordinary compounds that are discovered (novelty + proxy)

Also:

  1. number of unique elements that are added during the addition of a unique template (novelty)
  2. same as (1) directly above but with the additional constraint that it was also an extraordinary compound (novelty + proxy)

With so many rows, it might be nice to have an interactive figure with a dropdown or a few dropdowns.

See https://github.com/sparks-baird/mat_discover/blob/main/examples/adaptive_design_compare.py

Aside: There's also the question of whether "top 2% of performers" is too tight of a constraint given the additional constraints above.

Numpy inconsistency in mat_discover

Hi!
After pip-installing mat_discover in a new conda environment, I found the following potential problem:

>>> from mat_discover.mat_discover_ import Discover   
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\mat_discover\mat_discover_.py", line 36, in <module>
    import umap
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\umap\__init__.py", line 2, in <module>
    from .umap_ import UMAP
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\umap\umap_.py", line 28, in <module>
    import numba
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\numba\__init__.py", line 200, in <module>
    _ensure_critical_deps()
  File "C:\Users\anton\anaconda3\envs\mat-disc\lib\site-packages\numba\__init__.py", line 140, in _ensure_critical_deps
    raise ImportError("Numba needs NumPy 1.21 or less")
ImportError: Numba needs NumPy 1.21 or less

After which I downgraded my numpy to 1.21.0 via pip install. I was prompted the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
matminer 0.7.6 requires numpy>=1.22.0, but you have numpy 1.21.0 which is incompatible.

I am currently on numpy 1.21.0, and everything seems to work, but perhaps this incompatibility is worth a thought?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.