Code Monkey home page Code Monkey logo

cbfv's Introduction

CBFV Package

Tool to quickly create a composition-based feature vectors from materials datafiles.

Installation

The source code is currently hosted on GitHub at: https://github.com/kaaiian/CBFV

Binary installers for the latest released version are available at the Python Package Index (PyPI)

# PyPI
pip install CBFV

Making the composition-based feature vector

The CBFV package assumes your data is stored in a pandas dataframe of the following structure:

formula target
Tc1V1 248.539
Cu1Dy1 66.8444
Cd3N2 91.5034

To featurize this data, the generate_features function can be called as follows:

from CBFV import composition
X, y, formulae, skipped = composition.generate_features(df)

Extended Functionality

The featurization scheme can be adjusted by calling the the elem_prop parameter. The following featurization schemes are included within CBFV:

  • jarvis
  • magpie
  • mat2vec
  • oliynyk (default)
  • onehot
  • random_200

Duplicate formula handeling is controlled by the drop_duplicates parameter. It is set to False by default to preserve datapoints containing variation outside of their formula. For example, heat capacity measurements performed for the same material at different temperatures.

The extend_features parameter is used to specify whether columns outside of ['formula', 'target'] should be considered during featurization. It is set to False by default to exclude nuisance information from consideration. Setting extend_features=True would allow additional information (i.e. ['temperature', 'pressure']) to be preserved.

The sum_feat parameter specifies whether to calculate the sum features when generating the CBFVs for the chemical formulae. It is set to False by default.

Calling generate_features with these parameters can be implemented as follows:

formula target temp
Tc1V1 248.539 373
Tc1V1 66.8444 473
Cd3N2 91.5034 273
from CBFV import composition
X, y, formulae, skipped = composition.generate_features(df,
                                                        elem_prop='magpie',
                                                        drop_duplicates=False,
                                                        extend_features=True,
                                                        sum_feat=True)

cbfv's People

Contributors

andrewfalkowski avatar anthony-wang avatar kaaiian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cbfv's Issues

`generate_features(..., extend_features=...)` InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Processing Input Data: 100%|██████████| 1794/1794 [00:00<00:00, 7378.49it/s]
	Featurizing Compositions...
Assigning Features...: 100%|██████████| 1778/1778 [00:00<00:00, 3426.03it/s]
NOTE: Your data contains formula with exotic elements. These were skipped.
	Creating Pandas Objects...

---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
[<ipython-input-45-22826a03d387>](https://localhost:8080/#) in <module>()
      1 from CBFV import composition
----> 2 X, y, formulae, skipped = composition.generate_features(df, extend_features="R")

4 frames
[/usr/local/lib/python3.7/dist-packages/CBFV/composition.py](https://localhost:8080/#) in generate_features(df, elem_prop, drop_duplicates, extend_features, sum_feat, mini)
    307         extended = pd.DataFrame(extra_features, columns=features)
    308         extended = extended.set_index('formula', drop=True)
--> 309         X = pd.concat([X, extended], axis=1)
    310 
    311     # reset dataframe indices

[/usr/local/lib/python3.7/dist-packages/pandas/util/_decorators.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

[/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/concat.py](https://localhost:8080/#) in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    305     )
    306 
--> 307     return op.get_result()
    308 
    309 

[/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/concat.py](https://localhost:8080/#) in get_result(self)
    526                     obj_labels = obj.axes[1 - ax]
    527                     if not new_labels.equals(obj_labels):
--> 528                         indexers[ax] = obj_labels.get_indexer(new_labels)
    529 
    530                 mgrs_indexers.append((obj._mgr, indexers))

[/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_indexer(self, target, method, limit, tolerance)
   3440 
   3441         if not self._index_as_unique:
-> 3442             raise InvalidIndexError(self._requires_unique_msg)
   3443 
   3444         if not self._should_compare(target) and not is_interval_dtype(self.dtype):

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

CompositionError: ( is an invalid formula!

Processing Input Data: 92%|█████████▏| 1263/1377 [00:00<00:00, 20629.89it/s]

CompositionError Traceback (most recent call last)
in <cell line: 1>()
1 for f in['jarvis','magpie','mat2vec','oliynyk','onehot','random_200']:
----> 2 X_train_unscaled,y_train,formulae_train,skipped_train = generate_features(df, elem_prop=f, drop_duplicates=False, extend_features=False, sum_feat=True)
3 #it has to be tested again with bg data which I have created by deleting the duplicates
4
5 SEED=42

4 frames
/usr/local/lib/python3.10/dist-packages/CBFV/composition.py in generate_features(df, elem_prop, drop_duplicates, extend_features, sum_feat, mini)
281 if 'x' in formula:
282 continue
--> 283 l1, l2 = _element_composition_L(formula)
284 formula_mat.append(l1)
285 count_mat.append(l2)

/usr/local/lib/python3.10/dist-packages/CBFV/composition.py in _element_composition_L(formula)
97
98 def _element_composition_L(formula):
---> 99 comp_frac = _element_composition(formula)
100 atoms = list(comp_frac.keys())
101 counts = list(comp_frac.values())

/usr/local/lib/python3.10/dist-packages/CBFV/composition.py in _element_composition(formula)
86
87 def _element_composition(formula):
---> 88 elmap = parse_formula(formula)
89 elamt = {}
90 natoms = 0

/usr/local/lib/python3.10/dist-packages/CBFV/composition.py in parse_formula(formula)
62 expanded_formula = formula.replace(m.group(), expanded_sym)
63 return parse_formula(expanded_formula)
---> 64 sym_dict = get_sym_dict(formula, 1)
65 return sym_dict
66

/usr/local/lib/python3.10/dist-packages/CBFV/composition.py in get_sym_dict(f, factor)
26 f = f.replace(m.group(), "", 1)
27 if f.strip():
---> 28 raise CompositionError(f'{f} is an invalid formula!')
29 return sym_dict
30

CompositionError: ( is an invalid formula!

`Me` element missing, not accounted for in "exotic" elements checking

Exception has occurred: ValueError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
'Me' is not in list
  File "[C:\Users\sterg\miniconda3\envs\vickers\Lib\site-packages\composition_based_feature_vector\composition.py]()", line 133, in _assign_features
    row = elem_index[elem_symbols.index(elem)]
  File "[C:\Users\sterg\miniconda3\envs\vickers\Lib\site-packages\composition_based_feature_vector\composition.py]()", line 295, in generate_features
    feats, targets, formulae, skipped = _assign_features(matrices,
  File "[C:\Users\sterg\Documents\GitHub\sparks-baird\VickersHardnessPrediction\vickers_hardness\utils\mpds.py]()", line 12, in <module>
    X, y, formulae, skipped = generate_features(df)

Invalid formulas during generate_features()

Some formulas in my datasets are occasionally not recognized and I get the error

raise CompositionError(f'{f} is an invalid formula!')
CompositionError: ,65 is an invalid formula!

This is happening into get_sym_dict() function. Is there a way to automatically drop non-recognized symbols?

Accompanying paper

What is the paper that accompanies this? Maybe include in the README?

Usage instructions for pip-installed cbfv (possible bug)

PyPi doesn't have a description, and it's not obvious from the README. Taylor and I are both having trouble with it. In its own conda environment:

(cbfv) C:\Users\sterg>pip install cbfv
Collecting cbfv
  Downloading cbfv-1.0.0-py3-none-any.whl (5.0 kB)
Collecting numpy
  Using cached numpy-1.21.2-cp38-cp38-win_amd64.whl (14.0 MB)
Collecting pytest
  Downloading pytest-6.2.5-py3-none-any.whl (280 kB)
     |████████████████████████████████| 280 kB 819 kB/s
Collecting pandas
  Using cached pandas-1.3.3-cp38-cp38-win_amd64.whl (10.2 MB)
Collecting tqdm
  Downloading tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
     |████████████████████████████████| 76 kB 5.5 MB/s
Collecting python-dateutil>=2.7.3
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pytz>=2017.3
  Downloading pytz-2021.3-py2.py3-none-any.whl (503 kB)
     |████████████████████████████████| 503 kB 2.2 MB/s
Collecting iniconfig
  Downloading iniconfig-1.1.1-py2.py3-none-any.whl (5.0 kB)
Collecting atomicwrites>=1.0
  Downloading atomicwrites-1.4.0-py2.py3-none-any.whl (6.8 kB)
Collecting py>=1.8.2
  Downloading py-1.10.0-py2.py3-none-any.whl (97 kB)
     |████████████████████████████████| 97 kB 2.2 MB/s
Collecting pluggy<2.0,>=0.12
  Downloading pluggy-1.0.0-py2.py3-none-any.whl (13 kB)
Collecting colorama
  Using cached colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting packaging
  Downloading packaging-21.0-py3-none-any.whl (40 kB)
     |████████████████████████████████| 40 kB ...
Collecting attrs>=19.2.0
  Downloading attrs-21.2.0-py2.py3-none-any.whl (53 kB)
     |████████████████████████████████| 53 kB 1.0 MB/s
Collecting toml
  Using cached toml-0.10.2-py2.py3-none-any.whl (16 kB)
Collecting six>=1.5
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting pyparsing>=2.0.2
  Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Installing collected packages: six, pyparsing, toml, pytz, python-dateutil, py, pluggy, packaging, numpy, iniconfig, colorama, attrs, atomicwrites, tqdm, pytest, pandas, cbfv
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
skrebate 0.6 requires scikit-learn, which is not installed.
skrebate 0.6 requires scipy, which is not installed.
automatminer 1.0.3.20200727 requires matminer==0.6.2, which is not installed.
automatminer 1.0.3.20200727 requires pymatgen==2020.01.28, which is not installed.
automatminer 1.0.3.20200727 requires scikit-learn==0.22.2, which is not installed.
automatminer 1.0.3.20200727 requires tpot==0.11.0, which is not installed.
auto-xrd 0.0.1 requires pymatgen, which is not installed.
auto-xrd 0.0.1 requires scipy, which is not installed.
tensorflow 2.5.0rc1 requires absl-py~=0.10, which is not installed.
tensorflow 2.5.0rc1 requires astunparse~=1.6.3, which is not installed.
tensorflow 2.5.0rc1 requires flatbuffers~=1.12.0, which is not installed.
tensorflow 2.5.0rc1 requires gast==0.4.0, which is not installed.
tensorflow 2.5.0rc1 requires google-pasta~=0.2, which is not installed.
tensorflow 2.5.0rc1 requires grpcio~=1.34.0, which is not installed.
tensorflow 2.5.0rc1 requires h5py~=3.1.0, which is not installed.
tensorflow 2.5.0rc1 requires keras-nightly~=2.5.0.dev, which is not installed.
tensorflow 2.5.0rc1 requires keras-preprocessing~=1.1.2, which is not installed.
tensorflow 2.5.0rc1 requires typing-extensions~=3.7.4, which is not installed.
dtw-python 1.1.10 requires scipy>=1.1, which is not installed.
tensorboard 2.4.1 requires absl-py>=0.4, which is not installed.
tensorboard 2.4.1 requires google-auth<2,>=1.6.3, which is not installed.
tensorboard 2.4.1 requires google-auth-oauthlib<0.5,>=0.4.1, which is not installed.
tensorboard 2.4.1 requires grpcio>=1.24.3, which is not installed.
tensorboard 2.4.1 requires markdown>=2.6.8, which is not installed.
tensorboard 2.4.1 requires requests<3,>=2.21.0, which is not installed.
tensorboard 2.4.1 requires tensorboard-plugin-wit>=1.6.0, which is not installed.
tensorboard 2.4.1 requires werkzeug>=0.11.15, which is not installed.
tensorflow 2.5.0rc1 requires numpy~=1.19.2, but you have numpy 1.21.2 which is incompatible.
tensorflow 2.5.0rc1 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
Successfully installed atomicwrites-1.4.0 attrs-21.2.0 cbfv-1.0.0 colorama-0.4.4 iniconfig-1.1.1 numpy-1.21.2 packaging-21.0 pandas-1.3.3 pluggy-1.0.0 py-1.10.0 pyparsing-2.4.7 pytest-6.2.5 python-dateutil-2.8.2 pytz-2021.3 six-1.16.0 toml-0.10.2 tqdm-4.62.3

(cbfv) C:\Users\sterg>python3
Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import cbfv
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'cbfv'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.