Comments (4)
Note that I don't think (1) is ultimately an issue. It is merely a scaling factor and has no effect on most trend analysis. That said, I agree it is usually normalized by number density.
More generally, I find this to be a very poor and overly complicated implementation of a RDF. You do not actually need to do the whole complicated dist_rdf[dist] += 1 (first off, a better implementation of that would use a Counter). You simply need to create a sequence of all distances with no counting at all, then use numpy.histogram to bin all the distances by dist, i.e., somehting like
# Use Structure.get_all_neighbors instead of get_neighbors
# Compile the distances into a list.
# Concatenate the distances into one long sequence.
# Create histogram of distances (numpy will even normalize it properly for you if you specify normed=True)
I think this will accomplish the RDF in less than 5 lines, and will probably be 10x or more faster because you use the inherent efficiencies in get_all_distances and numpy's histogramming (rather than the very inefficient manual historgramming).
from matminer.
Hi all,
Just catching up with this thread. I haven't been actively developing or reviewing matminer development because we're trying to push through "atomate" at the moment and my full focus (when it's possible for me to code...) is on atomate.
@WardLT - thanks for getting involved with matminer and for your suggestion about normalization. For now I have pulled your PR especially since it helps with setup. In the future it would suggest to separate the PR for setup.py vs the partial RDF but again I am glad that you are taking the time to contribute.
@shyuep - I agree with your implementation suggestions (actually, the get_all_neighbors + numpy histogram method was my suggestion to the original implementor - I guess it was ignored). If @WardLT wants to try that implementation I would be all for it. Otherwise, I will definitely make sure to clean up the code when I get around to making a push for matminer.
from matminer.
I agree, the scaling factor isn't important if you use the RDF as a descriptor for trend analysis/machine learning. However, in some cases (e.g., using the RDF to estimate coordination numbers) the scaling factor is important and I don't want matminer users to be surprised that the scaling is off from what they expected.
I also agree that the implementation leaves some room for improvement. Beyond the performance issues you point out, having a floating point as the key to the dictionary leads to problematic issues with floating point equality.
I'll go ahead and revamp both the RDF and PRDF code to use this histogram method (thanks for the advice). Any objections to me changing both methods to return a numpy array rather than a dictionary?
@computron - Sorry for conflating the edits to setup.py and the PRDF addition. I'll make sure to keep things separate in the future.
from matminer.
Hi all,
I'm closing this issue for now. Feel free to reopen if needed
from matminer.
Related Issues (20)
- Materials Project time split dataset - `load_data_from_json` returns `None` during debugging (conditionally)
- `matminer.datasets.utils._validate_dataset()` flaky on Windows? HOT 5
- [FEATURE REQUEST] SkipAtom compositional featurizer
- AttributeError: 'DensityFeatures' object has no attribute 'desired_features' HOT 1
- SOAP features HOT 2
- Suggestion: OPTIMADE data retriever HOT 1
- Fail to approach MPData HOT 4
- Handling NaNs from ElementProperty HOT 3
- CI failing due to broken mongo service
- Fixing matminer's multiprocessing problem HOT 1
- New release 0.9.0? HOT 2
- WenAlloy wrong valence electron counts
- mp-api for MPDataRetrieval needs upgrade badly HOT 1
- Missing compatibility with pandas v2 HOT 11
- Issue link to matsci.org broken
- compatibility request: pandas-2+ HOT 2
- Error in import composition HOT 6
- Simple composition-based featurization fails due to an upgrade in pymatgen HOT 2
- Re-enable tests that are skipped in CI HOT 3
- when I import matminer,mistake as following:ValueError: Unexpected atomic number Z=119。 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from matminer.