I think there are a few issues with the RDF calculation method (<a href="https://githu

Potential Issue with RDF Method about matminer HOT 4 CLOSED

WardLT commented on August 16, 2024

Potential Issue with RDF Method

from matminer.

Comments (4)

shyuep commented on August 16, 2024

Note that I don't think (1) is ultimately an issue. It is merely a scaling factor and has no effect on most trend analysis. That said, I agree it is usually normalized by number density.

More generally, I find this to be a very poor and overly complicated implementation of a RDF. You do not actually need to do the whole complicated dist_rdf[dist] += 1 (first off, a better implementation of that would use a Counter). You simply need to create a sequence of all distances with no counting at all, then use numpy.histogram to bin all the distances by dist, i.e., somehting like

# Use Structure.get_all_neighbors instead of get_neighbors
# Compile the distances into a list.
# Concatenate the distances into one long sequence.
# Create histogram of distances (numpy will even normalize it properly for you if you specify normed=True)

I think this will accomplish the RDF in less than 5 lines, and will probably be 10x or more faster because you use the inherent efficiencies in get_all_distances and numpy's histogramming (rather than the very inefficient manual historgramming).

from matminer.

computron commented on August 16, 2024

Hi all,

Just catching up with this thread. I haven't been actively developing or reviewing matminer development because we're trying to push through "atomate" at the moment and my full focus (when it's possible for me to code...) is on atomate.

@WardLT - thanks for getting involved with matminer and for your suggestion about normalization. For now I have pulled your PR especially since it helps with setup. In the future it would suggest to separate the PR for setup.py vs the partial RDF but again I am glad that you are taking the time to contribute.

@shyuep - I agree with your implementation suggestions (actually, the get_all_neighbors + numpy histogram method was my suggestion to the original implementor - I guess it was ignored). If @WardLT wants to try that implementation I would be all for it. Otherwise, I will definitely make sure to clean up the code when I get around to making a push for matminer.

from matminer.

WardLT commented on August 16, 2024

I agree, the scaling factor isn't important if you use the RDF as a descriptor for trend analysis/machine learning. However, in some cases (e.g., using the RDF to estimate coordination numbers) the scaling factor is important and I don't want matminer users to be surprised that the scaling is off from what they expected.

I also agree that the implementation leaves some room for improvement. Beyond the performance issues you point out, having a floating point as the key to the dictionary leads to problematic issues with floating point equality.

I'll go ahead and revamp both the RDF and PRDF code to use this histogram method (thanks for the advice). Any objections to me changing both methods to return a numpy array rather than a dictionary?

@computron - Sorry for conflating the edits to setup.py and the PRDF addition. I'll make sure to keep things separate in the future.

from matminer.

computron commented on August 16, 2024

Hi all,

I'm closing this issue for now. Feel free to reopen if needed

from matminer.

Potential Issue with RDF Method about matminer HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent