Code Monkey home page Code Monkey logo

Comments (9)

rkurchin avatar rkurchin commented on June 24, 2024 2

Thanks (as always) for a thoughtful and detailed perspective. I'll make a branch and start sketching this out and make a draft PR once I think I have something vaguely sensible, and we can comment/iterate from there.

from chemistryfeaturization.jl.

DhairyaLGandhi avatar DhairyaLGandhi commented on June 24, 2024 1

calculating of the value of the feature (e.g. calling the MolecularGraph function) and the actual encoding of that value (e.g. calling the onehot-encoder).

To clarify, the design was to separate out the concerns and enforce it via API, the encoding etc are details that can (and should) be made part of the complete system. What happens inside the function is something that is completely up to us after all. Consider another design where everything is lazy and the featurizations only output a nested thunk which is materialized at the end. In this case, the encoding can happen efficiently at the end.

From an AD pov, I would suggest doing what makes sense for the problem. Memory overhead can catch us in AD sometimes, but hopefully we don't have too much memory to hold on to.

from chemistryfeaturization.jl.

rkurchin avatar rkurchin commented on June 24, 2024

Okay so @DhairyaLGandhi and I finally have an occasion to actually do this (for that most noblest of goals of completely trouncing deepchem's performance using AGN), and we had a chat about it earlier and realized some things. @thazhemadam, would love your input...

At the moment, the (implicit) presumption is that the workflow looks like (presuming that each step has access only to the outputs of the previous one)

structure (e.g. SMILES, CIF, etc...) --> some intermediate structural representation such as Crystal or GraphMol --> ChemistryFeaturization representation (e.g. AtomGraph) --> encode and store features in FeaturizedAtoms

However, for the case of SpeciesFeature, we (currently) intend to do the encoding via MolecularGraph.jl functions, which act on GraphMol objects. This suggests that the above ordering doesn't really work, because you can't encode on an AtomGraph without converting it back to a GraphMol, which is really clunky. In this case, what would make more sense would be

structure (SMILES) --> GraphMol --> compute features --> convert to AtomGraph and store in FeaturizedAtoms

and the API as presently designed does not easily accommodate this. I could see two generic forms of solution to this:

  1. make the API more flexible to alternative featurization workflows, potentially at the cost of some of the assurances we have now (e.g. that the encoded_features stored in any FeaturizedAtoms object were actually encoded via the featurization acting on the atoms
  2. have Atoms objects store (potentially via some a type parameter?) their "original" representations (GraphMol, Crystal, etc. objects), increasing memory footprint, but ensuring we always have access to it, and maintaining the guarantees on "integrity" of FeaturizedAtoms as well as an even deeper notion of "data provenance"

A related question (especially if we follow route 1 above) is whether the type hierarchy of feature descriptors as we've constructed it is actually adding anything. In the preliminary solution @DhairyaLGandhi had sketched out, he basically defined new structs for each feature (hybridization, aromaticity, valence, etc. etc.) without ever making use of the SpeciesFeatureDescriptor type at all. Personally, I like the organization we have from a philosophical/conceptual perspective, and I maintain hope that it can ultimately reduce duplication (Dhairya was also separately calling one-hot encoding each time), but if it adds too much overhead without really making anything that much easier, then it may not be worth keeping. (Personally, I currently think/hope it's worth keeping, but felt this was important to bring up)

If we were to do this via SpeciesFeatureDescriptor (and thus likely follow route 2 above), I think the type needs a new field to store the function actually used to compute the value of the feature (I thought about putting this in the codec, but I think it's a distinct concept, as this function is basically taking the place of the lookup table in ElementFeatureDescriptor. What I could then imagine would be, similar to the built-in atom_data_df, we'd have a built-in list of functions from MolecularGraph so that for at least some common subset of species features, the user wouldn't have to worry about implementing/linking up things at all. In this case, we'd probably need to dispatch these functions (TBD what to call them #ntih) on the type parameter alluded to in the above paragraph, since I don't think there will in general be an easy way to convert an AtomGraph to a GraphMol if it was originally constructed from a Crystal (there's a lot of ancillary information stored that would be hard to generate).

Finally, the other thing I like about option 2 is that it maintains relative ease in changing around featurizations after the fact (e.g. adding/removing FD's) without completely starting from scratch. If you carry along the structure representation, you have everything you need to compute additional features, etc.

Okay, that was a lot, but I think writing it up helped me to clarify some things in my head about possible ways to proceed. But please LMK your thoughts! If we have general consensus on an approach, I can make a branch and start sketching things out.

from chemistryfeaturization.jl.

DhairyaLGandhi avatar DhairyaLGandhi commented on June 24, 2024

I think the design goal of maintaining the "decodability" present in FeaturizedAtoms is good (I like it too). What would we need to preserve it and get a bit more flexibility? Ideally AtomGraph constructed from any source would be equivalent, but storing the source would in effect delineate them. There are pros and cons to this. It makes decoding easier, but adds overheads at all times (I'm unsure if that's worth it tbh, but we wouldn't know that for some time yet I think)

from chemistryfeaturization.jl.

rkurchin avatar rkurchin commented on June 24, 2024

Just to clarify, the decodability isn't an issue with either of the routes outlined above, because that's just a matter of "inverting" the one-hot encoder, which we can already do.

The other kind of separation of concerns that your approach didn't have was the separation between the calculating of the value of the feature (e.g. calling the MolecularGraph function) and the actual encoding of that value (e.g. calling the onehot-encoder). I think it's important to maintain this as well, because it allows experimentation with different parameters of encoding (e.g. different numbers of bins for continuous-valued features) or even different encoding approaches (e.g. direct numerical encoding, something I've wanted to experiment with).

Those things being said, I definitely think you're asking the right kinds of questions. Ultimately, I think both routes also can offer the flexibility we need, but with different balances between fixed and variable costs of implementation:

option 1 = low fixed cost, high variable cost (fairly easy to get up and running, but more work required each time we want to encode a different feature)
option 2 = higher fixed cost, lower variable cost (some infrastructural changes up-front, but then much easier to just provide another function for a new feature and have everything else "just work")

In general in cases like this, I tend to lean towards the higher fixed cost, lower variable cost option...of course, as you also point out, option 2 also entails an increase in overhead, but it's unclear how much or if that matters. I think overhead that lives purely within ChemistryFeaturization is generally not an issue, but anything that might potentially be part of a model (as we're currently playing around with in the AD stuff) obviously has to be performant. However, it's also true that storing the structure could have benefits in the AD case, since it will make it simpler to propagate "all the way" back...

from chemistryfeaturization.jl.

thazhemadam avatar thazhemadam commented on June 24, 2024

However, for the case of SpeciesFeature, we (currently) intend to do the encoding via MolecularGraph.jl functions, which act on GraphMol objects. This suggests that the above ordering doesn't really work, because you can't encode on an AtomGraph without converting it back to a GraphMol, which is really clunky.

I agree. This would also break our logical flow requirement of AtomGraphs being the type that is primarily involved in feature generation.

  1. make the API more flexible to alternative featurization workflows, potentially at the cost of some of the assurances we have now (e.g. that the encoded_features stored in any FeaturizedAtoms object were actually encoded via the featurization acting on the atoms

I feel like the logically generic and deterministic guarantee we provide here right now might be worth holding on to.

A related question (especially if we follow route 1 above) is whether the type hierarchy of feature descriptors as we've constructed it is actually adding anything.

That's a good question. I think SpeciesFeatureDescriptor should be present as a type (for the logical grouping of different FDs, if not for any other reason). However, if the fields that different types of SpeciesFeatures need aren't as similar/flexible as we initially thought, it might also be a good idea to make it an abstract type instead, or even better, a trait.

(I thought about putting this in the codec, but I think it's a distinct concept, as this function is basically taking the place of the lookup table in ElementFeatureDescriptor)

Agreed. Codecs are used for actually encoding the features themselves, not for encoding (or even decoding, I suppose) what will be used for encoding the features. While in an abstract sense it may make sense, I feel like that's a level of complexity we shouldn't need to deal with.

Overall, I think Option 2. might be the better option. If the original representation is stored, then our current workflow wouldn't break, and we can simply apply FeatureDescriptors onto an Atoms object, and let a specialized Codec determine how the particular structure/representation needs to be handled.

TLDR;
I like Option 2 better.

from chemistryfeaturization.jl.

thazhemadam avatar thazhemadam commented on June 24, 2024

I also don't think the amount of memory we would be dealing with would be too high, and if storing the initial representation also would benefit in AD, then the tradeoff might be worth it. But, if cumulative cost proves that isn't, I'd assume we could just as easily extract the "pure" AtomGraph representation (which is basically the same structure, without the "original" GraphMol/Crystal, etc representation), create a FeaturizedAtoms object using it and use that in the models instead.

from chemistryfeaturization.jl.

DhairyaLGandhi avatar DhairyaLGandhi commented on June 24, 2024

I may be missing something but what is the benefit to AD? Holding on to memory is one of the things to avoid in AD usually, unless we can avoid computing something expensive.

from chemistryfeaturization.jl.

rkurchin avatar rkurchin commented on June 24, 2024

I just meant that it could be cleaner to propagate the forces back to the original structure if it's attached to the same object. Not that it makes the AD itself easier/faster.

from chemistryfeaturization.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.