Code Monkey home page Code Monkey logo

safe's Introduction

๐Ÿฆบ SAFE

Sequential Attachment-based Fragment Embedding (SAFE) is a novel molecular line notation that represents molecules as an unordered sequence of fragment blocks to improve molecule design using generative models.



Paper | Docs | ๐Ÿค— Model | ๐Ÿค— Training Dataset



PyPI Conda PyPI - Downloads Conda Code license Data License GitHub Repo stars GitHub Repo stars arXiv

test release code-check doc

Overview of SAFE

SAFE is the deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as a contiguous sequence of connected fragments. SAFE strings are valid SMILES strings, and thus are able to preserve the same amount of information. The intuitive representation of molecules as an ordered sequence of connected fragments greatly simplifies the following tasks often encountered in molecular design:

  • de novo design
  • superstructure generation
  • scaffold decoration
  • motif extension
  • linker generation
  • scaffold morphing.

The construction of a SAFE strings requires defining a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrates the process of building a SAFE string. The resulting string is a valid SMILES that can be read by datamol or RDKit.


News ๐Ÿš€

๐Ÿ’ฅ 2024/01/15 ๐Ÿ’ฅ

  1. @IanAWatson has a C++ implementation of SAFE in LillyMol that is quite fast and use a custom fragmentation algorithm. Follow the installation instruction on the repo and checkout the docs of the CLI here: docs/Molecule_Tools/SAFE.md

Installation

You can install safe using pip:

pip install safe-mol

You can use conda/mamba:

mamba install -c conda-forge safe-mol

Datasets and Models

Type Name Infos Size Comment
Model datamol-io/safe-gpt 87M params 350M Default model
Training Dataset datamol-io/safe-gpt 1.1B rows 250GB Training dataset
Drug Benchmark Dataset datamol-io/safe-drugs 26 rows 20 kB Benchmarking dataset

Usage

Please refer to the documentation, which contains tutorials for getting started with safe and detailed descriptions of the functions provided, as well as an example of how to get started with SAFE-GPT.

API

We summarize some key functions provided by the safe package below.

Function Description
safe.encode Translates a SMILES string into its corresponding SAFE string.
safe.decode Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's Chem.MolFromSmiles with an optional correction argument to take care of missing hydrogen bonds.
safe.split Tokenizes a SAFE string to build a generative model.

Examples

Translation between SAFE and SMILES representations

import safe

ibuprofen = "CC(Cc1ccc(cc1)C(C(=O)O)C)C"

# SMILES -> SAFE -> SMILES translation
try:
    ibuprofen_sf = safe.encode(ibuprofen)  # c12ccc3cc1.C3(C)C(=O)O.CC(C)C2
    ibuprofen_smi = safe.decode(ibuprofen_sf, canonical=True)  # CC(C)Cc1ccc(C(C)C(=O)O)cc1
except safe.EncoderError:
    pass
except safe.DecoderError:
    pass

ibuprofen_tokens = list(safe.split(ibuprofen_sf))

Training/Finetuning a (new) model

A command line interface is available to train a new model, please run safe-train --help. You can also provide an existing checkpoint to continue training or finetune on you own dataset.

For example:

safe-train --config <path to config> \
    --model-path <path to model> \
    --tokenizer  <path to tokenizer> \
    --dataset <path to dataset> \
    --num_labels 9 \
    --torch_compile True \
    --optim "adamw_torch" \
    --learning_rate 1e-5 \
    --prop_loss_coeff 1e-3 \
    --gradient_accumulation_steps 1 \
    --output_dir "<path to outputdir>" \
    --max_steps 5

References

If you use this repository, please cite the following related paper:

@misc{noutahi2023gotta,
      title={Gotta be SAFE: A New Framework for Molecular Design},
      author={Emmanuel Noutahi and Cristian Gabellini and Michael Craig and Jonathan S. C Lim and Prudencio Tossou},
      year={2023},
      eprint={2310.10773},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

The training dataset is licensed under CC BY 4.0. See DATA_LICENSE for details. This code base is licensed under the Apache-2.0 license. See LICENSE for details.

Note that the model weights of SAFE-GPT are exclusively licensed for research purposes (CC BY-NC 4.0).

Development lifecycle

Setup dev environment

mamba create -n safe -f env.yml
mamba activate safe

pip install --no-deps -e .

Tests

You can run tests locally with:

pytest

safe's People

Contributors

anri-lombard avatar hadim avatar kjappelbaum avatar maclandrol avatar mercuryseries avatar zhu0619 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

safe's Issues

Vocabulary Size

Hey everyone,

Just a heads up: on the paper, the tokenizer section mentions 1180 vocabulary size, which should be 1880.

Feel free to close this, it's more a notification than an issue.

Non deterministic output

It came to my attention that the design methods of SAFE do not respect the random_seed.

This is because the random_seed is for SAFE related algorithm, while if you want to ensure deterministic output in sampling, you need to use the transformers.set_seed(your_seed) before your call.

[BUG] Max length not always followed

If you use de-novo generation ( I havnt tested on the other techniques) you can get outputs greater than max length specified. I tried it on temperature and for higher temperatures like 1.3 some sequencwes generated were over my max length specified

Cannot resume from checkpoint

A slurm error caused the model to stop training (not an issue with the library); when trying to resume from checkpoint an error occurs. Here is the command:

safe-train --config $config_path \
  --resume_from_checkpoint $checkpoint_path \
  --tokenizer $tokenizer_path \
  --dataset $dataset_path \
  --text_column "SAFE" \
  --optim "adamw_torch" \
  --learning_rate 5e-4 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 32 \
  --gradient_accumulation_steps 2 \
  --report_to "wandb" \
  --warmup_steps 20000 \
  --logging_first_step True \
  --logging_steps 100 \
  --eval_accumulation_steps 1000 \
  --save_steps 1000 \
  --eval_steps 1000 \
  --eval_strategy "steps" \
  --wandb_project "SAFE_small" \
  --num_train_epochs 10 \
  --save_total_limit 1 \
  --output_dir $output_dir \
  --overwrite_output_dir True \
  --do_train True \
  --do_eval True \
  --save_safetensors True \
  --gradient_checkpointing True \
  --num_train_epochs 10 \
  --prediction_loss_only True \
  --max_grad_norm 1.0 \
  --include_descriptors False

Here is the specific error:

  File "/home/lmbanr001/.local/bin/safe-train", line 8, in <module>
    sys.exit(main())
  File "/home/lmbanr001/.local/lib/python3.10/site-packages/safe/trainer/cli.py", line 410, in main
    train(model_args, data_args, training_args)
  File "/home/lmbanr001/.local/lib/python3.10/site-packages/safe/trainer/cli.py", line 360, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/lmbanr001/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/lmbanr001/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2236, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/lmbanr001/.local/lib/python3.10/site-packages/accelerate/data_loader.py", line 665, in __iter__
    self.set_epoch(self.iteration)
  File "/home/lmbanr001/.local/lib/python3.10/site-packages/accelerate/data_loader.py", line 742, in set_epoch
    if hasattr(self.batch_sampler.sampler, "set_epoch"):
AttributeError: 'SkipBatchSampler' object has no attribute 'sampler'

My versions of transformers and accelerate:

Name: accelerate
Version: 0.33.0

Name: transformers
Version: 4.43.3

Just to verify, this is an accelerate issue, right? I've searched around their issues and it seems to have popped up Oct 2023 and fixed. This can't be an issue with the collator or dataloader on the safe library's side, right? If it's not an issue here feel free to close it, I can open an issue in the accelerate library if needed ๐Ÿ‘

Parameters for smaller model

I trained the safe-gpt model with the 20M model configurations specified in the paper, namely:

{
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 10000,
  "embd_pdrop": 0.1,
  "eos_token_id": 1,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-5,
  "model_type": "gpt2",
  "n_embd": 768,
  "n_head": 8,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": "tanh",
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_hidden_size": 128,
  "summary_use_proj": true,
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 10000,
  "num_labels": 9
}

After training it ended up being a 40M parameter model. We are trying to reproduce your results for our paper; did you perhaps retrain a smaller tokenizer with a reduced vocabulary, or did you use the same tokenizer as for the 87M model? If the latter I'm not sure how the parameters ended up burgeoning beyond the paper's claims.

Validity Calculation

The paper claims a validity very close to 1 (and in fragment constrained generation exactly 1 on average). Was this calculation done by looking at the validity of generated molecules, or how many molecules were valid from the eventually generated molecules?

For example:

  • If I generate 1000 molecules and 952 are valid, but of the generated molecules running it through rdkit shows 100% of them are valid, is validity 0.952 or 1.0? The reason I ask is because after training the model from scratch and replicating your de_novo generation results, the fragment generation results do not seem to have as high a validity as the claim. I'm curious if this is my mistake.

For some more context here is a snippet of my notebook:

ds = load_dataset("datamol-io/safe-drugs")
benchmark_df = DataFrame(ds['train'])
benchmark_df.info()

def calculate_diversity(mols):
    fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048) for mol in mols]
    similarities = []
    for i in range(len(fps)):
        for j in range(i + 1, len(fps)):
            similarities.append(1 - DataStructs.TanimotoSimilarity(fps[i], fps[j]))
    return np.mean(similarities)
    
def calculate_distance_to_original(generated_mols, original_mol):
    original_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(original_mol, 2, nBits=2048)
    generated_fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048) for mol in generated_mols]
    distances = [1 - DataStructs.TanimotoSimilarity(original_fp, gen_fp) for gen_fp in generated_fps]
    return np.mean(distances)
    
def run_fragment_constrained_benchmark(designer, benchmark_df, n_samples=1000, n_trials=1):
    results = []
    
    for _, row in tqdm(benchmark_df.iterrows(), total=len(benchmark_df)):
        original_mol = Chem.MolFromSmiles(row['smiles'])
        
        # Linker design
        linkers = designer.linker_generation(*row['morphing'].split('.'), n_samples_per_trial=n_samples, n_trials=n_trials, sanitize=True)
        linker_mols = [Chem.MolFromSmiles(smi) for smi in linkers if smi]
        
        # Motif extension
        motifs = designer.motif_extension(row['motif'], n_samples_per_trial=n_samples, n_trials=n_trials, sanitize=True)
        motif_mols = [Chem.MolFromSmiles(smi) for smi in motifs if smi]
        
        # Scaffold decoration
        scaffolds = designer.scaffold_decoration(row['scaffold'], n_samples_per_trial=n_samples, n_trials=n_trials, sanitize=True)
        scaffold_mols = [Chem.MolFromSmiles(smi) for smi in scaffolds if smi]
        
        # Scaffold morphing
        morphs = designer.scaffold_morphing(side_chains=row['morphing'].split('.'), n_samples_per_trial=n_samples, n_trials=n_trials, sanitize=True)
        morph_mols = [Chem.MolFromSmiles(smi) for smi in morphs if smi]
        
        # Superstructure generation
        superstructures = designer.super_structure(row['superstructure'], n_samples_per_trial=n_samples, n_trials=n_trials, sanitize=True)
        superstructure_mols = [Chem.MolFromSmiles(smi) for smi in superstructures if smi]
        
        tasks = ['Linker design', 'Motif extension', 'Scaffold decoration', 'Scaffold morphing', 'Superstructure']
        mol_lists = [linker_mols, motif_mols, scaffold_mols, morph_mols, superstructure_mols]
        
        for task, mols in zip(tasks, mol_lists):
            if mols:
                validity = len(mols) / n_samples
                uniqueness = len(set([Chem.MolToSmiles(mol) for mol in mols])) / len(mols) if mols else 0
                diversity = calculate_diversity(mols)
                distance = calculate_distance_to_original(mols, original_mol)
                sa_scores = [sascorer.calculateScore(mol) for mol in mols]
                sa_mean = np.mean(sa_scores)
                
                results.append({
                    'Drug': row['pref_name'],
                    'Task': task,
                    'Validity': validity,
                    'Uniqueness': uniqueness,
                    'Diversity': diversity,
                    'Distance': distance,
                    'SA score': sa_mean
                })
            else:
                results.append({
                    'Drug': row['pref_name'],
                    'Task': task,
                    'Validity': 0,
                    'Uniqueness': 0,
                    'Diversity': 0,
                    'Distance': 0,
                    'SA score': 0
                })
    
    return pd.DataFrame(results)

(Disclaimer: this is with the small 20M model, so that could be the cause now that I ponder this)

Goal-directed generative capabilities

Hi,

Firstly, thank you for making this excellent repository. The SAFE paper has some really interesting insights on molecular design.

As I was going over the codebase, I did not find any codes for the reinforcement learning part. Could you please let me know if I am missing something?
Also, how are you getting the advantage estimates in the ppo loss ? Is there an additional value network ?

Apologies if the questions are too obvious. I would really appreciate you insights on this.

About generation evaluation

Hello, how to eval your model from vaildity, uniqueness and diversity?

I used your tutorial, but get 83% validity for De Novo generation.

Mol rings and attachment number with multiple rings

Amazing work guys! Very impressed.

We had an issue with encode func on the following molecule:

import safe as sf
eh=sf.encode('CC1=CC=C(C=C1)C#CC1=NC=C(C=C1)C(F)(F)F')
print(eh)
#Cc1ccc(C#Cc2ccc2cn2)cc1.C2(F)(F)F
#expecting:
#Cc1ccc(C#Cc2ccc3cn2)cc1.C3(F)(F)F
sf.to_image('Cc1ccc(C#Cc2ccc3cn2)cc1.C3(F)(F)F')

Thank you!

Attachment points numbered differently by version

Hello. This might be an oddly specific question.

Running the tutorial Getting Started with SAFE, the decoded example in version 0.1.3 for the safe_fragment[0] is:
'FC(F)(F)c1cc([:2])n([:3])n1'

However, this changes in version 0.1.4, where the safe_fragment[0] is now:
'FC(F)(F)c1cc([:4])n([:3])n1'

Is the new version now reading molecules differently in version 0.1.4?

Grad Norm and SAFE encoding Misunderstanding

When training a model on a different dataset, in this case (https://huggingface.co/datasets/sagawa/ZINC-canonicalized - somewhat larger than MOSES and quite a bit smaller than SAFE-GPT), the perplexity ends up very bad:

{
    "epoch": 1.0,
    "eval_runtime": 3393.0287,
    "eval_samples_per_second": 677.64,
    "eval_steps_per_second": 84.705,
    "perplexity": Infinity,
    "total_flos": 5.026540798656038e+16,
    "train_loss": 0.6743136047124862,
    "train_runtime": 6681.5609,
    "train_samples_per_second": 383.144,
    "train_steps_per_second": 2.993
}

Looking into it further I discovered the grad_norm is very large despite explicitly setting max_grad_norm:

"log_history": [
    {
      "epoch": 5e-05,
      "grad_norm": 57.65608215332031,
      "learning_rate": 5.0000000000000004e-08,
      "loss": 7.5185,
      "step": 1
    },
    {
      "epoch": 0.025,
      "grad_norm": 14.687941551208496,
      "learning_rate": 2.5e-05,
      "loss": 2.2887,
      "step": 500
    },
    {
      "epoch": 0.05,
      "grad_norm": 3.8548357486724854,
      "learning_rate": 5e-05,
      "loss": 1.0481,
      "step": 1000
    },
    {
      "epoch": 0.075,
      "grad_norm": 3.50759220123291,
      "learning_rate": 7.500000000000001e-05,
      "loss": 0.8689,
      "step": 1500
    },

The model then does not generate any valid molecules and seems to overfit:
training_convergence

I tried adjusting the library myself and realised transformers set grad_norm to 1.0 by default, which made sense when I replicated your small model results since it stayed between 0 and 1 throughout and gave good results at the end.

Do you have a solution in mind? It might be that the default is ignored when doing warmup steps but I haven't found any evidence for this reading through the Trainer code (https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py).

For more context this is a 50M parameter model with a learning rate of 1e-4 and the dataset is ~20M Zinc molecules. Do you have intuition what the problem might be?

output is not deterministic (?)

Thanks for packaging up SAFE in such a nice package!

I've been trying

safe.encode(smiles, seed=42, canonical=True, randomize=False)

with smiles=CC(Cc1ccc(cc1)C(C(=O)O)C)C (using the latest PyPi release)

but receive sometimes c12ccc3cc1.C3(C)C(=O)O.CC(C)C2, othertimes c13ccc2cc1.C2(C)C(=O)O.CC(C)C3.

I realize that the representations are equivalent but wondered if the canonical output was not supposed to be determinstic?

Cheers,
Kevin

embedding layer

Hi,
What is the easiest way to extract the embedding layer so we can get a 768-dim or x-dim for every molecule ?

Identifying the starting number for the bond fragments

Hi Emmanuel, it is Ulrich (I hola you on linkedIn).
Thanks for the amazing work. I wanted to find out if there is a way to identify the starting_number used for the new rings from the safe encoded string (https://github.com/datamol-io/safe/blob/main/safe/converter.py#L330). I want to experiment with a slightly different representation of safe but I want a neat mechanism to convert from safe to my representation and back.

I took this smile as an example, the one in the paper.
repr = encode("O=C(C#CCN1CCCCC1)Nc1ccc2ncnc(Nc3cccc(Br)c3)c2c1", canonical=False)
running the above I get 'O=C7C#CC6.N16CCCCC1.N74.c14ccc2ncnc8c2c1.N85.c15cccc(Br)c1'

The starting_number here is 4 but is there any of the functions I can use to know this.

Inquiry Regarding Reverse Molecular Design and Comparison of Models

Hi Emmanuel Noutahi,

I trust this message finds you well. I recently came across your article on the impressive performance of the representation SAFE in reverse molecular design. I have a few questions and would greatly appreciate your insights.

Firstly, in your comparison of the performance of different large pretrained models on molecules, I noticed the absence of MOLGPT, which is known for its exceptional performance. Given MOLGPT's ability to conduct conditional generation on targeted fragments or properties, my first question is about the performance comparison between SAFE and MOLGPT (e.g., Table 2).

Secondly, could you shed some light on the comparison between SAFE and MOLGPT in terms of their capabilities for conditional generation on targeted fragments or properties?

Lastly, I am curious about the choice not to employ conditional generation, as seen in MOLGPT, and instead adopt Proximal Policy Optimization (PPO) for goal-directed generative tasks. Additionally, it appears that the PPO-related programs are not open-sourced. Could you provide some insights into the rationale behind this choice?

Thank you in advance for your time and consideration. I look forward to hearing from you soon.

Best,

Yan Chen

Error in Fused Ring Systems

Thanks for the amazing work. I found out that in certain cases the encoding and decoding of SMILES strings is not consistent. For example, for the canonical string:

'CC1CCC2(C)C(CCC3C2CCC2(C)C(C(=O)CO)CCC32)C1'

In the canonical form, this string only has 3 integers, despite having 4 ring systems. The generated fragment has again 4 digits. Thus, the assignment of the attachment integer fails. Once you decode the safe string, you end up with a different molecule that now has a 7 & 4 membered ring and not as in the original a 6 and 5 membered ring.

Here is a small code snippet to reproduce the issue:

import safe 
import datamol as dm
test_string = 'CC1CCC2(C)C(CCC3C2CCC2(C)C(C(=O)CO)CCC32)C1'

output_string = safe.decode(safe.encode(test_string))

print(output_string == test_string)
dm.viz.to_image([dm.to_mol(test_string), dm.to_mol(output_string)])

Original Molecule:
test_string

Decoded Molecule:
output_string

To do

  • Revisit readme
  • Clean Code
  • build docs
  • models availability
  • training scripts.
  • application demo

Decoding fragments containing square brackets fail

I am using the decoder to decode individual fragments and I notice that if the fragment smiles contain square brackets then the following decoding fails:

import safe as sf
import datamol as dm

example_failed = """
O=C(CN1CC[NH2+]CC1)N1CCCCC1
[NH3+]Cc1ccccc1
c1cc2c(cc1[C@@H]1CCC[NH2+]1)OCCO2
"""


safe_obj = sf.SAFEConverter(slicer="brics", require_hs=False)

safe_str = sf.encode("c1cc2c(cc1[C@@H]1CCC[NH2+]1)OCCO2", canonical=True)
 safe_fragment = safe_str.split(".")

  with dm.without_rdkit_log():
      for frag in safe_fragment:
            f = safe_obj.decoder(
                frag,
                as_mol=False,
                canonical=False, 
                fix=True,
                remove_dummies=True,
                remove_added_hs=True,
           )
           if f is None: print(frag)

This returns None and I feel this is due to the way square brackets are parsed in the decoder.

Thank you.

Cannot train model from scratch

When running the following script:

config_path="../trainer/configs/small_config.json"
tokenizer_path="../tokenizer.json"
dataset_path="../../Datasets/MOSES/datasets"
output_dir="./trained/SAFE_small"

safe-train --config $config_path \
  --tokenizer $tokenizer_path \
  --dataset $dataset_path \
  --text_column "SAFE" \
  --torch_compile True \
  --optim "adamw_torch" \
  --learning_rate 5e-4 \
  --prop_loss_coeff 1e-3 \
  --gradient_accumulation_steps 1 \
  --output_dir $output_dir \
  --num_labels 9 \
  --num_train_epochs 3 \
  --per_device_train_batch_size 8 \
  --lr_scheduler_type "linear" \
  --warmup_steps 500 \
  --logging_steps 100 \
  --evaluation_strategy "steps" \
  --eval_steps 500 \
  --save_steps 500 \
  --load_best_model_at_end True \
  --metric_for_best_model "eval_loss" \
  --greater_is_better False

I get the error:

Traceback (most recent call last):
File "/home/lmbanr001/.local/bin/safe-train", line 8, in
sys.exit(main())
File "/home/lmbanr001/.local/lib/python3.10/site-packages/safe/trainer/cli.py", line 406, in main
train(model_args, data_args, training_args)
File "/home/lmbanr001/.local/lib/python3.10/site-packages/safe/trainer/cli.py", line 335, in train
trainer = SAFETrainer(
File "/home/lmbanr001/.local/lib/python3.10/site-packages/safe/trainer/trainer_utils.py", line 19, in init
self.accelerator.dispatch_batches = dispatch_batches
AttributeError: can't set attribute 'dispatch_batches'

As I understand it, dispatch_batches is set to true when using another for of ingesting the data. Is there some intuition as to why my code is trying to set dispatch_batches?

Decoding fragment fails on double bond, [possible bug]?

I noticed that if the slicer (in this case BRICS) breaks double bonds, the resulting fragment cannot be properly decoded.

Using such a molecule and following the documentation:

import safe as sf

safe_str = sf.encode("C(=C/c1ccccc1)\CCc1ccccc1", canonical=True)
print(safe_str)

safe_fragment = safe_str.split(".")
sf.decode(safe_fragment[0], as_mol=True)

I get: SAFEDecodeError: Failed to decode C(=2)c1ccccc1

I think the more appropriate output might be C(=[*])c1ccccc1??

mc_labels

Hi, awesome stuff.

what does mc_labels contain? there's no info about it thank you

Strange artifacts when wildcart (*) present in SMILES

Hi, I found a few artifacts when dealing with SMILES that contain the wildcard *, specially in rings. In the attached screenshot, you can see that a ring becomes a 'square' after encoding and decoding. This happens when converting from BRICS to SAFE.

Capture

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.