Code Monkey home page Code Monkey logo

Comments (7)

longemen3000 avatar longemen3000 commented on June 27, 2024 1

Hi Caleb,

Given the old and new versions, i could program a manual diff to see what's changed, I'm gonna start with this and let you know what I found.

from chemicals.

longemen3000 avatar longemen3000 commented on June 27, 2024 1

for a preliminar parsing:
there are more synonyms, compared to the old database:

Old

julia> CC.load_db!(:inorganic_old2)
[ Info: :inorganic_old2 arrow file not generated, processing...
syms_i = 6326 #amount of synonyms
syms_unique  = 6325 # unique elements (there is one element repeated that i have yet find)
(Arrow.Table with 153 rows, 9 columns, and schema:
.....

New

julia> CC.load_db!(:inorganic_new)
[ Info: :inorganic_new database file not found, downloading from https://github.com/CalebBell/chemicals/files/6912649/Inorganic.db.csv       
[ Info: :inorganic_new database file downloaded.
[ Info: :inorganic_new arrow file not generated, processing...
syms_i = 9461
syms_unique = 9438
(Arrow.Table with 164 rows, 9 columns, and schema:

comparing the differences, by InChI:

InChI contained in the old database, not present in the new database

  "InChI=1S/CH2.Co/h1H2;/q-1;+1"
  "InChI=1S/Cr.2H2Si/h;2*1H2"
  "InChI=1S/H4Si/h1H4"
  "InChI=1S/Na.H3O4P/c;1-5(2,3)4/h;(H3,1,2,3,4…  "InChI=1S/F6Si.2H3N/c1-7(2,3,4,5)6;;/h;2*1H3…  "InChI=1S/Bi.2ClH.2H/h;2*1H;;/q+2;;;;/p-2"
  "InChI=1S/Al.Na.2O.2H/q-1;+1;;;;"
  "InChI=1S/BrHO3.Cs/c2-1(3)4;/h(H,2,3,4);/q;+…  "InChI=1S/2Na.H3O4P/c;;1-5(2,3)4/h;;(H3,1,2,…  "InChI=1S/2BH2.Ti/h2*1H2;"
  "InChI=1S/F6Si.2Na/c1-7(2,3,4,5)6;;/q-2;2*+1"  ""
  "InChI=1S/2Na.3H2O4S/c;;3*1-5(2,3)4/h;;3*(H2…

InChI contained in the new database, not present in the old database

  "InChI=1S/Cl2S2/c1-3-4-2"
  "InChI=1S/O.Pr"
  "InChI=1S/Bi.2ClH/h;2*1H/q+2;;/p-2"
  "InChI=1S/Cr.2Si"
  "InChI=1S/C32H16N8.Cu/c1-2-10-18-17(9-1)25-33-26(18)38-28-21-13-5-6-14-22(21)30(35-28)40-32-24-…  "InChI=1S/Al.Na.2O/q-1;+1;;"
  "InChI=1S/Al.La.O"
  "InChI=1S/2B.Ti"
  "InChI=1S/Na.H3O4P/c;1-5(2,3)4/h;(H3,1,2,3,4)"
  "InChI=1S/C.Co/q-1;+1"
  "InChI=1S/3O.2Yb/q3*-2;2*+3"
  "InChI=1S/2HI.Sm/h2*1H;/q;;+2/p-2"
  "InChI=1S/3ClH.Ru/h3*1H;/q;;;+3/p-3"
  "InChI=1S/H2O/h1H2"
  "InChI=1S/2B.Zr"
  "InChI=1S/10CO.2Re/c10*1-2;;"
  "InChI=1S/H3NO.H2O4S/c1-2;1-5(2,3)4/h2H,1H2;(H2,1,2,3,4)"
  "InChI=1S/Li.H"
  "InChI=1S/Na.H2O4S/c;1-5(2,3)4/h;(H2,1,2,3,4)"
  "InChI=1S/C.2W/q+1;;-1"
  "InChI=1S/6Al.2O2Si.9O/c;;;;;;2*1-3-2;;;;;;;;;"
  "InChI=1S/B.Li.O"
  "InChI=1S/Cd.2FH/h;2*1H/q+2;;/p-2"

from chemicals.

longemen3000 avatar longemen3000 commented on June 27, 2024 1

doing the same thing with the formulas:

julia> setdiff(set_new,set_old)
Set{String} with 21 elements:
  "Cl3Ru"
  "O3Yb2"
  "H2O" #water is in new the inorganics database
  "AlLaO"
  "I2Sm"
  "B2Zr"
  "H3NaO4P"
  "HLi"
  "Al6O13Si2"
  "Cl2S2"
  "As2H12O3"
  "CW2"
  "C32H16CuN8"
  "OPr"
  "ClH2Tl"
  "H5NO5S"
  "C10O10Re2"
  "BLiO"
  "H2NaO4S"
  "BrH2Tl"
  "CdF2"
julia> setdiff(set_old,set_new)
Set{String} with 11 elements:
  "HNa2O4P"
  "ClTl"
  "H4Si"
  "H4Na2O12S3"
  "As2O3"
  "BrCsO3"
  "BrTl"
  "H2NaO4P"
  "F6H8N2Si"
  "F6Na2Si"
  "D2Se"

from chemicals.

CalebBell avatar CalebBell commented on June 27, 2024

Hi Andrés,
Like all software not maintained, bits and pieces of the chemicals-metadata repository have rotted away. I cannot get the inchi module in rdkit to work for me, and I am having issues building rdkit.
Thanks for letting me know about the issue. I'm afraid we may have to manually patch the file for now.
Sincerely,
Caleb

from chemicals.

CalebBell avatar CalebBell commented on June 27, 2024

Hi Andrés,
I found a version of rdkit which works on linux - and it's on pypi! One step closer to being able to update the database again. I think I actually need to port chemical-metadata to Python 3 as well.

Sincerely,
Caleb

from chemicals.

longemen3000 avatar longemen3000 commented on June 27, 2024

what do you think of adding ; as an aditional separator? the main problem would checking if other names actually have ; as part of their name.
maybe adding:

line = line.replace(';','\t')

before this line

values = line.rstrip('\n').split('\t')

could solve the problem temporally?

Also, i noticed (by a quick view, nothing exhaustive) that those synonyms separated by ';' are always at the end of the list.

Edit: the split ; must always be done after parsing the InChI

from chemicals.

CalebBell avatar CalebBell commented on June 27, 2024

Hi Andrés,
I have fixed the chemical-metadata repository a lot, and generated a new inorganic file without this particular issue. I attached it.

What is hard to do is that the online data has changed so much, I can't even use a diff program to see what changed. Because of that, it's hard to replace the current file with the new one. Do you want to look at it?

Sincerely,
Caleb

Inorganic db.csv

from chemicals.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.