Code Monkey home page Code Monkey logo

datastew's Introduction

datastew

tests GitHub Release

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

Installation

pip install datastew

Usage

Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a csv, tsv or excel file. An example how to match two seperate variable descriptions is shown in datastew/scripts/mapping_excel_example.py:

from datastew.process.parsing import DataDictionarySource
from datastew.process.mapping import map_dictionary_to_dictionary

# Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")
target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")

df = map_dictionary_to_dictionary(source, target)
df.to_excel("result.xlxs")

The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches as well as a similarity measure per row.

Per default this will use the local MPNet model, which may not yield the optimal performance. If you got an OpenAI API key it is possible to use their embedding API instead. To use your key, create an OpenAIAdapter model and pass it to the function:

from datastew.embedding import GPT4Adapter

embedding_model = GPT4Adapter(key="your_api_key")
df = map_dictionary_to_dictionary(source, target, embedding_model=embedding_model)

Creating and using stored mappings

A simple example how to initialize an in memory database and compute a similarity mapping is shown in datastew/scripts/mapping_db_example.py:

from datastew.repository.sqllite import SQLLiteRepository
from datastew.repository.model import Terminology, Concept, Mapping
from datastew.embedding import MPNetAdapter

# omit mode to create a permanent db file instead
repository = SQLLiteRepository(mode="memory")
embedding_model = MPNetAdapter()

terminology = Terminology("snomed CT", "SNOMED")

text1 = "Diabetes mellitus (disorder)"
concept1 = Concept(terminology, text1, "Concept ID: 11893007")
mapping1 = Mapping(concept1, text1, embedding_model.get_embedding(text1))

text2 = "Hypertension (disorder)"
concept2 = Concept(terminology, text2, "Concept ID: 73211009")
mapping2 = Mapping(concept2, text2, embedding_model.get_embedding(text2))

repository.store_all([terminology, concept1, mapping1, concept2, mapping2])

text_to_map = "Sugar sickness"
embedding = embedding_model.get_embedding(text_to_map)
mappings, similarities = repository.get_closest_mappings(embedding, limit=2)
for mapping, similarity in zip(mappings, similarities):
    print(f"Similarity: {similarity} -> {mapping}")

output:

Similarity: 0.47353370635583486 -> Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder)
Similarity: 0.20031612264852067 -> Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder)

You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to download & compute embeddings for SNOMED from ebi OLS can be found in datastew/scripts/ols_snomed_retrieval.py.

datastew's People

Contributors

mehmetcanay avatar tiadams avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

datastew's Issues

Add DB adapter for Weaviate (vector db)

Implement a DB adapter for weaviate:
https://weaviate.io/developers/weaviate

Use the lokal in memory / file based DB in a first implementation

It should be possible to:

Store a computed embedding together with

  • A terminology label / ID (String)
  • Label of the Model used for generating this embedding (String)
  • The original String
  • A concept label / ID (String)

Retrieve an embedding

  • Based on the highest (cosine) similarity
  • Up to limit=n most similar vectors

Retrieve limit=n Random vectors from the DB for visualiazion

Target name mismatching in output

If you look at row number 8 in the target table below, 'var' name is 4220292, however after the text mapping (using both GPT4Adapter and MPNetAdapter) the output returns 4220292Â for the same target variable in the output table.

Source:

  var desc
0 edmmtyp Multiples Myelom (symptomatisch)
1 edmmtyp Smouldering Myeloma (asymptomatisch)
2 edmmtyp MGUS - monoklonale Gammopathie unklarer Signifikanz
3 edmmtyp Solitäres Plasmozytom
4 edmmtyp Plasmazell-Leukämie
5 *sympmws Schmerzenim Bereich der mittleren WS
6 *sympuws Schmerzen im Bereich der unteren WS
7 *sympknoch Knochenschmerzen
8 *sympleist Leistungsverlust
9 *sympmued Müdigkeit
10 *sympschwae Schwäche

Target:

  var desc
0 437233 Multiple myeloma
1 4184985 Smoldering myeloma
2 4082463 Monoclonal gammopathy of uncertain significance
3 4216139 Plasmacytoma
4 133154 Plasma cell leukemia
5 4169580 Pain in spine
6 4169580 Pain in spine
7 4129418 Bone pain
8 4220292 Impaired psychomotor performance
9 4223659 Fatigue
10 437113 Asthenia

Output:

  Source Variable Target Variable Similarity
0 edmmtyp1 437233 0.898915
1 edmmtyp2 4184985 0.910589
2 edmmtyp3 4082463 0.903709
3 edmmtyp4 4216139 0.847672
4 edmmtyp5 133154 0.897126
5 *sympmws1 4169580 0.813263
6 *sympuws2 4169580 0.81713
7 *sympknoch 4169580 0.844623
8 *sympleist 4220292Â 0.790407
9 *sympmued 4223659 0.899128
10 *sympschwae 437113 0.828928

AttributeError: module 'pkgutil' has no attribute 'ImpImporter'

I tried to install the package into PDataViewer. The virtual environment uses python version 3.12.1 and pip version 24.1.2. I tried pip install --upgrade virtualenv and then virtualenv venv --python=python3.12 with and without --reset-app-data. However, this does not solve the issue. There is no error while installing with python3.11 so it is likely a problem with the python or pip version. I would be glad if you could inform me whether there is a way to mitigate this error without rolling back to python3.11.

  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [33 lines of output]
      Traceback (most recent call last):
        File "/home/may/git/PDataViewer/backend/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/home/may/git/PDataViewer/backend/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/may/git/PDataViewer/backend/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 112, in get_requires_for_build_wheel
          backend = _build_backend()
                    ^^^^^^^^^^^^^^^^
        File "/home/may/git/PDataViewer/backend/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend
          obj = import_module(mod_path)
                ^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/may/.pyenv/versions/3.12.1/lib/python3.12/importlib/__init__.py", line 90, in import_module
          return _bootstrap._gcd_import(name[level:], package, level)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
        File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
        File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
        File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
        File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
        File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
        File "<frozen importlib._bootstrap_external>", line 994, in exec_module
        File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
        File "/tmp/pip-build-env-8q0wgqek/overlay/lib/python3.12/site-packages/setuptools/__init__.py", line 16, in <module>
          import setuptools.version
        File "/tmp/pip-build-env-8q0wgqek/overlay/lib/python3.12/site-packages/setuptools/version.py", line 1, in <module>
          import pkg_resources
        File "/tmp/pip-build-env-8q0wgqek/overlay/lib/python3.12/site-packages/pkg_resources/__init__.py", line 2172, in <module>
          register_finder(pkgutil.ImpImporter, find_on_path)
                          ^^^^^^^^^^^^^^^^^^^
      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.