ispyb / py-ispyb Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 13.0 12.07 MB

ISPyB backend server based on FastAPI

License: GNU Lesser General Public License v3.0

Python 99.37% Shell 0.52% Dockerfile 0.11%

fastapi jwt python sqlalchemy

py-ispyb's Introduction

ISPyB Project

Installing
Database creation and update
Database schema

Installing

Clone the ISPyB repository:

git clone https://github.com/ispyb/ISPyB.git

ISPyB needs the third-party libraries provided in the dependencies directory. These don't exist in a public repository, so install them to the local Maven repository so that Maven can find them:
```
cd dependencies && mvn initialize
```

Build ISPyB using Maven:

mvn clean install

By default, ISPyB builds for the GENERIC site and the development environment. These can be changed with the ispyb.site and ispyb.env system properties. For example, to build for the ESRF site and the test environment:

mvn -Dispyb.site=ESRF -Dispyb.env=test clean install

If the build succeeds, a summary message will be printed like this:

[INFO] Reactor Summary:
[INFO]
[INFO] ispyb-parent ...................................... SUCCESS [0.251s]
[INFO] ispyb-ejb3 ........................................ SUCCESS [10.243s]
[INFO] ispyb-ws .......................................... SUCCESS [1.751s]
[INFO] ispyb-ui .......................................... SUCCESS [7.212s]
[INFO] ispyb-ear ......................................... SUCCESS [5.048s]
[INFO] ispyb-bcr ......................................... SUCCESS [2.217s]
[INFO] ispyb-bcr-ear ..................................... SUCCESS [1.806s]

Database creation and update

Run the following creation scripts from the ispyb-ejb module (note that this requires the pxadmin database user to exist and have full permissions):

ispyb-ejb/db/scripts/pyconfig.sql: This corresponds to the menu options and contains both structure and data.
ispyb-ejb/db/scripts/pydb.sql: This corresponds to the ISPyB metadata and contains only the database structure.
ispyb-ejb/db/scripts/schemastatus.sql: This corresponds to the SchemaStatus table and contains both structure and data. The entries indicate which update scripts have been run.
ispyb-ejb/db/scripts/ispybAutoprocAttachment.sql: This corresponds to the type and names of different autoPROC attachments.

The creation scripts are normally updated for each tag, but if you are using the master branch, you may have to run the update scripts in ispyb-ejb/db/scripts/ahead.

Check the entries in the SchemaStatus table to know which scripts to execute. The scripts already run for the current tag are in ispyb-ejb/db/scripts/passed.

Creating an update script

The first line must be:

insert into SchemaStatus (scriptName, schemaStatus) values ('2017_06_06_blabla.sql','ONGOING');

Then comes the actual updates of the script:

-- SQL statements here

And the last line must be:

update SchemaStatus set schemaStatus = 'DONE' where scriptName = '2017_06_06_blabla.sql';

The last line updates the SchemaStatus table to mark the script as having been run.

Database schema

A patch or commit that changes the database schema must be accompanied by a corresponding change to the schema file to keep it up to date. This file can be edited using MySQL Workbench (a free tool from MySQL).

py-ispyb's People

Contributors

Stargazers

Watchers

Forkers

drnasmith dsanchez-cells antolinos clemenbor stufisher herouxa rjgildea qinwenming stuartcampbell tomdiamond1 hormiai76 fabcor-maxiv alessandro-olivo

py-ispyb's Issues

Split setup.py -> setup.cfg

Move metadata from setup.py into setup.cfg, use setup.py only for python commands

Input Validation

Another important part of ISPyB is input and validation. This is great that you are already using marshmallow and the likes, however its worth looking a bit deeper into the validation. I find that string is usually not a sufficient validator. Protein acronym often ends up in the file system and should be constrained to something along the lines of an [A-z0-9]+ regexp. These are easy to build with marshmallow, so we should probably build a compilation of useful ones.

I use something along the lines of:

def ValidatedRegexp(pattern, **kwargs):
    expressions = {
        "word": "^[A-z0-9]+$",
        "word-dash": "^[A-z0-9-]+$",
        "word-dash-space": "^[\w\s-]+$",
    }

    if pattern in expressions:
        validates = kwargs.pop("validate", None)
        if validates and type(validates) != list:
            validates = [validates]

        return fields.Str(
            pattern=expressions[pattern],
            validate=[
                validate.Regexp(
                    expressions[pattern],
                    flags=0,
                    error="Input {input} does not match {regex}",
                )
            ]
            + (validates if validates else []),
            **kwargs
        )
    else:
        raise Exception(f"Cant find regular expression for {pattern}")

I was also thinking about how say shippingid, proposalid, etc, is always the same, and it would be good to build an assortment of these that can be reused. Not quite sure how to do this, but it would be nice to be able to define:

propsalid = fields.Int()

and reuse that through the models. Some thought needed on how to organise that.

Add contributing.md

Add a contributing.md with instructions on project guidelines, code style, review process, etc.

Endpoint Design

Thanks for putting this together, this is looking good. Below some general discussion on endpoint design, related to the document than @drnasmith put together for api spec.

    /prop/list
    /prop/get_by_person
    /prop/<proposalid>

These three resources all return the same payload. The design of using different resources for different filters is historic and it is not very REST like. It also makes consuming the api complicated because you must use different endpoints to ultimatly access the same resource.

Most REST guidelines suggest these should be a single resource, with query parameter filters. Right now the queries are very simple, but once you add authorisation these will grow in complexity, by making these a single resource you will significantly reduce code duplication later. In synchweb we have a single resource per table so there is only ever one source of truth, it makes maintainace much simpler.

Also if you read REST guidlines they suggest using plural nouns, rather than things like /list, /get and the likes so we should go for something like

/proposals
/proposals/<proposalid>
/proposals?login=<login>
/proposals?active=1

With sqlaclhemy you can actually share the query for the single and list case with some pretty succint code:

    query = Proposals.query

    if login:
        query = (query
            .join(PersonModel, PersonModel.personid=Propsals.personid)
            .filter_by(PersonModel.login=login)
        )


    if proposalid:
        query = query.filter_by(proposalid=proposalid)
        proposal = query.first()
        if not proposal:
            return { "error": "No such proposal" }, 404
        else
            return proposal
    else:
        proposals = query.all()
        return ...

I would also suggest that right from the start you paginate every resource, ive found throughout writing synchweb that if you start with non paginated resources they all grow to the point they need paginating. Best to do this from the off, and its really easy with sqlalchemy:

    total = query.count()
    query = query.limit(lims["per_page"]).offset(lims["st"])
    proposals = query.all()

    return {"total": total, "rows": proposals}

Looks like there is actually a paginate function in flask_sqlalchemy?

The endpoints also return simple structures at the moment, these will also grow in complexity, the best part of using a relational database is we know the relations, with proposals you want to know how many sessions, maybe samples, datacollections, etc. This is also really nicely done with sqlalchemy, it will often be able to infer the relationship, so you can just do:

    query = Proposals.query

    query = (
        .join(BLSession).add_columns(func.count(BLSession.session).label("sessions"))
    )

Simplify configuration

For example, use cfg files for site specific configuration

AbstractAuthentication

Add AbstractAuthentication class

Generate API documentation in CI

Would be nice to generate API documentation in CI and push to docs with swagger or redoc like:
https://ui.gitlab-pages.esrf.fr/daiquiri/api/spec/
https://gitlab.esrf.fr/ui/daiquiri/-/blob/master/.gitlab-ci.yml#L162

We use a hacky method in daiquiri to create the spec.json, this might be easier with FastAPI

Add a folder for the minutes

We need a place where the minutes are stored and made available. As it is continuity of the ISPyB project it also includes the minutes from:
https://github.com/ispyb/ISPyB/tree/master/webpages

Next step might be to implement something similar to:
https://mxcube.github.io/mxcube/

This is the proposed structure for the collaboration meetings:

├── 2018
│   ├── December_3_2018.pdf
│   ├── ESRF-DLS-Cryo-EM-Tablesv01102018.pdf
│   ├── February_26_2018.pdf
│   ├── July_2_2018.pdf
│   ├── June_4_2018.pdf
│   ├── March_26_2018.pdf
│   ├── May_3_2018.pdf
│   ├── meetings2018.html
│   ├── November_5_2018.pdf
│   └── October_1_2018.pdf
├── 2019
│   ├── 2019-01-14_meeting_minutes.pdf
│   ├── 2019-02-12_meeting_minutes.pdf
│   ├── 2019-03-05_minutes.pdf
│   ├── 2019-04-29_meeting_minutes.pdf
│   ├── 2019-07-09_meeting_minutes.pdf
│   ├── 2019-10-07-minutes.pdf
│   ├── Developers_20191202_final.pdf
│   └── meetings2019.html
├── 2020
│   ├── 2020-03-09_meeting_minutes.pdf
│   ├── Developers_20200120_TI.pdf
│   ├── ISPyB_developers_20200804.docx
│   ├── ISPyB_Developers_20200909_v2.pdf
│   ├── ISPyB_Developpers_Web_meeting_Minutes_20200309.pdf
│   ├── ISPyB_Developpers_Web_meeting_minutes_20200406.pdf
│   ├── ISPyB_Developpers_Web_meeting_minutes_20200511.pdf
│   └── meetings2020.html
└── 2022
    ├── 2022_03_17
    │   ├── PyISPyB_backend_minutes.pdf
    │   ├── PyISPyB_backend_presentation.pdf
    │   ├── PyISPyB_frontend_minutes.pdf
    │   └── PyISPyB_frontend_presentation.pdf
    ├── 2022_03_31
    │   ├── fastAPI_presentation.pdf
    │   ├── flask.pdf
    │   ├── introduction.pdf
    │   └── minutes.pdf
    └── 2022_04_20
        └── minutes.pdf

Version requirement on marshmallow_jsonschema

There is no version constraint for the requirement on marshmallow_jsonschema.
This can cause problem as some versions are not compatible with the specified version of marsmallow (for instance 0.13.0 generates the error bellow).
I suggest having a constraint (marshmallow_jsonschema==0.10.0 works just fine) to make sure we end up with a compatible version.

Error with marshmallow_jsonschema 0.13.0:
ImportError: cannot import name 'INCLUDE' from 'marshmallow' (/usr/local/lib/python3.8/dist-packages/marshmallow/init.py)

Improvements for database schema file (ispyb_ssx_db.sql)

A couple of things that could be done better in the database schema file:

The PK columns don't need to specify UNIQUE as they are already unique by virtue of being primary keys.
The PK columns that have datatype int should be int unsigned.
It should be possible to declare the JSON columns as datatype JSON in MariaDB since version 10.2.7 and in MySQL since version 5.7.8. Note that in MariaDB the JSON keyword is just an alias for longtext something-something with a constraint that checks that it's valid JSON. This is fine, and should work as expected.
I see there are table comments for a couple of the tables. It would be good to have something for all of them.

I would also advise you to consider importing the current schema into a database, then exporting that using mysqldump (e.g. mysqldump ispyb_ssx > ispyb_ssx_db.sql), and then using that schema file in the repo instead of the one you have currently. The reason is that: a) the current schema file has separated the foreign key constraints from the create table statements, which makes it hard to read. b) Also, the ints don't have a specified number of digits (int(10), int(11) etc), and different versions of MariaDB / MySQL might have different defaults for this, which could lead to errors. (The number of digits is really only used for presentation purposes, but it can still break foreign key constraint creation if the number is different between foreign key column and the column it's referencing.)

Session / proposal decorators should return 403 not 401

These two decorators are returning 401 at the moment which is unauthenticated. The client has a token but is not allowed access to the resource, so it should return 403 instead. I think its good to differentiate these two cases

https://github.com/ispyb/py-ispyb/blob/master/pyispyb/core/routes/legacy/dependencies.py#L43
https://github.com/ispyb/py-ispyb/blob/master/pyispyb/core/routes/legacy/dependencies.py#L78

Theres a case here for 404 too, depending how hidden the routes should be:
https://auth0.com/blog/forbidden-unauthorized-http-status-codes/

code definitions:
https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Define module structure

Currently project has the following structure:

app
| --- extensions: system related extensions like logging, access to db, authentication
| --- + modules: marshmallows schemas and flask models.
| --- person
| --- proposals
| --- routes: endpoints
| --- autoproc

Currently one module reflects one database table and routes use several modules. For example route auto_proc uses auto_proc_status, auto_proc_attachment, and etc

Supporting different databases

I was thinking one way to support dls and esrf database structures would be to create two new packages

ispyb-models-dls
ispyb-models-pydb

which both provide ispyb.models, pyispyb should depend on one of these being installed. By installing the right one you get the relevant models.

These two packages can build the sqlalchemy models automatically in CI from the relevant database structures, for dls there is already a github workflow (azure, needs to be converted):
https://github.com/DiamondLightSource/ispyb-api/blob/master/.azure-pipelines/update-orm.yml
https://github.com/diamondlightsource/ispyb-database

I suggest extracting the pydb sql definitions out of the java project into a new project.

ispyb-models-dls provides ispyb.models -> generated from https://github.com/diamondlightsource/ispyb-database
ispyb-models-pydb provides ispyb.models -> generated from ispyb/pydb (new project)

Or we could generate both within py-ispyb CI and select with an env var?

We can also build docker containers based on each of the packages.
In the long term we should aim to remerge (they are not a million miles away)

Thoughts @mgaonach ?

Pagination

Original issue: #36

    total = query.count()
    query = query.limit(lims["per_page"]).offset(lims["st"])
    proposals = query.all()

    return {"total": total, "rows": proposals}

Looks like there is actually a paginate function in flask_sqlalchemy?

    query = Proposals.query

    query = (
        .join(BLSession).add_columns(func.count(BLSession.session).label("sessions"))
    )

Error Handling / Status Codes

Reading through i think you have some of this already, but its also worth getting error handling set up in a standardised way, in another project i have a

ErrorSchema(Schema):
    error = fields.Str()

MessageSchema(Schema):
    message = fields.Str()

Now if you use the same schema everywhere we have standardised response types.

We should think about http status codes too:
422 = malformed payload, you should get this from an invalid marshmallow model i think?
401 = unauthorised
403 = no proposal maybe?
404 = not found
201 = created (say new proposal)
200 = ok, normal

we should avoid sending 500, the user should never see a 500, this is an uncaught exception.

Update Readme with the need of installing libmariadbclient-dev on ubuntu

At least in Ubuntu distributions the command sudo pip install -r requirements.txt will fail if libmariadbclient-dev is not installed with this message:

    ERROR: Command errored out with exit status 1:
     command: /home/ale/miniconda3/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-nhuytjq7/mysqlclient/setup.py'"'"'; __file__='"'"'/tmp/pip-install-nhuytjq7/mysqlclient/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-gpaelu57
         cwd: /tmp/pip-install-nhuytjq7/mysqlclient/
    Complete output (12 lines):
    /bin/sh: 1: mysql_config: not found
    /bin/sh: 1: mariadb_config: not found
    /bin/sh: 1: mysql_config: not found
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-nhuytjq7/mysqlclient/setup.py", line 15, in <module>
        metadata, options = get_config()
      File "/tmp/pip-install-nhuytjq7/mysqlclient/setup_posix.py", line 65, in get_config
        libs = mysql_config("libs")
      File "/tmp/pip-install-nhuytjq7/mysqlclient/setup_posix.py", line 31, in mysql_config
        raise OSError("{} not found".format(_mysql_config_path))
    OSError: mysql_config not found
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

In order to prevent this error someone could:

apt-get install libmariadbclient-dev

I wonder if it would be worth to mention in the Readme page

Add a site-specific-config.cfg file parsed with config.py

Site specific configuration will be stored in site-specific-config.cfg (ignored in git) and parsed in config.py.

Add LDAP authentication

Clean up logging

Clear up the logging mechanism, log file, kong log.

Authentication mechanism

Hi @IvarsKarpics,

Have you already thought about authentication and authorization mechanisms? Something I really don't like from ISPyB today is that the code is inside ISPyB itself or it is managed by Wildfly.
Could we envisage that ISPyB connects to an authentication module that lives outside of ISPyB?

Thanks,
A.

Create a test DB

Create a test DB:

that is light enough to be pushed
set up locally for tests (as easily as possible)
set up automatically for CI tests

Authentication mechanisms

Gather information and requirements for site specific user authentication. In comments please add necessary mechanisms.

Update documentation for FastAPI

Documentation was written for the Flask project - update it to match with FastAPI project

Cleanup unnecessary files

Cleanup all unnecessary files left from previous Flask versions

JSON Schema

You can transform all of the marshmallow models into json-schema models automatically using:
https://github.com/fuhrysteve/marshmallow-jsonschema

This is awesome because now we have a single source of models, in synchweb we have to maintain a model in the front end, and php api to enforce validation.

You can now use these json-schema models to automatically generate forms with the proper validation:
https://github.com/rjsf-team/react-jsonschema-form

Decide how to handle site-specific features

We need to decide how to handle site-specific features:

Have them in this repository
- how to separate them from the generic code?
Have them in external repositories
- how to use this external code?

Fix docker container

Remove db token

As noticed here #139 (comment), the authentication module is creating a database token.

This was an experimentation to create a database token, which is used by the Java ISPyB backend. py-ispyb just creates it and then never uses it.
With this, we were able to log-in into py-ispyb and then use both backends simultaneously without having to handle two different logging in.
But then it was decided to use legacy routes in py-ispyb instead, so this should be removed.

Keep tokens after restaring the server

Should we somehow (db, redis) keep tokens after restarting the server?

Remove ESRF references from LDAP authentication plugin

There are some ESRF references in the LDAP authentication plugin that should be removed:

py-ispyb/pyispyb/app/extensions/auth/LdapAuthentication.py

Line 94 in a4d0711

+ ",ou=People,dc=esrf,dc=fr))"

Authorisation

At the moment the queries dont have any authorisation, this is a big part of ISPyB. For every resource you need to enforce a linkage right through to either:

Session_has_Person to know what user has access to what.
Some sort of admin/staff check (further discussion below)

In synchweb i check for a dcid or proposal parameter, and use this to check the user has access before getting in to doing a resource query. I still always join the proposal right through as well in the queries, you can check this out in the synchweb source code. It would be worth working out how to do this in a generic way and apply it to all routes.

This means that all queries are tied to a proposal, and in synchweb we find that sometimes we want to access resources in an administrative way i.e. see all shipments from all proposals.

The basic authorisation scheme is:

A user can only access data if they were on the session i.e. Session_has_Person
We follow this for proposals too, the user must exist in Session_has_Person in order to access the proposal
There is also a Proposal_has_Person table, which is unused in Synchweb, we should think about whether this should be implemented. There are cases where you might want to register a shipment before a session exists.

I dont know if you use proposal login at EMBL, but this is a special case and should not be the norm. If it is really needed it should rely on a Session_has_Person entry for a person with login of the proposal. Really, we should enforce user login, so we know who is logged in and what they are doing.

Roles / UserGroup:

At the moment the official java api uses roles to identify staff types, in my opinion these are way too primitive. At diamond we use UserGroup, Permission, UserGroup_has_person, and UserGroup_has_permission to enable fine grained access to specific resources. If you are making a scalable python api from scratch i would suggest you use this approach and add a specific permission to each endpoint. This lets you build complex access rights cleanly and simply. You can also admin them from within the api as we do in synchweb

Bigger scale tests

Current tests (#123) allow for to detection of breaking changes in the code but do not represent real-life use cases, since they use a very light database.

Make it possible to run bigger scale tests on each site's infrastructure to make sure the code keeps scaling up with good performance and reliability.

Those tests will not be part of CI and will probably be partially site-specific.
Need to make it as generic as possible.

Review and complete auth.md

I am just checking if the auth.md is clear and wondering if it could be completed.

Safer dummy auth plugin

The dummy auth plugin allows to impersonate any user and to get any permission, which is very useful for dev and test.
But if this plugin ends up enabled in a production environment by mistake, it would be a big security issue.

We should find a way to make sure this cannot happen. Some ideas:

Ideally, find some way to make it impossible to enable it in production
Otherwise, add a safety layer (ask for confirmation when enabled at startup?)

Any other idea?

Save server and client configuration in db

There is a long standing ticket in SynchWeb to move the configuration out of some horrible php config file and into the database., similar to how wordpress works (https://wordpress.com/support/dashboard/). This way the only configuration that is needed is a couple of env vars for the db and the auth config to boot the backend. Then everything else can be configured at run time which doesnt require your devops team to redeploy your container / server when you want to change a config option. AdminVar could be used for this purpose storing dumped json in the value to allow more complex data structures.

It would be sensible for the UI to read this configuration information as well so it doesn't need its own config file. This is how daiquiri works, the server sends it the relevant config information, this means there is a single source of truth.

Move to src layout

As per https://blog.ionelmc.ro/2014/05/25/python-packaging/

Problem with yaml load for auth settings

File "/users/gaonach/git/py-ispyb/./pyispyb/config.py", line 43, in
yaml_settings.update(yaml.safe_load(f, Loader=yaml.FullLoader))
TypeError: safe_load() got an unexpected keyword argument 'Loader'

Fresh installation will fail because of database credentials

By following the guide I have wrongly assumed that I could edit ispyb_core_config.yml freely.
This might be the case when there is an existing database however when creating the DB from scracth by using the script create_core_db.sh it forces to use root and then it creates the user mxuser independently of ispyb_core_config.yml

get DataColletions bad result

There is an issue with the get method on DataColletions: the returned result always is an object with only null fields.

When I comment the following line in DataColletions.get (pyispyb/core/routes/data_collections.py), I get a correct result:
@api.marshal_list_with(data_collection_schemas.f_schema, skip_none=False, code=HTTPStatus.OK)

flake8 plugin for naming conventions

From #142 we should be able to build a custom flake8 plugin for our naming conventions built on:
https://github.com/PyCQA/pep8-naming

background:
https://flake8.pycqa.org/en/latest/plugin-development/index.html#video-tutorial

Rename `master` branch to `main`

https://github.com/github/renaming
https://sfconservancy.org/news/2020/jun/23/gitbranchname/

Implement DESY authentication plugin

Since DESY is not using LDAP or keycloack to authenticate external users, a custom plugin needs to be created.

Mock of Database

Hi @IvarsKarpics,

For unit and integration testing, would it be convenient to be capable to mock a MariaDB instance with synthetic data?
We could enrich such database with different use cases. If so, I can have a look at it.

Cheers,
A.

Naming convention

We currently have two naming conventions in the code

snake case (python standard)
mixed case (used to refer to DB columns names) - see #135

The linting for variable names has been deactivated (see #137 ) to cope with this difference.

We need to enforce a standard through linting so that the code does not become a mix of everything after some time.

We need to decide with the collaboration to either :

Enforce python standard snake case everywhere, and decouple code naming from db names
Override python standard and enforce mixed case everywhere, and thus have a consistent convention in db, backend and frontend (is there a flake8 plugin to do so?)

Add inbuilt query diagnostics

It would be good to be able to automatically time, print, and potentially EXPLAIN queries running in production or testing to aid debugging. I think it should be easy to add a part of the json response. It's currently possible to enable query debugging with an env variable (QUERY_DEBUG) which will dump the query and execution time, expand upon this.

Implement User Office data synchronization

As recommended by Alejandro (21.03.2022), the idea will be to create a specific JSON data structure which will include all the information about a proposal (sessions, participants, etc). Every site will adapt their data coming from the User Office into the defined py-ispyb JSON structure.
In this way, we can create a general synchronization solution which every site will be able to use, and also we may include automatic tests.