kartoza / geodata-mart Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 3.0 4.22 MB

Geoprocessing as a Service

Home Page: https://data.kartoza.com

Shell 0.36% Python 16.88% Dockerfile 0.58% PowerShell 0.06% HTML 21.16% CSS 2.09% SCSS 0.12% JavaScript 58.74%

geodata

geodata-mart's Introduction

Geodata Mart

Search, select, clip, and deliver spatial data sources.

https://data.kartoza.com

Run

The development stack is managed with Docker Compose.

Spin up development environment

cp .env.example .env
docker compose up -d --build

Bring down the stack

docker compose down -v

Running commands

Because the django app is run within an isolated docker container and may not have access to the declared environment variables for the project, from within a container run the provided helper script to configure the environments:

source /app/setenv.sh

Once this script has run and defined the environment, running django commands may proceed as normal.

python /app/manage.py shell

Using the docker extension with vscode along with the docker.commands.attach command available from the supplied settings.json file will automatically run this script when attaching a shell to a container.

Deploy

Deployment is done with docker compose (for now).

Some notes/ caveats on the 0.1 release deployment:

run docker compose with root level perms: otherwise running and creating users on postgresql may fail
lowercase for postgresql user: if you generate a random username make sure everything is cast to lowercase
file permissions issues: celery workers need to be run as root (for now) due to the qgis config and local user config not being 100%. As a result, the geodata/ qgis directory needs to probably be owned by root and have 777 permissions, due to django etc running under the django user. This may also cause issues with data removal from the admin ui in django
cascading reference removal: many foreign key fields relate to managed file objects, which include lifecycle hooks for keeping the filesystem and the db models in some sort of sync. There's a bug there that causes an error on cascading, so the default for most fk in architecture/ conception phase was "do nothing". This is also used for accounts etc for which the lifecycle is not yet defined. This needs review and reevaluation, as for now even removing users may result in a meaningless "500/ Oops" error, because accounts and all other related items need to be removed first. Same goes for projects etc.
docker logging config: it's the default atm and will likely cause system bloat

Adding projects

There's a bit of a serious bug with regard to adding new project and config items... When developing/ iterating over the schema structure and what fields or requirements needed to be put into the system, the fk field was defined on the project, and then that related to the managed file objects. When the change was implemented to use dynamic file paths based on project details, it became possible to associate a file with the project and when calling save() (as demonstrated in the seed operation) the file would be uploaded to the project path. This keeps relative paths intact and prevents collisions.

The problem is that this is not exposed through the admin ui, as when uploading a config file it is not possible to specify the reverse relation to the project in order to get the dynamic file path. I thought there may be a clever hack but I haven't been able to find anything and it seems like a schema restructure is a better option, but it's a bit late in the day for that to enter the deployment, so new project definitions will have to be managed programmatically for the time being or have auxiliary files placed in the relevant geodata root directory.

Development

Development and stack is managed using docker. Note that there are multiple "environments" for the application development, including:

Local development environment: This is a python environment that includes prerequisites such as precommit, black, and other linting/ testing/ code quality tools. This can be the system environment, but using a venv is recommended.
Development environment: The requirements dev.txt is used by the dev Dockerfile, which is a Django environment with a number of development and debug tools. This is the docker-compose stack environment used for development
Production environment: The requirements production.txt is used by the production Dockerfile, which is a Django environment intended to be pushed to a container repository, and deployed with kubernetes.
On windows, add FORKED_BY_MULTIPROCESSING=1 to your .env to prevent the celery worker from failing

Prerequisites

Development systems should have:

git
precommit
editorconfig
python 3.8

Dev environment venv creation

python -m venv venv
venv/bin/activate
python -m pip install -e .[dev]

Framework

This application was modified from cookiecutter-django, and the project docs may be helpful to developers.

Settings

It is desirable to aim for dev-prod-parity. The production.py settings exclude many functions from dev.py, such as debug and dev tools and a change in email service configuration. The frontend.py settings replicate dev, but disable white noise and the offline caching of django-compressor to allow more dynamic loading and modification of frontend assets without having to restart the stack.

Note that changes to the celery tasks etc. will require you to rerun the docker build function before your changes are reflected in the tasks and scripts.

geodata-mart's People

Contributors

Stargazers

Watchers

Forkers

seabilwe nyakudyaa thanthamky

geodata-mart's Issues

Limit application user capabilities on datastores

Law of least privilege should apply to prevent breaches or abuse, and enforce rate limiting as outlined in #27. This is of particular importance to ensuring that #23 is implemented in an appropriate manner, however this process needs to be robust and properly documented so that third party vendors are capable of providing similar protections to their data and services

Research moodle login

Check if auth integration with moodle is possible

Define "project source" management strategy

In line with #17, it stands to reason that each "project" (see #32) may require a particular "data source" definition which needs to be managed accordingly.

For PostGIS, this might be a service definition. For a QGIS backend, this will likely be an uploaded QGIS Project file with all the appropriate layers, styling, and data source credentials (like a pgservice definition etc).

This needs to be properly modelled, and have the lifecycle of a project taken into account.

Define initial data sources

The application is expected to deliver key data collections for the initial proof of concept delivery. This possibly includes data sources from ZANGI hosted by kartoza, and possibly OSM data (via docker OSM/ overpass etc). We need to collect a list of specific layers from specific sources and identify how they might be included into the product.

Create credit system

One simple solution to #27 may be to create a "credit" system which attaches a balance to a user account and limits the number of operations they are able to perform based on that balance.

This scales as well, as new users might simply start with a credit balance, and manually adding new credits to an account should be trivial for administrators. Once proper billing is implemented, a more sophisticated approach to this might be used.

Although I am typically not a proponent of microtransactions, in this case I believe it may have significant merit. In addition, I feel that the typical ethical issues that I believe affect many microtransaction implementations can easily be mitigated by pegging the value of the credits to a legitimate currency/ basket and offering reimbursement options. This prevents skewing of user value perceptions and may simply be made implementation specific if necessary.

Define metering apparatus

Multiple levels of metering would likely be required in order to track, bill, and limit (#27) user actions which may result in costly processing and storage operations. This is unlikely to be meaningful exclusively at an infrastructure level and most likely some level of metering information would need to be captured on each operation (such as the region size to be clipped, or the total amount of data processed in Mb etc.)

Create application API

The core application backend should be exposed via an API to ensure integration capabilities outlined in #1

Add legal and information pages

Pages for #44 and #45 will need to be added to the application and referenced accordingly in #46

Init development environment processes

Create processes and documentation for dev

Define initial storage medium and management process

This is expected to be a storage intensive process and managing data access by users, storage locations, and transfer between systems is a key consideration

Define initial query parameters

The utility or API will require a few parameters in order to perform it's work which will need to be clearly defined and documented, including which parameters are required, optional, or what defaults might be used

Define billing operations

Billing will likely be a critical consideration for multiple aspects of the system, as it will serve to ensure platform sustainability, promote growth, and provide incentive for maintaining platform data by vendors.

Billing is also non-trivial and has multiple points of contact beyond a simple "subscriber" type system, including managing vendor costs, storage, processing, metering, and transfers to vendors where appropriate.

Create initial frontend

Develop a frontend (select framework) and some UI components

Collect data for initial data sources

If any data is outstanding as listed in #22 it will need to be sourced and incorporated into the corporate data store

Define initial data pipeline

Define the initial data processing pipeline used to develop the PoC

User login and ACL

The user authentication framework will need to be coupled with the application to enforce ACL rules to resource access

Define user model

User attributes should be clearly defined, especially in line with ensuring that storing and processing of user data is done in compliance with the expected legislations (GDPR/ POPIA etc)

Define project URL

In line with requirements for #12 a web address will be critical to the application exposure and brand

Define data object models

Data objects will need to be properly defined and ensure that their lifecycle and properties can be handled effectively.

The nomenclature should probably also be standardised (e.g. data object/ element/ source/ project/ collection etc.)

Collect additional data sources

Aside from the data required from #22 it makes sense to try and identify potential data sources that might be made available within the product or framework

Resolve WPS processing issues

Ideally we should be in a position to leverage the py-qgis-wps framework for a scalable and powerful approach to backend processing, which provides a status api and provides many of the benefits of a generic WPS, along with direct support for qgis providers, plugins, scripts, and models.

A simple stack can be setup up with the following docker-compose.yaml:

version: "3"

services:
  wps:
    image: 3liz/qgis-wps:ltr-rc
    platform: linux/x86_64
    environment:
      QGSWPS_SERVER_PARALLELPROCESSES: "2"
      QGSWPS_SERVER_LOGSTORAGE: REDIS
      QGSWPS_REDIS_HOST: wpsredis
      QGSWPS_PROCESSING_PROVIDERS_MODULE_PATH: /processing
      QGSWPS_CACHE_ROOTDIR: /projects
      QGSWPS_SERVER_WORKDIR: /srv/data
      # QGSWPS_USER: 1000:1000
      QGSRV_SERVER_RESPONSE_TIMEOUT: 1800
      QGSRV_SERVER_CROSS_ORIGIN: "yes"
      QGSWPS_LOGLEVEL: DEBUG
    volumes:
      - ./projects:/projects
      - ./processing:/processing
      - ./output:/srv/data
    ports:
      - "9999:8080"

  wpsredis:
    image: redis:5-alpine

The initial script developed for testing is on my development branch under commit c42ca95, which can be downloaded directly from github:

https://github.com/zacharlie/geodata-mart/blob/c42ca95119610059308723afa4e3c3b742e7f16a/geodata/processing/scripts/clip_project.py

The "test" directory contains a small sample data set that can be used for evaluation:

Note that QGIS Processing will require the script to be added to the default profile, as the processing framework does not currently support profiles. The script can be run locally with QGIS Processing, e.g.

C:\OSGeo4W\bin\qgis_process-qgis-ltr.bat run script:gdmclip --distance_units=meters --area_units=m2 --ellipsoid=EPSG:7030 --LAYERS='world, dem' --CLIP_GEOM='Polygon ((28.5 -28.0, 28.5 -29.0, 29.5 -29.0, 29.5 -28.0, 28.5 -28.0))' --OUTPUT_CRS='EPSG:4326' --BUFFER_DIST_KM=50 --PROJECT_PATH='c:/test/projects/sample.qgs' --OUTPUT='c:/test/output/geodatamart'

This produces a zip file with gpkg and associated rasters

resetting raster paths in output project seems to be a bit buggy will fix

Storing the raster to geopackage does not work within the processing framework. When using the CLI, it throws an error. When using the same processing script from the QGIS GUI (3.26) it throws an error, but writes the output raster to the geopackage. When using the processing tools driectly in QGIS, it works. The RASTER_TABLE parameter does not work properly though unless used as TABLE.

Instead of a single clip and transform with GDAL, instead a multi-step process has been used to try identify or address issues as a work around, with the geopackage output raster process being commented out and replaced with a flat file output raster to include in a zip.

Assessing the WPS capabilities, operations, input, and output requirements can be performed with OWSLib, or alternatively, with a little mor work, vanilla requests as outlined by the py-qgis-wps tests.

OWSLib is easily installed with conda and is utilised as demonstrated below:

from owslib.wps import WebProcessingService, printInputOutput

# https://geopython.github.io/OWSLib/usage.html#wps


wps = WebProcessingService(
    "http://127.0.0.1:9999/ows/?service=WPS&MAP=sample", verbose=False, skip_caps=True
)
wps.getcapabilities()
wps.identification.type
wps.identification.title
wps.identification.abstract
for operation in wps.operations:
    operation.name

for process in wps.processes:
    process.identifier, process.title

processes = [process for process in wps.processes]

for process in processes:
    print(process.identifier)
    process = wps.describeprocess(process.identifier)
    for output in process.processOutputs:
        printInputOutput(output)
    # for output in process.processOutputs:
    #     printInputOutput(output)
    # process.identifier
    # process.title
    # process.abstract
    print("----------")

Execute process:

from owslib.wps import WebProcessingService, monitorExecution
import re
import uuid

output_name = uuid.uuid4().hex
wps = WebProcessingService(
    "http://127.0.0.1:9999/ows/?service=WPS&MAP=sample", verbose=False, skip_caps=True
)
processid = "script:gdmclip"
inputs = [
    ("LAYERS", "world, dem"),
    ("CLIP_GEOM", "POLYGON((29.0 29.0,29.0 30.0,30.0 30.0,30.0 29.0,29.0 29.0))"),
    ("OUTPUT_CRS", "4326"),
    ("BUFFER_DIST_KM", "50"),
    ("OUTPUT", "OUTPUT"),
]

execution = wps.execute(processid, inputs, output=output_name)

response = str(execution.response)

id = re.search(r"uuid=(.*?)\"", response).group(1)

print(id)

monitorExecution(execution)

The result is an "internal error" which has not been resolved despite various attempted workarounds and experiments.

Using the template processing script from QGIS (simple vector buffer), works with the WPS and compose stack provided.

In the meantime, the plan is to use QGIS Processing directly in the django/ celery containers and revisit the WPS utilization later.

Collect additional data sources

Aside from the data required from #22 it makes sense to try and identify potential data sources that might be made available within the product or framework

Define project license

The project needs an appropriate license, such as AGPL

Outline product ecosystem

Outline existing tools, frameworks, utilities, and processes that may be used alongside #1 to determine product and process structure

Create test QGIS project

We need a test project and set of data for development purposes

Define terms of use

A terms and conditions document needs to be created

Develop brand guidelines

As a product and platform the system should have it's own brand guidelines

Define initial backend structure

Define structure of backend application

Add application users

Add user authentication management

Create auth UX

Create a login/ logout UI for the frontend

Provide capacity for processing external sources

This may be specific to certain data collection types or processing backends, but it is likely that certain models and application details (such as definitions of layer objects) will need to be managed internally by the application or specified by data vendors.

The backend may already accommodate for this, such as adding WFS layers to QGIS projects for processing, but it is likely that implementation specific details will need to be considered such ensuring the correct credentials are accessible by the project and the system.

Create project documentation framework

An effective way to document project details for developers and users must be configured

Define rate limiting process

For access to data as well as limiting the processing abilities of individual users to prevent degradation of services or abuse, it may be required to implement various rate limiting operations on different parts of the system. The strategy for these implementations should be clearly defined.

Define policy on support for external sources

It should be decided whether external resources (such as metadata objects stored in a STAC catalog, GeoNetwork, or similar) are able to be defined on the platform, and what sort of distinguishing metadata might be attached to such items.

Create access accounts for initial data sources

For any items listed in #22 access will need to be made available to the application. This includes the creation of a specific user for a backend data source, e.g. PostgreSQL

Create initial QGIS project(s)

QGIS projects will need to be created for the initial data sources outlined in #22

Outline core application structure

Define the product and structure for development

Define backend processing management framework

Multiple processing backends are expected to be supported in order to properly support the #1 Roadmap, so we need to answer:

What backends are supported
How are they managed
How is this model extended

Define privacy policy

A privacy policy would be required for storing and managing user data. In addition, if user actions are logged such as outlined in #43 it is likely that this action must be outlined in policy

Instantiate application

Get a project template with boilerplate code and structure in place

Wireframe UI/ High level UX

Define the UX for the clipping tool and marketplace

Define product roadmap

Define a product development road map that outlines expected milestones, targets, and strategy for development

Expose data objects to external systems

It would probably make a lot of sense to expose data components to external systems through industry standard APIs such as WCS or other OGC APIs where possible

Outline required policies

Aside from typical policies such as #44 and #45 additional policies are likely required in order for the platform to function efficiently in line with its goals. Many of these would typically be required when operating a service on behalf of data vendors. The first step would be to outline required policies.

Some such examples might include:

data removal
data use, licensing, and relicensing
data quality standards
data access

Create Checkout Workflow

Map forms will be used to create processing jobs, but an "ecommerce" type workflow and ux should be used to:

ensure that processes are only run when desired
ensure that operations/ orders can be inspected/ double checked against costs (e.g. what layers are to be processed)
processing costs are accurately evaluated and available for introspection
credit balances are correct and can be debited at runtime
advanced validation can be done prior to processing, preventing frontend bugs from affecting backend processes
orders can be cancelled, or stored for later processing

Define vendor model

Data vendors will be distinct from users and will need to be handled appropriately to ensure control of data sources

Define data collection/ categorization methodology

It is highly probable that at some point the ability to group or otherwise categorize data will be required to provide a good UX. This includes items such as allowing users to select "layer groups" from a data source, or browsing and searching the data catalogue.

This may include some options such as layer tagging or other metadata management processes which should be outlined earlier rather than later, as each distinct UX may have implementation specific concerns.

Create sign up procedure

A sign up procedure should be created which ensures that users agree to the specified terms. If the service remains a closed service, information regarding account creation and the process to create accounts should be made available

Define project definition management strategy

While the definition of how to manage the details for connecting to a projects data sources may be covered in #34, it is still necessary to establish a model and strategy for how this data might be exposed or published for end users.

This includes the definition of specific layers that are published, how they might be tested for consistency, definition of available extents for data or layers, and some form of access control to limit which users might be exposed to a particular data collection.

In addition, some form of tests against requests for out of bounds data might be required as well.

Init deployment process

Create processes and documentation for prod