magda-io / magda Goto Github PK

A federated, open-source data catalog for all your big data and small data

License: Apache License 2.0

Scala 8.81% Shell 0.06% HTML 0.08% JavaScript 28.98% Java 0.02% TypeScript 18.29% Dockerfile 0.04% Open Policy Agent 0.81% SCSS 4.23% Mustache 0.27% Smarty 0.01% Raku 38.40%

open-data scala nodejs kubernetes elasticsearch postgresql

magda's People

Contributors

Stargazers

Watchers

Forkers

tobybellwood ajbar7 benjaminleighton csiro-enviro-informatics rsignell-usgs jyucsiro gitter-badger cuulee ketikat quadtog nilfanif gavintapp govau stephencannings sandra-arato mwu2018 zuhrasofyan lxshadowxl t83714 nikunj3121994 gordjw myfreebrain interest1024 nghamilton paulcavallaroqld terriajs netg5 zabi-kamran markmo artb1sh keyboardsounds sukhrajghuman fahmidaward xiashuijun rj2019 kdigbeu wassimbensalem jmmeyers soyarsauce wuqunfei pkuong stackedsax halain00 isururanawaka wccropper chinadean biggani safwanalqulayti locnguyenhuu dorawyy karmaton quynhdd piplcom maruvha felipeduarteferreira-dft manasys laopeng2021 saurabharch knut7 xuckit ikrajciova nf-s adambouras jlphillipsphd spbriggs brandonfeldhaus fmandela erdal-pb gisdevelope worthlesspixels tino097 davidthewatson wsp-digital akgyzv brandonmcclure fgydata saadalmogren magdabot code-as-a-crime-scene datanadi mush clientsidedebloating teamcfml-in tomleelive pabrojast wangzhen263 yulianyang owami joaorodrigues9 guillaumerosinosky

magda's Issues

Correct or remove harvested datasource and spatial dataset counts

Enable HTTPS

For magda-dev.terria.io and search.data.gov.au.

Add an account screen to change user details

Remove "spatial coverage" section from dataset page

Since we don't currently have spatial coverage (we'll add it before long).

CKAN: Datasets can be missed due to paging

The CKAN Connector executes multiple package_search queries for a page of datasets at a time. CKAN's default search order is first by relevance (presumably all datasets have equal relevance when there is no query) and second by modified date, descending. This puts the most recently modified datasets at the beginning.

If new datasets are added while the crawler is running, they'll be missed, while other datasets will potentially be examined multiple times. This is acceptable-ish because crawling is inherently a point-in-time activity. The new datasets will be picked up on the next crawl.

More concerning, though, is if an existing dataset is modified during crawling. This will move it to the first page of results after the crawler has already moved to later pages. Therefore, it will be missed. Missing a dataset that existed prior to the start of crawling is not acceptable (-ish or otherwise).

It would be nice if CKAN datasets had a simple auto increment ID type of thing that we could use as the sort order, guaranteeing stable order and that new datasets added during crawling would be added at the end. But since there is nothing like that (AFAICT), it's tricky.

One idea is to sort by ascending modified data, purposely request overlap in the pages, verify that overlap actually exists, and if it doesn't, re-request the page with an earlier start index until we find our overlap.

Improve Search

Adding tasks from workshop on 31 May 17:

fix issue with "Business names by ASIC as CSV" example search query - @AlexGilleran
fix issue with "trees in Victoria" example search query - @AlexGilleran
search/query syntax help page needs to be created and linked from search examples box - @cam-grant @chloeleichen
review of filter UI - @cam-grant
publishers sidebar in search results: use list of publishers selected in publishers filter, take top 5 and display details for each - @chloeleichen @cam-grant
publisher filter - clear command should close popver panel - @chloeleichen
Investigate & fix (if broken) date range filter - @AlexGilleran @chloeleichen
Investigate and fix (if broken) location filter - @AlexGilleran @chloeleichen

Make indexer use fast registry paging

Using page tokens instead of offsets will be much faster.

CSW Connector for the registry

Read from a CSW server, write records and aspects to the MAGDA registry.

UI: Implement collaboration / project view page

Purpose of projects (data requests) is unclear

API Doc and Feedback links at the bottom of the web UI are hardcoded addresses

One is the dev IP and the other is localhost.

facet not adding up

in first search, on selecting a publisher i got 3 results

When i select the second publisher, there should be 3+ 31 = 34 results, but it tells me it has no matching result, and facet hitcount has changed:

Is it because the publisher facet is using and logic? if we are using and, then why don't we just force user to be able to only select one option at any given time?

Remove Elmo

Unicode error in dataset descriptions

apostrophes are coming up as ' - this may be a magda-web issue or magda-metadata

They'r eencoded ok on source CKAN portal

inaccurate hitcount

so if i search http://terria.io/magda-web/build/?q=advisers+by+Australian+securities, it shows for Commonwealth of Australia (Geoscience Australia) there is one result, but as soon as it select that facet, the hitcount for that facet becomes 8125, and the result is match-part, it is confusing because it told me there was one result (see screenshot).

Add ability to exclude publisher from results

In this scenario - the dominant publisher adds no value to the results, and should be excluded:

Asuming this is easy in ES, a UI tweak to +/- the line and maybe a rework of the x to remove may be needed?

Add project open/close flag and UI to control it

Only admin users can change the state of the flag.

Distribution page looks just like the dataset page

It even says "Home / Dataset" in the breadcrumbs. The distinction should be made more clear.

Enable registry authorization

It's just a matter of changing this value in magda-registry-api's application.conf:

authorization {
    skip = true
}

But once we do that the connectors won't be able to write to the registry because they don't authenticate themselves. So we should authenticate and authorize the connectors.

Future issue roadmap

Access control for the registry
System-level logging (using ELK Stack maybe)

Are we able to filter out sources via a URL to the search API?

Thinking down the track - we currently operate a "Search Results Sharing Partnership" - currently with SA and NSW. This is achieved by the CKAN portals running a modified ckanext-harvest (https://github.com/datagovau/ckanext-harvest) plugin to remove the redirect loop it creates.

What if we were able to facilitate a CKAN plugin that did the same role, but excluded the requestor's portal, thus giving users of the plugin access to the full (new) data.gov.au results?

States (eg "nsw", "queensland") don't seem to work in region selector

I expect these will be common spatial filters on searches (esp for state govts) so it would be good to support searching on them.

Add Data61 logo to the web site

Top level nav items are duplicated

Search, Projects, Publishers, About top-level items appear twice.

Remove Subscribe link on the dataset page

It currently goes to the data.gov.au blog RSS feed, and we don't currently have a way to subscribe.

Share button on dataset summary on search page doesn't do anything

Just remove it for now.

UI: Implement home / landing page

WebHooks for dataset create-update-delete events

We'll use these to run the indexer, sleuths, and maybe user notifications eventually.

Indexer needs to be able to attribute source of records from the registry

Currently it considers the source of all records that come from the MAGDA registry as "MAGDA Registry", but the real source should be the CKAN or whatever server that the record originally came from.

Make GNAF and Admin Boundaries featured datasets

Investigate use of Google Cloud Functions and other options for connectors and indexers

UI: Implement dataset details page

Update footer navigation

Revised list of footer nav links:

Search (category header)

Search syntax -> /page/search-syntax
Data sources -> page listing all harvested sources - available?

Projects

Browse projects -> /projects
Start a project -> /project/new

Publishers

Publisher index -> /publishers
Open data toolkit -> https://toolkit.data.gov.au/

Developers

Content in this section to be confirmed with Kevin and Alex. If not enough time to create content, leave this section out for the time being.

Architecture -> page with info about MAGDA architecture
API Docs -> page with API doc info

About

About data.gov.au -> http://data.gov.au/about
Blog -> https://blog.data.gov.au/

Feedback

Send feedback -> mailto:[email protected]

Store comments, links to datasets, and links to user for projects

The collaboration system will evolve, but at a minimum we need to persist the basic information in the title.

Create production deployment of new architecture

SA Harvest portal not excluded from data.gov.au harvest

http://search.data.gov.au/?q=%2A+by+South+Australian+Governments returns all the harvested SA datasets in dga, leading to dupes (http://search.data.gov.au/?q=+FINTCBP01)

Make region import robust even with messy polygons

Region polygons often have problems, such as:

Unclosed linear rings (the first and last point are not the same, even though GeoJSON says they must be).
Duplicate positions in a linear ring, other than the first and last. This indicates a loop in the ring, which is not allowed by the OGC simple feature specification, even though it is allowed and common and Esri shapefiles. Elasticsearch seems to expect GeoJSON polygons to conform to the simple feature specification.
Self-intersecting segments in the polygon.

In a perfect world, whoever provided the regions would fix these problems. But in reality our system needs to be robust in the face of imperfect data. To that end, we should automatically clean up these problems in the regions we load. This recent paper, A triangulation-based approach to automatically repair GIS polygons, proposes a very promising approach:
http://dx.doi.org/10.1016/j.cageo.2014.01.009
http://www.sciencedirect.com/science/article/pii/S009830041400020X

There is an implementation of this approach, but it is not suitable for use in Magda because it uses the GPL licence. It should be straightforward to implement it ourselves, though.

data.json sources

List of data.json sources already available from data.gov.au

data.json end point	harvest_source_title
http://data.logancity.opendata.arcgis.com/data.json	Logan City ArcGIS JSON Harvest
http://opendata.launceston.tas.gov.au/data.json	City of Launceston ArcGIS JSON Harvester
http://data-1.hobartcc.opendata.arcgis.com/data.json	City of Hobart ArcGIS JSON Harvest
https://data.melbourne.vic.gov.au/data.json	Melbourne JSON Harvester
http://www.data.act.gov.au/data.json	ACT JSON Harvester
http://data.moretonbay.qld.gov.au/data.json	Moreton Bay ArcGIS JSON Harvest
http://data.esta000.opendata.arcgis.com/data.json	ESTA Emergency Marker ArcGIS JSON Harvester

Add feedback button at the top of the page

We want lots of feedback!

Redirect after project creation is broken

The project is created successfully, but the user is left on the "create" page.

Make location filter map pop-over wider

The location filter (pop-over map) is too thin, making it hard to browse. Please change so that it is at least as wide os the search results, and a little taller.

Confusing page numbering

Page 1 of the search results displays this:

A person seeing this might reasonably assume it's telling them they're on page 2, when really it's telling them that the next page is page 2. Can we make this clearer?

Remove "Add to Project" link on dataset page

We don't currently have a way to add datasets to projects.

About page goes nowhere

Clicking it leaves you on the current page.

Create data.json connector

The data.json format (https://github.com/GSA/ckanext-datajson) is used extensively around open data portals - most specifically (in Australia) to connect to Socrata and ArcGIS online portals, but also extensively in the USA. Currently we harvest into data.gov.au (although this is not reliable).

For example:
Socrata:
http://www.data.act.gov.au/data.json
https://data.melbourne.vic.gov.au/data.json
https://data.sunshinecoast.qld.gov.au/data.json
ArcGIS Online:
http://data-1.hobartcc.opendata.arcgis.com/data.json
http://opendata.launceston.tas.gov.au/data.json
http://data.logancity.opendata.arcgis.com/data.json
http://data.moretonbay.qld.gov.au/data.json
http://data.goldcoast.opendata.arcgis.com/data.json
http://vicroadsopendata.vicroadsmaps.opendata.arcgis.com/data.json

Add example search queries + "learn more" link to homepage

Please add the following sample queries + a "Learn about the new search" link to the homepage (see attached drawing). "Learn more link should point to the search syntax static page (http://magda-dev.terria.io/page/search-syntax).

Example queries:

Business Names by ASIC as CSV
Taxation Statistics from 2013
Trees in Victoria

Style the login page

Registry: Record PATCH explodes when removing an aspect

When doing a PATCH request for a record, an exception is thrown when a Remove patch tries to remove an entire aspect. The error is Cannot delete an empty path.

Also, when doing a PUT of a record, missing aspects should be left alone rather than removed. Currently, this scenario gets turned into the one above.

consider change from "by" for publisher syntax

I think we should reconsider the use of "by" in the query language for limiting the search to a particular publisher. That is because:
a) "by" isn't an exact fit for the relationship between a dataset and a publisher (the publisher is making the data available - they may not have created it - it may have been created by a different agency). "From" is closer to the meaning we want (there may also be other possibilities); and
b) I think we'll later need to use "by" for influencing display of results (eg for sorting and/or grouping - eg "population by state") and it would be good if people didn't get in the habit of using it.

I think that overloading "from" to allow a trailing publisher or date is fine and easy to disambiguate. I think that it won't be possible to do so between "by" and the likely future use of "by" I mentioned above.

So, I think we should use "from" for publisher instead of "by". What do others think?

Replace GA FIND harvester with Harvest Nodes

Here are the CSW endpoints to replace the GA-FIND harvester with - not a current priority, but a next step.

We should also update the ignoreHarvestSources for data.gov.au as follows:

ignoreHarvestSources = [
"FIND (http://find.ga.gov.au) CSW Harvester",
"Brisbane City Council CKAN Harvester",
"Data NSW CKAN Harvester",
"Data SA CKAN Harvester"
"Australian Institute of Marine Science CSW Harvester"
"Navy Meteorology and Oceanography (METOC) CSW Harvester"
"Mineral Resources Tasmania CSW Harvester"
"Tasmania Department of Primary Industries, Parks, Water and Environment CSW Harvester"
]

And here are the replacement CSW's. There will be a fuller list of CSWs coming from GA at some point.
There is also a harvest node (QLD spatial) that isn't CSW - we'll need to work that out when we build new connectors

### Atlas of Living Australia
http://spatial.ala.org.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Lower Priority

Mineral Resources Tasmania

http://www.mrt.tas.gov.au/web-catalogue/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Tasmania TheList

https://data.thelist.tas.gov.au:443/datagn/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

NSW Land and Property

https://sdi.nsw.gov.au/csw?Request=getCapabilities&Version=2.0.2&Service=CSW

magda-io / magda Goto Github PK

magda's People

Contributors

Stargazers

Watchers

Forkers

magda's Issues

Australian Bureau of Meteorology

Australian Institute of Marine Science

Australian Oceans Data Network

Geoscience Australia

CSIRO Marlin

Terrestrial Ecosystem Research Network

Lower Priority

Mineral Resources Tasmania

Tasmania TheList

NSW Land and Property

Recommend Projects

Recommend Topics

Recommend Org