magda-io / magda Goto Github PK

A federated, open-source data catalog for all your big data and small data

License: Apache License 2.0

Scala 8.82% Shell 0.06% HTML 0.08% JavaScript 28.98% Java 0.02% TypeScript 18.29% Dockerfile 0.04% Open Policy Agent 0.81% SCSS 4.23% Mustache 0.27% Smarty 0.01% Raku 38.40%

open-data scala nodejs kubernetes elasticsearch postgresql

magda's Issues

Remove "spatial coverage" section from dataset page

Since we don't currently have spatial coverage (we'll add it before long).

WebHooks for dataset create-update-delete events

We'll use these to run the indexer, sleuths, and maybe user notifications eventually.

Redirect after project creation is broken

The project is created successfully, but the user is left on the "create" page.

Make location filter map pop-over wider

The location filter (pop-over map) is too thin, making it hard to browse. Please change so that it is at least as wide os the search results, and a little taller.

Create production deployment of new architecture

Style the login page

Purpose of projects (data requests) is unclear

UI: Implement collaboration / project view page

Add an account screen to change user details

Investigate use of Google Cloud Functions and other options for connectors and indexers

Add example search queries + "learn more" link to homepage

Please add the following sample queries + a "Learn about the new search" link to the homepage (see attached drawing). "Learn more link should point to the search syntax static page (http://magda-dev.terria.io/page/search-syntax).

Example queries:

Business Names by ASIC as CSV
Taxation Statistics from 2013
Trees in Victoria

Future issue roadmap

Access control for the registry
System-level logging (using ELK Stack maybe)

Distribution page looks just like the dataset page

It even says "Home / Dataset" in the breadcrumbs. The distinction should be made more clear.

data.json sources

List of data.json sources already available from data.gov.au

data.json end point	harvest_source_title
http://data.logancity.opendata.arcgis.com/data.json	Logan City ArcGIS JSON Harvest
http://opendata.launceston.tas.gov.au/data.json	City of Launceston ArcGIS JSON Harvester
http://data-1.hobartcc.opendata.arcgis.com/data.json	City of Hobart ArcGIS JSON Harvest
https://data.melbourne.vic.gov.au/data.json	Melbourne JSON Harvester
http://www.data.act.gov.au/data.json	ACT JSON Harvester
http://data.moretonbay.qld.gov.au/data.json	Moreton Bay ArcGIS JSON Harvest
http://data.esta000.opendata.arcgis.com/data.json	ESTA Emergency Marker ArcGIS JSON Harvester

Correct or remove harvested datasource and spatial dataset counts

facet not adding up

in first search, on selecting a publisher i got 3 results

When i select the second publisher, there should be 3+ 31 = 34 results, but it tells me it has no matching result, and facet hitcount has changed:

Is it because the publisher facet is using and logic? if we are using and, then why don't we just force user to be able to only select one option at any given time?

Update footer navigation

Revised list of footer nav links:

Search (category header)

Search syntax -> /page/search-syntax
Data sources -> page listing all harvested sources - available?

Projects

Browse projects -> /projects
Start a project -> /project/new

Publishers

Publisher index -> /publishers
Open data toolkit -> https://toolkit.data.gov.au/

Developers

Content in this section to be confirmed with Kevin and Alex. If not enough time to create content, leave this section out for the time being.

Architecture -> page with info about MAGDA architecture
API Docs -> page with API doc info

About

About data.gov.au -> http://data.gov.au/about
Blog -> https://blog.data.gov.au/

Feedback

Send feedback -> mailto:[email protected]

Unicode error in dataset descriptions

apostrophes are coming up as ' - this may be a magda-web issue or magda-metadata

They'r eencoded ok on source CKAN portal

Create data.json connector

The data.json format (https://github.com/GSA/ckanext-datajson) is used extensively around open data portals - most specifically (in Australia) to connect to Socrata and ArcGIS online portals, but also extensively in the USA. Currently we harvest into data.gov.au (although this is not reliable).

For example:
Socrata:
http://www.data.act.gov.au/data.json
https://data.melbourne.vic.gov.au/data.json
https://data.sunshinecoast.qld.gov.au/data.json
ArcGIS Online:
http://data-1.hobartcc.opendata.arcgis.com/data.json
http://opendata.launceston.tas.gov.au/data.json
http://data.logancity.opendata.arcgis.com/data.json
http://data.moretonbay.qld.gov.au/data.json
http://data.goldcoast.opendata.arcgis.com/data.json
http://vicroadsopendata.vicroadsmaps.opendata.arcgis.com/data.json

inaccurate hitcount

so if i search http://terria.io/magda-web/build/?q=advisers+by+Australian+securities, it shows for Commonwealth of Australia (Geoscience Australia) there is one result, but as soon as it select that facet, the hitcount for that facet becomes 8125, and the result is match-part, it is confusing because it told me there was one result (see screenshot).

Replace GA FIND harvester with Harvest Nodes

Here are the CSW endpoints to replace the GA-FIND harvester with - not a current priority, but a next step.

We should also update the ignoreHarvestSources for data.gov.au as follows:

ignoreHarvestSources = [
"FIND (http://find.ga.gov.au) CSW Harvester",
"Brisbane City Council CKAN Harvester",
"Data NSW CKAN Harvester",
"Data SA CKAN Harvester"
"Australian Institute of Marine Science CSW Harvester"
"Navy Meteorology and Oceanography (METOC) CSW Harvester"
"Mineral Resources Tasmania CSW Harvester"
"Tasmania Department of Primary Industries, Parks, Water and Environment CSW Harvester"
]

And here are the replacement CSW's. There will be a fuller list of CSWs coming from GA at some point.
There is also a harvest node (QLD spatial) that isn't CSW - we'll need to work that out when we build new connectors

### Atlas of Living Australia
http://spatial.ala.org.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Lower Priority

Mineral Resources Tasmania

http://www.mrt.tas.gov.au/web-catalogue/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Tasmania TheList

https://data.thelist.tas.gov.au:443/datagn/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

NSW Land and Property

https://sdi.nsw.gov.au/csw?Request=getCapabilities&Version=2.0.2&Service=CSW

CSW Connector for the registry

Read from a CSW server, write records and aspects to the MAGDA registry.

Indexer needs to be able to attribute source of records from the registry

Currently it considers the source of all records that come from the MAGDA registry as "MAGDA Registry", but the real source should be the CKAN or whatever server that the record originally came from.

Make indexer use fast registry paging

Using page tokens instead of offsets will be much faster.

Add Data61 logo to the web site

Make GNAF and Admin Boundaries featured datasets

About page goes nowhere

Clicking it leaves you on the current page.

Add ability to exclude publisher from results

In this scenario - the dominant publisher adds no value to the results, and should be excluded:

Asuming this is easy in ES, a UI tweak to +/- the line and maybe a rework of the x to remove may be needed?

Make region import robust even with messy polygons

Region polygons often have problems, such as:

Unclosed linear rings (the first and last point are not the same, even though GeoJSON says they must be).
Duplicate positions in a linear ring, other than the first and last. This indicates a loop in the ring, which is not allowed by the OGC simple feature specification, even though it is allowed and common and Esri shapefiles. Elasticsearch seems to expect GeoJSON polygons to conform to the simple feature specification.
Self-intersecting segments in the polygon.

In a perfect world, whoever provided the regions would fix these problems. But in reality our system needs to be robust in the face of imperfect data. To that end, we should automatically clean up these problems in the regions we load. This recent paper, A triangulation-based approach to automatically repair GIS polygons, proposes a very promising approach:
http://dx.doi.org/10.1016/j.cageo.2014.01.009
http://www.sciencedirect.com/science/article/pii/S009830041400020X

There is an implementation of this approach, but it is not suitable for use in Magda because it uses the GPL licence. It should be straightforward to implement it ourselves, though.

UI: Implement home / landing page

Enable HTTPS

For magda-dev.terria.io and search.data.gov.au.

Remove Subscribe link on the dataset page

It currently goes to the data.gov.au blog RSS feed, and we don't currently have a way to subscribe.

Improve Search

Adding tasks from workshop on 31 May 17:

fix issue with "Business names by ASIC as CSV" example search query - @AlexGilleran
fix issue with "trees in Victoria" example search query - @AlexGilleran
search/query syntax help page needs to be created and linked from search examples box - @cam-grant @chloeleichen
review of filter UI - @cam-grant
publishers sidebar in search results: use list of publishers selected in publishers filter, take top 5 and display details for each - @chloeleichen @cam-grant
publisher filter - clear command should close popver panel - @chloeleichen
Investigate & fix (if broken) date range filter - @AlexGilleran @chloeleichen
Investigate and fix (if broken) location filter - @AlexGilleran @chloeleichen

SA Harvest portal not excluded from data.gov.au harvest

http://search.data.gov.au/?q=%2A+by+South+Australian+Governments returns all the harvested SA datasets in dga, leading to dupes (http://search.data.gov.au/?q=+FINTCBP01)

consider change from "by" for publisher syntax

I think we should reconsider the use of "by" in the query language for limiting the search to a particular publisher. That is because:
a) "by" isn't an exact fit for the relationship between a dataset and a publisher (the publisher is making the data available - they may not have created it - it may have been created by a different agency). "From" is closer to the meaning we want (there may also be other possibilities); and
b) I think we'll later need to use "by" for influencing display of results (eg for sorting and/or grouping - eg "population by state") and it would be good if people didn't get in the habit of using it.

I think that overloading "from" to allow a trailing publisher or date is fine and easy to disambiguate. I think that it won't be possible to do so between "by" and the likely future use of "by" I mentioned above.

So, I think we should use "from" for publisher instead of "by". What do others think?

CKAN: Datasets can be missed due to paging

The CKAN Connector executes multiple package_search queries for a page of datasets at a time. CKAN's default search order is first by relevance (presumably all datasets have equal relevance when there is no query) and second by modified date, descending. This puts the most recently modified datasets at the beginning.

If new datasets are added while the crawler is running, they'll be missed, while other datasets will potentially be examined multiple times. This is acceptable-ish because crawling is inherently a point-in-time activity. The new datasets will be picked up on the next crawl.

More concerning, though, is if an existing dataset is modified during crawling. This will move it to the first page of results after the crawler has already moved to later pages. Therefore, it will be missed. Missing a dataset that existed prior to the start of crawling is not acceptable (-ish or otherwise).

It would be nice if CKAN datasets had a simple auto increment ID type of thing that we could use as the sort order, guaranteeing stable order and that new datasets added during crawling would be added at the end. But since there is nothing like that (AFAICT), it's tricky.

One idea is to sort by ascending modified data, purposely request overlap in the pages, verify that overlap actually exists, and if it doesn't, re-request the page with an earlier start index until we find our overlap.

States (eg "nsw", "queensland") don't seem to work in region selector

I expect these will be common spatial filters on searches (esp for state govts) so it would be good to support searching on them.

Top level nav items are duplicated

Search, Projects, Publishers, About top-level items appear twice.

Add feedback button at the top of the page

We want lots of feedback!

Are we able to filter out sources via a URL to the search API?

Thinking down the track - we currently operate a "Search Results Sharing Partnership" - currently with SA and NSW. This is achieved by the CKAN portals running a modified ckanext-harvest (https://github.com/datagovau/ckanext-harvest) plugin to remove the redirect loop it creates.

What if we were able to facilitate a CKAN plugin that did the same role, but excluded the requestor's portal, thus giving users of the plugin access to the full (new) data.gov.au results?

UI: Implement dataset details page

API Doc and Feedback links at the bottom of the web UI are hardcoded addresses

One is the dev IP and the other is localhost.

Share button on dataset summary on search page doesn't do anything

Just remove it for now.

Store comments, links to datasets, and links to user for projects

The collaboration system will evolve, but at a minimum we need to persist the basic information in the title.

Enable registry authorization

It's just a matter of changing this value in magda-registry-api's application.conf:

authorization {
    skip = true
}

But once we do that the connectors won't be able to write to the registry because they don't authenticate themselves. So we should authenticate and authorize the connectors.

magda-io / magda Goto Github PK

magda's Issues

Australian Bureau of Meteorology

Australian Institute of Marine Science

Australian Oceans Data Network

Geoscience Australia

CSIRO Marlin

Terrestrial Ecosystem Research Network

Lower Priority

Mineral Resources Tasmania

Tasmania TheList

NSW Land and Property

Recommend Projects

Recommend Topics

Recommend Org