Code Monkey home page Code Monkey logo

magda's People

Contributors

alexgilleran avatar azerella avatar benjaminleighton avatar chloeleichen avatar gitter-badger avatar gordjw avatar hun220 avatar jevy-wangfei avatar jyucsiro avatar ketikat avatar kring avatar maxious avatar mwu2018 avatar nahidakbar avatar nf-s avatar nghamilton avatar nilfanif avatar rowanwins avatar sajidanower23 avatar sandra-arato avatar soyarsauce avatar stackedsax avatar stephencannings avatar steve9164 avatar sukhrajghuman avatar t83714 avatar tkeuneman avatar tobybellwood avatar tristochief avatar yayalu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

magda's Issues

Enable HTTPS

For magda-dev.terria.io and search.data.gov.au.

CKAN: Datasets can be missed due to paging

The CKAN Connector executes multiple package_search queries for a page of datasets at a time. CKAN's default search order is first by relevance (presumably all datasets have equal relevance when there is no query) and second by modified date, descending. This puts the most recently modified datasets at the beginning.

If new datasets are added while the crawler is running, they'll be missed, while other datasets will potentially be examined multiple times. This is acceptable-ish because crawling is inherently a point-in-time activity. The new datasets will be picked up on the next crawl.

More concerning, though, is if an existing dataset is modified during crawling. This will move it to the first page of results after the crawler has already moved to later pages. Therefore, it will be missed. Missing a dataset that existed prior to the start of crawling is not acceptable (-ish or otherwise).

It would be nice if CKAN datasets had a simple auto increment ID type of thing that we could use as the sort order, guaranteeing stable order and that new datasets added during crawling would be added at the end. But since there is nothing like that (AFAICT), it's tricky.

One idea is to sort by ascending modified data, purposely request overlap in the pages, verify that overlap actually exists, and if it doesn't, re-request the page with an earlier start index until we find our overlap.

Improve Search

Adding tasks from workshop on 31 May 17:

facet not adding up

in first search, on selecting a publisher i got 3 results
screen shot 2016-12-08 at 11 17 43 pm
When i select the second publisher, there should be 3+ 31 = 34 results, but it tells me it has no matching result, and facet hitcount has changed:
screen shot 2016-12-08 at 11 17 53 pm

Is it because the publisher facet is using and logic? if we are using and, then why don't we just force user to be able to only select one option at any given time?

Add ability to exclude publisher from results

In this scenario - the dominant publisher adds no value to the results, and should be excluded:
image
Asuming this is easy in ES, a UI tweak to +/- the line and maybe a rework of the x to remove may be needed?

Enable registry authorization

It's just a matter of changing this value in magda-registry-api's application.conf:

authorization {
    skip = true
}

But once we do that the connectors won't be able to write to the registry because they don't authenticate themselves. So we should authenticate and authorize the connectors.

Are we able to filter out sources via a URL to the search API?

Thinking down the track - we currently operate a "Search Results Sharing Partnership" - currently with SA and NSW. This is achieved by the CKAN portals running a modified ckanext-harvest (https://github.com/datagovau/ckanext-harvest) plugin to remove the redirect loop it creates.

What if we were able to facilitate a CKAN plugin that did the same role, but excluded the requestor's portal, thus giving users of the plugin access to the full (new) data.gov.au results?

Update footer navigation

Revised list of footer nav links:

Search (category header)

  • Search syntax -> /page/search-syntax
  • Data sources -> page listing all harvested sources - available?

Projects

  • Browse projects -> /projects
  • Start a project -> /project/new

Publishers

Developers

Content in this section to be confirmed with Kevin and Alex. If not enough time to create content, leave this section out for the time being.

  • Architecture -> page with info about MAGDA architecture
  • API Docs -> page with API doc info

About

Feedback

Make region import robust even with messy polygons

Region polygons often have problems, such as:

  • Unclosed linear rings (the first and last point are not the same, even though GeoJSON says they must be).
  • Duplicate positions in a linear ring, other than the first and last. This indicates a loop in the ring, which is not allowed by the OGC simple feature specification, even though it is allowed and common and Esri shapefiles. Elasticsearch seems to expect GeoJSON polygons to conform to the simple feature specification.
  • Self-intersecting segments in the polygon.

In a perfect world, whoever provided the regions would fix these problems. But in reality our system needs to be robust in the face of imperfect data. To that end, we should automatically clean up these problems in the regions we load. This recent paper, A triangulation-based approach to automatically repair GIS polygons, proposes a very promising approach:
http://dx.doi.org/10.1016/j.cageo.2014.01.009
http://www.sciencedirect.com/science/article/pii/S009830041400020X

There is an implementation of this approach, but it is not suitable for use in Magda because it uses the GPL licence. It should be straightforward to implement it ourselves, though.

data.json sources

List of data.json sources already available from data.gov.au

data.json end point harvest_source_title
http://data.logancity.opendata.arcgis.com/data.json Logan City ArcGIS JSON Harvest
http://opendata.launceston.tas.gov.au/data.json City of Launceston ArcGIS JSON Harvester
http://data-1.hobartcc.opendata.arcgis.com/data.json City of Hobart ArcGIS JSON Harvest
https://data.melbourne.vic.gov.au/data.json Melbourne JSON Harvester
http://www.data.act.gov.au/data.json ACT JSON Harvester
http://data.moretonbay.qld.gov.au/data.json Moreton Bay ArcGIS JSON Harvest
http://data.esta000.opendata.arcgis.com/data.json ESTA Emergency Marker ArcGIS JSON Harvester

Make location filter map pop-over wider

The location filter (pop-over map) is too thin, making it hard to browse. Please change so that it is at least as wide os the search results, and a little taller.

Confusing page numbering

Page 1 of the search results displays this:
image

A person seeing this might reasonably assume it's telling them they're on page 2, when really it's telling them that the next page is page 2. Can we make this clearer?

Create data.json connector

Registry: Record PATCH explodes when removing an aspect

When doing a PATCH request for a record, an exception is thrown when a Remove patch tries to remove an entire aspect. The error is Cannot delete an empty path.

Also, when doing a PUT of a record, missing aspects should be left alone rather than removed. Currently, this scenario gets turned into the one above.

consider change from "by" for publisher syntax

I think we should reconsider the use of "by" in the query language for limiting the search to a particular publisher. That is because:
a) "by" isn't an exact fit for the relationship between a dataset and a publisher (the publisher is making the data available - they may not have created it - it may have been created by a different agency). "From" is closer to the meaning we want (there may also be other possibilities); and
b) I think we'll later need to use "by" for influencing display of results (eg for sorting and/or grouping - eg "population by state") and it would be good if people didn't get in the habit of using it.

I think that overloading "from" to allow a trailing publisher or date is fine and easy to disambiguate. I think that it won't be possible to do so between "by" and the likely future use of "by" I mentioned above.

So, I think we should use "from" for publisher instead of "by". What do others think?

Replace GA FIND harvester with Harvest Nodes

Here are the CSW endpoints to replace the GA-FIND harvester with - not a current priority, but a next step.

We should also update the ignoreHarvestSources for data.gov.au as follows:

ignoreHarvestSources = [
"FIND (http://find.ga.gov.au) CSW Harvester",
"Brisbane City Council CKAN Harvester",
"Data NSW CKAN Harvester",
"Data SA CKAN Harvester"
"Australian Institute of Marine Science CSW Harvester"
"Navy Meteorology and Oceanography (METOC) CSW Harvester"
"Mineral Resources Tasmania CSW Harvester"
"Tasmania Department of Primary Industries, Parks, Water and Environment CSW Harvester"
]

And here are the replacement CSW's. There will be a fuller list of CSWs coming from GA at some point.
There is also a harvest node (QLD spatial) that isn't CSW - we'll need to work that out when we build new connectors

### Atlas of Living Australia
http://spatial.ala.org.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Australian Bureau of Meteorology

http://www.bom.gov.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Australian Institute of Marine Science

http://data.aims.gov.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Australian Oceans Data Network

http://catalogue.aodn.org.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Geoscience Australia

http://www.ga.gov.au/geonetwork/srv/en/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

CSIRO Marlin

http://www.marlin.csiro.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Terrestrial Ecosystem Research Network

http://data.auscover.org.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities
-or-
http://geonetwork.tern.org.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Lower Priority

Mineral Resources Tasmania

http://www.mrt.tas.gov.au/web-catalogue/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

Tasmania TheList

https://data.thelist.tas.gov.au:443/datagn/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

NSW Land and Property

https://sdi.nsw.gov.au/csw?Request=getCapabilities&Version=2.0.2&Service=CSW

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.