magda-io / magda Goto Github PK
View Code? Open in Web Editor NEWA federated, open-source data catalog for all your big data and small data
Home Page: https://magda.io
License: Apache License 2.0
A federated, open-source data catalog for all your big data and small data
Home Page: https://magda.io
License: Apache License 2.0
Since we don't currently have spatial coverage (we'll add it before long).
We'll use these to run the indexer, sleuths, and maybe user notifications eventually.
The project is created successfully, but the user is left on the "create" page.
The location filter (pop-over map) is too thin, making it hard to browse. Please change so that it is at least as wide os the search results, and a little taller.
Please add the following sample queries + a "Learn about the new search" link to the homepage (see attached drawing). "Learn more link should point to the search syntax static page (http://magda-dev.terria.io/page/search-syntax).
Example queries:
It even says "Home / Dataset" in the breadcrumbs. The distinction should be made more clear.
List of data.json sources already available from data.gov.au
data.json end point | harvest_source_title |
---|---|
http://data.logancity.opendata.arcgis.com/data.json | Logan City ArcGIS JSON Harvest |
http://opendata.launceston.tas.gov.au/data.json | City of Launceston ArcGIS JSON Harvester |
http://data-1.hobartcc.opendata.arcgis.com/data.json | City of Hobart ArcGIS JSON Harvest |
https://data.melbourne.vic.gov.au/data.json | Melbourne JSON Harvester |
http://www.data.act.gov.au/data.json | ACT JSON Harvester |
http://data.moretonbay.qld.gov.au/data.json | Moreton Bay ArcGIS JSON Harvest |
http://data.esta000.opendata.arcgis.com/data.json | ESTA Emergency Marker ArcGIS JSON Harvester |
in first search, on selecting a publisher i got 3 results
When i select the second publisher, there should be 3+ 31 = 34 results, but it tells me it has no matching result, and facet hitcount has changed:
Is it because the publisher facet is using and
logic? if we are using and, then why don't we just force user to be able to only select one option at any given time?
Revised list of footer nav links:
Search (category header)
Projects
Publishers
Developers
Content in this section to be confirmed with Kevin and Alex. If not enough time to create content, leave this section out for the time being.
About
Feedback
The data.json format (https://github.com/GSA/ckanext-datajson) is used extensively around open data portals - most specifically (in Australia) to connect to Socrata and ArcGIS online portals, but also extensively in the USA. Currently we harvest into data.gov.au (although this is not reliable).
For example:
Socrata:
http://www.data.act.gov.au/data.json
https://data.melbourne.vic.gov.au/data.json
https://data.sunshinecoast.qld.gov.au/data.json
ArcGIS Online:
http://data-1.hobartcc.opendata.arcgis.com/data.json
http://opendata.launceston.tas.gov.au/data.json
http://data.logancity.opendata.arcgis.com/data.json
http://data.moretonbay.qld.gov.au/data.json
http://data.goldcoast.opendata.arcgis.com/data.json
http://vicroadsopendata.vicroadsmaps.opendata.arcgis.com/data.json
so if i search http://terria.io/magda-web/build/?q=advisers+by+Australian+securities, it shows for Commonwealth of Australia (Geoscience Australia) there is one result, but as soon as it select that facet, the hitcount for that facet becomes 8125, and the result is match-part
, it is confusing because it told me there was one result (see screenshot).
Here are the CSW endpoints to replace the GA-FIND harvester with - not a current priority, but a next step.
We should also update the ignoreHarvestSources for data.gov.au as follows:
ignoreHarvestSources = [
"FIND (http://find.ga.gov.au) CSW Harvester",
"Brisbane City Council CKAN Harvester",
"Data NSW CKAN Harvester",
"Data SA CKAN Harvester"
"Australian Institute of Marine Science CSW Harvester"
"Navy Meteorology and Oceanography (METOC) CSW Harvester"
"Mineral Resources Tasmania CSW Harvester"
"Tasmania Department of Primary Industries, Parks, Water and Environment CSW Harvester"
]
And here are the replacement CSW's. There will be a fuller list of CSWs coming from GA at some point.
There is also a harvest node (QLD spatial) that isn't CSW - we'll need to work that out when we build new connectors
### Atlas of Living Australia
http://spatial.ala.org.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities
http://www.bom.gov.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities
http://data.aims.gov.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities
http://www.ga.gov.au/geonetwork/srv/en/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities
http://www.marlin.csiro.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities
http://data.auscover.org.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities
-or-
http://geonetwork.tern.org.au/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities
https://sdi.nsw.gov.au/csw?Request=getCapabilities&Version=2.0.2&Service=CSW
Read from a CSW server, write records and aspects to the MAGDA registry.
Currently it considers the source of all records that come from the MAGDA registry as "MAGDA Registry", but the real source should be the CKAN or whatever server that the record originally came from.
Using page tokens instead of offsets will be much faster.
Clicking it leaves you on the current page.
Region polygons often have problems, such as:
In a perfect world, whoever provided the regions would fix these problems. But in reality our system needs to be robust in the face of imperfect data. To that end, we should automatically clean up these problems in the regions we load. This recent paper, A triangulation-based approach to automatically repair GIS polygons, proposes a very promising approach:
http://dx.doi.org/10.1016/j.cageo.2014.01.009
http://www.sciencedirect.com/science/article/pii/S009830041400020X
There is an implementation of this approach, but it is not suitable for use in Magda because it uses the GPL licence. It should be straightforward to implement it ourselves, though.
For magda-dev.terria.io and search.data.gov.au.
It currently goes to the data.gov.au blog RSS feed, and we don't currently have a way to subscribe.
Adding tasks from workshop on 31 May 17:
fix issue with "Business names by ASIC as CSV" example search query - @AlexGilleran
fix issue with "trees in Victoria" example search query - @AlexGilleran
search/query syntax help page needs to be created and linked from search examples box - @cam-grant @chloeleichen
review of filter UI - @cam-grant
publishers sidebar in search results: use list of publishers selected in publishers filter, take top 5 and display details for each - @chloeleichen @cam-grant
publisher filter - clear command should close popver panel - @chloeleichen
Investigate & fix (if broken) date range filter - @AlexGilleran @chloeleichen
Investigate and fix (if broken) location filter - @AlexGilleran @chloeleichen
http://search.data.gov.au/?q=%2A+by+South+Australian+Governments returns all the harvested SA datasets in dga, leading to dupes (http://search.data.gov.au/?q=+FINTCBP01)
I think we should reconsider the use of "by" in the query language for limiting the search to a particular publisher. That is because:
a) "by" isn't an exact fit for the relationship between a dataset and a publisher (the publisher is making the data available - they may not have created it - it may have been created by a different agency). "From" is closer to the meaning we want (there may also be other possibilities); and
b) I think we'll later need to use "by" for influencing display of results (eg for sorting and/or grouping - eg "population by state") and it would be good if people didn't get in the habit of using it.
I think that overloading "from" to allow a trailing publisher or date is fine and easy to disambiguate. I think that it won't be possible to do so between "by" and the likely future use of "by" I mentioned above.
So, I think we should use "from" for publisher instead of "by". What do others think?
The CKAN Connector executes multiple package_search
queries for a page of datasets at a time. CKAN's default search order is first by relevance (presumably all datasets have equal relevance when there is no query) and second by modified date, descending. This puts the most recently modified datasets at the beginning.
If new datasets are added while the crawler is running, they'll be missed, while other datasets will potentially be examined multiple times. This is acceptable-ish because crawling is inherently a point-in-time activity. The new datasets will be picked up on the next crawl.
More concerning, though, is if an existing dataset is modified during crawling. This will move it to the first page of results after the crawler has already moved to later pages. Therefore, it will be missed. Missing a dataset that existed prior to the start of crawling is not acceptable (-ish or otherwise).
It would be nice if CKAN datasets had a simple auto increment ID type of thing that we could use as the sort order, guaranteeing stable order and that new datasets added during crawling would be added at the end. But since there is nothing like that (AFAICT), it's tricky.
One idea is to sort by ascending modified data, purposely request overlap in the pages, verify that overlap actually exists, and if it doesn't, re-request the page with an earlier start index until we find our overlap.
I expect these will be common spatial filters on searches (esp for state govts) so it would be good to support searching on them.
Search, Projects, Publishers, About top-level items appear twice.
We want lots of feedback!
Thinking down the track - we currently operate a "Search Results Sharing Partnership" - currently with SA and NSW. This is achieved by the CKAN portals running a modified ckanext-harvest (https://github.com/datagovau/ckanext-harvest) plugin to remove the redirect loop it creates.
What if we were able to facilitate a CKAN plugin that did the same role, but excluded the requestor's portal, thus giving users of the plugin access to the full (new) data.gov.au results?
One is the dev IP and the other is localhost.
The collaboration system will evolve, but at a minimum we need to persist the basic information in the title.
It's just a matter of changing this value in magda-registry-api's application.conf:
authorization {
skip = true
}
But once we do that the connectors won't be able to write to the registry because they don't authenticate themselves. So we should authenticate and authorize the connectors.
We don't currently have a way to add datasets to projects.
When doing a PATCH request for a record, an exception is thrown when a Remove
patch tries to remove an entire aspect. The error is Cannot delete an empty path
.
Also, when doing a PUT of a record, missing aspects should be left alone rather than removed. Currently, this scenario gets turned into the one above.
Only admin users can change the state of the flag.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.