Code Monkey home page Code Monkey logo

wikidata-fuzzy-search's People

Contributors

daniellaillouz avatar debonacache avatar greatyyx avatar jiashengwu avatar kyao avatar shaindelb avatar zmbq avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikidata-fuzzy-search's Issues

Create variable

The metadata API is defined here: https://datamart-upload.readthedocs.io/en/latest/api/
Implement that POST metadata/datasets/id/variables method.
The only required property of the JSON content is the name of the variable.

Below is an example POST metadata/datasets/UAZ/variables

{
  "name": "FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]"
}

If the optional variableID property is not provided, generate an variable. Suppose the generate variableID is "VUAZ-0".

The corresponding KGTK edge file is:

id node1 label node2 label;label
UAZ-V1 QVUAZ-0 label "FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]"
UAZ-V2 QVUAZ-0 P31 Q50701 instance of
UAZ-V2-1 UAZ-V2 P1932 "VUAZ-0" state as
UAZ-V3 QVUAZ-0 PcorrespondsToProperty "PVUAZ-0"
UAZ-V4 QVUAZ-0 P361 QUAZ part of
UAZ-V5 QUAZ PvariableMeasured QUAZ-101
UAZ-V6 PVUAZ-0 P31 Q18616576 instance of
UAZ-V7 PVUAZ-0 label "FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]"
UAZ-V8 PVUAZ-0 data_type quantity

The JSON returned is:

{
  "datasetID": "UAZ",
  "variableID": :VUAZ-0",
  "name": "FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]"
}

Delete variable

Implement DELETE /metadata/datasets/id/variables/id2

Delete metadata and data content associated with the variable.

Example, DELETE /metadata/datasets/UAZ/variables/VUAZ-0

Create dataset

The metadata API is defined here: https://datamart-upload.readthedocs.io/en/latest/api/
Implement POST metadata/datasets method.
The minimal JSON content must include properties: name, description and url.
Below is a sample JSON:

{
  "name": "UAZ Indicators",
  "description: "Indicators collected by UAZ",
  "url": "https://github.com/ml4ai/delphi",
  "shortName": "UAZ"
}

If the short name exists, then it is an error.

The shortName will become the dataset ID

From the input, first generate the following KGTK edge file:

id node1 label node2
QUAZ-P31-0 QUAZ P31 Q1172284
  QUAZ-P31-0 P1932 "UAZ"
  QUAZ label "UAZ Indicators"
  QUAZ P1476 "UAZ Indicators"
  QUAZ description "Indicators collected by UAZ"
  QUAZ P2699 "https://github.com/ml4ai/delphi"

Run kgtk explode to produce the following file

node1 label node2 id node2;kgtk:data_type node2;kgtk:number node2;kgtk:low_tolerance node2;kgtk:high_tolerance node2;kgtk:units_node node2;kgtk:text node2;kgtk:latitude node2;kgtk:longitude node2;kgtk:date_and_time node2;kgtk:precision node2;kgtk:truth node2;kgtk:symbol
QUAZ P31 Q1172284 QUAZ-P31-0 symbol Q1172284
QUAZ-P31-0 P1932 FAO QUAZ-P31-0-P1932-0 string FAO
QUAZ label FAO Statistics QUAZ-label-0 string FAO Statistics
QUAZ P1476 FAO Statistics QUAZ-P1476-0 string FAO Statistics
QUAZ descriptions FAOSTAT provides free access to food and agriculture data for over 245 countries and territories and covers all FAO regional groupings from 1961 to the most recent year available. QUAZ-descriptions-0 string FAOSTAT provides free access to food and agriculture data for over 245 countries and territories and covers all FAO regional groupings from 1961 to the most recent year available.
QUAZ P2699 http://www.fao.org/faostat/ QUAZ-P2699-0 string http://www.fao.org/faostat/

Load the exploded file to the database.

The JSON returned is:

{
  "name": "UAZ Indicators",
  "description: "Indicators collected by UAZ",
  "url": "https://github.com/ml4ai/delphi",
  "datasetID": "UAZ"
}

Metadata: Search for datasets

Implement the GET /datasets method.

Get with no parameters returns a list of all datasetID.

Get with name=UAZ,Wikidata return a list of specified dataset descriptions.

Get with keyword=maize,production return a list of relevant datasets. Use the variable keyword search to find relevant variables, then aggregate variables by datasets. In this case the variableMeasured field of the dataset description should contain only the relevant variables.

Data Content: uploading CSV in canonical data format

Implement PUT /datasets/id/variable/id2 method.
The corresponding datasetID and variableID must have already been defined through the metadata API.

Wikification:

  • If the main_subject_id column is not given, run wikifier to generate the column based on the content of the main_subject column
  • If the any of the country, admin1, admin2, admin3 columns are given without the corresponing *_id columns, run wikifier to generate those columns.

Qualifier columns:

  • Use the column header name to generate a property. If a property with this name has already been generated for another variable under this dataset, then reuse this property.
  • Note: Update metadata API to allow users to specify the property, i.e. change qualifier from List[String] to List[Object]

Then:

  • Generate KGTK edge file
  • Generate RDF ttl file from this edge file
  • Upload to Wikidata (or PostgreSQL) as appropriate

Metadata: Search for variables

Implement the GET /variable method.
With no parameters returns list of all variables under all datasets.
With &ids= returns list of variable descriptions in JOSN.
With &keywords= returns list of variable description in JSON.

Fuzzy search on variables

Currently, fuzzy keyword searches on Wikidata property (label and description).

Modify to search on variable descriptions. Dataset description should be add to each variable description for searching purposes.

Move the cache and indices directories outside the docker image.

In the backend's docker image, map both the cache and the indices folder to /external/cache and /external/indices. People will mount /external to their own host folders before starting the image.

When the container starts, run the script that populates the cache and indices before running gunicorn (you probably need a small script that runs both).

Depends on #8 and #7

de-cache when there's an update in js

Simply add a version number or a random number in url after compiling.

e.g., latest_javascript.js?version=4, latest_javascript.js?v=adsfn2ion2f23jfn2nseg

Write a country wikifier

Amandeep will supply an edge file with all the countries labels and aliases.

The input is a column of countries, or country abbreviations.

The country can appear in the column country or main subject

Get canonical data from variable

Implement GET /datasets/id/variables/id2

The API will include a query string where the user specifies the country

For example, GET /datasets/UAZ/variables/VUAZ-0 without any query parameters should return:

dataset_id,variable_id,variable,main_subject,main_subject_id,value,value_unit,time,time_precision,country,place,coordinate
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.16,%,1971-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.16,%,1972-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.17,%,1973-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.17,%,1974-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.15,%,1975-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.15,%,1976-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.16,%,1977-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.19,%,1978-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.2,%,1979-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.19,%,1980-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.21,%,1981-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.19,%,1982-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.21,%,1983-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.18,%,1984-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.18,%,1985-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.16,%,1986-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.27,%,1987-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.24,%,1988-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.23,%,1989-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.26,%,1990-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)
UAZ,VUAZ-0,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",Ethiopia,Q115,0.18,%,1991-01-01T00:00:00Z,year,Ethiopia,,Point(40.0 9.0)

Upload canonical data into variable

Implement PUT /datasets/id/variable/id2 method.

Verification

  • dataset id exists
  • variable id exists
  • If the input file has columns we don't recognize, we assume they are qualifiers. In the first implementation we will ignore such columns.

Unknown column could be:

  1. recognized qualifiers such as source
  2. a new column that has been seen in another variable in this dataset
  3. a completely new column that we have never seen in this dataset.

If it is a previously column, we can look up the p-node, and other information for it.

If the column is completely unknown, the edges for the new qualifier have to be inserted.

Below is a sample CSV in canonical data format that has been wikified, PUT /datasets/UAZ/variables/UAZ-0:

dataset_id,variable,variable_id,main_subject,main_subject_id,value,value_unit,time,time_precision,country,country_id,admin1,admin_id,place,place_id,source,source_id
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.16,%,1971-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.16,%,1972-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.17,%,1973-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.17,%,1974-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.15,%,1975-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.15,%,1976-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.16,%,1977-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.19,%,1978-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.2,%,1979-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.19,%,1980-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.21,%,1981-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.19,%,1982-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.21,%,1983-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.18,%,1984-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.18,%,1985-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.16,%,1986-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.27,%,1987-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.24,%,1988-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.23,%,1989-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.26,%,1990-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151
UAZ,"FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]",VUAZ-0,Ethiopia,Q115,0.18,%,1991-01-01T00:00:00Z,year,Ethiopia,Q115,,,,,FAO,Q82151

The corresponding KGTK edge file is:
See this shared folder: https://drive.google.com/drive/u/2/folders/1m3PkTtIDzxeLN3_EoXsfgV3jkYfJqxrb

Metadata: create/update/get a variable description

The metadata API is defined here: https://datamart-upload.readthedocs.io/en/latest/api/
Implement that POST /datasets/id/variables method.
The minimal JOSN content is the name of the variable:

{
  "name": "Access to clean fuels and technologies for cooking"
}

If the optional property VariableID is given then use the value of the VariableID as the variable identifier within the dataset. Otherwise, automatically generate a variable identifier.

The method PUT /datasets/id/variables/id2 updates an existing variable given JSON description.
The method GET /datasets/id/variables/id2 return the variable JSON description.

Metadata: create/update/get a dataset description

The metadata API is defined here: https://datamart-upload.readthedocs.io/en/latest/api/
The minimal JSON content for POST /datasets must include properties: name, description and url. Below is a sample JSON:

{
  "name": "World Bank Development Indicators",
  "description": "World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources.",
  "url": "https://data.worldbank.org/indicator"
}

If the optional datasetID property is included, then the value of datasetID is used as the dataset identifier. Otherwise, a dataset identifier is automatically generated.

The method PUT /datasets/id updates an existing datasets with given JSON content.
The method GET /datasets/id returns the JSON description.

Update variable metadata

Implement PUT metadata/datasets/id/variables/id2 method.
Invoking PUT metadata/datasets/UAZ/variables/VUAZ-0 with this JSON:

{
  "name": "FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]"
  "shortName": "Agriculture and forestry energy use",
    "mainSubject": [
        {"name":"Kercha", 
        "identifier":"https://www.wikidata.org/wiki/Q6393737"},
        {"name":"Liben", 
        "identifier":"https://www.wikidata.org/wiki/Q3237714"}],
    "unitOfMeasure": ["%"],
    "country": [{"name":"Ethiopia", 
        "identifier":"https://www.wikidata.org/wiki/Q115"}],
    "startTime": "1993",
    "endTime": "2016",
    "endTime_precision": "Year",
    "dataInterval": "Monthly",
    "qualifier": ["time", "source"]
}

should return:

{
  "datasetID": "UAZ",
  "variableID": "VUAZ-0",
  "name": "FAO: Agriculture and forestry energy use as a % of total Energy use, Energy used in agriculture and forestry[%]"
  "shortName": "Agriculture and forestry energy use",
    "mainSubject": [
        {"name":"Kercha", 
        "identifier":"https://www.wikidata.org/wiki/Q6393737"},
        {"name":"Liben", 
        "identifier":"https://www.wikidata.org/wiki/Q3237714"}],
    "unitOfMeasure": ["%"],
    "country": [{"name":"Ethiopia", 
        "identifier":"https://www.wikidata.org/wiki/Q115"}],
    "startTime": "1993",
    "endTime": "2016",
    "endTime_precision": "Year",
    "dataInterval": "Monthly",
    "qualifier": ["time", "source"]
}

Get dataset metadata

Implement GET metadata/datasets/id method.

Invoking GET metadata/datasets/UAZ should return the dataset metadata for the UAZ dataset:

{
  "name": "UAZ Indicators",
  "description: "Indicators collected by UAZ",
  "url": "https://github.com/ml4ai/delphi",
  "datasetID": "UAZ"
}

Search for variable

See API document here: https://datamart-upload.readthedocs.io/en/latest/api/

Implement GET /metadata/variable end point for keyword search.

Given list of keywords, return list of variable metadata.

Use PostgresSQL fuzzysearch on the the dataset name, dataset description, variable name and variable description.

The result should be a JSON object that includes simple information that a GUI may use to display to a user:

  • variable name
  • variable description
  • variable id
  • dataset short name (used as id)
  • dataset name

If we have time, the result should also include:

  • time span (if defined in the database)
  • number of points
  • precision (if defined for the database)

If time allows, the search should have a bit of smarts so that if a country or countries are provided or detected in the keywords, the countries should be wikified and the search should filter out the variable if it is not available for the requested country or countries.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.