Code Monkey home page Code Monkey logo

linked-data's People

Contributors

baskaufs avatar cliffordanderson avatar dashboard-user avatar eshook2010 avatar jbaskauf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jimfhahn jbaskauf

linked-data's Issues

add support for Commons images to Vanderbot

Testing reveals that the snak JSON for a commons image looks like this:

"P18": [
  {
    "mainsnak": {
      "snaktype": "value",
      "property": "P18",
      "hash": "c99db2661b192fd22f5dff1a532b1d6eb9433ef9",
      "datavalue": {
        "value": "Ace The Wonder Dog.jpg",
        "type": "string"
      },
      "datatype": "commonsMedia"
    },
    "type": "statement",
    "id": "Q15397819$0d43e90b-4963-b88f-49c4-fc42afe5b606",
    "rank": "normal"
  }
]

When queried via SPARQL, the returned value is:

http://commons.wikimedia.org/wiki/Special:FilePath/Ace%20The%20Wonder%20Dog.jpg

Something is wrong with the reference checking on new records

When creating new records, the response JSON gave this, with an error message:

rocessing row: 27  Label: Julia Pim Reis  new record
Write confirmation:  {'entity': {'labels': {'en': {'language': 'en', 'value': 'Julia Pim Reis'}}, 'descriptions': {'en': {'language': 'en', 'value': 'biodiversity software developer'}}, 'aliases': {}, 'sitelinks': {}, 'claims': {'P108': [{'mainsnak': {'snaktype': 'value', 'property': 'P108', 'hash': 'a1db1fbb4ae38348b212f40c4b89752a0aaffacc', 'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 233098, 'id': 'Q233098'}, 'type': 'wikibase-entityid'}, 'datatype': 'wikibase-item'}, 'type': 'statement', 'id': 'Q99580299$2CCF45A6-04DF-48B8-A6F4-248982C93FBC', 'rank': 'normal', 'references': [{'hash': '5c2b16da511ee084aa935c5b011d7e8c21187b41', 'snaks': {'P854': [{'snaktype': 'value', 'property': 'P854', 'hash': 'bb37fc08d6164b52ebb76f807e6c547a227f725d', 'datavalue': {'value': 'https://www.linkedin.com/in/juliapimreis/', 'type': 'string'}, 'datatype': 'url'}], 'P813': [{'snaktype': 'value', 'property': 'P813', 'hash': '5d20b450b1aa6c2cd42bb1d1b137f4d841b595e1', 'datavalue': {'value': {'time': '+2020-09-24T00:00:00Z', 'timezone': 0, 'before': 0, 'after': 0, 'precision': 11, 'calendarmodel': 'http://www.wikidata.org/entity/Q1985727'}, 'type': 'time'}, 'datatype': 'time'}]}, 'snaks-order': ['P854', 'P813']}]}], 'P31': [{'mainsnak': {'snaktype': 'value', 'property': 'P31', 'hash': 'ad7d38a03cdd40cdc373de0dc4e7b7fcbccb31d9', 'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 5, 'id': 'Q5'}, 'type': 'wikibase-entityid'}, 'datatype': 'wikibase-item'}, 'type': 'statement', 'id': 'Q99580299$7E46DB32-9BED-4CDD-8DC6-AFF05ECA69D6', 'rank': 'normal'}], 'P21': [{'mainsnak': {'snaktype': 'value', 'property': 'P21', 'hash': '5760796ff6ebc63aae12cdcbf509b07ebf0bd201', 'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 6581072, 'id': 'Q6581072'}, 'type': 'wikibase-entityid'}, 'datatype': 'wikibase-item'}, 'type': 'statement', 'id': 'Q99580299$5EC83EC8-B4E2-41CD-8EE3-60D38F403A3B', 'rank': 'normal'}]}, 'id': 'Q99580299', 'type': 'item', 'lastrevid': 1281714167}, 'success': 1}

No reference in the response JSON matched with the reference for statement: Q99580299   P108   Q233098
Reference   {'refHashColumn': 'employerReferenceHash', 'refPropList': ['P854', 'P813'], 'refValueColumnList': ['employerReferenceSourceUrl', 'employerReferenceRetrieved'], 'refEntityOrLiteral': ['literal', 'value'], 'refTypeList': ['url', 'time'], 'refValueTypeList': ['string', 'time']}

The error message is generated on line 1315, where it is noted that the condition causing it should never occur. This needs to be debugged by recording which instance of setting referenceMatch = False was the one that triggered the error. Probably also good to record the value of responseReference during the loop that sets it False. I'm thinking that the breakin line 1299 isn't really killing the loop and that it's continuing after there is a match to the correct reference. The reason I think so is that the value is getting correctly set in the table.

Replace direct values with value nodes for dates

Here is the situation:

Generally, when a property column like employer_startDate has a value type of Date chosen from the dropdown, the csv-metadata.json output file needs to change (in a qualifier example) from

{
    "titles": "employer_startDate",
    "name": "employer_startDate",
    "datatype": "dateTime",
    "aboutUrl": "http://www.wikidata.org/entity/statement/{qid}-{employer_uuid}",
    "propertyUrl": "http://www.wikidata.org/prop/qualifier/P580"
},

to

{
    "titles": "employer_startDate_rand",
    "name": "employer_startDate_rand",
    "datatype": "string",
    "aboutUrl": "http://www.wikidata.org/entity/statement/{qid}-{employer_uuid}",
    "propertyUrl": "http://www.wikidata.org/prop/qualifier/value/P580",
    "valueUrl": "http://example.com/.well-known/genid/{employer_startDate_rand}"
},
{
    "titles": "employer_startDate_val",
    "name": "employer_startDate_val",
    "datatype": "dateTime",
    "aboutUrl": "http://example.com/.well-known/genid/{employer_startDate_rand}",
    "propertyUrl": "http://wikiba.se/ontology#timeValue"
},
{
    "titles": "employer_startDate_prec",
    "name": "employer_startDate_prec",
    "datatype": "integer",
    "aboutUrl": "http://example.com/.well-known/genid/{employer_startDate_rand}",
    "propertyUrl": "http://wikiba.se/ontology#timePrecision"
},

The CSV header needs to change from

...,employer_startDate,...

to

...,employer_startDate_rand,employer_startDate_val,employer_startDate_prec,...

Before, the property linking to the direct value had a namespace like http://www.wikidata.org/prop/x/ where x was statement, reference, or qualifier. Now, the property linking to the value node needs to have a namespace like http://www.wikidata.org/prop/x/value/ where x still is statement, reference, or qualifier. The links from the value node to the time value and time precision are the same regardless of the property link. The IRI pattern for the value node identifier is always the same, a blank node Skolem IRI: http://example.org/.well-known/genid/{propertyName_rand}, where propertyName_rand is the property name from the form with _rand appended.

Vanderbot does not find some image names in the data sent back from the API

In these cases, the record was successfully written to the API and the rest of the metadata (except for the associated references) is written to the CSV file. However, if the CSV is used to write again, duplicate claims will be made since the UUIDs and hashes aren't recorded in the table. So they have to be manually copied out of the returned JSON and pasted into the CSV.

Q111821575
transformed image URL:
http://commons.wikimedia.org/wiki/Special:FilePath/%22Accusing%20Finger%20of%20Conscience...God%20and%20Conscience%20Witness%20Every%20Action...The%20Authorities%20Ask%20That%20You%20Save%20Fats...Reli%20-%20NARA%20-%20512560.jpg
API value:
Accusing Finger of Conscience...God and Conscience Witness Every Action...The Authorities Ask That You Save Fats...Reli - NARA - 512560.jpg

Q111821677
transformed image URL:
http://commons.wikimedia.org/wiki/Special:FilePath/Henry%20Dunant%20apocalypse%20diagram%20.JPG
API value:
Henry Dunant apocalypse diagram.JPG

In this case it looks like the API stripped off a trailing space before the file extension.

Q111822239
transformed image URL:
http://commons.wikimedia.org/wiki/Special:FilePath/Simon%20Bening%20%28Flemish%20-%20Villagers%20on%20Their%20Way%20to%20Church%20-%20Google%20Art%20Project.jpg
API value:
Simon Bening - Villagers on Their Way to Church - Google Art Project.jpg

Statement IRIs that are subjects of qualifier triples are missing the item Q ID

@jbaskauf Sorry, the example I gave you to look at for qualifier statements was an old one where I was expressing the statement IRIs incorrectly. They are supposed to have the item Q ID appended with a dash in front of the UUIDs. You can see the correction in this diff: 573b3b8

This change makes the IRI the same as it is in all of the other cases where the statement is the subject of a triple.

I want to create a database so that I can access the data that has been accumulated

It isn't clear to me what the best place is to store the data that we have scraped so that we can clean and disambiguate it. Some options are:

  • relational database on a local computer
  • relational database in the cloud
  • DynamoDB database in AWS
  • triplestore (Blazegraph) in the cloud
  • CSV files on Github

I didn't list Wikibase itself because before data can be put into it, we need to get past the identifier and data model issues. Eventually we would like for the data to live in a Wikibase instance, but it's going to have to be cleaned a lot first.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.