wkiri / mte Goto Github PK
View Code? Open in Web Editor NEWMars Target Encyclopedia
License: Apache License 2.0
Mars Target Encyclopedia
License: Apache License 2.0
/proj/mte/results/mer-a-jsre-v2-ads-gaz.jsonl
/proj/mte/results/mer-a-unary-v2-ads-gaz.jsonl
The first step is to try parsing the journal documents @stevenlujpl already downloaded.
For some documents, we may need to process them multiple times for each mission whose targets are mentioned (see issue #22).
ingest_sqlite.py
update_sqlite.py
twice: once with LPSC annotations and once with journal paper annotations(after it is available on the AN / delivered to the PDS)
Example use case: Because targets from different missions can appear in the same document, we have some pre-existing labels for MER and MSL targets already completed in the MPF and PHX collections. To avoid re-doing this work, we should copy over these already reviewed target names from the MPF/PHX .ann
files to the new MER and MSL collections prior to human review.
Collection | MER targets | MSL targets |
---|---|---|
MPF | 64 | 87 |
PHX | 4 | 127 |
Sources
/var/www/brat/data/mpf/all-reviewed+properties-v2/
/var/www/brat/data/phx/all-reviewed+properties-v2/
The script would need to change e.g. Target-MER
to Target
when copying the entity annotation to the MER collection.
Envisioned solution:
$ propagate_entity_annotations.py <source_dir> <dest_dir> <source_entity_type> <dest_entity_type>
This would allow us to link known aliases of the same target, like Jake_M
and Jake_Matijevic
.
Tom Stein provided feedback on the MER-A content in the MTE:
It is likely that something was lost when buffalo
was decommissioned. Some detective work is needed to check the installation (https://github.com/wkiri/MTE/wiki/MTE-Web-Interface) and get this working on our current machines.
We are working with Tom Stein at the Geosciences node to make MTE target information searchable through the Analyst's Notebook (AN) for each mission.
Since our parser capabilities no longer depend on Solr, it would make sense to migrate the relevant parser scripts (and issues) to the MTE repository (leaving current versions of the parser scripts as-is in the parser-indexer repo).
Then, we can also remove the dependency on the parser-indexer repo in this repository.
We want to increase recall of Target types specifically. One way is to do a string-based matching from a gazette of entity terms. We can provide a gazette file that consists of "Entity_type Entity_name" pairs to inform the string matching.
Steven suggests integrating this capability into corenlp_parser.py
. It would happen after applying the trained NER model and before running relation extraction.
-g <gazette_file>
which if specified would activate this capabilitypre_annotate.py
script can be an inspiration for this capability (but does not need to be followed exactly).grobid needs to be applied when new documents are ingested.
The plan is to re-train the model using all available annotations (MSL, MPF, PHX).
This will allow multiple simultaneous runs of the MTE pipeline on the same machine.
When iterating over documents, it would be helpful to distinguish between errors specific to a document (such as when an input file is not found) and errors that affect all documents (such as the CoreNLP server being unavailable or we cannot load the NER model). In the former case, processing should advance to the next document. In the latter case, processing should stop with a message to the user on standard out indicating the run was terminated prematurely (details can be in the log file).
Global problems:
The MTE consists of a separate database per mission. However, some documents mention targets from multiple missions. How should we accommodate these documents?
To help decide the best course of action, we would like to evaluate each option on the same test set. This test set should be composed of some documents from each of the labeled sets (MSL, MPF, PHX).
Let's use this issue to continue planning how to proceed.
jsre_parser.py
to apply more than one model (see #34 )Scott said:
readme went to 1.2, collection_document_invetory.csv still had ::1.0.
The new parser-indexer-py populates several new fields with document meta-data using the ADS service. These fields start with "ads:"
. We want to update ingest_sqlite.py
to use these fields, when available. Otherwise, fall back to the "grobid:"
fields:
Lines 38 to 41 in c7fd597
Currently we are using version 1.14.0.0 (1E00).
To support non-LPSC content (like journals), we will need to update these functions:
Lines 93 to 113 in 3147233
Reviewer 1:
I reviewed the readme.txt file in the document collection and looked at the data files in the MER collection.
1) I understand that this dataset is structured like a relational database. However, I feel that a typical user would find it hard to use without some sort of user interface to connect the separate data files.
2) The documentation should add a note that MER2 corresponds to the Spirit rover that is also sometimes referred to as MERA.
3) The documentation should make it clear that the resulting target list is based on what has been published in LPSC abstracts and is not necessarily a complete list of targets defined and observed by the mission.
4) It seems to me that the structure of the documents product would make it hard to expand it to include journal publications. For example, it includes a conference name, which is not applicable to journals and an abstract number that may not be applicable to non-LPI conferences. Maybe the product should be renamed to abstracts or lpsc_abstracts.
Reviewer 2:
The data are presented from a ML / data science viewpoint and not from a planetary scientist (or general science user) viewpoint. The structure makes perfect sense for a programmer, but a scientist has to build a system to link from table to table to get any answers. An average science user who wants to find abstracts that reference a given target has to relink all of the tables using a variety of IDs along the way. Attention has to be paid to the aliases table to find any name variations like misspellings and abbreviations. It’s not that the structure is terrible, it’s just not friendly to the common user.
I think the targets.csv table should have columns something like (target_id, target_name, mte_target_id, met_target_name, mission) and then include both the target_id and mte_target_id in the other tables. The target_id and _name refer to the canonical name, so the set of columns could be (target_id_canonical, target_name_canonical, target_id_mte, target_name_mte, mission).
Perhaps the MTE should include a “first order” table that quickly and easily gets users 90% of the way to a reference:
Plenty of repetition, but this sort of table is easily scanned by human eyes and can be loaded into Excel (or wherever) for column-based sorting. For a user looking at the archive volume as it is, the filename “target.csv” is glowing red-hot and carries the suggestion that the file is a list of official targets. After opening the file, the user sees lots of names with typos and extraneous-looking appendages.
Reviewer 3:
bundle_mars_target_encyclopedia.xml:
collection_mpf.xml, collection_phx.xml, and collection_document.xml:
data_mer/collection_mer2.xml:
document/collection_document.xml:
document/readme.xml:
MTE-schema.jpg and MTE-schema.xml
All data_mpf and data_phx XML labels:
data_mpf/has_property.xml, data_mpf/mentions.xml, data_mpf/targets.xml, data_phx/contains.xml, data_phx/documents.xml, data_phx/has_property.xml, data_phx/mentions.xml, data_phx/targets.xml.
data/mer/has_property.xml, data_mpf/has_property.xml, and data_phx/has_property.xml
data_mpf/properties.xml, data_mpf/sentences.xml, data_phx/sentences.xml
Scott VanBommel:
I think what is missing from the readme file is discussion of how the target names were determined, and how the canonical names were selected when they appear in the aliases list. (Scott's comment: this ties into the concern, which I share, that a user accepts the information presented as absolute truth - however, targets.csv and aliases.csv are neither authoritative nor comprehensive.)
Change the part in the readme file that says “when a target name is re-used” to “when a target name is used in multiple missions”
This model will supersede the current jSRE contains model, which was trained only on LPSC 2015 documents.
Scott VanBommel used version 2.1.0 of the validate tool and identified some issues that need investigation. The full output is included below.
PDS Validate Tool Report
Configuration:
Version 2.1.0
Date 2022-01-27T13:32:31Z
Parameters:
Targets [file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/]
Rule Type pds4.bundle
Severity Level WARNING
Recurse Directories true
File Filters Used [*.xml, *.XML]
Data Content Validation on
Product Level Validation on
Allow Unlabeled Files false
Max Errors 100000
Registered Contexts File C:\PDS\Tools\Validate\bin..\resources\registered_context_products.json
Product Level Validation Results
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/bundle_mars_target_encyclopedia.xml
1 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/aliases.xml
2 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/collection_mer2_inventory.xml
3 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/components.xml
4 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/contains.xml
5 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/documents.xml
6 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/has_property.xml
7 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/mentions.xml
8 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/properties.xml
9 product validation(s) completed
FAIL: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.xml
Begin Content Validation: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.csv
ERROR [error.validation.invalid_field_value] table 1, record 1884, field 3: The field value 'For example, Rayleigh fractional crystallization of Adirondack magma steadily increases incompatible element concentrations (K; ! D Kbulk " 0) and rapidly decreases compatible element concentrations (Ni; ! D Ni bulk >>1).' that starts with double quote should not contain double quote(s)
End Content Validation: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.csv
10 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/targets.xml
11 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/aliases.xml
12 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/collection_mpf_inventory.xml
13 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/components.xml
14 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/contains.xml
15 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/documents.xml
16 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/has_property.xml
17 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/mentions.xml
18 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/properties.xml
19 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/sentences.xml
20 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/targets.xml
21 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/aliases.xml
22 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/collection_phx_inventory.xml
23 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/components.xml
24 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/contains.xml
25 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/documents.xml
26 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/has_property.xml
27 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/mentions.xml
28 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/properties.xml
29 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/sentences.xml
30 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/targets.xml
31 product validation(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/collection_document_inventory.xml
32 product validation(s) completed
FAIL: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/readme.xml
ERROR [error.validation.internal_error] Error occurred while processing TEXT file content for readme.txt: String index out of range: -1
33 product validation(s) completed
PDS4 Bundle Level Validation Results
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/collection_mpf_inventory.xml
1 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/collection_phx_inventory.xml
2 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/collection_document_inventory.xml
3 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/bundle_mars_target_encyclopedia.xml
WARNING [warning.integrity.missing_context_reference] The context reference 'urn:nasa:pds:context:investigation:mission.mars_science_laboratory' could not be found in this bundle but it was defined in urn:nasa:pds:mars_target_encyclopedia:document::1.3. (Disable with --skip-context-reference-check flag)
WARNING [warning.integrity.missing_context_reference] The context reference 'urn:nasa:pds:context:instrument_host:spacecraft.msl' could not be found in this bundle but it was defined in urn:nasa:pds:mars_target_encyclopedia:document::1.3. (Disable with --skip-context-reference-check flag)
WARNING [warning.integrity.missing_context_reference] The context reference 'urn:nasa:pds:context:instrument:chemcam_libs.msl' could not be found in this bundle but it was defined in urn:nasa:pds:mars_target_encyclopedia:document::1.3. (Disable with --skip-context-reference-check flag)
4 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/components.xml
5 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/sentences.xml
6 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/aliases.xml
WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:aliases::1.0' is not a member of any collection within the given target
7 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/has_property.xml
8 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/components.xml
9 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/contains.xml
10 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/collection_mer2_inventory.xml
11 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/aliases.xml
12 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/contains.xml
13 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.xml
WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:sentences::1.0' is not a member of any collection within the given target
14 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/documents.xml
WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:documents::1.0' is not a member of any collection within the given target
15 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/mentions.xml
16 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/sentences.xml
17 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/components.xml
WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:components::1.0' is not a member of any collection within the given target
18 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/targets.xml
19 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/targets.xml
20 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/properties.xml
21 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/contains.xml
WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:contains::1.0' is not a member of any collection within the given target
22 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/has_property.xml
WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:has_property::1.0' is not a member of any collection within the given target
23 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/mentions.xml
WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:mentions::1.0' is not a member of any collection within the given target
24 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/properties.xml
25 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/mentions.xml
26 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/documents.xml
27 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/properties.xml
WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:properties::1.0' is not a member of any collection within the given target
28 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/documents.xml
29 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/has_property.xml
30 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/readme.xml
31 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/targets.xml
WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:targets::1.0' is not a member of any collection within the given target
32 integrity check(s) completed
PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/aliases.xml
33 integrity check(s) completed
Summary:
2 error(s)
12 warning(s)
Product Validation Summary:
31 product(s) passed
2 product(s) failed
0 product(s) skipped
Referential Integrity Check Summary:
33 check(s) passed
0 check(s) failed
0 check(s) skipped
Message Types:
1 error.validation.internal_error
1 error.validation.invalid_field_value
9 warning.integrity.unreferenced_member
3 warning.integrity.missing_context_reference
End of Report
I noticed that "Calcium" was listed as both an "Element" (correct) and a "Mineral" (incorrect) in our MER-A (mer2) database. It turns out that this is not due to an incorrect annotation, but instead an NER error . "Ca" is listed as a "Mineral" NER in the source .json file (/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397.jsonl) when it is part of "Ca-sulfates" in 2006_1472. This is corrected in the annotations to be of type "Element", but the remove-orphans step in update_sqlite.py does not remove "Calcium" from the components table because "Calcium" still appears in the contains table (due to the element appearing in at least one valid contains relation), and the components table is not refreshed based on the annotations (probably so that we can run update_sqlite.py several times to progressively add/update if desired?).
A possible solution would be for the remove-orphans step to regenerate components and properties at the end of processing so that they accurately reflect content in the documents at that point. However, it is worth more thought to determine if this is the best solution.
aliases
tablesrc/
e.g., identify "container" and "containees" prior to applying relation classifier to improve precision; search for a "container" when "containee" is true to improve recall
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.