mte,wkiri

Compare jSRE and unary relation classifier on MER-A docs

jSRE: /proj/mte/results/mer-a-jsre-v2-ads-gaz.jsonl
Unary classifier: /proj/mte/results/mer-a-unary-v2-ads-gaz.jsonl

Process journal papers and add content to MTE

The first step is to try parsing the journal documents @stevenlujpl already downloaded.
For some documents, we may need to process them multiple times for each mission whose targets are mentioned (see issue #22).

Generate initial annotations (w/Contains and HasProperty from jSRE) using the MTE pipeline
Perform expert review of Target, Contains, and HasProperty annotations. Guidelines: https://docs.google.com/document/d/1KnkVtxkKb9kcRVKZqPWwuXSi_6zK8Scay1M6qiz1Khc/edit#
For each mission, update aliases table (if any updates are needed)
For each mission, concatenate LPSC + journal .jsonl files and use them for ingest_sqlite.py
For each mission, run update_sqlite.py twice: once with LPSC annotations and once with journal paper annotations
Check contents via MTE mission-specific websites
Generate MTE bundle v3.0 with MER-B and journal content added

Advertise availability of MTE content (PDS, Analyst's Notebook)

(after it is available on the AN / delivered to the PDS)

Planetary Exploration Newsletter (PEN): http://planetarynews.org/
JPL Mars Forum
MER mission scientists / engineers / interested parties (John Callas, Amitabha Ghosh, et al.)

Add script to propagate entity labels from one set of .ann files to another

Example use case: Because targets from different missions can appear in the same document, we have some pre-existing labels for MER and MSL targets already completed in the MPF and PHX collections. To avoid re-doing this work, we should copy over these already reviewed target names from the MPF/PHX .ann files to the new MER and MSL collections prior to human review.

Collection	MER targets	MSL targets
MPF	64	87
PHX	4	127

Sources

MPF: /var/www/brat/data/mpf/all-reviewed+properties-v2/
PHX: /var/www/brat/data/phx/all-reviewed+properties-v2/

The script would need to change e.g. Target-MER to Target when copying the entity annotation to the MER collection.

Envisioned solution:

$ propagate_entity_annotations.py <source_dir> <dest_dir> <source_entity_type> <dest_entity_type>

Assemble MER-A (Spirit) target list

Submit LPSC 2022 abstract

website: https://www.hou.usra.edu/meetings/lpsc2022/
due: Jan. 11, 2022
Goal: describe and report MER-A and MER-B results and new database

Add aliases table to MTE DB

This would allow us to link known aliases of the same target, like Jake_M and Jake_Matijevic.

Support HasProperty relations in `jsre_parser.py`

Respond to Tom Stein's comments on MER-A content

Tom Stein provided feedback on the MER-A content in the MTE:

1. A canonical table of targets is missing. Of course, one was not provided by the mission. However, including target names with typos in the target table is confusing. A variety of target names exists:
- Defined in the mission planning tool. Includes proposed targets that were named but never used.
- Defined in team reports and planning documents like the Quill reports. Target name use is irregular and names diverge. Often these targets not added into the mission planning tool.
- Defined in literature.
2. The readme file says the targets table includes all targets named by the mission. What is the origin of this list? Need source reference. As described, the targets.csv comes across as the authoritative list of mission target names. However, some documented targets are not included (e.g., John_Sutter and Pioneer_Dunes are not in the MER2 list). Other entries are clearly typos or abbreviations.
3. Some typo/abbreviations in the aliases table are in the targets table, but not all. These MER2 targets are in the aliases table but not targets table: Gerturdeweise, Haskin’s_Ridge, Pot_O_Gold, Winterhaven.
4. The MER aliases table includes entries where target_name = canonical_name (Winter_Haven, Tyrone_Nodules). Is this expected?
5. What is the metric for adding an entry to the aliases table? For example, how do you know that Wishtone and Wishstone are two different targets for MER2? Or is an entry missing from the aliases table? Another MER2 example is Mason_Dixon and Masondixon.
6. The MER2 table “Prl:pasorobleslight” seems like an invalid name. Automation error?
7. The MER2 mentions table target “Pr2” is not included in the aliases or targets table. A disconnect?
8. A diagram showing the relationships between tables would be super helpful. The text descriptions are sufficient, but a visualization would be a big plus.

Update jsre_parser.py to skip jSRE if no records to classify

Check/update search interfaces at https://mte.jpl.nasa.gov/

It is likely that something was lost when buffalo was decommissioned. Some detective work is needed to check the installation (https://github.com/wkiri/MTE/wiki/MTE-Web-Interface) and get this working on our current machines.

Integrate MTE content into Analyst's Notebooks

We are working with Tom Stein at the Geosciences node to make MTE target information searchable through the Analyst's Notebook (AN) for each mission.

Phoenix Analyst's Notebook: https://an.rsl.wustl.edu/phx/solbrowser/browserFr.aspx?tab=solsumm - select a sol, then "Targets and Features" to browse. MTE content can be linked to each target.
Pathfinder does not yet have an AN, so this will be created from scratch.

Update MTE to include content from LPSC 2021 for MPF and PHX

Download LPSC 2021 PDF files
Extract information (NER and relations) for MPF and PHX
Review/edit annotations
Deliver updated DBs to PDS

Copy parser files from parser-indexer to this repository

Since our parser capabilities no longer depend on Solr, it would make sense to migrate the relevant parser scripts (and issues) to the MTE repository (leaving current versions of the parser scripts as-is in the parser-indexer repo).

Then, we can also remove the dependency on the parser-indexer repo in this repository.

Add ability to auto-annotate entities using a gazette

We want to increase recall of Target types specifically. One way is to do a string-based matching from a gazette of entity terms. We can provide a gazette file that consists of "Entity_type Entity_name" pairs to inform the string matching.

Steven suggests integrating this capability into corenlp_parser.py. It would happen after applying the trained NER model and before running relation extraction.

Suggest adding an option like -g <gazette_file> which if specified would activate this capability
We don't want to generate duplicate annotations, so this step should check to see if a given entity is already marked by the NER system and if so, omit adding the new entity annotation.
The pre_annotate.py script can be an inspiration for this capability (but does not need to be followed exactly).

Conduct user study of MTE content

Plan study - what we want to measure
- Correctness of content (for some specified queries; for queries of their choice) - yes/no for e.g. 10 queries
- Findability (of content of personal interest) ("could you find it?" and "how long did it take?")
- Utility ("would you find this resource helpful in your research?") - Likert scale
- Expansion ("what changes/features would you want to make this even more useful?") - text box
Devise survey or other information-gathering form
Recruit participants: Mars surface mission team members, planetary scientists that were not on a Mars surface mission, scientists from other fields
Collect and analyze results

Document ingestion does not populate document title, venue

grobid needs to be applied when new documents are ingested.

Integrate unary parser into MTE pipeline

Train CoreNLP named entity recognizer for use with MER documents

The plan is to re-train the model using all available annotations (MSL, MPF, PHX).

First, do cross-validation to get a sense of generalization capability. We no longer need to divide train/dev/test temporally. Include Property as an entity type.
Re-train over all annotations (including Property).
Augment with MER-A Target gazette for a MER-A classifier
Augment with MER-B Target gazette for a MER-B classifier

Add process id to jSRE temporary filename

This will allow multiple simultaneous runs of the MTE pipeline on the same machine.

Parser improvement: terminate run on global problems

When iterating over documents, it would be helpful to distinguish between errors specific to a document (such as when an input file is not found) and errors that affect all documents (such as the CoreNLP server being unavailable or we cannot load the NER model). In the former case, processing should advance to the next document. In the latter case, processing should stop with a message to the user on standard out indicating the run was terminated prematurely (details can be in the log file).

Global problems:

CoreNLP unavailable
NER model cannot be read
grobid unavailable

Document ingestion does not remove references

MTE processing of documents that mention multiple missions

The MTE consists of a separate database per mission. However, some documents mention targets from multiple missions. How should we accommodate these documents?

Independently apply each mission's NER model + shared jSRE model. However, the review process will be tedious because each document has to be reviewed #missions times.
Train a merged Target NER model across all missions. How then to separate them for each mission's database?
Train a merged model to distinguish between targets per mission (different entity types for Target-MPF, Target-PHX, etc.). I suspect this will not have high accuracy and would require a lot of manual editing to reassign targets to their correct mission. However, it is worth evaluating.

To help decide the best course of action, we would like to evaluate each option on the same test set. This test set should be composed of some documents from each of the labeled sets (MSL, MPF, PHX).

Let's use this issue to continue planning how to proceed.

Train jSRE model for "has_property" relation

Train jSRE model for "has_property" relation using all labeled MPF and PHX docs
It would be good to create a script for training a new jSRE model, instead of referring to the wiki page where the process is documented.
We need a separate jSRE model for each relation used. This means we'll also need to update jsre_parser.py to apply more than one model (see #34 )

Check bundle delivery script versioning for README file

Scott said:

readme went to 1.2, collection_document_invetory.csv still had ::1.0.

Update ingest_sqlite.py to use ADS document meta-data fields

The new parser-indexer-py populates several new fields with document meta-data using the ADS service. These fields start with "ads:". We want to update ingest_sqlite.py to use these fields, when available. Otherwise, fall back to the "grobid:" fields:

MTE/src/ingest_sqlite.py

Lines 38 to 41 in c7fd597

    
           'title': rec['metadata'].get('grobid:header_Title', ''), 
        
           'authors': rec['metadata'].get('grobid:header_Authors', ''), 
        
           'primary_author': '', 
        
           'affiliations': rec['metadata'].get('grobid:header_FullAffiliations', ''),

Update MTE PDS bundle format to use latest PDS information model

Currently we are using version 1.14.0.0 (1E00).

As of Jan. 2021, version 1.15.0.0 is available: https://pds.nasa.gov/datastandards/documents/im/current/index_1F00.html It seems it came out on Dec. 23, 2020. The changes are (briefly) described here: https://pds.nasa.gov/datastandards/documents/im/v1/PDS4_PDS_1F00_Release_Notes.pdf
Also, version 1.16.0.0 is planned for June 2021.

Disable default NER models in CoreNLP output

Update ingest_sqlite.py to remove LPSC-specific content

To support non-LPSC content (like journals), we will need to update these functions:

MTE/src/ingest_sqlite.py

Lines 93 to 113 in 3147233

    
           # Document feature update functions by Thamme Gowda (from insert_docs.py) 
        
           def construct_doc_url(rec): 
        
               if rec['year'] < 2000: 
        
                   rec['doc_url'] = 'http://www.lpi.usra.edu/meetings/' + \ 
        
                                    ('LPSC%s/pdf/' % (rec['year'] - 1900)) + \ 
        
                                    str(rec['abstract']) + '.pdf' 
        
               elif rec['year'] <= 2017: # 2000 through 2017 
        
                   rec['doc_url'] = 'http://www.lpi.usra.edu/meetings/' + \ 
        
                                    ('lpsc%s/pdf/' % rec['year']) + \ 
        
                                    str(rec['abstract']) + '.pdf' 
        
               else: # 2018 and later 
        
                   rec['doc_url'] = 'http://www.hou.usra.edu/meetings/' + \ 
        
                                    ('lpsc%s/pdf/' % rec['year']) + \ 
        
                                    str(rec['abstract']) + '.pdf' 
        
               return rec 
        
           def update_doc_venue(rec): 
        
               rec['venue'] = 'Lunar and Planetary Science Conference, ' + \ 
        
                              ('Abstract #%d' % rec['abstract']) 
        
               return rec

Address PDS feedback on MER-A bundle

Reviewer 1:

I reviewed the readme.txt file in the document collection and looked at the data files in the MER collection.

1) I understand that this dataset is structured like a relational database. However, I feel that a typical user would find it hard to use without some sort of user interface to connect the separate data files.
2) The documentation should add a note that MER2 corresponds to the Spirit rover that is also sometimes referred to as MERA.
3) The documentation should make it clear that the resulting target list is based on what has been published in LPSC abstracts and is not necessarily a complete list of targets defined and observed by the mission.
4) It seems to me that the structure of the documents product would make it hard to expand it to include journal publications. For example, it includes a conference name, which is not applicable to journals and an abstract number that may not be applicable to non-LPI conferences. Maybe the product should be renamed to abstracts or lpsc_abstracts.

Components.csv:

- Some listed components are not element or mineral. For example Hawaiite and Hyaloclastite.
- Abbreviations should be expanded. For example, Feot to FeO-total, px to pyroxene.
- Single and plural entries of the same thing should be combined. The reason for having both pyroxene and pyroxenes, for example, is not clear. If it is important, please explain. Another example is Npox and Np-ox.
- Feot and Feotot is listed as a mineral. The abstract could be referring to the total iron oxide content in the composition of a sample and not necessarily to iron-oxide mineral.
- It is not clear why the second component of chemical compounds are shown in lower case and not in the standard chemical notation for elements. This also applies to the contains product.

Reviewer 2:

The data are presented from a ML / data science viewpoint and not from a planetary scientist (or general science user) viewpoint. The structure makes perfect sense for a programmer, but a scientist has to build a system to link from table to table to get any answers. An average science user who wants to find abstracts that reference a given target has to relink all of the tables using a variety of IDs along the way. Attention has to be paid to the aliases table to find any name variations like misspellings and abbreviations. It’s not that the structure is terrible, it’s just not friendly to the common user.

I think the targets.csv table should have columns something like (target_id, target_name, mte_target_id, met_target_name, mission) and then include both the target_id and mte_target_id in the other tables. The target_id and _name refer to the canonical name, so the set of columns could be (target_id_canonical, target_name_canonical, target_id_mte, target_name_mte, mission).

Perhaps the MTE should include a “first order” table that quickly and easily gets users 90% of the way to a reference:

ID columns (the target names and IDs)
Reference type (literature mention, element/mineral/property)
Reference value (for element/mineral/property: “oxygen”, “carbonate”, “lag_deposit”, etc.)
Document reference (id, title, author, year, URL)

Plenty of repetition, but this sort of table is easily scanned by human eyes and can be loaded into Excel (or wherever) for column-based sorting. For a user looking at the archive volume as it is, the filename “target.csv” is glowing red-hot and carries the suggestion that the file is a list of official targets. After opening the file, the user sees lots of names with typos and extraneous-looking appendages.

In short, my primary recommendation is that the science community probably is better served if the MTE product has two distinct parts: a user-friendly table (described above) and the formal computer science portion that is not the first thing a normal user would see.

Reviewer 3:

bundle_mars_target_encyclopedia.xml:

- I would make the bundle V2.0, since an entire collection was added.
- Modification_Detail description should say that the data_mer2 collection was added.
Maybe something like, "Added data_mer2 collection. Also added aliases table to data_mpf and data_phx collections" if that's accurate.

collection_mpf.xml, collection_phx.xml, and collection_document.xml:

- Version 1.2 of these files online are named collection_mpf_inventory.xml, collection_phx_inventory.xml, and collection_document_inventory.xml. Changing the file names is going to mess things up at the EN. If you insist on changing them, I think they might have to start over at Version 1.0, but we'd need to check with Richard. I'd just stick with the online names, and follow the same naming convention for the new MER2 inventory (collection_mer2_inventory.xml).
Recommendation is to revert to the “inventory” naming convention because the online files have already been registered with EN, and this might cause problems. Also, it’s fine to have “inventory” in the collection label filename. Many of bundles at GEO use this convention.

data_mer/collection_mer2.xml:

- Rename to collection_mer2_inventory.xml per above.
- Change to DOS line endings.
- <start_date_time>1997-07-04Z</start_date_time>
<stop_date_time>2020-03-16Z</stop_date_time>
This can't be right.
I noticed that all the labels (including MPF and PHX) have these dates (the bundle XML is slightly different).
Where do these dates come from? Shouldn't they be mission-specific?
Scott: I agree with the reviewer and think what they were getting at was to have the bundle span all time, and the individual collections span only from when the mission data starts (i.e., MER starts in 03 or 04, not 97).

document/collection_document.xml:

- Modification_Detail description should say that the MTE-schema.jpg file was added and readme.txt was updated. "Add aliases table" should be removed.

document/readme.xml:

- Modification_Detail description should say something like "Updated to reflect addition of aliases table and mer2 data collection."

MTE-schema.jpg and MTE-schema.xml

- Lowercase the file names. Do same for pointer in MTE-schema.xml.
Not a requirement, but we favour lowercase filenames at GEO. Also, this is the only file in the bundle that is lowercase. LIDs have to be lowercase, so we try to have filenames match.

All data_mpf and data_phx XML labels:

Why was the <modification_date> for V1.2 changed from 2021-06-07 to 2021-06-01? The labels online have 2021-06-07.

data_mpf/has_property.xml, data_mpf/mentions.xml, data_mpf/targets.xml, data_phx/contains.xml, data_phx/documents.xml, data_phx/has_property.xml, data_phx/mentions.xml, data_phx/targets.xml.

- V1.3 Modification_Detail descriptions say "Add aliases table." These do not reflect the edits that were actually made to the individual files.

data/mer/has_property.xml, data_mpf/has_property.xml, and data_phx/has_property.xml

- Change to DOS line endings.

data_mpf/properties.xml, data_mpf/sentences.xml, data_phx/sentences.xml

Should be V1.3, since changes were made to the files. Remember to update the inventory file also.

Scott VanBommel:

I think what is missing from the readme file is discussion of how the target names were determined, and how the canonical names were selected when they appear in the aliases list. (Scott's comment: this ties into the concern, which I share, that a user accepts the information presented as absolute truth - however, targets.csv and aliases.csv are neither authoritative nor comprehensive.)
Change the part in the readme file that says “when a target name is re-used” to “when a target name is used in multiple missions”

Update jSRE model for "contains" relation using all annotated docs

This model will supersede the current jSRE contains model, which was trained only on LPSC 2015 documents.

Address/correct MTE 1.3.0 validation errors

Scott VanBommel used version 2.1.0 of the validate tool and identified some issues that need investigation. The full output is included below.

PDS Validate Tool Report

Configuration:
Version 2.1.0
Date 2022-01-27T13:32:31Z

Parameters:
Targets [file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/]
Rule Type pds4.bundle
Severity Level WARNING
Recurse Directories true
File Filters Used [*.xml, *.XML]
Data Content Validation on
Product Level Validation on
Allow Unlabeled Files false
Max Errors 100000
Registered Contexts File C:\PDS\Tools\Validate\bin..\resources\registered_context_products.json

Product Level Validation Results

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/bundle_mars_target_encyclopedia.xml
1 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/aliases.xml
2 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/collection_mer2_inventory.xml
3 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/components.xml
4 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/contains.xml
5 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/documents.xml
6 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/has_property.xml
7 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/mentions.xml
8 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/properties.xml
9 product validation(s) completed

FAIL: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.xml
Begin Content Validation: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.csv

ERROR [error.validation.invalid_field_value] table 1, record 1884, field 3: The field value 'For example, Rayleigh fractional crystallization of Adirondack magma steadily increases incompatible element concentrations (K; ! D Kbulk " 0) and rapidly decreases compatible element concentrations (Ni; ! D Ni bulk >>1).' that starts with double quote should not contain double quote(s)
End Content Validation: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.csv
10 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/targets.xml
11 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/aliases.xml
12 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/collection_mpf_inventory.xml
13 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/components.xml
14 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/contains.xml
15 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/documents.xml
16 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/has_property.xml
17 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/mentions.xml
18 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/properties.xml
19 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/sentences.xml
20 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/targets.xml
21 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/aliases.xml
22 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/collection_phx_inventory.xml
23 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/components.xml
24 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/contains.xml
25 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/documents.xml
26 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/has_property.xml
27 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/mentions.xml
28 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/properties.xml
29 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/sentences.xml
30 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/targets.xml
31 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/collection_document_inventory.xml
32 product validation(s) completed

FAIL: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/readme.xml
ERROR [error.validation.internal_error] Error occurred while processing TEXT file content for readme.txt: String index out of range: -1
33 product validation(s) completed

PDS4 Bundle Level Validation Results

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/collection_mpf_inventory.xml
1 integrity check(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/collection_phx_inventory.xml
2 integrity check(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/collection_document_inventory.xml
3 integrity check(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/bundle_mars_target_encyclopedia.xml

Summary:

2 error(s)
12 warning(s)

Product Validation Summary:
31 product(s) passed
2 product(s) failed
0 product(s) skipped

Referential Integrity Check Summary:
33 check(s) passed
0 check(s) failed
0 check(s) skipped

Message Types:
1 error.validation.internal_error
1 error.validation.invalid_field_value
9 warning.integrity.unreferenced_member
3 warning.integrity.missing_context_reference

End of Report

Remove spurious Components/Properties

I noticed that "Calcium" was listed as both an "Element" (correct) and a "Mineral" (incorrect) in our MER-A (mer2) database. It turns out that this is not due to an incorrect annotation, but instead an NER error . "Ca" is listed as a "Mineral" NER in the source .json file (/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397.jsonl) when it is part of "Ca-sulfates" in 2006_1472. This is corrected in the annotations to be of type "Element", but the remove-orphans step in update_sqlite.py does not remove "Calcium" from the components table because "Calcium" still appears in the contains table (due to the element appearing in at least one valid contains relation), and the components table is not refreshed based on the annotations (probably so that we can run update_sqlite.py several times to progressively add/update if desired?).

A possible solution would be for the remove-orphans step to regenerate components and properties at the end of processing so that they accurately reflect content in the documents at that point. However, it is worth more thought to determine if this is the best solution.

Generate MER-A PDS4 bundle

Generate initial MER-A annotations (w/Contains from jSRE) using the MTE pipeline
Add HasProperty annotations (train and apply jSRE model) #19
Perform expert review of Target, Contains, and HasProperty annotations. Guidelines: https://docs.google.com/document/d/1KnkVtxkKb9kcRVKZqPWwuXSi_6zK8Scay1M6qiz1Khc/edit#
Create Aliases table? #9
Create MER-A bundle template
Generate MER-A bundle

Assemble MER-B (Opportunity) target list

Generate MTE PDS4 bundle with MER-B content

Generate initial MER-B annotations (w/Contains and HasProperty from jSRE) using the MTE pipeline
Perform expert review of Target, Contains, and HasProperty annotations. Guidelines: https://docs.google.com/document/d/1KnkVtxkKb9kcRVKZqPWwuXSi_6zK8Scay1M6qiz1Khc/edit#
Populate aliases table
Generate SQLite database for MER-A (some items were updated) and MER-B
Check contents via MTE website
Generate MTE bundle with MER-B (mer1) content added
- Note: PDS made us go to version 2.0 when we added mer2, so likely we should advance the bundle to version 3.0 with mer1. (Individual files will advance versions only as needed)

Documentation

Add README to src/
- Steven: describe how to parse a document (full details on wiki page)
- Kiri: describe how to create SQLite database (ingest, review with json2brat, update)
Describe how to deliver PDS bundle - this has many steps and probably should be a wiki page instead; will be copied from github-fn wiki page
Update wiki to remove mentions of Solr and replace with SQLite

Prepare LPSC 2022 poster

Draft poster and share with co-authors
Submit to doc review by Feb. 23
March 2: latest date to "publish" poster online
March 9: poster session presentation

Develop improved relation extraction methods

e.g., identify "container" and "containees" prior to applying relation classifier to improve precision; search for a "container" when "containee" is true to improve recall

Compare to jSRE
Compare to PURE

	'title': rec['metadata'].get('grobid:header_Title', ''),
	'authors': rec['metadata'].get('grobid:header_Authors', ''),
	'primary_author': '',
	'affiliations': rec['metadata'].get('grobid:header_FullAffiliations', ''),

	# Document feature update functions by Thamme Gowda (from insert_docs.py)
	def construct_doc_url(rec):
	if rec['year'] < 2000:
	rec['doc_url'] = 'http://www.lpi.usra.edu/meetings/' + \
	('LPSC%s/pdf/' % (rec['year'] - 1900)) + \
	str(rec['abstract']) + '.pdf'
	elif rec['year'] <= 2017: # 2000 through 2017
	rec['doc_url'] = 'http://www.lpi.usra.edu/meetings/' + \
	('lpsc%s/pdf/' % rec['year']) + \
	str(rec['abstract']) + '.pdf'
	else: # 2018 and later
	rec['doc_url'] = 'http://www.hou.usra.edu/meetings/' + \
	('lpsc%s/pdf/' % rec['year']) + \
	str(rec['abstract']) + '.pdf'
	return rec


	def update_doc_venue(rec):
	rec['venue'] = 'Lunar and Planetary Science Conference, ' + \
	('Abstract #%d' % rec['abstract'])
	return rec

wkiri / mte Goto Github PK

mte's People

Contributors

Stargazers

Watchers

mte's Issues

Recommend Projects

Recommend Topics

Recommend Org