Code Monkey home page Code Monkey logo

mte's People

Contributors

karanjeets avatar stevenlujpl avatar thammegowda avatar wkiri avatar yyzhuang1991 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mte's Issues

Process journal papers and add content to MTE

The first step is to try parsing the journal documents @stevenlujpl already downloaded.
For some documents, we may need to process them multiple times for each mission whose targets are mentioned (see issue #22).

  • Generate initial annotations (w/Contains and HasProperty from jSRE) using the MTE pipeline
  • Perform expert review of Target, Contains, and HasProperty annotations. Guidelines: https://docs.google.com/document/d/1KnkVtxkKb9kcRVKZqPWwuXSi_6zK8Scay1M6qiz1Khc/edit#
  • For each mission, update aliases table (if any updates are needed)
  • For each mission, concatenate LPSC + journal .jsonl files and use them for ingest_sqlite.py
  • For each mission, run update_sqlite.py twice: once with LPSC annotations and once with journal paper annotations
  • Check contents via MTE mission-specific websites
  • Generate MTE bundle v3.0 with MER-B and journal content added

Add script to propagate entity labels from one set of .ann files to another

Example use case: Because targets from different missions can appear in the same document, we have some pre-existing labels for MER and MSL targets already completed in the MPF and PHX collections. To avoid re-doing this work, we should copy over these already reviewed target names from the MPF/PHX .ann files to the new MER and MSL collections prior to human review.

Collection MER targets MSL targets
MPF 64 87
PHX 4 127

Sources

  • MPF: /var/www/brat/data/mpf/all-reviewed+properties-v2/
  • PHX: /var/www/brat/data/phx/all-reviewed+properties-v2/

The script would need to change e.g. Target-MER to Target when copying the entity annotation to the MER collection.

Envisioned solution:

$ propagate_entity_annotations.py <source_dir> <dest_dir> <source_entity_type> <dest_entity_type>

Respond to Tom Stein's comments on MER-A content

Tom Stein provided feedback on the MER-A content in the MTE:


  • 1. A canonical table of targets is missing. Of course, one was not provided by the mission. However, including target names with typos in the target table is confusing. A variety of target names exists:
    • Defined in the mission planning tool. Includes proposed targets that were named but never used.
    • Defined in team reports and planning documents like the Quill reports. Target name use is irregular and names diverge. Often these targets not added into the mission planning tool.
    • Defined in literature.
  • 2. The readme file says the targets table includes all targets named by the mission. What is the origin of this list? Need source reference. As described, the targets.csv comes across as the authoritative list of mission target names. However, some documented targets are not included (e.g., John_Sutter and Pioneer_Dunes are not in the MER2 list). Other entries are clearly typos or abbreviations.
  • 3. Some typo/abbreviations in the aliases table are in the targets table, but not all. These MER2 targets are in the aliases table but not targets table: Gerturdeweise, Haskin’s_Ridge, Pot_O_Gold, Winterhaven.
  • 4. The MER aliases table includes entries where target_name = canonical_name (Winter_Haven, Tyrone_Nodules). Is this expected?
  • 5. What is the metric for adding an entry to the aliases table? For example, how do you know that Wishtone and Wishstone are two different targets for MER2? Or is an entry missing from the aliases table? Another MER2 example is Mason_Dixon and Masondixon.
  • 6. The MER2 table “Prl:pasorobleslight” seems like an invalid name. Automation error?
  • 7. The MER2 mentions table target “Pr2” is not included in the aliases or targets table. A disconnect?
  • 8. A diagram showing the relationships between tables would be super helpful. The text descriptions are sufficient, but a visualization would be a big plus.

Copy parser files from parser-indexer to this repository

Since our parser capabilities no longer depend on Solr, it would make sense to migrate the relevant parser scripts (and issues) to the MTE repository (leaving current versions of the parser scripts as-is in the parser-indexer repo).

Then, we can also remove the dependency on the parser-indexer repo in this repository.

Add ability to auto-annotate entities using a gazette

We want to increase recall of Target types specifically. One way is to do a string-based matching from a gazette of entity terms. We can provide a gazette file that consists of "Entity_type Entity_name" pairs to inform the string matching.

Steven suggests integrating this capability into corenlp_parser.py. It would happen after applying the trained NER model and before running relation extraction.

  • Suggest adding an option like -g <gazette_file> which if specified would activate this capability
  • We don't want to generate duplicate annotations, so this step should check to see if a given entity is already marked by the NER system and if so, omit adding the new entity annotation.
  • The pre_annotate.py script can be an inspiration for this capability (but does not need to be followed exactly).

Conduct user study of MTE content

  • Plan study - what we want to measure
    • Correctness of content (for some specified queries; for queries of their choice) - yes/no for e.g. 10 queries
    • Findability (of content of personal interest) ("could you find it?" and "how long did it take?")
    • Utility ("would you find this resource helpful in your research?") - Likert scale
    • Expansion ("what changes/features would you want to make this even more useful?") - text box
  • Devise survey or other information-gathering form
  • Recruit participants: Mars surface mission team members, planetary scientists that were not on a Mars surface mission, scientists from other fields
  • Collect and analyze results

Train CoreNLP named entity recognizer for use with MER documents

The plan is to re-train the model using all available annotations (MSL, MPF, PHX).

  • First, do cross-validation to get a sense of generalization capability. We no longer need to divide train/dev/test temporally. Include Property as an entity type.
  • Re-train over all annotations (including Property).
  • Augment with MER-A Target gazette for a MER-A classifier
  • Augment with MER-B Target gazette for a MER-B classifier

Parser improvement: terminate run on global problems

When iterating over documents, it would be helpful to distinguish between errors specific to a document (such as when an input file is not found) and errors that affect all documents (such as the CoreNLP server being unavailable or we cannot load the NER model). In the former case, processing should advance to the next document. In the latter case, processing should stop with a message to the user on standard out indicating the run was terminated prematurely (details can be in the log file).

Global problems:

  • CoreNLP unavailable
  • NER model cannot be read
  • grobid unavailable

MTE processing of documents that mention multiple missions

The MTE consists of a separate database per mission. However, some documents mention targets from multiple missions. How should we accommodate these documents?

  1. Independently apply each mission's NER model + shared jSRE model. However, the review process will be tedious because each document has to be reviewed #missions times.
  2. Train a merged Target NER model across all missions. How then to separate them for each mission's database?
  3. Train a merged model to distinguish between targets per mission (different entity types for Target-MPF, Target-PHX, etc.). I suspect this will not have high accuracy and would require a lot of manual editing to reassign targets to their correct mission. However, it is worth evaluating.

To help decide the best course of action, we would like to evaluate each option on the same test set. This test set should be composed of some documents from each of the labeled sets (MSL, MPF, PHX).

Let's use this issue to continue planning how to proceed.

Train jSRE model for "has_property" relation

  • Train jSRE model for "has_property" relation using all labeled MPF and PHX docs
  • It would be good to create a script for training a new jSRE model, instead of referring to the wiki page where the process is documented.
  • We need a separate jSRE model for each relation used. This means we'll also need to update jsre_parser.py to apply more than one model (see #34 )

Update ingest_sqlite.py to use ADS document meta-data fields

The new parser-indexer-py populates several new fields with document meta-data using the ADS service. These fields start with "ads:". We want to update ingest_sqlite.py to use these fields, when available. Otherwise, fall back to the "grobid:" fields:

MTE/src/ingest_sqlite.py

Lines 38 to 41 in c7fd597

'title': rec['metadata'].get('grobid:header_Title', ''),
'authors': rec['metadata'].get('grobid:header_Authors', ''),
'primary_author': '',
'affiliations': rec['metadata'].get('grobid:header_FullAffiliations', ''),

Update ingest_sqlite.py to remove LPSC-specific content

To support non-LPSC content (like journals), we will need to update these functions:

MTE/src/ingest_sqlite.py

Lines 93 to 113 in 3147233

# Document feature update functions by Thamme Gowda (from insert_docs.py)
def construct_doc_url(rec):
if rec['year'] < 2000:
rec['doc_url'] = 'http://www.lpi.usra.edu/meetings/' + \
('LPSC%s/pdf/' % (rec['year'] - 1900)) + \
str(rec['abstract']) + '.pdf'
elif rec['year'] <= 2017: # 2000 through 2017
rec['doc_url'] = 'http://www.lpi.usra.edu/meetings/' + \
('lpsc%s/pdf/' % rec['year']) + \
str(rec['abstract']) + '.pdf'
else: # 2018 and later
rec['doc_url'] = 'http://www.hou.usra.edu/meetings/' + \
('lpsc%s/pdf/' % rec['year']) + \
str(rec['abstract']) + '.pdf'
return rec
def update_doc_venue(rec):
rec['venue'] = 'Lunar and Planetary Science Conference, ' + \
('Abstract #%d' % rec['abstract'])
return rec

Address PDS feedback on MER-A bundle

Reviewer 1:

I reviewed the readme.txt file in the document collection and looked at the data files in the MER collection.

  • 1) I understand that this dataset is structured like a relational database. However, I feel that a typical user would find it hard to use without some sort of user interface to connect the separate data files.

  • 2) The documentation should add a note that MER2 corresponds to the Spirit rover that is also sometimes referred to as MERA.

  • 3) The documentation should make it clear that the resulting target list is based on what has been published in LPSC abstracts and is not necessarily a complete list of targets defined and observed by the mission.

  • 4) It seems to me that the structure of the documents product would make it hard to expand it to include journal publications. For example, it includes a conference name, which is not applicable to journals and an abstract number that may not be applicable to non-LPI conferences. Maybe the product should be renamed to abstracts or lpsc_abstracts.

  1. Components.csv:
  • - Some listed components are not element or mineral. For example Hawaiite and Hyaloclastite.
  • - Abbreviations should be expanded. For example, Feot to FeO-total, px to pyroxene.
  • - Single and plural entries of the same thing should be combined. The reason for having both pyroxene and pyroxenes, for example, is not clear. If it is important, please explain. Another example is Npox and Np-ox.
  • - Feot and Feotot is listed as a mineral. The abstract could be referring to the total iron oxide content in the composition of a sample and not necessarily to iron-oxide mineral.
  • - It is not clear why the second component of chemical compounds are shown in lower case and not in the standard chemical notation for elements. This also applies to the contains product.

Reviewer 2:

The data are presented from a ML / data science viewpoint and not from a planetary scientist (or general science user) viewpoint. The structure makes perfect sense for a programmer, but a scientist has to build a system to link from table to table to get any answers. An average science user who wants to find abstracts that reference a given target has to relink all of the tables using a variety of IDs along the way. Attention has to be paid to the aliases table to find any name variations like misspellings and abbreviations. It’s not that the structure is terrible, it’s just not friendly to the common user.

I think the targets.csv table should have columns something like (target_id, target_name, mte_target_id, met_target_name, mission) and then include both the target_id and mte_target_id in the other tables. The target_id and _name refer to the canonical name, so the set of columns could be (target_id_canonical, target_name_canonical, target_id_mte, target_name_mte, mission).

Perhaps the MTE should include a “first order” table that quickly and easily gets users 90% of the way to a reference:

  • ID columns (the target names and IDs)
  • Reference type (literature mention, element/mineral/property)
  • Reference value (for element/mineral/property: “oxygen”, “carbonate”, “lag_deposit”, etc.)
  • Document reference (id, title, author, year, URL)

Plenty of repetition, but this sort of table is easily scanned by human eyes and can be loaded into Excel (or wherever) for column-based sorting. For a user looking at the archive volume as it is, the filename “target.csv” is glowing red-hot and carries the suggestion that the file is a list of official targets. After opening the file, the user sees lots of names with typos and extraneous-looking appendages.

  • In short, my primary recommendation is that the science community probably is better served if the MTE product has two distinct parts: a user-friendly table (described above) and the formal computer science portion that is not the first thing a normal user would see.

Reviewer 3:

bundle_mars_target_encyclopedia.xml:

  • - I would make the bundle V2.0, since an entire collection was added.
  • - Modification_Detail description should say that the data_mer2 collection was added.
    Maybe something like, "Added data_mer2 collection. Also added aliases table to data_mpf and data_phx collections" if that's accurate.

collection_mpf.xml, collection_phx.xml, and collection_document.xml:

  • - Version 1.2 of these files online are named collection_mpf_inventory.xml, collection_phx_inventory.xml, and collection_document_inventory.xml. Changing the file names is going to mess things up at the EN. If you insist on changing them, I think they might have to start over at Version 1.0, but we'd need to check with Richard. I'd just stick with the online names, and follow the same naming convention for the new MER2 inventory (collection_mer2_inventory.xml).
    Recommendation is to revert to the “inventory” naming convention because the online files have already been registered with EN, and this might cause problems. Also, it’s fine to have “inventory” in the collection label filename. Many of bundles at GEO use this convention.

data_mer/collection_mer2.xml:

  • - Rename to collection_mer2_inventory.xml per above.
  • - Change to DOS line endings.
  • - <start_date_time>1997-07-04Z</start_date_time>
    <stop_date_time>2020-03-16Z</stop_date_time>
    This can't be right.
    I noticed that all the labels (including MPF and PHX) have these dates (the bundle XML is slightly different).
    Where do these dates come from? Shouldn't they be mission-specific?
    Scott: I agree with the reviewer and think what they were getting at was to have the bundle span all time, and the individual collections span only from when the mission data starts (i.e., MER starts in 03 or 04, not 97).

document/collection_document.xml:

  • - Modification_Detail description should say that the MTE-schema.jpg file was added and readme.txt was updated. "Add aliases table" should be removed.

document/readme.xml:

  • - Modification_Detail description should say something like "Updated to reflect addition of aliases table and mer2 data collection."

MTE-schema.jpg and MTE-schema.xml

  • - Lowercase the file names. Do same for pointer in MTE-schema.xml.
    Not a requirement, but we favour lowercase filenames at GEO. Also, this is the only file in the bundle that is lowercase. LIDs have to be lowercase, so we try to have filenames match.

All data_mpf and data_phx XML labels:

  • Why was the <modification_date> for V1.2 changed from 2021-06-07 to 2021-06-01? The labels online have 2021-06-07.

data_mpf/has_property.xml, data_mpf/mentions.xml, data_mpf/targets.xml, data_phx/contains.xml, data_phx/documents.xml, data_phx/has_property.xml, data_phx/mentions.xml, data_phx/targets.xml.

  • - V1.3 Modification_Detail descriptions say "Add aliases table." These do not reflect the edits that were actually made to the individual files.

data/mer/has_property.xml, data_mpf/has_property.xml, and data_phx/has_property.xml

  • - Change to DOS line endings.

data_mpf/properties.xml, data_mpf/sentences.xml, data_phx/sentences.xml

  • Should be V1.3, since changes were made to the files. Remember to update the inventory file also.

Scott VanBommel:

  • I think what is missing from the readme file is discussion of how the target names were determined, and how the canonical names were selected when they appear in the aliases list. (Scott's comment: this ties into the concern, which I share, that a user accepts the information presented as absolute truth - however, targets.csv and aliases.csv are neither authoritative nor comprehensive.)

  • Change the part in the readme file that says “when a target name is re-used” to “when a target name is used in multiple missions”

Address/correct MTE 1.3.0 validation errors

Scott VanBommel used version 2.1.0 of the validate tool and identified some issues that need investigation. The full output is included below.


PDS Validate Tool Report

Configuration:
Version 2.1.0
Date 2022-01-27T13:32:31Z

Parameters:
Targets [file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/]
Rule Type pds4.bundle
Severity Level WARNING
Recurse Directories true
File Filters Used [*.xml, *.XML]
Data Content Validation on
Product Level Validation on
Allow Unlabeled Files false
Max Errors 100000
Registered Contexts File C:\PDS\Tools\Validate\bin..\resources\registered_context_products.json

Product Level Validation Results

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/bundle_mars_target_encyclopedia.xml
1 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/aliases.xml
2 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/collection_mer2_inventory.xml
3 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/components.xml
4 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/contains.xml
5 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/documents.xml
6 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/has_property.xml
7 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/mentions.xml
8 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/properties.xml
9 product validation(s) completed

FAIL: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.xml
Begin Content Validation: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.csv

  • ERROR [error.validation.invalid_field_value] table 1, record 1884, field 3: The field value 'For example, Rayleigh fractional crystallization of Adirondack magma steadily increases incompatible element concentrations (K; ! D Kbulk " 0) and rapidly decreases compatible element concentrations (Ni; ! D Ni bulk >>1).' that starts with double quote should not contain double quote(s)
    End Content Validation: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.csv
    10 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/targets.xml
    11 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/aliases.xml
    12 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/collection_mpf_inventory.xml
    13 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/components.xml
    14 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/contains.xml
    15 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/documents.xml
    16 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/has_property.xml
    17 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/mentions.xml
    18 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/properties.xml
    19 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/sentences.xml
    20 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/targets.xml
    21 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/aliases.xml
    22 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/collection_phx_inventory.xml
    23 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/components.xml
    24 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/contains.xml
    25 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/documents.xml
    26 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/has_property.xml
    27 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/mentions.xml
    28 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/properties.xml
    29 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/sentences.xml
    30 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/targets.xml
    31 product validation(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/collection_document_inventory.xml
    32 product validation(s) completed

    FAIL: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/readme.xml

  • ERROR [error.validation.internal_error] Error occurred while processing TEXT file content for readme.txt: String index out of range: -1
    33 product validation(s) completed

PDS4 Bundle Level Validation Results

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/collection_mpf_inventory.xml
1 integrity check(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/collection_phx_inventory.xml
2 integrity check(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/collection_document_inventory.xml
3 integrity check(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/bundle_mars_target_encyclopedia.xml

  • WARNING [warning.integrity.missing_context_reference] The context reference 'urn:nasa:pds:context:investigation:mission.mars_science_laboratory' could not be found in this bundle but it was defined in urn:nasa:pds:mars_target_encyclopedia:document::1.3. (Disable with --skip-context-reference-check flag)

  • WARNING [warning.integrity.missing_context_reference] The context reference 'urn:nasa:pds:context:instrument_host:spacecraft.msl' could not be found in this bundle but it was defined in urn:nasa:pds:mars_target_encyclopedia:document::1.3. (Disable with --skip-context-reference-check flag)

  • WARNING [warning.integrity.missing_context_reference] The context reference 'urn:nasa:pds:context:instrument:chemcam_libs.msl' could not be found in this bundle but it was defined in urn:nasa:pds:mars_target_encyclopedia:document::1.3. (Disable with --skip-context-reference-check flag)
    4 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/components.xml
    5 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/sentences.xml
    6 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/aliases.xml

  • WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:aliases::1.0' is not a member of any collection within the given target
    7 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/has_property.xml
    8 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/components.xml
    9 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/contains.xml
    10 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/collection_mer2_inventory.xml
    11 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/aliases.xml
    12 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/contains.xml
    13 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.xml

  • WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:sentences::1.0' is not a member of any collection within the given target
    14 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/documents.xml

  • WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:documents::1.0' is not a member of any collection within the given target
    15 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/mentions.xml
    16 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/sentences.xml
    17 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/components.xml

  • WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:components::1.0' is not a member of any collection within the given target
    18 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/targets.xml
    19 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/targets.xml
    20 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/properties.xml
    21 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/contains.xml

  • WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:contains::1.0' is not a member of any collection within the given target
    22 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/has_property.xml

  • WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:has_property::1.0' is not a member of any collection within the given target
    23 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/mentions.xml

  • WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:mentions::1.0' is not a member of any collection within the given target
    24 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/properties.xml
    25 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/mentions.xml
    26 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/documents.xml
    27 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/properties.xml

  • WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:properties::1.0' is not a member of any collection within the given target
    28 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/documents.xml
    29 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/has_property.xml
    30 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/readme.xml
    31 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/targets.xml

  • WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:targets::1.0' is not a member of any collection within the given target
    32 integrity check(s) completed

    PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/aliases.xml
    33 integrity check(s) completed

Summary:

2 error(s)
12 warning(s)

Product Validation Summary:
31 product(s) passed
2 product(s) failed
0 product(s) skipped

Referential Integrity Check Summary:
33 check(s) passed
0 check(s) failed
0 check(s) skipped

Message Types:
1 error.validation.internal_error
1 error.validation.invalid_field_value
9 warning.integrity.unreferenced_member
3 warning.integrity.missing_context_reference

End of Report

Remove spurious Components/Properties

I noticed that "Calcium" was listed as both an "Element" (correct) and a "Mineral" (incorrect) in our MER-A (mer2) database. It turns out that this is not due to an incorrect annotation, but instead an NER error . "Ca" is listed as a "Mineral" NER in the source .json file (/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397.jsonl) when it is part of "Ca-sulfates" in 2006_1472. This is corrected in the annotations to be of type "Element", but the remove-orphans step in update_sqlite.py does not remove "Calcium" from the components table because "Calcium" still appears in the contains table (due to the element appearing in at least one valid contains relation), and the components table is not refreshed based on the annotations (probably so that we can run update_sqlite.py several times to progressively add/update if desired?).

A possible solution would be for the remove-orphans step to regenerate components and properties at the end of processing so that they accurately reflect content in the documents at that point. However, it is worth more thought to determine if this is the best solution.

Generate MTE PDS4 bundle with MER-B content

  • Generate initial MER-B annotations (w/Contains and HasProperty from jSRE) using the MTE pipeline
  • Perform expert review of Target, Contains, and HasProperty annotations. Guidelines: https://docs.google.com/document/d/1KnkVtxkKb9kcRVKZqPWwuXSi_6zK8Scay1M6qiz1Khc/edit#
  • Populate aliases table
  • Generate SQLite database for MER-A (some items were updated) and MER-B
  • Check contents via MTE website
  • Generate MTE bundle with MER-B (mer1) content added
    • Note: PDS made us go to version 2.0 when we added mer2, so likely we should advance the bundle to version 3.0 with mer1. (Individual files will advance versions only as needed)

Documentation

  • Add README to src/
    • Steven: describe how to parse a document (full details on wiki page)
    • Kiri: describe how to create SQLite database (ingest, review with json2brat, update)
  • Describe how to deliver PDS bundle - this has many steps and probably should be a wiki page instead; will be copied from github-fn wiki page
  • Update wiki to remove mentions of Solr and replace with SQLite

Prepare LPSC 2022 poster

  • Draft poster and share with co-authors
  • Submit to doc review by Feb. 23
  • March 2: latest date to "publish" poster online
  • March 9: poster session presentation

Develop improved relation extraction methods

e.g., identify "container" and "containees" prior to applying relation classifier to improve precision; search for a "container" when "containee" is true to improve recall

  • Compare to jSRE
  • Compare to PURE

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.