Code Monkey home page Code Monkey logo

streusle's Introduction

STREUSLE Dataset

Example
STREUSLE annotations visualized with streusvis.py

STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. The text is from the web reviews portion of the English Web Treebank [9]. STREUSLE incorporates comprehensive annotations of multiword expressions (MWEs) [1] and semantic supersenses for lexical expressions. The supersense labels apply to single- and multiword noun and verb expressions, as described in [2], and prepositional/possessive expressions, as described in [3, 4, 5, 6, 7, 8]. Lexical expressions also feature a lexical category label indicating its holistic grammatical status; for verbal multiword expressions, these labels incorporate categories from the PARSEME 1.1 guidelines [15]. For each token, these pieces of information are concatenated together into a lextag: a sentence's words and their lextags are sufficient to recover lexical categories, supersenses, and multiword expressions [8].

🧮 Corpus Stats: >55k words, >3k multiword expression instances, >22k supersense-tagged expressions

👩‍💻 Using the data: The canonical file with source annotations is streusle.conllulex. For scripting, the JSON format will likely be preferred. See Formats below.

🤖 Tagger: Code for a lexical semantic recognition tagger [8] trained on STREUSLE can be downloaded at: https://github.com/nelson-liu/lexical-semantic-recognition/

Release URL: https://github.com/nert-nlp/streusle
Additional information: http://www.cs.cmu.edu/~ark/LexSem/
Online corpus search in ANNIS: https://corpling.uis.georgetown.edu/annis/#_c=c3RyZXVzbGVfNC4z (instructions)
Browse semantic annotations of prepositions/possessives on the Xposition website [17]: http://www.xposition.org/en/

The English Web Treebank sentences were also used by the Universal Dependencies (UD) project as the primary reference corpus for English [10]. STREUSLE incorporates the syntactic and morphological parses from UD_English-EWT v2.10 (released May 15, 2022); these follow the UD v2 standard.

This dataset's multiword expression and supersense annotations are licensed under a Creative Commons Attribution-ShareAlike 4.0 International license (see LICENSE). The UD annotations are redistributed under the same license. The source sentences and PTB part-of-speech annotations, which are from the Reviews section of the English Web Treebank (EWTB; [9]), are redistributed with permission of Google and the Linguistic Data Consortium, respectively.

An independent effort to improve the MWE annotations from those in STREUSLE 3.0 resulted in the HAMSTER resource [14]. The HAMSTER revisions have not been merged with the 4.0 revisions, though we intend to do so for a future release.

CC BY-SA 4.0 PyPI version

Files

  • streusle.conllulex: Full dataset.

  • STATS.md, LEXCAT.txt, MWES.txt, SUPERSENSES.txt: Statistics summarizing the full dataset.

  • train/, dev/, test/: Data splits established by the UD project and accompanying statistics.

  • releaseutil/: Scripts for preparing the data for release.

  • ACKNOWLEDGMENTS.md: Contributors and support that made this dataset possible.

  • CONLLULEX.md: Description of data format.

  • EXCEL.md: Instructions for working with the data as a spreadsheet.

  • LICENSE.txt: License.

  • ACL2018.md: Links to resources reported in [7].

  • conllulex2json.py: Script to validate the data and convert it to JSON.

  • json2conllulex.py: Script to convert STREUSLE JSON to .conllulex.

  • conllulex2csv.py: Script to create an Excel-readable CSV file with the data.

  • csv2conllulex.py: Script to convert an Excel-generated CSV file to .conllulex.

  • conllulex2UDlextag.py: Script to remove all STREUSLE fields except lextags.

  • UDlextag2json.py: Script to unpack lextags, populating remaining STREUSLE fields.

  • normalize_mwe_numbering.py: Script to ensure MWEs within each sentence are numbered in a consistent order.

  • govobj.py: Utility for adding heuristic preposition/possessor governor and object links to the JSON.

  • lexcatter.py: Utilities for working with lexical categories.

  • mwerender.py: Utilities for working with MWEs.

  • supersenses.py: Utilities for working with supersense labels.

  • streusvis.py: Utility for browsing MWE and supersense annotations.

  • supdate.py: Utility for applying lexical semantic annotations made by editing the output of streusvis.py.

  • tagging.py: Utilities for working with BIO-style tags.

  • tquery.py: Utility for searching the data for tokens that meet certain criteria.

  • tupdate.py: Utility for applying lexical tag changes made by editing the output of tquery.py.

  • streuseval.py: Unified evaluation script for MWEs and supersenses.

  • psseval.py: Evaluation script for preposition/possessive supersense labeling only.

  • pssid/: Heuristics for identifying SNACS targets.

  • setup.py: Setup script for installing this as a Python package via setuptools.

Formats

  • The canonical data format for STREUSLE 4.0+ is the CONLLULEX tabular format. It extends the CoNLL-U format from the Universal Dependencies project with additional columns for lexical semantic annotations. (The .sst and .tags formats from STREUSLE 3.0 are not expressive enough and are no longer supported.)

  • Scripts support conversion between .conllulex and a JSON format: conllulex2json.py, json2conllulex.py. A JSON file can be enriched with syntactic details of the preposition/possessive relations via the govobj.py script. JSON files are included in the train, dev, and test subdirectories.

  • Other scripts support conversion between .conllulex and Excel-compatible CSV.

  • Luke Gessler has written a module for the Pepper tool so that STREUSLE data can be converted to other Pepper-supported formats, including PAULA XML and ANNIS. See instructions for converting.

References

Citations describing the annotations in this corpus (main STREUSLE papers in bold):

  • [1] Nathan Schneider, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T. Mordowanec, Henrietta Conrad, and Noah A. Smith. Comprehensive annotation of multiword expressions in a social web corpus. Proceedings of the Ninth International Conference on Language Resources and Evaluation, Reykjavík, Iceland, May 26–31, 2014. http://people.cs.georgetown.edu/nschneid/p/mwecorpus.pdf

  • [2] Nathan Schneider and Noah A. Smith. A corpus and model integrating multiword expressions and supersenses. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, May 31–June 5, 2015. http://people.cs.georgetown.edu/nschneid/p/sst.pdf

  • [3] Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Meredith Green, Abhijit Suresh, Kathryn Conger, Tim O'Gorman, and Martha Palmer. A corpus of preposition supersenses. Proceedings of the 10th Linguistic Annotation Workshop, Berlin, Germany, August 11, 2016. http://people.cs.georgetown.edu/nschneid/p/psstcorpus.pdf

  • [4] Jena D. Hwang, Archna Bhatia, Na-Rae Han, Tim O’Gorman, Vivek Srikumar, and Nathan Schneider. Double trouble: the problem of construal in semantic annotation of adpositions. Proceedings of the Sixth Joint Conference on Lexical and Computational Semantics, Vancouver, British Columbia, Canada, August 3–4, 2017. http://people.cs.georgetown.edu/nschneid/p/prepconstrual2.pdf

  • [5] Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Archna Bhatia, Na-Rae Han, Tim O'Gorman, Sarah R. Moeller, Omri Abend, Adi Shalev, Austin Blodgett, and Jakob Prange (June 15, 2022). Adposition and Case Supersenses v2.6: Guidelines for English. arXiv preprint. https://arxiv.org/abs/1704.02134

  • [6] Austin Blodgett and Nathan Schneider (2018). Semantic supersenses for English possessives. Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan, May 9–11, 2018. http://people.cs.georgetown.edu/nschneid/p/gensuper.pdf

  • [7] Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Jakob Prange, Austin Blodgett, Sarah R. Moeller, Aviram Stern, Adi Bitan, and Omri Abend. Comprehensive supersense disambiguation of English prepositions and possessives. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, July 15–20, 2018. http://people.cs.georgetown.edu/nschneid/p/pssdisambig.pdf

Related work:

  • [8] Nelson F. Liu, Daniel Hershcovich, Michael Kranzlein, and Nathan Schneider (2021). Lexical semantic recognition. Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021), Online, August 6, 2021. https://people.cs.georgetown.edu/nschneid/p/lsr.pdf (tagger code)

  • [9] Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. English Web Treebank. Linguistic Data Consortium, Philadelphia, Pennsylvania, August 16, 2012. https://catalog.ldc.upenn.edu/LDC2012T13

  • [10] Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel R. Bowman, Miriam Connor, John Bauer, and Christopher D. Manning (2014). A gold standard dependency corpus for English. Proceedings of the Ninth International Conference on Language Resources and Evaluation, Reykjavík, Iceland, May 26–31, 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/1089_Paper.pdf

  • [11] Nathan Schneider, Emily Danchik, Chris Dyer, and Noah A. Smith. Discriminative lexical semantic segmentation with gaps: running the MWE gamut. Transactions of the Association for Computational Linguistics, 2(April):193−206, 2014. https://people.cs.georgetown.edu/nschneid/p/mwe.pdf

  • [12] Nathan Schneider, Jena D. Hwang, Vivek Srikumar, and Martha Palmer. A hierarchy with, of, and for preposition supersenses. Proceedings of the 9th Linguistic Annotation Workshop, Denver, Colorado, June 5, 2015. https://people.cs.georgetown.edu/nschneid/p/pssts.pdf

  • [13] Nathan Schneider, Dirk Hovy, Anders Johannsen, and Marine Carpuat. SemEval-2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM). Proceedings of the 10th International Workshop on Semantic Evaluation, San Diego, California, June 16–17, 2016. http://people.cs.georgetown.edu/nschneid/p/dimsum.pdf

  • [14] King Chan, Julian Brooke, and Timothy Baldwin. Semi-automated resolution of inconsistency for a harmonized multiword expression and dependency parse annotation. Proceedings of the 13th Workshop on Multiword Expressions, Valencia, Spain, April 4, 2017. http://www.aclweb.org/anthology/W17-1726

  • [15] PARSEME Shared Task 1.1 - Annotation guidelines. 2018. http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=home

  • [16] Daniel Hershcovich, Nathan Schneider, Dotan Dvir, Jakob Prange, Miryam de Lhoneux, and Omri Abend. Comparison by conversion: reverse-engineering UCCA from syntax and lexical semantics. Proceedings of the Second International Workshop on Designing Meaning Representations, Online, December 13, 2020. https://arxiv.org/abs/2011.00834 (rule-based system, statistical system)

  • [17] Luke Gessler, Austin Blodgett, Joseph Ledford, and Nathan Schneider (2022). Xposition: An online multilingual database of adpositional semantics. Proceedings of the 13th Linguistic Resources and Evaluation Conference, Marseille, France, June 20-25, 2022. https://people.cs.georgetown.edu/nschneid/p/xposition.pdf (website)

Contact

Questions should be directed to:

Nathan Schneider
[email protected]
http://nathan.cl

History

Synopsis of changes since 4.0

The 4.0 release [7] updates the inventory and application of preposition supersenses, applies those supersenses to possessives (detailed in [6]), incorporates the syntactic annotations from the Universal Dependencies project, and adds lexical category labels to indicate the holistic grammatical status of strong multiword expressions. The 4.1 release adds subtypes for verbal MWEs (VID, VPC.{full,semi}, LVC.{full,cause}, IAV) according to PARSEME 1.1 guidelines [15]. The 4.2 and 4.3 releases revise some of the semantic annotations. The 4.4 release updates only UD annotations. The 4.5 release updates UD annotations and renames a couple of semantic labels.

Detailed changes

  • STREUSLE 4.5: 2022-06-15.
    • Update SNACS annotations to the v2.6 standard (automatically rename p.Causer -> p.Force and p.RateUnit -> p.SetIteration).
    • Update UD to v2.10. This affects many UPOS tags and lemmas, especially for proper names. The UD update also introduces lines encoding multiword tokens (not to be confused with multiword expressions) for clitics.
  • STREUSLE 4.4: 2020-11-04.
    • Update govobj.py to recognize a different style of annotation for preposition stranding.
    • Update UD to v2.6.
    • Link from README to [16], a new paper on converting STREUSLE annotations to UCCA (Universal Conceptual Cognitive Annotation), which uses this version of the data in experiments.
  • STREUSLE 4.3: 2020-05-01.
    • Updated preposition/possessive annotations to SNACS v2.5 guidelines ([5], specifically https://arxiv.org/abs/1704.02134v6), which includes changes in the set of labels.
    • Added a sentence that had been omitted from a document in the training set.
    • Updated UD parses to the latest dev version (post-v2.5). This improves lemmas for misspelled words and adds paragraph boundaries.
    • Link from README to new Pepper converter module.
    • Link from README to online search tool using ANNIS.
  • STREUSLE 4.2: 2020-01-01.
    • Added streuseval.py, a unified evaluation script for MWEs + supersenses (issue #31).
    • Added streusvis.py, for viewing sentences with their MWE and supersense annotations.
    • Added supdate.py (sentence-wise) and tupdate.py (token-wise) for editing lexical semantic annotations (issue #54).
    • Added format conversion scripts conllulex2json.py, conllulex2UDlextag.py, and UDlextag2json.py.
    • Normalized the way MWEs within a sentence are numbered in markup (normalize_mwe_numbering.py, issue #42).
    • Several improvements to govobj.py (most notably issue #35, affecting 184 tokens, and a small fix in 58db569 which affected 53 tokens).
    • Subdirectories for splits (train/, dev/, test/) now include .json and .govobj.json files alongside the source .conllulex.
    • Added release preparation scripts under releaseutil/.
    • Added setup.py.
    • Fixed a very small bug in tquery.py affecting the display of sentence-final matches, and made minor changes in functionality involving null values and negative constraints; token-level attributes of multiword expressions; and a new option to filter by sentence length.
    • Manually corrected all tokens with the placeholder lexcat symbol !!@ (introduced in v4.0) to have a real lexcat and, if appropriate, a supersense (issue #15).
    • A number of revisions to SNACS (preposition/possessive supersense) annotations coordinated with updated guidelines ([5], specifically SNACS v2.4, https://arxiv.org/abs/1704.02134v5; this incorporates updates for SNACS v2.3 as well).
    • Minor corrections in the data and validation improvements.
    • Updated UD parses to the latest dev version (post-v2.5). Among other things, this improves lemmas for words with nonstandard spellings.
  • STREUSLE 4.1: 2018-07-02. Added subtypes to verbal MWEs (871 tokens) per PARSEME Shared Task 1.1 guidelines [15]; some MWE groupings revised in the process. Minor improvements to SNACS (preposition/possessive supersense) annotations coordinated with updated guidelines ([5], specifically https://arxiv.org/abs/1704.02134v3). Implementation of SNACS (preposition/possessive supersense) target identification heuristics from [7]. New utility scripts for listing/filtering tokens (tquery.py) and converting to and from an Excel-compatible CSV format.
  • STREUSLE 4.0: 2018-02-10. Updated preposition supersenses to new annotation scheme (4398 tokens). Annotated possessives (1117 tokens) using preposition supersenses. Revised a considerable number of MWEs involving prepositions. Added lexical category for every single-word or strong multiword expression. New data format (.conllulex) integrates gold syntactic annotations from the Universal Dependencies project.
  • STREUSLE 3.0: 2016-08-23. Added preposition supersenses
  • STREUSLE 2.1: 2015-09-25. Various improvements chiefly to auxiliaries, prepositional verbs; added `p class label as a stand-in for preposition supersenses to be added in a future release, and `i for infinitival 'to' where it should not receive a supersense. From 2.0 (not counting `p and `i):
    • Annotations have changed for 877 sentences (609 involving changes to labels, 474 involving changes to MWEs).
    • 877 class labels have been changed/added/removed, usually involving a non-supersense label or triggered by an MWE change. Most frequently (118 cases) this was to replace stative with the auxiliary label `a. In only 21 cases was a supersense label replaced with a different supersense label.
  • STREUSLE 2.0: 2015-03-29. Added noun and verb supersenses
  • CMWE 1.0: 2014-03-26. Multiword expressions for 55k words of English web reviews

streusle's People

Contributors

ablodge avatar csome avatar danielhers avatar jakpra avatar lgessler avatar nschneid avatar ryanamannion avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

streusle's Issues

streusvis.py: consider flagging errors, adding space to align tokens across sentences

Currently, each sentence is rendered separately with MWEs and supersenses, and color is added post hoc to annotations based on a regex.

Given the gold and predicted sentences below:

No more having_to drive|v.motion to|p.Goal San_Francisco|n.LOCATION for|p.Purpose a great mani_pedi|n.ACT .
No more having to drive to|p.Goal San Francisco for|p.Purpose a great mani pedi .

It would be nice to

  • highlight where the prediction was incorrect (maybe with a red background and white text for a missing or extra label and red text for an incorrect label, or maybe just by making the word red if either the MWE analysis or the supersense was incorrect)

  • align the tokens, i.e.

      No more having_to drive|v.motion to|p.Goal San_Francisco|n.LOCATION for|p.Purpose a great mani_pedi|n.ACT .
      No more having to drive          to|p.Goal San_Francisco            for|p.Purpose a great mani pedi       .
    

Add tsv files to prepare-4.0 branch

@ablodge Thanks for uploading the files. I've moved them to the prepare-4.0 branch. Can you also add the spreadsheets that the script uses (since you changed the column names)?

BTW I've changed your script so that the input is streusle_v3.sst and the output is streusle_v4.sst. (v1 of preposition supersenses corresponds to v3 of the STREUSLE corpus.)

Lexcat heuristics: PP false positives

Ken Litkowski noticed that PP is erroneously the lexcat for in hope to, just about, and nothing but, which should be P. This is because the UPOS of the last word in the MWE is PART, ADV, or CCONJ. Under the current heuristics in lexcatter.py, an MWE is treated as P only if the last word is tagged as ADP or SCONJ.

Address all !!@ and !@ tokens

This is used as a stopgap lexcat for non-prepositional tokens that need to be revised—typically ones that need a N or V supersense.

Possession-related data review

  • In general, p.Possessor is only used as function if the scene role is also p.Possessor. But there are a few exceptions which may be inconsistencies.
  • Review p.Originator~>p.Gestalt. Some look like they should be p.Possessor because the governor is the entity, not the transfer event.

MWE numbering within sentence is inconsistent

In some sentences all strong MWEs are numbered before weak ones; in others the numbering is by token offset.

This does not matter for the semantics, but it means that equivalent files will be superficially different. So perhaps we should enforce a normal form for numbering MWEs.

In the script for #41:

# Note that numbering of strong+weak MWEs doesn't follow a consistent order in the data!
# Ordering by first token offset (tiebreaker to strong MWE):
#xgroups = [(min(sg),'s',sg) for sg in sgroups] + [(min(wg),'w',wg) for wg in wgroups]
# Putting all strong expressions before any weak expressions:
xgroups = [(None,'s',sg) for sg in sgroups] + [(None,'w',wg) for wg in wgroups]
# This means that the MWE columns are not *completely* determined by
# the lextag in a way that matches the original data, but different MWE
# orders does not matter semantically.
# See also check in _postproc_sent(), which ensures that the MWE numbers
# count from 1, but does not mandate an order.

streusle/UDlextag2json.py

Lines 124 to 129 in 09014b4

# check that MWEs are numbered from 1
# fix_mwe_numbering.py was written to correct this
# However, this does NOT require a particular sort order of the MWEs in the sentence.
# It just requires that they have unique numbers 1, ..., N if there are N MWEs.
for i,(k,mwe) in enumerate(sorted(chain(sent['smwes'].items(), sent['wmwes'].items()), key=lambda x: int(x[0])), 1):
assert int(k)==i,(sent['sent_id'],i,k,mwe)

Double-check s-genitives

In our annotation for the LREC paper @ablodge and I disagreed on some tokens. The guidelines have since been revised. We should go through the disagreements and adjudicate them.

govobj of "rather than"

"they seemed more interested~in helping me find the right car rather_then just make_ a _sale" (245160.4)

"rather then" has gov "make" and obj null, but in fact "make" should be the obj (and maybe "helping" should be the gov).
In UD, "rather" is a cc of "make", so I can see where this comes from, but it would be nice to have govobj.py handle this.

Attested but undocumented construals

The following 30 construals are attested between 1 and 3 times in the data, but not documented in the current guidelines. Let's look at them to see which are worth documenting, which are borderline but worth keeping in the data, and which should be reannotated.

   1 p.Agent	p.Locus
   3 p.Beneficiary	p.Gestalt
   1 p.Characteristic	p.Manner
   1 p.ComparisonRef	p.Beneficiary
   1 p.Cost	p.Extent
   1 p.Direction	p.Goal
   1 p.Experiencer	p.Agent
   1 p.Explanation	p.Manner
   1 p.Extent	p.Whole
   2 p.Gestalt	p.Purpose
   1 p.Gestalt	p.Source
   2 p.Gestalt	p.Topic
   1 p.Goal	p.Whole
   3 p.Instrument	p.Manner
   1 p.Instrument	p.Theme
   3 p.Manner	p.Source
   1 p.Manner	p.Topic
   1 p.Means	p.Path
   1 p.Originator	p.Instrument
   1 p.Possession	p.PartPortion
   3 p.Possession	p.Theme
   2 p.Purpose	p.Goal
   2 p.Purpose	p.Locus
   1 p.Purpose	p.Theme
   1 p.SocialRel	p.Source
   3 p.Stimulus	p.Source
   1 p.Stimulus	p.Theme
   1 p.Theme	p.Accompanier
   1 p.Theme	p.Characteristic
   2 p.Time	p.Extent

Causative get/have: supersense

E.g. "get/have something fixed", "get my hair done". For "get my hair done", should "get" be v.change and "done" be v.body?

76 instances of VBN.*xcomp, most of which are this construction. (This doesn't count resultative PPs: "I got her on the phone".)

These might qualify as LVC.cause under the PARSEME 1.1 guidelines, though it's such a productive construction that I'd be reluctant to call these MWEs.

Revisit predicative/copular MWEs

egrep -v '^$' streusle.conllulex | egrep -v '^#' | cut -f13 | egrep '^be ' | sort | uniq -c
   1 be a big baby
   3 be a joke
   1 be a nice touch
   1 be a no brainer
   1 be a pain
   1 be a plus
   1 be first call
   1 be happy camper
   1 be in
   2 be in for a treat
   2 be in hand
   1 be inclined
   1 be make to
   1 be no more
   1 be out of this world
   1 be rude
   1 be say and do
   8 be suppose to
   1 be sure
   1 be sure to
   1 be there
   1 be there / do that
   1 be through
  16 be to
   1 be up

Evaluating lextag tagging performance

Hi!

I'd like to build a system to predict each token's lextag---I think the evaluation script for this is streusleval.py?

If so, it doesn't seem like it's part of the latest release? Also, is the data the same between 4.1 and the master ref? Not sure what the release cycle looks like for STREUSLE, but could be nice to have a minor release with all the improvements since last July :)

Constructions in STREUSLE

Use this thread to make note of interesting constructions where a words-with-spaces MWE analysis is unsatisfying because there is constrained productivity in certain parts of the expression.

`$ makes it easy to find idioms with an opaque possessive slot (e.g. "quick on X's feet").

so that

Should probably be so_that when used as SCONJ

govobj of "to" in "3 to 4"

"I've been here [3 to 4] times" ( 325292.2)

"to" currently has gov "4" and obj null, but should have gov "3" and obj "4".

related to #38

User-friendly concordance format and token update script

For revising certain classes of annotations (e.g., P supersenses where the scene role in Manner) it would be useful to have a concordance view. This would put a token's context on the same line for easy sorting and batch editing. So it would be a more human-readable view of the lexical annotation.

Does tquery.py already do this? Should it be run when building the release to produce a row for every supersense-annotated strong lexical expression, within the train/dev/test subdiretories? This would make diffs in commit history easier to read. (Not having this in the root directory would make it clear that .conllulex is the canonical data file.)

There would need to be a script to apply supersense edits made in the concordance view back to the original. untquery.py? tupdate.py?

Is there a natural way to specify MWE edits in the concordance view, also? Currently, adding an MWE or changing the strength of an existing MWE is painful to implement by hand in .conllulex.

Govobj extraction: edge cases involving coordination, approximators, directional particle + PP combinations, etc.

govobj.py can be improved to deal with various syntactic edge cases, some of which result in the undesirable property of listing a SNACS-tagged unit as governed by another SNACS-tagged unit:

  1. Currently, governors of coordinated Ps or PPs are misleading. Better to use Enhanced Dependencies to get the more semantic governor.
  2. Approximators ("ABOUT/AROUND/LESS_THAN/OVER 5 minutes", "BETWEEN 5 and 10 minutes") are treated as having a governor but no object. Though these constructions are syntactically weird, because these prepositions are rarely intransitive in other contexts, better to treat these as having an object but no governor.
  3. Comparative AS-AS: In UD, the second AS-phrase is treated as a dependent of the first AS. Thus, currently, "pay twice AS1 much AS2 they tell you" is currently analyzed as AS1(gov=much, obj=null), AS2(gov=as, obj=tell). Instead, do AS1:p.Cost~>p.Extent(gov=pay, obj=much), AS2:p.ComparisonRef(gov=much, obj=tell).
  4. In some cases, a directional adverb/particle (AWAY/BACK/DOWN/HOME/OUT/OVER) is treated in UD as having a PP complement even though the adverb can be omitted (UniversalDependencies/docs#570): "got BACK FROM france", "OVER BY 16th and 15th". In extracting governors, better to treat these as sisters.
  5. In some cases, the governor is not being retrieved correctly for SNACS expressions that are syntactically analyzed as adverb-modifying-adverb, e.g. "out there" advmod(there, out), "back home".
  6. AGO is analyzed as a postposition in SNACS, but an adverb in UD, hence there needs to be a special rule to extract the object.

Note that a rare legitimate case of a SNACS expression governing another SNACS expression is in a predicate complement: "they were OUT(g) FROM surgery", "I was IN(g) two weeks AGO".

govobj for AS-AS SMWE's

"as soon as" currently has "soon" as its obj, but since it is a strong MWE, its obj should be the obj of the second "as".

MWE lemma heuristics: use goeswith

We annotate MWEs where there are superfluous authorial spaces ("miss informed", "mean time"), but the lemma retains the space. The UD relation goeswith should be exploited to delete the space. We should also consider enforcing consistency between the MWE annotation and goeswith; right now there are goeswith annotations without corresponding MWE annotations.

Unannotated Tokens

These are tokens with null valued function or role. We should figure out what to do with these on the Xposition site.

reviews-008635-0002 with
reviews-010820-0003 in
reviews-010820-0011 on
reviews-017235-0005 for
reviews-021370-0001 for
reviews-026641-0003 with
reviews-029870-0002 for
reviews-030430-0002 from
reviews-034320-0003 to
reviews-035726-0002 in
reviews-039383-0004 on
reviews-042416-0005 to
reviews-045753-0001 in
reviews-053248-0017 my
reviews-081796-0005 for
reviews-081934-0003 since
reviews-081934-0003 to
reviews-081934-0004 as
reviews-081934-0006 to
reviews-088954-0001 of
reviews-093655-0007 as
reviews-121651-0003 on
reviews-158740-0002 of
reviews-160073-0001 on
reviews-163250-0004 for
reviews-192713-0008 like
reviews-193257-0003 to
reviews-207629-0003 for
reviews-211797-0004 like
reviews-217359-0006 for
reviews-225632-0005 to
reviews-228731-0001 with
reviews-228944-0003 on
reviews-311138-0003 to
reviews-323449-0007 to
reviews-326649-0006 like
reviews-329692-0011 on
reviews-332068-0002 for
reviews-333672-0006 as
reviews-336049-0004 out
reviews-339176-0006 with
reviews-348247-0006 for
reviews-372665-0004 with
reviews-376503-0004 with
reviews-377347-0005 like
reviews-382257-0002 in
reviews-391012-0006 through

Format extension: incorporating annotator notes?

The version of STREUSLE in Xposition contains some annotator notes on P tokens that are not included in the official release. The notes can help clarify the interpretation of the text, provide the annotator's rationale, or help cluster different usages at a finer level of granularity than the supersenses.

Should the .conllulex format have a place for these? An extra column? Or maybe a sentence header row, as they are rare?

Should there also be a standard for releasing rich annotation history metadata (such as who annotated which token, original vs. adjudicated annotations, timestamps, ...)?

Script to sync .conllu portion of .conllulex with UD_English dev branch

I think the UD_English repository may contain some recent syntactic fixes (lemmas, tags, trees) that have not been incorporated into streusle.conllulex. Need a script to take the not-to-release/sources/reviews/*.conllu files and streusle.conllulex and simply replace the first 10 columns of the latter with the former.

After running the script, be sure to examine the diff to ensure there weren't local fixes that got clobbered. Some may be due to outstanding pull requests: https://github.com/UniversalDependencies/UD_English/pulls

even if, even though, not even

These are sometimes analyzed as MWEs, but the annotations are inconsistent.

Note that "not even if" occurs, which would be problematic if not_even and even_if are both treated as MWEs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.