Code Monkey home page Code Monkey logo

stiff's Introduction

STIFF - Sense Tagged Instances For Finnish

pipeline status

This repository contains code to automatically create a tagged sense corpus from OpenSubtitles2018. It also contains a lot of corpora wrangling code, most notably code to convert (the CC-NC licensed) EuroSense into a format usable by finn-wsd-eval.

Set up

You will need HFST and OMorFi installed globally before beginning. The reason for this is neither are currently PyPI installable. You will also need poetry. You can then run

$ ./install.sh

(Only partially tested) Conversion pipelines and evaluation using the Makefile

There is a Makefile. Reading the source is a recommended next step after this README. It has variables for most file paths. These are over ridable with default. You can override them to help make it convenient when supplying intermediate steps/upstream corpora yourself, wanting outputs in a particular place, and when running it with Docker when you may want to use bind mounts to make the aforementioned appear on the host.

Make STIFF or EuroSense into data for finn-wsd-eval

You can make the data needed for finn-wsd-eval by running::

make wsd-eval

which will make the STIFF and EuroSense WSD evaluation corpora, including trying to fetch all dependencies. However,

  1. It will take a long time. The longest step is building STIFF from scratch which can take around two weeks. To speed things up you can supply a premade stiff.raw.xml.zst downloaded from here (TODO).
  2. It will not fetch one dependency with restrictions upon it: BABELWNMAP.

Obtaining BABELWNMAP

You will next need to set the environment variable BABELWNMAP as the path to a TSV mapping from BabelNet synsets to WordNet synsets. You can either:

  1. Obtain the BabelNet indices by following these instructions and dump out the TSV by following the instructions at https://github.com/frankier/babelnet-lookup
  2. If you are affiliated with a research institution, I have permission to send you the TSV file, but you must send me a direct communication from your institutional email address. (Please shortly state your position/affiliation and non-commercial research use in the email so there is a record.)
  3. Alternatively (subject to the same conditions) if you prefer, I can just send you eurosense.unified.sample.xml eurosense.unified.sample.key

Make STIFF and EuroSense P/R plot

Run::

make corpus-eval

(OLD) Example conversion pipelines and evaluation

Both the following pipelines first create a corpus tagged in the unified format, which consists of an xml and key file, and then create a directory consisting of the files needed by finn-wsd-eval.

STIFF Pipeline

Fetch OpenSubtitles2018

poetry run python scripts/fetch_opensubtitles2018.py cmn-fin

Make raw STIFF

poetry run python scripts/pipeline.py mk-stiff cmn-fin stiff.raw.xml.zst

Make recommended STIFF variant + convert ➡️ Unified

poetry run python scripts/variants.py proc bilingual-precision-4 stiff.raw.xml.zst stiff.bp4.xml.zst
./stiff2unified.sh stiff.bp4.xml.zst stiff.unified.bp4.xml stiff.unified.bp4.key

EuroSense Pipeline

EuroSense ➡️ Unified

You will first need to obtain EuroSense. Since there are some language tagging issues with the original, I currently recommend you use a version I have attempted to fix.

You will next need to set the environment variable BABEL2WN_MAP as the path to a TSV mapping from BabelNet synsets to WordNet synsets. You can either:

  1. Obtain the BabelNet indices by following these instructions and dump out the TSV by following the instructions at https://github.com/frankier/babelnet-lookup
  2. If you are affiliated with a research institution, I have permission to send you the TSV file, but you must send me a direct communication from your institutional email address. (Please shortly state your position/affiliation and non-commercial research use in the email so there is a record.)
  3. Alternatively (subject to the same conditions) if you prefer, I can just send you eurosense.unified.sample.xml eurosense.unified.sample.key

Then run:

poetry run python scripts/pipeline.py eurosense2unified \
  /path/to/eurosense.v1.0.high-precision.xml eurosense.unified.sample.xml \
  eurosense.unified.sample.key

Process finn-man-ann

First obtain finn-man-ann.

Then run:

poetry run python scripts/munge.py man-ann-select --source=europarl \
  ../finn-man-ann/ann.xml - \
  | poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 \
  ../finn-man-ann/ann.xml man-ann-opensubs18.xml

Make STIFF or EuroSense into data for finn-wsd-eval

This makes a directory usable by finn-wsd-eval.

Old

poetry run python scripts/pipeline.py unified-to-eval \
  /path/to/stiff-or-eurosense.unified.xml /path/to/stiff-or-eurosense.unified.key \
  stiff-or-eurosense.eval/

New

TODO: STIFF

poetry run python scripts/filter.py tok-span-dom man-ann-europarl.xml \
  man-ann-europarl.filtered.xml
poetry run python scripts/pipeline.py stiff2unified --eurosense \
  man-ann-europarl.filtered.xml man-ann-europarl.uni.xml man-ann-europarl.uni.key
poetry run python scripts/pipeline.py stiff2unified man-ann-opensubs18.xml \
  man-ann-opensubs18.uni.xml man-ann-opensubs18.key.xml
poetry run python scripts/pipeline.py unified-auto-man-to-evals \
  eurosense.unified.sample.xml man-ann-europarl.uni.xml \
  eurosense.unified.sample.key man-ann-europarl.uni.key eurosense.eval

Make STIFF and EuroSense P/R plot

First process finn-man-ann.

Gather STIFF eval data

poetry run python scripts/variants.py eval /path/to/stiff.raw.zst stiff-eval-out
poetry run python scripts/eval.py pr-eval --score=tok <(poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 /path/to/finn-man-ann/ann.xml -) stiff-eval-out stiff-eval.csv

Gather EuroSense eval data

poetry run python scripts/munge.py man-ann-select --source=europarl /path/to/finn-man-ann/ann.xml - | poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
mkdir eurosense-pr
mv /path/to/eurosense/high-precision.xml eurosense-pr/EP.xml
mv /path/to/eurosense/high-coverage.xml eurosense-pr/EC.xml
poetry run python scripts/eval.py pr-eval --score=tok man-ann-europarl.xml eurosense-pr europarl.csv

Plot on common axis

Warning, plot may be misleading...

poetry run python scripts/eval.py pr-plot stiff-eval.csv europarl.csv

Organisation & usage

For help using the tools, try running with --help. The main entry points are in scripts.

Innards

  • scripts/tag.py: Produce an unfiltered STIFF
  • scripts/filter.py: Filter STIFF according to various criteria
  • scripts/munge.py: Convert between different corpus/stream formats

Wrappers

  • scripts/stiff2unified.sh: Convert from STIFF format to the unified format
  • scripts/pipeline.py: Various pipelines composing multiple layers of filtering/conversion

Top level

The Makefile and Makefile.manann.

stiff's People

Contributors

frankier avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

pombredanne

stiff's Issues

There should one TaggedLemmas per multiword form

Currently a TaggedLemma can contain for example Lemma('sincerely.r.01.真诚地') and Lemma('sincerely.r.01.真诚+地') -- but they should probably be considered as different lemmatisations (note that differences due to Chinese normalisation should still be in the same TaggedLemma)

Conversion to unified causes duplication of trailing parts of MWEs

Example of viimeistä luokkaa in bp4 in eurosense and unified format (luokkaa is doubled!):

<sentence id="284">
<text id="fi-tok" lang="fi">Sinun ei tarvitsisi kerrata viimeistä luokkaa .</text>
<gram type="finnpos" for="fi-tok">[["sin\u00e4", {"pos": "PRONOUN", "subcat": "PERSONAL", "pers": "SG2", "num": "SG", "case": "GEN"}], ["ei", {"pos": "VERB", "subcat": "NEG", "voice": "ACT", "pers": "SG3"}], ["tarvita", {"pos": "VERB", "voice": "ACT", "mood": "COND", "neg": "CON"}], ["kerrata", {"pos": "VERB", "voice": "ACT", "inf": "A", "case": "LAT"}], ["viimeinen", {"pos": "ADJECTIVE", "num": "SG", "case": "PAR"}], ["luokka", {"pos": "NOUN", "num": "SG", "case": "PAR"}], [".", {"pos": "PUNCTUATION"}]]</gram>
<annotations>
<annotation id="13" type="stiff" support="transfer-type=unaligned&amp;transform-chain=%5B%5D&amp;transfer-from-wordnets=qwc&amp;transfer-from-source=zh-untok&amp;transfer-from-lemma-path=whole&amp;transfer-from-anchor-positions=from-id%3Dzh-untok%26char%3D0&amp;transfer-from-anchor-char-length=1" rank="1" freq="50921640" lang="fi" anchor="ei" anchor-positions="from-id=fi-tok&amp;char=6&amp;token=1&amp;token-length=1" lemma="ei" wnlemma="l=ei&amp;wn=fin,qwf,qf2" wordnets="fin qwf qf2" lemma-path="whole,omor,recurs,finnpos">not.r.01 ei.r.02</annotation>
<annotation id="51" type="stiff" rank="1" freq="27720" lang="fi" anchor="viimeistä luokkaa" anchor-positions="from-id=fi-tok&amp;char=28&amp;token=4&amp;token-length=2" lemma="viimeinen luokka" wnlemma="l=viimeinen_luokka&amp;wn=fin,qf2" wordnets="fin qf2" lemma-path="omor,recurs,finnpos omor,recurs,finnpos">graduating_class.n.01 valmistuva_luokka.n.01</annotation>
<annotation id="64" type="stiff" support="transfer-type=unaligned&amp;transform-chain=%5B%5D&amp;transfer-from-wordnets=qcn&amp;transfer-from-source=zh-untok&amp;transfer-from-lemma-path=whole&amp;transfer-from-anchor-positions=from-id%3Dzh-untok%26char%3D4&amp;transfer-from-anchor-char-length=1" rank="6" freq="55440" lang="fi" anchor="luokkaa" anchor-positions="from-id=fi-tok&amp;char=38&amp;token=5&amp;token-length=1" lemma="luokka" wnlemma="l=luokka&amp;wn=fin,qf2" wordnets="fin qf2" lemma-path="omor,recurs,finnpos">class.n.05 luokka.n.04</annotation>
</annotations>
</sentence>

</sentence><sentence id="stiff.0000133751.000.00000284">
<wf>Sinun</wf>
<instance lemma="ei" pos="ADV" id="stiff.0000133751.000.00000284.00000000">ei</instance>
<wf>tarvitsisi</wf>
<wf>kerrata</wf>
<instance lemma="viimeinen_luokka" pos="NOUN" id="stiff.0000133751.000.00000284.00000001">viimeistä luokkaa</instance>
<wf>luokkaa</wf>
<wf>.</wf>
</sentence><sentence id="stiff.0000133751.000.00000285">

Smart lemma/POS tournament

@filter.command("finnpos-smart-lemma-pos-dom")
@click.argument("inf", type=click.File("rb"))
@click.argument("outf", type=click.File("wb"))
def finnpos_smart_lemma_pos_dom(inf, outf, level):
    """
    FinnPOS Dominance filter: Use FinnPOS annotations to support certain
    annotations over others, in terms of POS or lemma.

    Smart lemma + POS filter: Either the lemma and POS matches

    Two things:

     * Either lemma and pos agrees with FinnPOS analysis (lemma can be
       generated from part of segmentation matching FinnPOS analysis, and POS
       either matches WordNet lemma or is idiomatic word form POS)
     * Or agreement between OMorFi analysis and WordNet POS is taken into account
     * Could take into account also whether FinnWN entry is idiomatic POS

    """
    pass

P/R calculations are incorrect

True positives are currently exclusive with other counts. They should be calculated per span and remove the other gold annotations from consideration so that guessing another annotation not in this set will add to the other counts.

Only unaligned supports of Finnish are currently produced

e.g.

<annotation id="7" type="stiff" support="transfer-type=unaligned&amp;transfer-from=365&amp;transform-chain=%5B%5D transfer-type=unaligned&amp;transfer-from=237&amp;transform-chain=%5B%27deriv%27%5D" rank="22" freq="0" lang="fi" anchor="tehtiin" anchor-positions="from-id=fi-tok&amp;char=16&amp;token=1&amp;token-length=1" lemma="tehdä" wnlemma="tehdä" wordnets="fin qf2" lemma-path="omor,recurs,finnpos">carry_through.v.01 toteuttaa.v.07</annotation>

stiff-to-unified strips leading/trailing punctuation from

Probably needs some thought.

  • Should the text be retokenised to include this punctuation in a separate token?
    • If so, should it be marked there was no space in the text? Probably not since this isn't the case for the previous tokenisation step.
    • Should this retokenisation take place in an ad-hoc way (only for instances) or should a FinnPOS retokenisation be done instead.
  • How does this interact with instances which are part of a compound word?
  • What about when/if segmentation is applied? Segmentation is marked as happening, but does this mean it should be marked for punctuation too? Probably not.

# XXX: This approach just deletes the leading punctation.

# XXX: This approach just deletes the trailing punctation.

May need to take consideration of #16

Possibility of doing span tournament on the source language

Something which dominates on the source side either in terms of simply being longer or actually outspanning should be a better support and therefore better annotation on the target side.

class SourceAnchorLengthDom(SpanKeyMixin, Tournament):
    @staticmethod
    def rank(ann):
        pass


class SourceAnchorSpanDom(SpanKeyMixin, Tournament):
    @staticmethod
    def rank(ann):
        pass

WordNetFin should normalise different comma escapings

Currently we can get both:

vastaajan_oikeus_saada_tuomioistuin_haastamaan_oikeuteen_todistaja,_joka_antaa_tapahtuneeseen_perustuvan_lausunnon,n

and

vastaajan_oikeus_saada_tuomioistuin_haastamaan_oikeuteen_todistaja,_joka_antaa_tapahtuneeseen_perustuvan_lausunnon,n

If the "\" is kept. Clearly OMW reader is doing one think and normal WordNet reader is doing another.

How well does retokenised input work with pos tournament?

Example:

" Laula. lhan sama mitä . "

[[""", {"pos": "PUNCTUATION", "subcat": "QUOTATION", "position": "INITIAL"}], ["laulaa", {"pos": "VERB", "voice": "ACT", "mood": "IMPV", "pers": "SG2"}], [".", {"pos": "PUNCTUATION"}], ["l", {"pos": "NOUN", "subcat": "ABBREVIATION", "num": "SG", "case": "GEN"}], ["sama", {"pos": "PRONOUN", "subcat": "QUANTOR", "num": "SG", "case": "NOM"}], ["mik\u00e4", {"pos": "PRONOUN", "subcat": "RELATIVE", "case": "PAR"}], [".", {"pos": "PUNCTUATION"}], [""", {"pos": "PUNCTUATION", "subcat": "QUOTATION", "position": "FINAL"}]]

When the POS tournament is applied, are things zipped up incorrectly?

Doubts about lemma/idiom extraction of adpositions

This works

def test_humalassapa():
    extractor = get_extractor("FinExtractor")
    tagging1 = extractor.extract("humala")
    print(tagging1.tokens)
    tagging2 = extractor.extract("humalassa")
    print(tagging2.tokens)
    tagging3 = extractor.extract("humalassapa")
    print(tagging3.tokens)

But what if humalassa wasn't in OMorFi?

Stop blocking on FinnPOS

Reading from FinnPOS is around 20% of the time spent by tag.py - if data were shoveled in through a buffer this could possibly be totally eliminated.

Is it possible to get more detailed information from FinnPOS to improve filtering?

Currently some words will be tagged with a POS based only part of the word. e.g.

~/e/i/stiff ❯❯❯ ftb-label                                                                                                                                                                           ⏎ master ⬆ ✱ ◼
FinnTreeBank tagger (v0.1-alpha-150-g82bce74) using OMorFi and FinnPos

/usr/local/bin/hfst-optimized-lookup: Reading from STDIN. Writing to STDOUT.
/usr/local/bin/finnpos-label: Loading tagger.
/usr/local/bin/finnpos-ratna-feats.py: Reading from STDIN. Writing to STDOUT.
/usr/local/bin/omorfi2finnpos.py: Reading from STDIN. Writing to STDOUT
1
1
/usr/local/bin/finnpos-label: Reading from STDIN. Writing to STDOUT.
Minä
olin
humalassapa
.
Minä	_	minä	[POS=PRONOUN]|[SUBCAT=PERSONAL]|[PERS=SG1]|[NUM=SG]|[CASE=NOM]	_
olin	_	olla	[POS=VERB]|[VOICE=ACT]|[MOOD=INDV]|[TENSE=PAST]|[PERS=SG1]	_
humalassapa	_	humalassa	[POS=PARTICLE]|[CLIT=PA]	_
.	_	.	[POS=PUNCTUATION]	_

Noting that humalassapa has a particle is all very well, but what we actually want to know is that humalassa is adverb-like and humala is a noun since these are the headwords which exist in Wiktionary.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.