Code Monkey home page Code Monkey logo

crosslingual-coreference's Introduction

Hi there ๐Ÿ‘‹

From failing to study medicine โžก๏ธ BSc industrial engineer โžก๏ธ MSc computer scientist.
Life can be strange, so better enjoy it.
Iยดm sure I do by: ๐Ÿ‘จ๐Ÿฝโ€๐Ÿณ Cooking, ๐Ÿ‘จ๐Ÿฝโ€๐Ÿ’ป Coding, ๐Ÿ† Committing.

Conference slides ๐Ÿ“–

employers ๐Ÿ‘จ๐Ÿฝโ€๐Ÿ’ป

  • Hugging Face ๐Ÿค—(2024-current) - The AI community building the future
  • Argilla(2022-current) - data annotation and monitoring for enterprise NLP
  • Pandora Intelligence(2020-2022) - an independent intelligence company, specialized in security risks

open source โญ๏ธ

maintainer ๐Ÿค“

contributions ๐Ÿซฑ๐Ÿพโ€๐Ÿซฒ๐Ÿผ

volunteering ๐ŸŒ

  • Bonfari - small to medium sustainable scale projects in Gambia ๐Ÿ‡ฌ๐Ÿ‡ฒ
  • 510 red-cross - occasional projects to improve humanitarian aid with data

Contacts

Gmail LinkedIn Twitter

crosslingual-coreference's People

Contributors

davidberenstein1957 avatar davidfrompandora avatar dvsrepo avatar martin-kirilov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

crosslingual-coreference's Issues

Retrieving cluster heads without replacing corefs

I am interested in being able to extract the cluster heads with something like doc._.coref_cluster_heads to get the cluster heads without getting the reconstituted text. It could be a separate function that also acts as input into replace_corefs potentially.

feat: look into ONNX enhanched transformer embeddings

Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.

installation failed: 'notebook' has no attribute 'nbextensions'

I tried to install it in a venv:
(.spaCy) PS C:\Users\joajo\Documents> pip --version
pip 23.2.1 from C:\Users\joajo\Documents.spaCy\Lib\site-packages\pip (python 3.11)

 AttributeError: module 'notebook' has no attribute 'nbextensions'
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

ร— Encountered error while generating package metadata.
โ•ฐโ”€> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Support for Spacy 3.4.0

Hi, I would like to use this nice package for Dutch language models that only work with Spacy 3.4.0+. How difficult would it be to support spacy 3.4.0?

Issue installing `crosslingual-coreference`

Chip: Apple M1
MacOS Sonoma, Version 14.3
Python version: 3.11
Rust Compiler installed already.

Building wheels for collected packages: tokenizers
Building wheel for tokenizers (pyproject.toml) ... error
error: subprocess-exited-with-error

ร— Building wheel for tokenizers (pyproject.toml) did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> [586 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-14.3-arm64-cpython-310
creating build/lib.macosx-14.3-arm64-cpython-310/tokenizers
copying py_src/tokenizers/init.py -> build/lib.macosx-14.3-arm64-cpython-310/tokenizers
creating build/lib.macosx-14.3-arm64-cpython-310/tokenizers/models
copying py_src/tokenizers/models/init.py -> build/lib.macosx-14.3-arm64-cpython-310/tokenizers/models
creating build/lib.macosx-14.3-arm64-cpython-310/tokenizers/decoders
copying py_src/tokenizers/decoders/init.py -> build/lib.macosx-14.3-arm64-cpython-310/tokenizers/decoders
creating build/lib.macosx-14.3-arm64-cpython-310/tokenizers/normalizers
copying py_src/tokenizers/normalizers/init.py -> build/lib.macosx-14.3-arm64-cpython-310/tokenizers/normalizers
creating build/lib.macosx-14.3-arm64-cpython-310/tokenizers/pre_tokenizers
...
8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/liblog-67209b2521d51704.rmeta --extern macro_rules_attribute=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libmacro_rules_attribute-efcdd1618729d88c.rmeta --extern onig=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libonig-55e025d9cd10ef1d.rmeta --extern paste=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libpaste-cbb23ac2fa72fb9c.dylib --extern rand=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/librand-0b09f74aae2d6e40.rmeta --extern rayon=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/librayon-30990235a038b514.rmeta --extern rayon_cond=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/librayon_cond-2483fbb1a8df1d3e.rmeta --extern regex=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libregex-7fcb56abc3e47088.rmeta --extern regex_syntax=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libregex_syntax-66179a6cbc8e8d90.rmeta --extern reqwest=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libreqwest-beeb0f83a3b16ee5.rmeta --extern serde=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libserde-289f63c404915bfb.rmeta --extern serde_json=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libserde_json-ca3e386be1e1db16.rmeta --extern spm_precompiled=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libspm_precompiled-f9acdce561af88ff.rmeta --extern thiserror=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libthiserror-a91213e24d296260.rmeta --extern unicode_normalization_alignments=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libunicode_normalization_alignments-1aef014c0aec3a81.rmeta --extern unicode_segmentation=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libunicode_segmentation-0866dc6a2ac87bf1.rmeta --extern unicode_categories=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libunicode_categories-f44cb1f9440beb5c.rmeta -L native=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/build/bzip2-sys-0c46cf013c67825b/out/lib -L native=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/build/zstd-sys-9a4d4f9b48c0d595/out -L native=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/build/esaxx-rs-9d2de4cc92300f46/out -L native=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/build/onig_sys-433c7599de979028/outwarning: variable does not need to be mutable --> tokenizers-lib/src/models/unigram/model.rs:265:21 | 265 | let mut target_node = &mut best_path_ends_at[key_pos]; | ----^^^^^^^^^^^ | | | help: remove thismut| = note:#[warn(unused_mut)]` on by default

  warning: variable does not need to be mutable
     --> tokenizers-lib/src/models/unigram/model.rs:282:21
      |
  282 |                 let mut target_node = &mut best_path_ends_at[starts_at + mblen];
      |                     ----^^^^^^^^^^^
      |                     |
      |                     help: remove this `mut`

  warning: variable does not need to be mutable
     --> tokenizers-lib/src/pre_tokenizers/byte_level.rs:200:59
      |
  200 |     encoding.process_tokens_with_offsets_mut(|(i, (token, mut offsets))| {
      |                                                           ----^^^^^^^
      |                                                           |
      |                                                           help: remove this `mut`

  error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
     --> tokenizers-lib/src/models/bpe/trainer.rs:526:47
      |
  522 |                     let w = &words[*i] as *const _ as *mut _;
      |                             -------------------------------- casting happend here
  ...
  526 |                         let word: &mut Word = &mut (*w);
      |                                               ^^^^^^^^^
      |
      = note: for more information, visit <https://doc.rust-lang.org/book/ch15-05-interior-mutability.html>
      = note: `#[deny(invalid_reference_casting)]` on by default

  warning: `tokenizers` (lib) generated 3 warnings
  error: could not compile `tokenizers` (lib) due to previous error; 3 warnings emitted

  Caused by:
    process didn't exit successfully: `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="cached-path"' --cfg 'feature="clap"' --cfg 'feature="cli"' --cfg 'feature="default"' --cfg 'feature="http"' --cfg 'feature="indicatif"' --cfg 'feature="progressbar"' --cfg 'feature="reqwest"' -C metadata=02b35ef3d3a318c6 -C extra-filename=-02b35ef3d3a318c6 --out-dir /private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps -L dependency=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps --extern aho_corasick=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libaho_corasick-1f16d7cbc1e140f2.rmeta --extern cached_path=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libcached_path-4aca8b2bc71340df.rmeta --extern clap=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libclap-6026cc4aa25a0aba.rmeta --extern derive_builder=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libderive_builder-3ec426cf16b9ba1a.dylib --extern dirs=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libdirs-1da7806f8cc3f346.rmeta --extern esaxx_rs=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libesaxx_rs-5aaf1f019751a9f2.rmeta --extern indicatif=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libindicatif-75757f2d3df8bc84.rmeta --extern itertools=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libitertools-e26f459727927c7c.rmeta --extern lazy_static=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/liblazy_static-a035a15073af7f16.rmeta --extern log=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/liblog-67209b2521d51704.rmeta --extern macro_rules_attribute=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libmacro_rules_attribute-efcdd1618729d88c.rmeta --extern onig=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libonig-55e025d9cd10ef1d.rmeta --extern paste=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libpaste-cbb23ac2fa72fb9c.dylib --extern rand=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/librand-0b09f74aae2d6e40.rmeta --extern rayon=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/librayon-30990235a038b514.rmeta --extern rayon_cond=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/librayon_cond-2483fbb1a8df1d3e.rmeta --extern regex=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libregex-7fcb56abc3e47088.rmeta --extern regex_syntax=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libregex_syntax-66179a6cbc8e8d90.rmeta --extern reqwest=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libreqwest-beeb0f83a3b16ee5.rmeta --extern serde=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libserde-289f63c404915bfb.rmeta --extern serde_json=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libserde_json-ca3e386be1e1db16.rmeta --extern spm_precompiled=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libspm_precompiled-f9acdce561af88ff.rmeta --extern thiserror=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libthiserror-a91213e24d296260.rmeta --extern unicode_normalization_alignments=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libunicode_normalization_alignments-1aef014c0aec3a81.rmeta --extern unicode_segmentation=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libunicode_segmentation-0866dc6a2ac87bf1.rmeta --extern unicode_categories=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/deps/libunicode_categories-f44cb1f9440beb5c.rmeta -L native=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/build/bzip2-sys-0c46cf013c67825b/out/lib -L native=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/build/zstd-sys-9a4d4f9b48c0d595/out -L native=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/build/esaxx-rs-9d2de4cc92300f46/out -L native=/private/var/folders/fm/gtjtp79s6zd19lfm06zv20v80000gn/T/pip-install-8rzv94n7/tokenizers_b1953320fc9f4f9c82f975191f952762/target/release/build/onig_sys-433c7599de979028/out` (exit status: 1)
  error: `cargo rustc --lib --message-format=json-render-diagnostics --manifest-path Cargo.toml --release -v --features pyo3/extension-module --crate-type cdylib -- -C 'link-args=-undefined dynamic_lookup -Wl,-install_name,@rpath/tokenizers.cpython-310-darwin.so'` failed with code 101
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

@davidberenstein1957 could you guide further on this ?

HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

Python 3.8.13
Spacy - 3.1.0
en_core_web_sm-3.1.0
crosslingual_coreference - 0.2.8

requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

Local run

Is there a way to run locally?
First, download all the data locally, and then run it locally via docker

Error when using coref as a spaCy pipeline

Hi all,
while trying to run a spacy test

import spacy
import crosslingual_coreference

text = """
    Do not forget about Momofuku Ando!
    He created instant noodles in Osaka.
    At that location, Nissin was founded.
    Many students survived by eating these noodles, but they don't even know him."""

# use any model that has internal spacy embeddings
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0})

doc = nlp(text)

print(doc._.coref_clusters)
print(doc._.resolved_text)

I encountered the following issue:

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Traceback (most recent call last):
  File "/home/user/test_coref/test.py", line 12, in <module>
    nlp.add_pipe(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
    pipe_component = self.create_pipe(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
    resolved, _ = cls._make(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
    filled, _, resolved = cls._fill(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
    getter_result = getter(*args, **kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/__init__.py", line 33, in make_crosslingual_coreference
    return SpacyPredictor(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictorSpacy.py", line 18, in __init__
    super().__init__(language, device, model_name, chunk_size, chunk_overlap)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 55, in __init__
    self.set_coref_model()
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 85, in set_coref_model
    self.predictor = Predictor.from_path(self.filename, language=self.language, cuda_device=self.device)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/predictors/predictor.py", line 366, in from_path
    load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 232, in load_archive
    dataset_reader, validation_dataset_reader = _load_dataset_readers(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
    dataset_reader = DatasetReader.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 636, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
    constructed_arg = pop_and_construct_arg(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 394, in construct_arg
    value_dict[key] = construct_arg(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
    result = annotation.from_params(params=popped_params, **subextras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 638, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 58, in __init__
    self._matched_indexer = PretrainedTransformerIndexer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 56, in __init__
    self._allennlp_tokenizer = PretrainedTransformerTokenizer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
    self.tokenizer = cached_transformers.get_tokenizer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
    return cls._from_pretrained(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 140, in __init__
    super().__init__(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: EOF while parsing a list at line 1 column 4920583

Here's what I have installed (pulled by poetry add crosslingual-coreference or pip install crosslingual-coreference):

(.venv) user@host$ pip freeze
aiohttp==3.8.1
aiosignal==1.2.0
allennlp==2.9.3
allennlp-models==2.9.3
async-timeout==4.0.2
attrs==21.4.0
base58==2.1.1
blis==0.7.7
boto3==1.23.5
botocore==1.26.5
cached-path==1.1.2
cachetools==5.1.0
catalogue==2.0.7
certifi==2022.5.18.1
charset-normalizer==2.0.12
click==8.0.4
conllu==4.4.1
crosslingual-coreference==0.2.4
cymem==2.0.6
datasets==2.2.1
dill==0.3.5.1
docker-pycreds==0.4.0
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
fairscale==0.4.6
filelock==3.6.0
frozenlist==1.3.0
fsspec==2022.5.0
ftfy==6.1.1
gitdb==4.0.9
GitPython==3.1.27
google-api-core==2.8.0
google-auth==2.6.6
google-cloud-core==2.3.0
google-cloud-storage==2.3.0
google-crc32c==1.3.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.1
h5py==3.6.0
huggingface-hub==0.5.1
idna==3.3
iniconfig==1.1.1
Jinja2==3.1.2
jmespath==1.0.0
joblib==1.1.0
jsonnet==0.18.0
langcodes==3.3.0
lmdb==1.3.0
MarkupSafe==2.1.1
more-itertools==8.13.0
multidict==6.0.2
multiprocess==0.70.12.2
murmurhash==1.0.7
nltk==3.7
numpy==1.22.4
packaging==21.3
pandas==1.4.2
pathtools==0.1.2
pathy==0.6.1
Pillow==9.1.1
pluggy==1.0.0
preshed==3.0.6
promise==2.3
protobuf==3.20.1
psutil==5.9.1
py==1.11.0
py-rouge==1.1
pyarrow==8.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydantic==1.8.2
pyparsing==3.0.9
pytest==7.1.2
python-dateutil==2.8.2
pytz==2022.1
PyYAML==6.0
regex==2022.4.24
requests==2.27.1
responses==0.18.0
rsa==4.8
s3transfer==0.5.2
sacremoses==0.0.53
scikit-learn==1.1.1
scipy==1.6.1
sentence-transformers==2.2.0
sentencepiece==0.1.96
sentry-sdk==1.5.12
setproctitle==1.2.3
shortuuid==1.0.9
six==1.16.0
smart-open==5.2.1
smmap==5.0.0
spacy==3.2.4
spacy-alignments==0.8.5
spacy-legacy==3.0.9
spacy-loggers==1.0.2
spacy-sentence-bert==0.1.2
spacy-transformers==1.1.5
srsly==2.4.3
tensorboardX==2.5
termcolor==1.1.0
thinc==8.0.16
threadpoolctl==3.1.0
tokenizers==0.12.1
tomli==2.0.1
torch==1.10.2
torchaudio==0.10.2
torchvision==0.11.3
tqdm==4.64.0
transformers==4.17.0
typer==0.4.1
typing-extensions==4.2.0
urllib3==1.26.9
wandb==0.12.16
wasabi==0.9.1
wcwidth==0.2.5
word2number==1.1
xxhash==3.0.0
yarl==1.7.2

Do you have any recommendations?
Is there an installation step missing?

Thanks in advance!

How do we character ranges of the clusters

Right now, when we call predictor.predict() we get the clusters as a list of lists, and the cluster heads along with their token indices. Is it possible to :

  • Get the cluster heads as character ranges? Meaning that 'cluster_heads': {'Momofuku Ando': [4, 5], 'Osaka': [12, 12], 'instant noodles':
    [9, 10], 'Many students': [22, 23], 'Nissin': [18, 18]}}, instead of the token/word indices like [4,5], [12,12], etc. can we get the character ranges
  • Alternatively, can we get a separate variable that maps the token indices to tokens? Something like ['Do', 'not', 'forget', 'about'....] .
    I tried looking at how the text is tokenized but couldn't exactly get that from the code. Basically for my application I need to check whether a coreference appears in a particular character range, and would like to do that accurately (with the best way to do that being using the character range)

Comparatively high initial prediction time for first predict() hit

I am using minilm model with language 'en_core_web_sm'.
While comparing the prediction time, i.e., predictor.predict(text), the prediction time for first hit is always a bit high than the following hits.
Suppose after creating a predictor object, I call predict as follows:

predictor.predict(text) ---> first call
predictor.predict(text) ---> second call
predictor.predict(text) ---> third call

Time taken for the first call is comparatively a bit higher(.2 sec) than the next prediction calls(.05 sec).
Could you please help me understand why this initial hit takes a bit high prediction time?

Which language model is using for minilm

I am using the following code snippet for coreference resolution

predictor = Predictor(language="en_core_web_sm", device=-1, model_name="minilm")

While checking the below source code,

"minilm": {
        "url": (
            "https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz"
        ),
        "f1_score_ontonotes": 74,
        "file_extension": ".tar.gz",
    },

it seems that the language model using here is https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

Is this the same one that I can see in https://huggingface.co/models like
https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main
or any other huggingface model?

[Errno 101] Network is unreachable

Hello, when I try to run the code below

predictor = Predictor(
    language="en_core_web_sm", device=1, model_name="info_xlm"
)

I get the following error:

ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/infoxlm-base/cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff90cba1a00>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Is this url still valid and what should I use instead?

Why does this package need to install google cloud auth, storage, api etc?

Hi,

after installing the library I saw google-api-core-2.10.1 google-auth-2.12.0 google-cloud-core-2.3.2 google-cloud-storage-1.44.0 have been installed as well. In fact these packages can be found in the poetry.lock file.

Is there a reason (I don't get) why this library needs these packages?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.