Code Monkey home page Code Monkey logo

bistring's Introduction

bistring

Build status Documentation status

The bistring library provides non-destructive versions of common string processing operations like normalization, case folding, and find/replace. Each bistring remembers the original string, and how its substrings map to substrings of the modified version.

For example:

Languages

PyPI version npm version

bistring is available in multiple languages, currently Python and JavaScript/TypeScript. Ports to other languages are planned for the near future.

The code is structured similarly in each language to make it easy to share algorithms, tests, and fixes between them. The main differences come from trying to mirror the language's built-in string API. If you want to contribute a bug fix or a new feature, feel free to implement it in any one of the supported languages, and we'll try to port it to the rest of them.

Demo

Click here for a live demo of the bistring library in your browser.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

bistring's People

Contributors

dependabot[bot] avatar microsoft-github-policy-service[bot] avatar mmdixon avatar msftgits avatar tavianator avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bistring's Issues

Whitespace at start of string mishandled by SentenceTokenizer?

I'm not sure if this is expected behaviour or a bug but the following code illustrates my uncertainty:

pre_split = bistring.bistr(" \tFoo. \t\n \tBar. \t") \
    .sub(r"^\s+", "") \
    .sub(r"\s*\n\s*", "\n") \
    .sub(r"\s+$", "\n")

post_split = bistring.bistr.join(
    [s.text for s in bistring.SentenceTokenizer("en_GB").tokenize(pre_split)]
)

# These should print True but actually print False
print(pre_split == post_split)
print(pre_split.original == post_split.original)

# This should print True and does print True
print(pre_split.modified == post_split.modified)

# This should print False but actually prints True
print(pre_split.original[2:] == post_split.original)

In summary, I was expecting the result of re-joining the tokens produced by SentenceTokenizer to yield an identical bistr to the one that existed prior to the splitting. This appears to be true with the only (known) exception being whitespace at the start of the first sentence is being lost. Whitespace at the end of the string, and whitespace between sentences within the string, are retained as expected.

Is this expected behaviour?

Produces behaviour using Python 3.7, bistring 0.4.0, pyicu 2.6, and icu 68.1 (all installed via conda-forge).

Composition of no-op replacements produces incorrect (or confusing?) alignment

>>> from bistring import bistr
>>> b = bistr("abc")
>>> b1 = b.replace("bc", "bc")
>>> b2 = b1.replace("ab", "ab")
>>> b2
bistr('abc', 'abc', Alignment([(0, 0), (1, 2), (3, 3)]))
>>> b2[:2].original
'a'

Both individual replacements effectively don't change the contents of the original string. I think b2[:2].original "should" return 'abc' here. This would probably be achieved with a coarser composed alignment Alignment([(0, 0), (3, 3)]).

Transliterate

I was hoping you might advise me on how to incorporate transliteration into a text transformation pipeline.

Let's say I want to use a 3rd party library like from unidecode import unidecode.
I could create a bistring with new_bistr = bistr(text.modified, unidecode(text.modified))
but I would loose all the previous operations.

Is there a way to fold in a modified string that is calculated outside bistring's capabilities?

Empty List not supported with new bistring join() in 0.5.0

>>> "".join([])
''
>>> bistring.bistr("").join([])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "{PYTHONPATH}/bistring/_bistr.py", line 393, in join
    return bistr("".join(original), "".join(modified), Alignment(alignment))
  File "{PYTHONPATH}/bistring/_alignment.py", line 102, in __init__
    raise ValueError('No sequence positions to align')
ValueError: No sequence positions to align

The new version of join() raises an error when given an empty list as an argument.

Seeing as the change was to match how str.join() operates should this not return bistr("")?

bistring.bistr("").join([]) also returns bistr("")

PyPI Release?

Thank you very much for this library! It's really useful in many text processing use cases for NLP.

The last release 0.4 on PyPI is from September 2019 and on master there's been at least a fix for bistr.join() since then: #20 Would you consider cutting a new release?

Problems with PyICU when installing bistring 0.4.0 with pip

Summary

I ran into what's apparently a known issue with installing PyICU over pip while trying to pip install bistring==0.4.0. Contrary to the error message and recommendations on that thread, installing pkg-config and libicu-dev didn't fix the issue for me. Only installing python3-icu (as recommended in the official PyICU docs) finally fixed it.

This is obviously not an issue with bistring itself, but it makes it difficult to install bistring because the ICU dependency can't be automatically installed by pip. If there is nothing else that can be done about it, maybe a note about this could at least be added to the Readme file, so people can avoid the frustration of running into the pip error?

More Details

Here is the output I got from pip install bistring==0.4.0:

Collecting bistring==0.4.0
  Downloading bistring-0.4.0-py3-none-any.whl (22 kB)
Collecting pyicu
  Downloading PyICU-2.8.tar.gz (299 kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 299 kB 2.1 MB/s
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 /tmp/tmprkr9c6uy get_requires_for_build_wheel /tmp/tmp5tnhtmha
       cwd: /tmp/pip-install-x_54yndb/pyicu
  Complete output (64 lines):
  (running 'icu-config --version')
  (running 'pkg-config --modversion icu-i18n')
  Traceback (most recent call last):
    File "setup.py", line 63, in <module>
      ICU_VERSION = os.environ['ICU_VERSION']
    File "/usr/lib/python3.8/os.py", line 675, in __getitem__
      raise KeyError(key) from None
  KeyError: 'ICU_VERSION'

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "setup.py", line 66, in <module>
      ICU_VERSION = check_output(('icu-config', '--version')).strip()
    File "setup.py", line 19, in check_output
      return subprocess_check_output(popenargs)
    File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
      return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    File "/usr/lib/python3.8/subprocess.py", line 489, in run
      with Popen(*popenargs, **kwargs) as process:
    File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
      self._execute_child(args, executable, preexec_fn, close_fds,
    File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
      raise child_exception_type(errno_num, err_msg, err_filename)
  FileNotFoundError: [Errno 2] No such file or directory: 'icu-config'

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "setup.py", line 69, in <module>
      ICU_VERSION = check_output(('pkg-config', '--modversion', 'icu-i18n')).strip()
    File "setup.py", line 19, in check_output
      return subprocess_check_output(popenargs)
    File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
      return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    File "/usr/lib/python3.8/subprocess.py", line 489, in run
      with Popen(*popenargs, **kwargs) as process:
    File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
      self._execute_child(args, executable, preexec_fn, close_fds,
    File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
      raise child_exception_type(errno_num, err_msg, err_filename)
  FileNotFoundError: [Errno 2] No such file or directory: 'pkg-config'

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/tmp/tmprkr9c6uy", line 280, in <module>
      main()
    File "/tmp/tmprkr9c6uy", line 263, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/tmp/tmprkr9c6uy", line 114, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/tmp/pip-build-env-p3nxngyx/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 162, in get_requires_for_build_wheel
      return self._get_build_requires(
    File "/tmp/pip-build-env-p3nxngyx/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 143, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-p3nxngyx/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 158, in run_setup
      exec(compile(code, __file__, 'exec'), locals())
    File "setup.py", line 71, in <module>
      raise RuntimeError('''
  RuntimeError:
  Please install pkg-config on your system or set the ICU_VERSION environment
  variable to the version of ICU you have installed.

  ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 /tmp/tmprkr9c6uy get_requires_for_build_wheel /tmp/tmp5tnhtmha Check the logs for full command output

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.