Code Monkey home page Code Monkey logo

quantulum3's Introduction

quantulum3

Travis master build state Coverage Status PyPI version PyPI - Python Version PyPI - Status

Python library for information extraction of quantities, measurements and their units from unstructured text. It is able to disambiguate between similar looking units based on their k-nearest neighbours in their GloVe vector representation and their Wikipedia page.

This is the Python 3 compatible fork of recastrodiaz' fork of grhawks' fork of the original by Marco Lagi. The compatibility with the newest version of sklearn is based on the fork of sohrabtowfighi.

User Guide

Installation

pip install quantulum3

To install dependencies for using or training the disambiguation classifier, use

pip install quantulum3[classifier]

The disambiguation classifier is used when the parser find two or more units that are a match for the text.

Usage

>>> from quantulum3 import parser
>>> quants = parser.parse('I want 2 liters of wine')
>>> quants
[Quantity(2, 'litre')]

The Quantity class stores the surface of the original text it was extracted from, as well as the (start, end) positions of the match:

>>> quants[0].surface
u'2 liters'
>>> quants[0].span
(7, 15)

The value attribute provides the parsed numeric value and the unit.name attribute provides the name of the parsed unit:

>>> quants[0].value
2.0
>>> quants[0].unit.name
'litre'

An inline parser that embeds the parsed quantities in the text is also available (especially useful for debugging):

>>> print parser.inline_parse('I want 2 liters of wine')
I want 2 liters {Quantity(2, "litre")} of wine

As the parser is also able to parse dimensionless numbers, this library can also be used for simple number extraction.

>>> print parser.parse('I want two')
[Quantity(2, 'dimensionless')]

Units and entities

All units (e.g. litre) and the entities they are associated to (e.g. volume) are reconciled against WikiPedia:

>>> quants[0].unit
Unit(name="litre", entity=Entity("volume"), uri=https://en.wikipedia.org/wiki/Litre)

>>> quants[0].unit.entity
Entity(name="volume", uri=https://en.wikipedia.org/wiki/Volume)

This library includes more than 290 units and 75 entities. It also parses spelled-out numbers, ranges and uncertainties:

>>> parser.parse('I want a gallon of beer')
[Quantity(1, 'gallon')]

>>> parser.parse('The LHC smashes proton beams at 12.8–13.0 TeV')
[Quantity(12.8, "teraelectronvolt"), Quantity(13, "teraelectronvolt")]

>>> quant = parser.parse('The LHC smashes proton beams at 12.9±0.1 TeV')
>>> quant[0].uncertainty
0.1

Non-standard units usually don't have a WikiPedia page. The parser will still try to guess their underlying entity based on their dimensionality:

>>> parser.parse('Sound travels at 0.34 km/s')[0].unit
Unit(name="kilometre per second", entity=Entity("speed"), uri=None)

Export/Import

Entities, Units and Quantities can be exported to dictionaries and JSON strings:

>>> quant = parser.parse('I want 2 liters of wine')
>>> quant[0].to_dict()
{'value': 2.0, 'unit': 'litre', "entity": "volume", 'surface': '2 liters', 'span': (7, 15), 'uncertainty': None, 'lang': 'en_US'}
>>> quant[0].to_json()
'{"value": 2.0, "unit": "litre", "entity": "volume", "surface": "2 liters", "span": [7, 15], "uncertainty": null, "lang": "en_US"}'

By default, only the unit/entity name is included in the exported dictionary, but these can be included:

>>> quant = parser.parse('I want 2 liters of wine')
>>> quant[0].to_dict(include_unit_dict=True, include_entity_dict=True)  # same args apply to .to_json()
{'value': 2.0, 'unit': {'name': 'litre', 'surfaces': ['cubic decimetre', 'cubic decimeter', 'litre', 'liter'], 'entity': {'name': 'volume', 'dimensions': [{'base': 'length', 'power': 3}], 'uri': 'Volume'}, 'uri': 'Litre', 'symbols': ['l', 'L', 'ltr', 'ℓ'], 'dimensions': [{'base': 'decimetre', 'power': 3}], 'original_dimensions': [{'base': 'litre', 'power': 1, 'surface': 'liters'}], 'currency_code': None, 'lang': 'en_US'}, 'entity': 'volume', 'surface': '2 liters', 'span': (7, 15), 'uncertainty': None, 'lang': 'en_US'}

Similar export syntax applies to exporting Unit and Entity objects.

You can import Entity, Unit and Quantity objects from dictionaries and JSON. This requires that the object was exported with include_unit_dict=True and include_entity_dict=True (as appropriate):

>>> quant_dict = quant[0].to_dict(include_unit_dict=True, include_entity_dict=True)
>>> quant = Quantity.from_dict(quant_dict)
>>> ent_json = "{'name': 'volume', 'dimensions': [{'base': 'length', 'power': 3}], 'uri': 'Volume'}"
>>> ent = Entity.from_json(ent_json)

Disambiguation

If the parser detects an ambiguity, a classifier based on the WikiPedia pages of the ambiguous units or entities tries to guess the right one:

>>> parser.parse('I spent 20 pounds on this!')
[Quantity(20, "pound sterling")]

>>> parser.parse('It weighs no more than 20 pounds')
[Quantity(20, "pound-mass")]

or:

>>> text = 'The average density of the Earth is about 5.5x10-3 kg/cm³'
>>> parser.parse(text)[0].unit.entity
Entity(name="density", uri=https://en.wikipedia.org/wiki/Density)

>>> text = 'The amount of O₂ is 2.98e-4 kg per liter of atmosphere'
>>> parser.parse(text)[0].unit.entity
Entity(name="concentration", uri=https://en.wikipedia.org/wiki/Concentration)

In addition to that, the classifier is trained on the most similar words to all of the units surfaces, according to their distance in GloVe vector representation.

Spoken version

Quantulum classes include methods to convert them to a speakable unit.

>>> parser.parse("Gimme 10e9 GW now!")[0].to_spoken()
ten billion gigawatts
>>> parser.inline_parse_and_expand("Gimme $1e10 now and also 1 TW and 0.5 J!")
Gimme ten billion dollars now and also one terawatt and zero point five joules!

Manipulation

While quantities cannot be manipulated within this library, there are many great options out there:

Extension

Training the classifier

If you want to train the classifier yourself, you will need the dependencies for the classifier (see installation).

Use quantulum3-training on the command line, the script quantulum3/scripts/train.py or the method train_classifier in quantulum3.classifier to train the classifier.

quantulum3-training --lang <language> --data <path/to/training/file.json> --output <path/to/output/file.joblib>

You can pass multiple training files in to the training command. The output is in joblib format.

To use your custom model, pass the path to the trained model file to the parser:

parser = Parser.parse(<text>, classifier_path="path/to/model.joblib")

Example training files can be found in quantulum3/_lang/<language>/train.

If you want to create a new or different similars.json, install pymagnitude.

For the extraction of nearest neighbours from a vector word representation file, use scripts/extract_vere.py. It automatically extracts the k nearest neighbours in vector space of the vector representation for each of the possible surfaces of the ambiguous units. The resulting neighbours are stored in quantulum3/similars.json and automatically included for training.

The file provided should be in .magnitude format as other formats are first converted to a .magnitude file on-the-run. Check out pre-formatted Magnitude formatted word-embeddings and Magnitude for more information.

Additional units

It is possible to add additional entities and units to be parsed by quantulum. These will be added to the default units and entities. See below code for an example invocation:

>>> from quantulum3.load import add_custom_unit, remove_custom_unit
>>> add_custom_unit(name="schlurp", surfaces=["slp"], entity="dimensionless")
>>> parser.parse("This extremely sharp tool is precise up to 0.5 slp")
[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]

The keyword arguments to the function add_custom_unit are directly translated to the properties of the unit to be created.

Custom Units and Entities

It is possible to load a completely custom set of units and entities. This can be done by passing a list of file paths to the load_custom_units and load_custom_entities functions. Loading custom untis and entities will replace the default units and entities that are normally loaded.

The recomended way to load quantities is via a context manager:

>>> from quantulum3 import load, parser
>>> with load.CustomQuantities(["path/to/units.json"], ["path/to/entities.json"]):
>>>     parser.parse("This extremely sharp tool is precise up to 0.5 slp")

[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]

>>> # default units and entities are loaded again

But it is also possible to load custom units and entities manually:

>>> from quantulum3 import load, parser

>>> load.load_custom_units(["path/to/units.json"])
>>> load.load_custom_entities(["path/to/entities.json"])
>>> parser.parse("This extremely sharp tool is precise up to 0.5 slp")

[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]

>>> # remove custom units and entities and load default units and entities
>>> load.reset_quantities()

See the Developer Guide below for more information about the format of units and entities files.

Developer Guide

Adding Units and Entities

See units.json for the complete list of units and entities.json for the complete list of entities. The criteria for adding units have been:

It's easy to extend these two files to the units/entities of interest. Here is an example of an entry in entities.json:

"speed": {
    "dimensions": [{"base": "length", "power": 1}, {"base": "time", "power": -1}],
    "URI": "https://en.wikipedia.org/wiki/Speed"
}
  • The name of an entity is its key. Names are required to be unique.
  • URI is the name of the wikipedia page of the entity. (i.e. https://en.wikipedia.org/wiki/Speed => Speed)
  • dimensions is the dimensionality, a list of dictionaries each having a base (the name of another entity) and a power (an integer, can be negative).

Here is an example of an entry in units.json:

"metre per second": {
    "surfaces": ["metre per second", "meter per second"],
    "entity": "speed",
    "URI": "Metre_per_second",
    "dimensions": [{"base": "metre", "power": 1}, {"base": "second", "power": -1}],
    "symbols": ["mps"]
},
"year": {
    "surfaces": [ "year", "annum" ],
    "entity": "time",
    "URI": "Year",
    "dimensions": [],
    "symbols": [ "a", "y", "yr" ],
    "prefixes": [ "k", "M", "G", "T", "P", "E" ]
}
  • The name of a unit is its key. Names are required to be unique.
  • URI follows the same scheme as in the entities.json
  • surfaces is a list of strings that refer to that unit. The library takes care of plurals, no need to specify them.
  • entity is the name of an entity in entities.json
  • dimensions follows the same schema as in entities.json, but the base is the name of another unit, not of another entity.
  • symbols is a list of possible symbols and abbreviations for that unit.
  • prefixes is an optional list. It can contain Metric and Binary prefixes and automatically generates according units. If you want to add specifics (like different surfaces) you need to create an entry for that prefixes version on its own.

All fields are case sensitive.

Contributing

dev build:

Travis dev build state Coverage Status

If you'd like to contribute follow these steps:

  1. Clone a fork of this project into your workspace
  2. Run pip install -e . at the root of your development folder.
  3. pip install pipenv and pipenv shell
  4. Inside the project folder run pipenv install --dev
  5. Make your changes
  6. Run scripts/format.sh and scripts/build.py from the package root directory.
  7. Test your changes with python3 setup.py test (Optional, will be done automatically after pushing)
  8. Create a Pull Request when having commited and pushed your changes

Language support

Travis dev build state Coverage Status

There is a branch for language support, namely language_support. From inspecting the README file in the _lang subdirectory and the functions and values given in the new _lang.en_US submodule, one should be able to create own language submodules. The new language modules should automatically be invoked and be available, both through the lang= keyword argument in the parser functions as well as in the automatic unittests.

No changes outside the own language submodule folder (i.e. _lang.de_DE) should be necessary. If there are problems implementing a new language, don't hesitate to open an issue.

quantulum3's People

Contributors

adhardy avatar ageitgey avatar bajisci avatar grhawk avatar hwalinga avatar lvikasz avatar marcolagi avatar nielstron avatar recastrodiaz avatar skatsaounis avatar sohrabtowfighi avatar yoavg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

quantulum3's Issues

Support for negative numbers in parenthesis

Is your feature request related to a problem? Please describe.
To display negative currencies or other units, some users prefer the format ($99.99) instead of -$99.99 especially in accounting.

Describe the solution you'd like
Considering numbers inside parenthesis as negative will be very helpful in all units of measurement.

AttributeError - NoneType object has no attribute start

Describe the bug
When calling parser.parse on certain strings this error is thrown

To Reproduce
Steps to reproduce the behavior:

  1. parser.parse('approximately one and one-half miles east')
  2. See error: AttributeError("'NoneType' object has no attribute 'start'",)

Expected behavior
No error

Additional information:

  • Python Version: 3.6.6
  • OS: Windows 10
  • Version

Additional context

span = (span[0], span[0] + match.start())

Compatability to Python 2.7

Essentially a new travis build for python 2 is needed as well as a lot of from __future__ import unicode_literals statements.

This should be possible to implement without additional libraries.

List Index Out Of Range

Describe the bug
parser.parse() on some texts generates this error

To Reproduce
Steps to reproduce the behavior:

  1. parser.parse('Acme Inc. re- commenced production at Goondicum in April 2015 but production was paused in August 2015.')
  2. See error

Additional information:

  • Python Version: 3.6.6
  • Classifier activated/ sklearn installed: [yes/no]
  • OS: Windows 10
  • Version

Consider using hard coded general knowledge to resolve ambiguity

Even though a classifier is nice an cool ändern stuff, it may be worthwhile to include some general knowledge (for example that no two units of same dimension are in the same compound unit). It may improve runtime and also be able to perform better when a classifier is not available (which should not be unusual)

Fix invokation of sklearn.train

Following warning is thrown when building the classifier currently. This should be fixed before changes are made by sklearn.

/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/numpy/matrixlib/defmatrix.py:68: PendingDeprecationWarning: the matrix subclass is not the recommended way to represent matrices or deal with linear algebra (see https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html). Please adjust your code to use regular ndarray.
  return matrix(data, dtype=dtype, copy=False)
/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.
  FutureWarning)

Parsing dates

Maybe dates are a thing to be parsed in the future.

Automatically generate SI-prefixed versions of units

Generally the tool yet lacks the ability to automatically infer SI-prefixed (kilo, mega, etc) versions of all units.
As not all units ate prefixable (i.e. there are no kiloinches, there are megabytes but no millibytes), prefixes should not be able to be parsed anywhere.

all applicable SI-prefixes for a unit could be defined by a separate list inside units.json. Special or additional entities can either be assigned inside that list or by creating a separate unit.

The value "all" , "positive" (kilo and upwards) and "negative" (milli and downwards) or something similar should be available to reduce redundancy.

Filter results before sorting (or use max straight away)

Is your feature request related to a problem? Please describe.
In disambiguation, the results of the classifier areas first sorted, then filtered, thanks the first element is chosen. This is a possible runtime problem and unnecessarily complicated.
Describe the solution you'd like
Just use max for choosing the best result and filter first to reduce the list length.

Describe alternatives you've considered
Sorting may be kept

The "Usage" fail

I tried to use library with this manual (https://github.com/nielstron/quantulum3#usage) and got an error: ModuleNotFoundError: No module named 'quantulum3._lang.en_US.parser'

Trace:

>>> parser.parse('I want 2 liters of wine')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/quantulum3/parser.py", line 433, in parse
    text = clean_text(text, lang)
  File "/usr/local/lib/python3.7/site-packages/quantulum3/parser.py", line 408, in clean_text
    text = _get_parser(lang).clean_text(text)
  File "/usr/local/lib/python3.7/site-packages/quantulum3/parser.py", line 28, in _get_parser
    return language.get('parser', lang)
  File "/usr/local/lib/python3.7/site-packages/quantulum3/language.py", line 52, in get
    '{}._lang.{}.{}'.format(__package__, subdir(lang), module)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'quantulum3._lang.en_US.parser'

Parse proper nouns or objects as units

50 cars results in 50 dimensionless.

Describe the solution you'd like
The result should be 50 car

This could be implemented per hand (with "car" as a unit) or by a list of objects as they don't have symbols and can be automatically generated. Several words in one line could resemble words describing the same object (i.e. people, human)

FileNotFoundError when importing quantulum3

Describe the bug
Could not load library in Python

To Reproduce
Steps to reproduce the behavior:

  1. install quantulum3 via pip command: pip install quantulum3 --user
  2. create python script with code:
    from quantulum3 import parser
  3. start python script
  4. See error: FileNotFoundError: [Errno 2] No such file or directory: '{USER}\AppData\Roaming\Python\Python36\site-packages\quantulum3\common-4-letter-words.txt'

Expected behavior
library should be imported correctly

Additional information:

  • Python Version: 3.6.6
  • OS: Windows 10
  • Version

a ... creates errors

Describe the bug
Parsing a 500 creates an error: Through substitute spellout values, a turns to 1.0, then the space after it is removed as a grouping operator. Thus it results in 1.0500, which is of course not parsed as 500

Use tox for linting issues

Tox is a useful tool to check beforehand that there are no packaging issues (which have occurred a lot of times)

Pure exponential values are not parsed

Describe the bug
"in physics, a unit of surface area equal to 10^-12 barns" => 10, -12 barns

Expected behavior
in physics, a unit of surface area equal to 10^-12 barns => 10^-12 barns

Fails with idioms

Describe the bug
Incorrectly detects measurements from statements with idioms

To Reproduce

>>> from quantulum3 import parser
>>> parser.parse("There's even a ton of comments saying the Shield is now heel?")
[Quantity(1, "short ton")]

Expected behavior
Will not detect anything

Desktop (please complete the following information):

  • OS: Windows
  • Version: 10

Implement correct entity resolving

Is your feature request related to a problem? Please describe.
If i.e. the unit "speed of light day" is entered, its entity is unknown even though it is actually known as it is "length". As quantulum3 does not expand all units to their very basic entities and shortens results (by resolving positive and negative powers), the result remains unknown.

Describe the solution you'd like
In get_entity_from_dimensions the dimension information should be brought down to the very last basic entities, then unit shortening should happen (i.e. as above, time^-1 and time^1 should be removed)

Describe alternatives you've considered

Additional context

Does not support numeric abbreviation

Is your feature request related to a problem? Please describe.
parser.parse cannot properly parse numeric abbreviations

>>> parser.parse("1k miles")
[Quantity(1, "unk mile")]
>>> parser.parse("1K miles")
[Quantity(1, "kelvin mile")]
>>> parser.parse("1M miles")
[Quantity(1, "metre mile")]

Describe the solution you'd like
Should be able to parse numeric abbreviations properly:

In English numeric abbreviations and currency:
K for thousand (from kilo)
M for million
B for billion
T for trillion

In metric prefixes for SI measurements:
k for thousand (note lowercase)
M for million (from mega)
G for billion (from giga)
T for trillion (from tera)

Reference: https://english.stackexchange.com/a/112250

Getting FileNotFoundError while running parser.parse('5.8 mass') in Python 3.6 and python3.5

Describe the bug
Getting FileNotFoundError while running parser.parse('5.8 mass') in Python 3.6 and python3.5

To Reproduce
from quantulum3 import parser
quants = parser.parse('5.8 mass')

Error

FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/dist-packages/quantulum3/_lang/en_US/train'

Additional information:

  • Python Version: 3.5,3.7
  • OS: [e.g. iOS] Ubuntu 16.04

one and One-half prodives nonsense

Describe the bug

from quantulum3 import parser                       
>>> parser.parse("one and one half miles east")         
[Quantity(2, "dimensionless" )]       

Additional information:

  • Python Version: 3.6
  • Classifier activated/ sklearn installed: no

Likely linked to #1

Add currency codes

Is your feature request related to a problem? Please describe.
Cents won't be parsed
when creating a spoken version, the currency code is needed by num2words

Describe the solution you'd like
Create units for cents etc
Implement Currency codes directly in units.json or as symbols for the units

Test failing

        "req": "I want a hundred and two of those",
        "res": [{"value": 102, "unit": "dimensionless", "surface": "a hundred and two"}]

results in value:2 rather than 102

Fix erroneous results

The examples don't produce errors on my device (no scipy installed). Yet there are some erroneous outputs:

>>> from quantulum3 import parser
>>> parser.parse("exports decreased from 816 000 t valued at $1943 million in 2012–13 to 934 000 t valued at $1964 million in 2013–14 and 885 000 t valued at $1854 million in 2014–15.")
[Quantity(816, "tonne"), Quantity(1.943e+09, "dollar"), Quantity(1012.5, "dimensionless"), Quantity(934, "tonne"), Quantity(1.964e+09, "dollar"), Quantity(1456, "tonne"), Quantity(1.854e+09, "dollar"), Quantity(1014.5, "dimensionless")]
>>> parser.parse("Acme Inc. re- commenced production at Goondicum in April 2015 but production was paused in August 2015.")
[Quantity(2015, "dimensionless"), Quantity(2015, "dimensionless")]
>>> parser.parse('in 2012–13 to 279 t valued at $13 009 million in 2013–14 and 278 t valued at $13 049 million in 2014–15.')
[Quantity(1012.5, "dimensionless"), Quantity(279, "tonne"), Quantity(2.2e+07, "dollar"), Quantity(1152.5, "tonne"), Quantity(6.2e+07, "dollar"), Quantity(1014.5, "dimensionless")]
>>>

Fixed:

  • 2012-13 is interpreted as a range. When the second number is less it should be automatically prefixed with other numbers or ignored right away

  • 940 000 t the three zeroes are simply ignored, resulting in a wrong magnitude

  • >>> parser.parse("2013-14 and 278 t") [Quantity(1152.5, "tonne")]

  • 4,500 tpd -> "tonne pint day" instead of tonne per day

  • The Great Australia mine open cut (south-east of Cloncurry) with a fault visible above the stopes of historical underground workings. -> "degree fahrenheit astronomical unit litre tonne" instead of nothing/ fault

  • Cannington—South 32 Ltd -> "litre tonne day" instead of dimensionless

Based on issue #41 and #40

Decimals wont be parsed

Describe the bug

>>> parser.parse("zero point five")
[Quantity(0, "dimensionless"), Quantity(5, "dimensionless")]

Expected behavior
Result should be 0.5

Additional information:

  • Version: 3.0

Package doesn't work when installed in custom folder (pip's --target)

Describe the bug
When package quantulum3 is installed into custom folder (e.g. "lib/"), it tries to load units.json file from wrong directory which causes an error.

To Reproduce
Steps to reproduce the behavior:

  1. Install package into custom folder (e.g. "lib/")
pip3 install --upgrade --target lib quantulum3
  1. From main folder (the folder where lib folder is) run this code:
import sys
sys.path.insert(0, 'lib')
from quantulum3 import parser

parser.parse('test string')
  1. See error
Traceback (most recent call last):
  File "lib\quantulum3\load.py", line 34, in cached_function
    return _CACHE_DICT[id(funct)][lang]
KeyError: 1609576810016

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "lib\quantulum3\load.py", line 34, in cached_function
    return _CACHE_DICT[id(funct)][lang]
KeyError: 1609576777792

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    parser.parse('test string')
  File "lib\quantulum3\parser.py", line 444, in parse
    for item in reg.units_regex(lang).finditer(text):
  File "lib\quantulum3\load.py", line 36, in cached_function
    result = funct(lang)
  File "lib\quantulum3\regex.py", line 324, in units_regex
    list(load.units(lang).surfaces.keys()) + list(
  File "lib\quantulum3\load.py", line 36, in cached_function
    result = funct(lang)
  File "lib\quantulum3\load.py", line 328, in units
    return Units(lang)
  File "lib\quantulum3\load.py", line 229, in __init__
    with path.open(encoding='utf-8') as file:
  File "C:\Users\me\AppData\Local\Programs\Python\Python35\lib\pathlib.py", line 1151, in open
    opener=self._opener)
  File "C:\Users\me\AppData\Local\Programs\Python\Python35\lib\pathlib.py", line 1005, in _opener
    return self._accessor.open(self, flags, mode)
  File "C:\Users\me\AppData\Local\Programs\Python\Python35\lib\pathlib.py", line 371, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: 'lib\\quantulum3\\units.json\\lib\\quantulum3\\_lang\\en_US\\units.json'

Expected behavior
Package shouldn't have troubles with loading.

Additional information:

  • Python Version: 3.5
  • Classifier activated/ sklearn installed: no
  • OS: probably all, but tested on Windows 10 1803 and Ubuntu 18.04.1 LTS
  • Version: 0.6.5

Parsing of "a 1/4 inch" fails

Entering "I want a gallon of beer", nothing is successfully parsed.

>>> parser.parse('I want a gallon of beer')
[]

Support precision in quantities.

Is your feature request related to a problem? Please describe.
Currently numbers are stored as floats in float precision. If they happen to be integer, they will be rounded. At no point precision is stored while scientifically important.

Describe the solution you'd like
The quantity class should have a precision member

"2.00 Watt" => precision 2
"2 Watt" => precision 0
"200 Watts" => precision -2

Describe alternatives you've considered
"2.00 Watt" => precision 0.01
"2 Watt" => precision 1
"200 Watts" => precision 100

Wikipedia should not be mandatory

wikipedia currently is required to be installed for using this package. It should not be though as its only necessary when training the classifier yourself.

It should be checked whether there is no problem in removing it from the install requires in setup.py. Also a notice should be added in the Readme and the import should be optional.

Module not found error when scipy is not installed

Describe the bug
A line in parser.py causes a modulenotfound error if the user didn't install scipy. scipy should not be necessary though.

The Problem lies in line 15. It can easily be removed as the import is not used.

Find a way to produce large amounts of nice sentence examples

Currently the training set for the classifier is quite tiny. For meaningful disambiguation a very large dataset is needed and parsing Wikipedia pages may not be the best solution.

There are websites providing example sentences in the usage of words. Maybe something similar can be found for the usage of quantity symbols or quantities in general.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.