mathics3 / mathics-scanner Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 3.0 2.27 MB

Tokenizer, and character tables, and conversion routines for the Wolfram Language.

License: GNU General Public License v3.0

Makefile 2.32% Shell 2.98% Perl 12.35% Python 82.34%

mathics-scanner's People

Stargazers

Watchers

Forkers

suryatmodulus ronsheely mkoeppe

mathics-scanner's Issues

Fix the unsoudness of requiring PyYAML to be externally installed before installing mathics-scanner

See #3 (comment) for details.

A new version of mathics-scanner should bump a new release of Mathics

mathics-scanner has been released to 1.1.0 and this produces an error in Mathics:
pkg_resources.DistributionNotFound: The 'Mathics_Scanner<1.1.0,>=1.0.0' distribution was not found and is required by Mathics3

Wrong behaviour in parsing escaped backslash

Consider the input in the

In[1]:= "\\[Integral]" == "\\" <> "[Integral]"

In WMA, the result is

Out[1]= True

On the other hand, in Mathics the result is
Out[1]= False

The reason is mathics-scanner parses the LHS as "\u222b", which is different to the RHS "\[Integral]"

I found it when I tried to write tests for Mathics3/mathics-core#541

Document the tokeniser

We should add some documentation on the tokenizer before releasing the package. Simple docstrings will do for now.

requirements-extra.txt not included in released source code and missing description

The file requirements-extra.txt is not included in the released source code, so building directly from that is not possible.

Since I'm packaging this for AUR, I would like to have a brief description of what each extra requirement brings. In this case, what would a user gain by installing ujson? I know it used to say #Optional Used in mathics_scanner.characters but that doesn't mean too much in terms of why (not) installing it.

Add named character \[IndentingNewLine]

mathics/Mathics#1174

Make the tokeniser more useful for other people

While working in #11 it became clear to me that the tokeniser is still very much tied up to internals of Mathics core. For instance, the whole messaging mechanism is completely useless to anyone other us (the developers of Mathics) and it could likely be entirely replaced by simply throwing errors. Also, there are multiple improvements that could be made to make the public interface cleaner and more intuitive.

I propose the following changes:

Entirely remove the messaging mechanism from Tokeniser and LineFeed (this will require some refactoring in core)
Implement __next__ for Tokeniser by simply calling the next method
Add functionality to control whether we want comments to be skipped or not (this is useful for syntax-highlighting-related usecases)
Remove the tag parameter of Token(tag, text, pos) and mark the type of the token by using subclasses of Token (i.e. Token("Number", "3", 4) becomes NumberToken("3", 4))
Rethink the usage of the incomplete method: I'd like to remove it (since it's more of an implementation detail than anything else), but it's used in core so we'd had to deal with that too. Even if we don't remove it, we should rename it to something descriptive

Ideally, I'd like to take care of this before the release, since this are breaking changes and therefore would require a major version bump if we were to merge them after the first release. However, I understand that the refactoring required will take some time and therefore I'm OK with doing this after the first release.

I also take full responsibility over this. I can do most of this work on my own if the rest of the contributors aren't interested. @rocky @mmatera Thoughts?

Properlly deliniate our public API

We should decide what's going to be part of our public API before releasing. Things that aren't in the public API should be marked with a _ at the start of their names.

As a start, I propose to keep everything other than replace_*_with_*, named_characters, esc_aliases, generate.rl_inputrc and whatever is used by Mathics core outside of the public API. @rocky @mmatera Thoughts?

Translation tests!

For tables this size, consistency tests are useful of the generated tables is useful

Most symbols round trip. Those that don't would be noted in a list in the test and all others would be checked for
round-tripping. The number of keys in each table agrees. The set of symbols that maps back to a WL long name is eactly the set of WL symbols.

Test errors with Python 3.9.16 on FreeBSD

All tests passed successfully with 1.2.4, but 5 AssertionError are emitted with 1.3.0:

FAILED test/test_tokeniser.py::test_apply - assert [Token(Symbol...Symbol, x, 5)] == [Token(Symbol...Symbol, x, 5)]
FAILED test/test_tokeniser.py::test_association - assert [Token(RawLef...ation, |>, 8)] == [Token(RawLef...ation, |>, 8)]
FAILED test/test_tokeniser.py::test_integeral - assert [Token(Integr...Symbol, y, 6)] == [Token(Integr...Symbol, y, 6)]
FAILED test/test_tokeniser.py::test_set - assert [Token(Symbol...Symbol, y, 4)] == [Token(Symbol...Symbol, y, 4)]
FAILED test/test_unicode_equivalent.py::test_has_unicode_equivalent - AssertionError: In Alternative - remove add unicode equivalent

Full log attached.
mathics-scanner_log.txt

Tilde vs RawTilde

In the current database in master, there is a mistake: a\[RawTilde]b\[RawTilde]c should be parsed as b[a,c],while
'a[Tilde]b[Tilde]c' parses to Tilde[a,b,c]. \[Tilde] has associated the Unicode character \u223c, while \[RawTilde] corresponds to the ASCII ~ (\u007e).

https://reference.wolfram.com/language/ref/character/Tilde.html
https://reference.wolfram.com/language/ref/character/RawTilde.html

Support for parsing Unicode characters from its code

In WMA, a not very well documented feature, allows to input unicode characters which does not have a name. For example, for a 32-bits character, we can write \|xxxxxx where xxxxxx are the hexadecimal digits of the character code:

In[1] := "\|01D451"
produces in WMA

Out[1] = 𝑑

(notice that this character does not have a name in WL).

The parser should convert NamedCharacters into wl-code as before (1.2.4), not in unicode-equivalent

After the last release, the behavior of MathicsScanner changed, in a way that named characters in strings are mapped to unicode-equivalent instead of wl-code as before. After fighting with the formatter code in mathics-core, I think this behavior is wrong.
The reason is that the goal of having unicode-equivalent is to provide a readable output, not to have an efficient way to store characters.

The example comes up with "\[DifferentialD]". In 1.2.4, this string was parsed as "\u7f4c", which was a WL specific character, with a specific meaning. If the string has a form like "\[Integral]F[x]\[DifferentialD] x", the string can be parsed afterward as the expression
Integrate[F[x], x]. On the other hand, if we want to produce a printable version, \[DifferentialD]
could be converted into d, or \u0001D451 or \, d, according to the place we need it.

With the current behavior in master, the test/format/test_format.py tests in mathics-core fails.

named-characters.yml is missing in released tar ball

The release tar ball as well as the pypi release is missing the mathics_scanner/data/named-characters.yml file, so running the testsuite will fail

mathics3 / mathics-scanner Goto Github PK

mathics-scanner's People

Stargazers

Watchers

Forkers

mathics-scanner's Issues

Recommend Projects

Recommend Topics

Recommend Org