Code Monkey home page Code Monkey logo

mathics-scanner's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mathics-scanner's Issues

Wrong behaviour in parsing escaped backslash

Consider the input in the

In[1]:= "\\[Integral]" == "\\" <> "[Integral]"

In WMA, the result is

Out[1]= True

On the other hand, in Mathics the result is
Out[1]= False

The reason is mathics-scanner parses the LHS as "\u222b", which is different to the RHS "\[Integral]"

I found it when I tried to write tests for Mathics3/mathics-core#541

Document the tokeniser

We should add some documentation on the tokenizer before releasing the package. Simple docstrings will do for now.

requirements-extra.txt not included in released source code and missing description

The file requirements-extra.txt is not included in the released source code, so building directly from that is not possible.

Since I'm packaging this for AUR, I would like to have a brief description of what each extra requirement brings. In this case, what would a user gain by installing ujson? I know it used to say #Optional Used in mathics_scanner.characters but that doesn't mean too much in terms of why (not) installing it.

Make the tokeniser more useful for other people

While working in #11 it became clear to me that the tokeniser is still very much tied up to internals of Mathics core. For instance, the whole messaging mechanism is completely useless to anyone other us (the developers of Mathics) and it could likely be entirely replaced by simply throwing errors. Also, there are multiple improvements that could be made to make the public interface cleaner and more intuitive.

I propose the following changes:

  • Entirely remove the messaging mechanism from Tokeniser and LineFeed (this will require some refactoring in core)
  • Implement __next__ for Tokeniser by simply calling the next method
  • Add functionality to control whether we want comments to be skipped or not (this is useful for syntax-highlighting-related usecases)
  • Remove the tag parameter of Token(tag, text, pos) and mark the type of the token by using subclasses of Token (i.e. Token("Number", "3", 4) becomes NumberToken("3", 4))
  • Rethink the usage of the incomplete method: I'd like to remove it (since it's more of an implementation detail than anything else), but it's used in core so we'd had to deal with that too. Even if we don't remove it, we should rename it to something descriptive

Ideally, I'd like to take care of this before the release, since this are breaking changes and therefore would require a major version bump if we were to merge them after the first release. However, I understand that the refactoring required will take some time and therefore I'm OK with doing this after the first release.

I also take full responsibility over this. I can do most of this work on my own if the rest of the contributors aren't interested. @rocky @mmatera Thoughts?

Properlly deliniate our public API

We should decide what's going to be part of our public API before releasing. Things that aren't in the public API should be marked with a _ at the start of their names.

As a start, I propose to keep everything other than replace_*_with_*, named_characters, esc_aliases, generate.rl_inputrc and whatever is used by Mathics core outside of the public API. @rocky @mmatera Thoughts?

Translation tests!

For tables this size, consistency tests are useful of the generated tables is useful

Most symbols round trip. Those that don't would be noted in a list in the test and all others would be checked for
round-tripping. The number of keys in each table agrees. The set of symbols that maps back to a WL long name is eactly the set of WL symbols.

Test errors with Python 3.9.16 on FreeBSD

All tests passed successfully with 1.2.4, but 5 AssertionError are emitted with 1.3.0:

FAILED test/test_tokeniser.py::test_apply - assert [Token(Symbol...Symbol, x, 5)] == [Token(Symbol...Symbol, x, 5)]
FAILED test/test_tokeniser.py::test_association - assert [Token(RawLef...ation, |>, 8)] == [Token(RawLef...ation, |>, 8)]
FAILED test/test_tokeniser.py::test_integeral - assert [Token(Integr...Symbol, y, 6)] == [Token(Integr...Symbol, y, 6)]
FAILED test/test_tokeniser.py::test_set - assert [Token(Symbol...Symbol, y, 4)] == [Token(Symbol...Symbol, y, 4)]
FAILED test/test_unicode_equivalent.py::test_has_unicode_equivalent - AssertionError: In Alternative - remove add unicode equivalent

Full log attached.
mathics-scanner_log.txt

Support for parsing Unicode characters from its code

In WMA, a not very well documented feature, allows to input unicode characters which does not have a name. For example, for a 32-bits character, we can write \|xxxxxx where xxxxxx are the hexadecimal digits of the character code:

In[1] := "\|01D451"
produces in WMA

Out[1] = ๐‘‘

(notice that this character does not have a name in WL).

The parser should convert NamedCharacters into wl-code as before (1.2.4), not in unicode-equivalent

After the last release, the behavior of MathicsScanner changed, in a way that named characters in strings are mapped to unicode-equivalent instead of wl-code as before. After fighting with the formatter code in mathics-core, I think this behavior is wrong.
The reason is that the goal of having unicode-equivalent is to provide a readable output, not to have an efficient way to store characters.

The example comes up with "\[DifferentialD]". In 1.2.4, this string was parsed as "\u7f4c", which was a WL specific character, with a specific meaning. If the string has a form like "\[Integral]F[x]\[DifferentialD] x", the string can be parsed afterward as the expression
Integrate[F[x], x]. On the other hand, if we want to produce a printable version, \[DifferentialD]
could be converted into d, or \u0001D451 or \, d, according to the place we need it.

With the current behavior in master, the test/format/test_format.py tests in mathics-core fails.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.