Code Monkey home page Code Monkey logo

scikit-sequitur's Introduction

Python SciKit Sequitur

SciKit Sequitur is an Apache2 licensed Python module for inferring compositional hierarchies from sequences.

Sequitur detects repetition and factors it out by forming rules in a grammar. The rules can be composed of non-terminals, giving rise to a hierarchy. It is useful for recognizing lexical structure in strings, and excels at very long sequences. The Sequitur algorithm was originally developed by Craig Nevill-Manning and Ian Witten.

>>> from sksequitur import parse
>>> grammar = parse('hello hello')
>>> print(grammar)
0 -> 1 _ 1
1 -> h e l l o                                    hello

SciKit Sequitur works on strings, lines, or any sequence of Python objects.

Features

  • Pure-Python
  • Developed on Python 3.10
  • Tested on CPython 3.6, 3.7, 3.8, 3.9, 3.10
  • Tested using GitHub Actions on Linux, Mac, and Windows

Quickstart

Installing scikit-sequitur is simple with pip:

$ pip install scikit-sequitur

You can access documentation in the interpreter with Python's built-in help function:

>>> import sksequitur
>>> help(sksequitur)                    # doctest: +SKIP

Tutorial

The scikit-sequitur module provides utilities for parsing sequences and understanding grammars.

>>> from sksequitur import parse
>>> print(parse('abcabc'))
0 -> 1 1
1 -> a b c                                        abc

The parse function is a shortcut for Parser and Grammar objects.

>>> from sksequitur import Parser
>>> parser = Parser()

Feed works incrementally.

>>> parser.feed('ab')
>>> parser.feed('cab')
>>> parser.feed('c')

Parsers can be converted to Grammars.

>>> from sksequitur import Grammar
>>> grammar = Grammar(parser.tree)
>>> print(grammar)
0 -> 1 1
1 -> a b c                                        abc

Grammars are keyed by Productions.

>>> from sksequitur import Production
>>> grammar[Production(0)]
[Production(1), Production(1)]

Mark symbols can be used to store metadata about a sequence. The mark symbol is printed as a pipe character "|".

>>> from sksequitur import Mark
>>> mark = Mark()
>>> mark
Mark()
>>> print(mark)
|

Attributes can be added to mark symbols using keyword arguments.

>>> mark = Mark(kind='start', name='foo.py')
>>> mark
Mark(kind='start', name='foo.py')
>>> mark.kind
'start'

Mark symbols can not be made part of a rule.

>>> parser = Parser()
>>> parser.feed('ab')
>>> parser.feed([Mark()])
>>> parser.feed('cab')
>>> parser.feed([Mark()])
>>> parser.feed('c')
>>> grammar = Grammar(parser.tree)
>>> print(grammar)
0 -> 1 | c 1 | c
1 -> a b                                          ab

Reference

License

Copyright 2021 Grant Jenks

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

scikit-sequitur's People

Contributors

grantjenks avatar

Stargazers

Johannes Hentschel avatar Alcides Fonseca avatar James McDermott avatar  avatar Carlos Sevilla Barceló avatar Emery Berger avatar Alonso Astroza Tagle avatar Oleg Dats avatar Daniel Mahler avatar

Watchers

 avatar James Cloos avatar  avatar

Forkers

zeitworks gsam

scikit-sequitur's Issues

Additional statistic: min depth

Calculate "min depth" per production. The depth of symbols at "g[0]" is 0. When visiting a production, the depth is increased by 1.

Also calculate "max depth" per production.

Forward grammar to the first object

I want to have the string that comes from a grammar. In fact, by having a grammar that we don't know what belongs to, reach the elements of the string (in the forward process of what the sequitur algorithm does). Now I wonder if there is any function for that in the scikit-sequitur package. If not, could you please tell me how can I have access the right part of the grammar parse export? As an example, in the following photo in the second line we have Production(5) which we have access to 7 c but I want to have access to gcc

Untitleddd

I didn't find any related code to the right export part in what you put here.

Need to support BFS and DFS traversal

def traverse_dfs(grammar, tokens):
    for token in tokens:
        yield token
        if isinstance(token, Production):
            yield from traverse_dfs(grammar[token])

def traverse_bfs(grammar, tokens):
    tokens = deque(tokens)
    for token in tokens:
        yield token
        if isinstance(token, Production):
            deque.extend(grammar[token])

Cannot import Parser

When I write from sksequitur import Parser I get:

ImportError Traceback (most recent call last)
in
----> 1 from sksequitur import Parser

ImportError: cannot import name 'Parser' from 'sksequitur'

Idea: Plotly UI for Exploration

Display table of statistics:

  • Production count
  • Production length
  • Production expansion size
  • Production minimum depth

Display plot:

  • Count vs. Length

Other ideas:

  • Symbol run length before Mark: "a b . 1 . 1 b c ." -- [2, 1, 3]
    • Whenever the "run length" is 1, the marked sequence is a duplicate.
    • Support for Mark-based statistics

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.