textgain / grasp Goto Github PK

Essential NLP & ML, short & fast pure Python code

License: BSD 3-Clause "New" or "Revised" License

JavaScript 2.66% Python 97.34%

machine-learning natural-language-processing explainable-ai perceptron naive-bayes decision-tree k-nearest-neighbors tokenizer part-of-speech-tagger sentiment-analysis

grasp's Introduction

Grasp.py – Explainable AI

Grasp is a lightweight AI toolkit for Python, with tools for data mining, natural language processing (NLP), machine learning (ML) and network analysis. It has 300+ fast and essential algorithms, with ~25 lines of code per function, self-explanatory function names, no dependencies, bundled into one well-documented file: grasp.py (250KB). Or install with pip, including language models (25MB):

$ pip install git+https://github.com/textgain/grasp

Tools for Data Mining

Download stuff with download(url) (or dl), with built-in caching and logging:

src = dl('https://www.textgain.com', cached=True)

Parse HTML with dom(html) into an Element tree and search it with CSS Selectors:

for e in dom(src)('a[href^="http"]'): # external links
    print(e.href)

Strip HTML with plain(Element) to get a plain text string:

for word, count in wc(plain(dom(src))).items():
    print(word, count)

Find articles with wikipedia(str), in HTML:

for e in dom(wikipedia('cat', language='en'))('p'):
    print(plain(e))

Find opinions with twitter.seach(str):

for tweet in first(10, twitter.search('from:textgain')): # latest 10
    print(tweet.id, tweet.text, tweet.date)

Deploy APIs with App. Works with WSGI and Nginx:

app = App()

@app.route('/')
def index(*path, **query):
    return 'Hi! %s %s' % (path, query)

app.run('127.0.0.1', 8080, debug=True)

Once this app is up, go check http://127.0.0.1:8080/app?q=cat.

Tools for Natural Language Processing

Get language with lang(str) for 40+ languages and ~92.5% accuracy:

print(lang('The cat sat on the mat.')) # {'en': 0.99}

Get locations with loc(str) for 25K+ EU cities:

print(loc('The cat lives in Catena.')) # {('Catena', 'IT', 43.8, 11.0): 1}

Get words & sentences with tok(str) (tokenize) at ~125K words/sec:

print(tok("Mr. etc. aren't sentence breaks! ;) This is:.", language='en'))

Get word polarity with pov(str) (point-of-view). Is it a positive or negative opinion?

print(pov(tok('Nice!', language='en'))) # +0.6
print(pov(tok('Dumb.', language='en'))) # -0.4

For de, en, es, fr, nl, with ~75% accuracy.
You'll need the language models in grasp/lm.

Tag word types with tag(str) in 10+ languages using robust ML models from UD:

for word, pos in tag(tok('The cat sat on the mat.'), language='en'):
    print(word, pos)

Parts-of-speech include NOUN, VERB, ADJ, ADV, DET, PRON, PREP, ...
For ar, da, de, en, es, fr, it, nl, no, pl, pt, ru, sv, tr, with ~95% accuracy.
You'll need the language models in grasp/lm.

Tag keywords with trie, a compiled dict that scans ~250K words/sec:

t = trie({'cat*': 1, 'mat' : 2})

for i, j, k, v in t.search('Cats love catnip.', etc='*'):
    print(i, j, k, v)

Get answers with gpt(). You'll need an OpenAI API key.

print(gpt("Why do cats sit on mats? (you're a psychologist)", key='...'))

Tools for Machine Learning

Machine Learning (ML) algorithms learn by example. If you show them 10K spam and 10K real emails (i.e., train a model), they can predict whether other emails are also spam or not.

Each training example is a {feature: weight} dict with a label. For text, the features could be words, the weights could be word count, and the label might be real or spam.

Quantify text with vec(str) (vectorize) into a {feature: weight} dict:

v1 = vec('I love cats! 😀', features=('c3', 'w1'))
v2 = vec('I hate cats! 😡', features=('c3', 'w1'))

c1, c2, c3 count consecutive characters. For c2, cats → 1x ca, 1x at, 1x ts.
w1, w2, w3 count consecutive words.

Train models with fit(examples), save as JSON, predict labels:

m = fit([(v1, '+'), (v2, '-')], model=Perceptron) # DecisionTree, KNN, ...

m.save('opinion.json')

m = fit(open('opinion.json'))

print(m.predict(vec('She hates dogs.')) # {'+': 0.4: , '-': 0.6}

Once trained, Model.predict(vector) returns a dict with label probabilities (0.0–1.0).

Tools for Network Analysis

Map networks with Graph, a {node1: {node2: weight}} dict subclass:

g = Graph(directed=True)

g.add('a', 'b') # a → b
g.add('b', 'c') # b → c
g.add('b', 'd') # b → d
g.add('c', 'd') # c → d

print(g.sp('a', 'd')) # shortest path: a → b → d

print(top(pagerank(g))) # strongest node: d, 0.8

See networks with viz(graph):

with open('g.html', 'w') as f:
    f.write(viz(g, src='graph.js'))

You'll need to set src to the grasp/graph.js lib.

Tools for Comfort

Easy date handling with date(v), where v is an int, a str, or another date:

print(date('Mon Jan 31 10:00:00 +0000 2000', format='%Y-%m-%d'))

Easy path handling with cd(...), which always points to the script's folder:

print(cd('kb', 'en-loc.csv')

Easy CSV handling with csv([path]), a list of lists of values:

for code, country, _, _, _, _, _ in csv(cd('kb', 'en-loc.csv')):
    print(code, country)

data = csv()
data.append(('cat', 'Kitty'))
data.append(('cat', 'Simba'))
data.save(cd('cats.csv'))

Tools for Good

A challenge in AI is bias introduced by human trainers. Remember the Model trained earlier? Grasp has tools to explain how & why it makes decisions:

print(explain(vec('She hates dogs.'), m)) # why so negative?

In the returned dict, the model's explanation is: “you wrote hat + ate (hate)”.

grasp's People

Stargazers

Watchers

Forkers

utkarshrai darkwizz shubhangikishore pandeydivesh15 menikhilpandey aelbouchti yashh-agarwal janithwanni siddhant-soni sachinkahawala iosdev474 krb4k swipswaps wijijo sorenkf vishalbelsare meghanap19 lierre kolaposki

grasp's Issues

Installation bug in setup.py?

Hi,
executing !python setup.py install in the directory where I cloned your https://github.com/vicru/grasp returns

warning: install_lib: 'build\lib' does not exist -- no Python modules to install zip_safe flag not set; analyzing archive contents...

Even though the above-mentioned execution also returns:

running install
running bdist_egg
running egg_info
writing Grasp.egg-info\PKG-INFO
writing dependency_links to Grasp.egg-info\dependency_links.txt
writing requirements to Grasp.egg-info\requires.txt
writing top-level names to Grasp.egg-info\top_level.txt
reading manifest file 'Grasp.egg-info\SOURCES.txt'
writing manifest file 'Grasp.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
creating build\bdist.win-amd64\egg
creating build\bdist.win-amd64\egg\EGG-INFO
copying Grasp.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFO
copying Grasp.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFO
copying Grasp.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFO
copying Grasp.egg-info\requires.txt -> build\bdist.win-amd64\egg\EGG-INFO
copying Grasp.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFO
creating 'dist\Grasp-2.0-py3.8.egg' and adding 'build\bdist.win-amd64\egg' to it
removing 'build\bdist.win-amd64\egg' (and everything under it)
Processing Grasp-2.0-py3.8.egg
Removing c:\users\drcrac\anaconda3\lib\site-packages\Grasp-2.0-py3.8.egg
Copying Grasp-2.0-py3.8.egg to c:\users\drcrac\anaconda3\lib\site-packages
Grasp 2.0 is already the active version in easy-install.pth
Installed c:\users\drcrac\anaconda3\lib\site-packages\grasp-2.0-py3.8.egg
Processing dependencies for Grasp==2.0
Finished processing dependencies for Grasp==2.0

when I execute from grasp import download the following error shows up

ImportError: cannot import name 'download' from 'grasp' (C:\Users\drcrac\anaconda3\lib\site-packages\grasp_init_.py)

Which makes believe that I am doing something wrong with my installation method. I would highly appreciate some feedback in this respect. I couldn't find your library on pip, that is why I tried the above mentioned installation. By the way, my goal is to execute your Proof-of-concept https://gist.github.com/tom-de-smedt/9c9d9b9168ba703e0c336ee0128ebae5

VERB being tagged as NOUN

In running a local test with some boilerplate text, I'm getting a result that isn't tagging the verb properly:

>>> parsed = list(parse("The quick brown fox jumps over the lazy dog."))
>>> parsed[0]
Sentence(u'The/DET quick/ADJ brown/ADJ fox/NOUN jumps/NOUN over/PREP the/DET lazy/ADJ dog/NOUN ./PUNC')

Note that if I change the verb to "jumped" it tags it correctly:

>>> parsed = list(parse("The quick brown fox jumped over the lazy dog."))
>>> parsed[0]
Sentence(u'The/DET quick/ADJ brown/ADJ fox/NOUN jumped/VERB over/PREP the/DET lazy/ADJ dog/NOUN ./PUNC')

Cloned the latest master, running on the following interpreter:

Python 2.7.12 (default, Oct 11 2016, 05:24:00)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin

Let me know if you need anything else!

-R

Error fid = open(os fspath(file), "rb")] on python3.5.0

Hello thank you for visitting my issue
I'm studying tensorflow by following "pygta5" on YouTube
I tried to run [balance_data.py](https://github.com/Sentdex/pygta5/tree/master/Tutorial%20Codes/Part%208-13%20code) file on both of "command prompt & IDLE"

Error on command prompt :

Traceback (most recent call last):
File "C:\Windows\System32\pygta5-master\Tutorial Codes\Part 8-13 code\balance_data.py", line 8, in
train_data = np.load('training_data.npy')
File "C:\Users\decax64\AppData\Local\Programs\Python\Python35\lib\site-packages\numpy\lib\npyio.py", line 415, in load
fid = open(os_fspath(file), "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'training_data.npy'

Error on IDLE (only the last 3 lines) :

File "C:\Users~\AppData\Local\Programs\Python\Python35\lib\shutil.py", line 1062, in get_terminal_size
size = os.get_terminal_size(sys.stdout.fileno())
AttributeError: 'NoneType' object has no attribute 'fileno'

I guess directory error
but doesnt work Even I copied training_data.npy file to /lib/ folder

Windows10
Python3.5.0
CPU:Xeon W3530 (similar to core i7 900)= neither AVX nor AVX2

The troublemaker file is : balance_data.py
My current step is :
https://www.youtube.com/watch?v=wIxUp-37jVY&list=PLQVvvaa0QuDeETZEOy4VdocT7TOjfSA8a&index=10

Regards...

URL wrapped in parentheses is not being tokenized properly

Hi there! In some use case testing, I discovered the following behavior:

>>> tokenize("(http://google.com)")
'( http://google.com)'

It looks like the tokens() rule that fires for closing parens occurs after the URL token rule, so the closing paren doesn't get split out as a separate token:

        if w.startswith('('):                                        # (http://
            return '( ' + tokens(w[1:])
        if re.search(r'^https?://', w):                              # http://
            return w
        if re.search(r'[^:;-][),:]$', w):                            # U.S.,
            return tokens(w[:-1]) + ' ' + w[-1]