dipanjans / text-analytics-with-python Goto Github PK

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.

License: Apache License 2.0

Python 2.31% Jupyter Notebook 97.69%

text-analytics text-summarization text-classification python natural-language natural-language-processing clustering sentiment semantic sentiment-analysis

text-analytics-with-python's Introduction

Text Analytics with Python - 2nd Edition

A Practitioner's Guide to Natural Language Processing

Text analytics can be a bit overwhelming and frustrating at times with the unstructured and noisy nature of textual data and the vast amount of information available. "Text Analytics with Python" is a book packed with 674 pages of useful information based on techniques, algorithms, experiences and various lessons learnt over time in analyzing text data. This repository contains datasets and code used in this book. I will also be adding various notebooks and bonus content here from time to time. Keep watching this space!

Get the book

About the book

Leverage Natural Language Processing (NLP) in Python and learn how to set up your own robust environment for performing text analytics. This second edition has gone through a major revamp and introduces several significant changes and new topics based on the recent trends in NLP.

You’ll see how to use the latest state-of-the-art frameworks in NLP, coupled with machine learning and deep learning models for supervised sentiment analysis powered by Python to solve actual case studies. Start by reviewing Python for NLP fundamentals on strings and text data and move on to engineering representation methods for text data, including both traditional statistical models and newer deep learning-based embedding models. Improved techniques and new methods around parsing and processing text are discussed as well.
Text summarization and topic models have been overhauled so the book showcases how to build, tune, and interpret topic models in the context of an interest dataset on NIPS conference papers. Additionally, the book covers text similarity techniques with a real-world example of movie recommenders, along with sentiment analysis using supervised and unsupervised techniques. There is also a chapter dedicated to semantic analysis where you’ll see how to build your own named entity recognition (NER) system from scratch. While the overall structure of the book remains the same, the entire code base, modules, and chapters has been updated to the latest Python 3.x release.

^{Edition: 2nd

Pages: 674

Language: English

Book Title: Text Analytics with Python

Book Subtitle: A Practitioner's Guide to Natural Language Processing

Publisher: Apress (a part of Springer)

Print ISBN: 978-1-4842-4353-4

Online ISBN: 978-1-4842-4354-1

DOI: 10.1007/978-1-4842-4354-1

Copyright: Dipanjan Sarkar}

With this book you will:

Understanding NLP and text syntax, semantics and structure
Discover text cleaning and feature engineering strategies
Learn and implement text classification and text clustering
Understand and build text summarization and topic models
Learn about the promise of deep learning and transfer learning for NLP
Implement hands-on examples based on Python and several popular open source libraries in NLP and text analytics, such as the natural language toolkit (nltk), gensim, scikit-learn, spaCy, keras and tensorflow

text-analytics-with-python's People

Contributors

Stargazers

Watchers

Forkers

janardhanv hidannyxu barliant radovankavicky alanponce anuragreddygv323 clementlefevre ivanfoong mjk276 rcprasanth allensmile metricle vyraun benjamesbabala walidsa3d ipsolar sourcepirate little1tow smokarizadeh qwaider winnerineast thirupathipattipaka neeraj12121 rcortx madhuri5279 cliffkimani andrewang ony4869 harryangstrom whuirlab wuqixiaobai vranjan muhakh devloper13 bobquest33 wesamalnabki jsonbao arpit12 cccdy countryold wwbigdata902 syan83 syzdemonhunter aniketgurav akshayjh zhezhe123 chaitanyacixlive patent-python smelike pysky td391 brianbelljr yimjhkr68 mywoot ystone1025 ajagaja hanksantford mdlenin mahaocheng jjymhkx0820 makalatarun solertis nicoleljc1227 anjunact naseeruddin martijnvanbeers colinsongf milstein otherlibrary danny1023 ianhongruzhang ramaswamym1987 nanfengpo analyticsanalytics lili6 anudeep13 shaktimukker sawantsaurabh rahulremanan nagakiran224 praneethgb gemunu ghostintheshellarise ambientlight kevinbsc sunnysai12345 casillas-qf hytsang bkbonde 000nelson000 deveshraj kormilitzin localboy geapoch githubadenes vidyaa123 lampts neerajvashistha sbrunelli avarf

text-analytics-with-python's Issues

Is this ready?

Just wondering, README not clear.

ModuleNotFoundError: No module named 'normalization

Can anyone advise how to fix this urgent problem while using Python 3 ?
While I am trying the code in Chapter 4:
"----> from normalization import normalize_corpus
import nltk
from operator import itemgetter

ModuleNotFoundError: No module named 'normalization' "

Variable "re" ? Where?

I was not enable to locate the foundation of variable "re".

What part from the source code is missing?

Jupyter Notebooks for 2nd Edition?

Hello Dipanjan,

I was wondering if you had the notebooks in question mentioned in the Safari/OReilly book available? The link led me here and I don't see them in the repo.

Thanks!

from pattern.en import tag raise BadZipFile in Chapter 6

When i run the code in Chapter 6,I got the following error:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1267, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

I tried to use pattern3 but it doesn't work.Google has little about this. I can't solve it.
It would be great if someone who have gone through this problem tells me how to solve it. thanks a lot!

Keras and spaCy New Versions

Keras and spacy have published updated versions and it breaks some of the follow-along book code.

I get in error

my nltk is not complete download, because one of module is out of date. So, when i try your code i get in error, on of module is not work.

In [47]: from contractions import CONTRACTION_MAP

ImportError Traceback (most recent call last)
in ()
----> 1 from contractions import CONTRACTION_MAP

ImportError: No module named contractions

Uploading Code from new edition?

Hi,
Would you be able to share content from the new edition?

csv files are not able to downlod

hi,
Cloning your repository is not downloading the data files(csv files).

How to download csv files from your repo.

Bug in feature_extractors() (Chapter 4)

Going through feature_extraction_demo.py, the line:

avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
                                                 model=model,
                                                 num_features=10)

raises an AttributionError:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-cdd908e72f5c> in <module>()
      2 TOKENIZED_CORPUS
      3 avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
----> 4                                                  model=model, num_features=10)

/Users/athair/researchdone/text_analytics_with_python/codes/feature_extractors.pyc in averaged_word_vectorizer(corpus, model, num_features)
     58 
     59 def averaged_word_vectorizer(corpus, model, num_features):
---> 60     vocabulary = set(model.index2word)
     61     features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
     62                     for tokenized_sentence in corpus]

AttributeError: 'Word2Vec' object has no attribute 'index2word'

with the latest gensim: v'3.0.1'
I tried with both pip install gensim and from pulling directly from the gensim github repository: https://github.com/RaRe-Technologies/gensim/

The accepted answer here suggests a fix: https://stackoverflow.com/questions/43146077/index2word-in-gensims-doc2vec-raises-an-attribute-error

Convert code base for Python 3.x

Python 3 is the future and even though a lot of legacy code and systems run on Python 2 (including our applications, which is why I had written this book in Python 2 in the first place). We need to slowly start migrating and building our code, apps and systems based on Python 3.

Looking for experts in Python 3.x as well as NLP and text analytics who could help out in migrating each chapter's codebase to Python 3.x, since I am occupied for a major part of this year on other projects. I do have some parts of it ready for Python 3.x and can offer help and support whenever needed.

Successful codebase migrations will make sure you are mentioned as a contributor in the acknowledgements & contributor list of this repository and project. Also you will get a mention in future versions of the book whenever that is in the pipeline.

Non functioning code in chapter 7: sentiwordnet example

This is also on page 356.

from nltk.corpus import sentiwordnet as swn

good = swn.senti_synsets('good', 'n')[0]
Traceback (most recent call last):
File "", line 1, in
TypeError: 'filter' object is not subscriptable

Computing BM25 Similarity for 30 Querys and 85000 Documents

Hello,

the Code from the Book for BM25 is not working for large Datasets.

File "C:\Users\xxx\Anaconda2\lib\site-packages\scipy\sparse\base.py", line 1039, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

It would be great if someone could change the code that it works in my case. I'm trying this by myself currently, but no success so far :(.

Thanks

Error in: text-analytics-with-python/New-Second-Edition/Ch05 - Text Classification/Ch05b - Text Classification - I.ipynb

Running this line:

normalize our corpus

norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True,
accented_char_removal=True, text_lower_case=True, text_lemmatization=True,
text_stemming=False, special_char_removal=True, remove_digits=True,
stopword_removal=True, stopwords=stopword_list)

Returns error:
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_17860/2616894830.py in
6
7 # normalize our corpus
----> 8 norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True,
9 accented_char_removal=True, text_lower_case=True, text_lemmatization=True,
10 text_stemming=False, special_char_removal=True, remove_digits=True,

AttributeError: module 'text_normalizer' has no attribute 'normalize_corpus'

I can't find a reference to normalize_corpus in the text_normalizer documentation. Thanks