Code Monkey home page Code Monkey logo

dipanjans / text-analytics-with-python Goto Github PK

View Code? Open in Web Editor NEW
1.6K 120.0 836.0 39.73 MB

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.

License: Apache License 2.0

Python 2.31% Jupyter Notebook 97.69%
text-analytics text-summarization text-classification python natural-language natural-language-processing clustering sentiment semantic sentiment-analysis

text-analytics-with-python's Introduction

Text Analytics with Python - 2nd Edition

A Practitioner's Guide to Natural Language Processing

Text analytics can be a bit overwhelming and frustrating at times with the unstructured and noisy nature of textual data and the vast amount of information available. "Text Analytics with Python" is a book packed with 674 pages of useful information based on techniques, algorithms, experiences and various lessons learnt over time in analyzing text data. This repository contains datasets and code used in this book. I will also be adding various notebooks and bonus content here from time to time. Keep watching this space!

Get the book



About the book

Book Cover

Leverage Natural Language Processing (NLP) in Python and learn how to set up your own robust environment for performing text analytics. This second edition has gone through a major revamp and introduces several significant changes and new topics based on the recent trends in NLP.

You’ll see how to use the latest state-of-the-art frameworks in NLP, coupled with machine learning and deep learning models for supervised sentiment analysis powered by Python to solve actual case studies. Start by reviewing Python for NLP fundamentals on strings and text data and move on to engineering representation methods for text data, including both traditional statistical models and newer deep learning-based embedding models. Improved techniques and new methods around parsing and processing text are discussed as well.
Text summarization and topic models have been overhauled so the book showcases how to build, tune, and interpret topic models in the context of an interest dataset on NIPS conference papers. Additionally, the book covers text similarity techniques with a real-world example of movie recommenders, along with sentiment analysis using supervised and unsupervised techniques. There is also a chapter dedicated to semantic analysis where you’ll see how to build your own named entity recognition (NER) system from scratch. While the overall structure of the book remains the same, the entire code base, modules, and chapters has been updated to the latest Python 3.x release.

Edition: 2nd
Pages: 674
Language: English
Book Title: Text Analytics with Python
Book Subtitle: A Practitioner's Guide to Natural Language Processing
Publisher: Apress (a part of Springer)
Print ISBN: 978-1-4842-4353-4
Online ISBN: 978-1-4842-4354-1
DOI: 10.1007/978-1-4842-4354-1
Copyright: Dipanjan Sarkar

With this book you will:

  • Understanding NLP and text syntax, semantics and structure
  • Discover text cleaning and feature engineering strategies
  • Learn and implement text classification and text clustering
  • Understand and build text summarization and topic models
  • Learn about the promise of deep learning and transfer learning for NLP
  • Implement hands-on examples based on Python and several popular open source libraries in NLP and text analytics, such as the natural language toolkit (nltk), gensim, scikit-learn, spaCy, keras and tensorflow

text-analytics-with-python's People

Contributors

ambientlight avatar dipanjans avatar martijnvanbeers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text-analytics-with-python's Issues

ModuleNotFoundError: No module named 'normalization

Can anyone advise how to fix this urgent problem while using Python 3 ?
While I am trying the code in Chapter 4:
"----> from normalization import normalize_corpus
import nltk
from operator import itemgetter

ModuleNotFoundError: No module named 'normalization' "

Variable "re" ? Where?

I was not enable to locate the foundation of variable "re".

What part from the source code is missing?

bug2
bug1

Jupyter Notebooks for 2nd Edition?

Hello Dipanjan,

I was wondering if you had the notebooks in question mentioned in the Safari/OReilly book available? The link led me here and I don't see them in the repo.

Thanks!

Screen Shot 2020-01-29 at 12 57 13 PM

from pattern.en import tag raise BadZipFile in Chapter 6

When i run the code in Chapter 6,I got the following error:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1267, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

I tried to use pattern3 but it doesn't work.Google has little about this. I can't solve it.
It would be great if someone who have gone through this problem tells me how to solve it. thanks a lot!

I get in error

my nltk is not complete download, because one of module is out of date. So, when i try your code i get in error, on of module is not work.

In [47]: from contractions import CONTRACTION_MAP

ImportError Traceback (most recent call last)
in ()
----> 1 from contractions import CONTRACTION_MAP

ImportError: No module named contractions

image

Bug in feature_extractors() (Chapter 4)

Going through feature_extraction_demo.py, the line:

avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
                                                 model=model,
                                                 num_features=10)

raises an AttributionError:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-cdd908e72f5c> in <module>()
      2 TOKENIZED_CORPUS
      3 avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
----> 4                                                  model=model, num_features=10)

/Users/athair/researchdone/text_analytics_with_python/codes/feature_extractors.pyc in averaged_word_vectorizer(corpus, model, num_features)
     58 
     59 def averaged_word_vectorizer(corpus, model, num_features):
---> 60     vocabulary = set(model.index2word)
     61     features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
     62                     for tokenized_sentence in corpus]

AttributeError: 'Word2Vec' object has no attribute 'index2word'

with the latest gensim: v'3.0.1'
I tried with both pip install gensim and from pulling directly from the gensim github repository: https://github.com/RaRe-Technologies/gensim/

The accepted answer here suggests a fix: https://stackoverflow.com/questions/43146077/index2word-in-gensims-doc2vec-raises-an-attribute-error

Convert code base for Python 3.x

Python 3 is the future and even though a lot of legacy code and systems run on Python 2 (including our applications, which is why I had written this book in Python 2 in the first place). We need to slowly start migrating and building our code, apps and systems based on Python 3.

Looking for experts in Python 3.x as well as NLP and text analytics who could help out in migrating each chapter's codebase to Python 3.x, since I am occupied for a major part of this year on other projects. I do have some parts of it ready for Python 3.x and can offer help and support whenever needed.

Successful codebase migrations will make sure you are mentioned as a contributor in the acknowledgements & contributor list of this repository and project. Also you will get a mention in future versions of the book whenever that is in the pipeline.

Computing BM25 Similarity for 30 Querys and 85000 Documents

Hello,

the Code from the Book for BM25 is not working for large Datasets.

File "C:\Users\xxx\Anaconda2\lib\site-packages\scipy\sparse\base.py", line 1039, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

It would be great if someone could change the code that it works in my case. I'm trying this by myself currently, but no success so far :(.

Thanks

Error in: text-analytics-with-python/New-Second-Edition/Ch05 - Text Classification/Ch05b - Text Classification - I.ipynb

Running this line:

normalize our corpus

norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True,
accented_char_removal=True, text_lower_case=True, text_lemmatization=True,
text_stemming=False, special_char_removal=True, remove_digits=True,
stopword_removal=True, stopwords=stopword_list)

Returns error:
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_17860/2616894830.py in
6
7 # normalize our corpus
----> 8 norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True,
9 accented_char_removal=True, text_lower_case=True, text_lemmatization=True,
10 text_stemming=False, special_char_removal=True, remove_digits=True,

AttributeError: module 'text_normalizer' has no attribute 'normalize_corpus'

I can't find a reference to normalize_corpus in the text_normalizer documentation. Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.