Textify-A Text Preprocessing Web Application

A text preprocessing web application which helps a user to get the summary of an Article and also a text generator which generates text based on user input

Technologies used

Flask (A micro web Framework)
nltk (Python Library)
Hugging Face GPT-2 (For text generation)
HTML,CSS,JS (Core technologies for building webpages)
Python (Programming language For Backend)
Pytorch (For text generation)
Tensorflow (An end-to-end open source machine learning platform)

Working of this application

Firstly the Application is Command line based executable under Python Environment and uses popular Python micro web framework that is FLASK.This Application consist of 2 main pages runs in LocalHost wherein intially a form is given to the user based on the content entered and on submit by the user,Accordingly the Summarizied Content is analyzed with support and importing of some python packages.This data is exported to the next connecting page.

For text generation Hugging Face is an NLP focused startup that shares a large open-source community and provides an open-source library for Natural Language Processing. Their core mode of operation for natural language processing revolves around the use of Transformers. This python based library exposes an API to use many well-known architectures that help obtain the state of the art results for various NLP tasks like text classification, information extraction, question answering, and text generation. All the architectures provided come with a set of pre-trained weights utilizing deep learning that help with ease of operation for such tasks.These transformer models come in different shape and size architectures and have their ways of accepting input data tokenization. A tokenizer takes an input word and encodes the word into a number, thus allowing faster processing.

Tokenizer in Python In both of the text processing part tokenizer is playing a vital role. In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.

This Project Comprises of 3 Modules namely

Landing Page
Summarized Content as Output
Text Generation as Output

Run the these Commands in the Windows Terminal:

Note Before Running the text-summarisation run these commands

pip install nlp

For exporting and processing the data,run the following script in new .py file before ruuning the application as follows:

   import nltk
   nltk.download('stopwords')
   nltk.download('word_tokenize')
   nltk.download('sent_tokenize')

In order to run and intialize the application there are 2 alternative methods:

Method - 1 : Run from Editor in venv and view localhost application in any Browser with link (http://127.0.0.1:5000/)
Method - 2 : Run from command prompt with specified path location of project by using following command

 python __init__.py

Landing Page

Summarisation (Before Summarisation)

Output (Summarised content of Article)

For the Text generation Part

Run these commands before running the Text Generation

 pip install tensorflow
 pip install transformers
 pip3 install torch torchvision torchaudio

For Conda

conda install pytorch torchvision torchaudio cpuonly -c pytorch

Note : While Running the text generator part the model will automatically download the required files for text generator i.e. GPT2 Model

Some terms and their meaning in the project

max_length : Outputs the no. of words you want to see while generating the text.
input_ids : Indices of input sequence tokens in the vocabulary.
pad_token_id : If a pad_token_id is defined in the configuration, it finds the last token that is not a padding token in each row.
num_beans : bean search to find the next appropriate words in the sequence.
no_repeat_ngram_size : Stops repeating certain sequences over and over again.(Basically it stops our model repeating words or sentences).
early_stopping : if model does not genrates more or great output it generally stops generating the output.
skip_special_tokens : always be True because we want to return sentences not the endofsentence tokens and other tokens we only want the words.
return_tensors : 'pt' refers as pytorch tensors.

Text-Generator (landing page)

Text-Generator (Output)

So here it Concludes the project by generating the output by matching the keywords what user has entered.

vivekchoudhary77 / textify-text-preprocessing Goto Github PK

textify-text-preprocessing's Introduction

Textify-A Text Preprocessing Web Application

A text preprocessing web application which helps a user to get the summary of an Article and also a text generator which generates text based on user input

Technologies used

Working of this application

This Project Comprises of 3 Modules namely

For the Text generation Part

textify-text-preprocessing's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent