Code Monkey home page Code Monkey logo

textify-text-preprocessing's Introduction

Textify-A Text Preprocessing Web Application

A text preprocessing web application which helps a user to get the summary of an Article and also a text generator which generates text based on user input

Technologies used


Working of this application

Firstly the Application is Command line based executable under Python Environment and uses popular Python micro web framework that is FLASK.This Application consist of 2 main pages runs in LocalHost wherein intially a form is given to the user based on the content entered and on submit by the user,Accordingly the Summarizied Content is analyzed with support and importing of some python packages.This data is exported to the next connecting page.

For text generation Hugging Face is an NLP focused startup that shares a large open-source community and provides an open-source library for Natural Language Processing. Their core mode of operation for natural language processing revolves around the use of Transformers. This python based library exposes an API to use many well-known architectures that help obtain the state of the art results for various NLP tasks like text classification, information extraction, question answering, and text generation. All the architectures provided come with a set of pre-trained weights utilizing deep learning that help with ease of operation for such tasks.These transformer models come in different shape and size architectures and have their ways of accepting input data tokenization. A tokenizer takes an input word and encodes the word into a number, thus allowing faster processing.

Tokenizer in Python In both of the text processing part tokenizer is playing a vital role. In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.

This Project Comprises of 3 Modules namely

  • Landing Page
  • Summarized Content as Output
  • Text Generation as Output

Run the these Commands in the Windows Terminal:

Note Before Running the text-summarisation run these commands

pip install nlp

For exporting and processing the data,run the following script in new .py file before ruuning the application as follows:

   import nltk
   nltk.download('stopwords')
   nltk.download('word_tokenize')
   nltk.download('sent_tokenize')

In order to run and intialize the application there are 2 alternative methods:

  • Method - 1 : Run from Editor in venv and view localhost application in any Browser with link (http://127.0.0.1:5000/)
  • Method - 2 : Run from command prompt with specified path location of project by using following command
 python __init__.py

Landing Page

alt text

Summarisation (Before Summarisation)

alt text

Output (Summarised content of Article)

alt text


For the Text generation Part

Run these commands before running the Text Generation

 pip install tensorflow
 pip install transformers
 pip3 install torch torchvision torchaudio

For Conda

conda install pytorch torchvision torchaudio cpuonly -c pytorch

Note : While Running the text generator part the model will automatically download the required files for text generator i.e. GPT2 Model

Some terms and their meaning in the project

  • max_length : Outputs the no. of words you want to see while generating the text.
  • input_ids : Indices of input sequence tokens in the vocabulary.
  • pad_token_id : If a pad_token_id is defined in the configuration, it finds the last token that is not a padding token in each row.
  • num_beans : bean search to find the next appropriate words in the sequence.
  • no_repeat_ngram_size : Stops repeating certain sequences over and over again.(Basically it stops our model repeating words or sentences).
  • early_stopping : if model does not genrates more or great output it generally stops generating the output.
  • skip_special_tokens : always be True because we want to return sentences not the endofsentence tokens and other tokens we only want the words.
  • return_tensors : 'pt' refers as pytorch tensors.

Text-Generator (landing page)

alt_text

Text-Generator (Output)

alt_text


So here it Concludes the project by generating the output by matching the keywords what user has entered.

textify-text-preprocessing's People

Contributors

vivekchoudhary77 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.