Code Monkey home page Code Monkey logo

machinelearning_deeplearning's Introduction

Journey of Machine Learning and Deep Learning

Images

Books & Resources Status of Completion
1. Natural Language Processing with Python
2. Practical Natural Language Processing
3. Fast AI NLP Course
Projects and Notebooks
1. Named Entity Recognition using spaCy

Day1 of MachineLearningDeepLearning

  • Natural Language Processing: It is a subfield of linguistics, computer science, and artificial intelligence concerned with interactions between computers and human language, in particulars how to program computers to process and analyze large amount of natural language data. In my journey of MachineLearning DeepLearning, I am brushing up on my skills of NLP. Today I got an overview of different text preprocessing processes such as Tokenization, Stopword removal, Stemming, Lemmatization and I implemented a few of them. I hope you will gain some insights and hope you will also spend time learning the topics. Excited about the days ahead!

Day2 of MachineLearningDeepLearning

  • Text Preprocessing: It is a crucial step that involves cleaning and transforming raw text data into a format that can be easily analyzed and understood by machine learning models. Today I have learned more about preprocessing steps like One Hot Encoding (OHE), Bag of Words, N-grams and implemented them in code. Here, I have shared the notes about text representation techniques in the snapshot and hope you will also spend time learning the topics. Excited about the days ahead!

Images

Day3 of MachineLearningDeepLearning

  • Text Representation: It is the process of converting raw text data into structured format that can be analyzed and processed by machine learning algorithms and goal is to capture meaning and structure of text data in a way that enables the ML algorithm to make accurate predictions or classifications. Today I learned about preprocessing steps, One Hot Encoding (OHE), TF–IDF, Word Embedding,Word2Vec, CBOW, skip grams, also implemented them in code. And explored Gensim library. Here, I have shared the notes about text representation techniques in the snapshot and hope you will also spend time learning the topics. Excited about the days ahead!

Images

Day4 of MachineLearningDeepLearning

  • Part of Speech Tagging: POS Tagging is the process of assigning grammatical information to words in a sentence, such as nouns, verbs, adjectives, adverbs, pronoun, prepositions, conjunction, and interjection. Its purpose is to analyze the text to understand the meaning of words and the relationships between them in given sentences or texts. Today I learned POS Tagging, Emission Probability, Transition Probability, Hidden Markov Models, Viterbi Algorithm and revised the previous topics which I have covered. I hope you will gain some insights and hope you will also spend time learning the topics. Excited about the days ahead!

Day5 of MachineLearningDeepLearning

  • Recurrent Neural Network: A recurrent neural network (RNN) is a type of artificial neural network that is designed to process sequential data, where the current input depends not only on the current time step but also on the previous inputs. RNNs can be used for a variety of tasks, including language modeling, speech recognition, and image captioning. Today I learned about types of RNN, Forward and Backward Propagation in RNN, LSTM RNN, and its architecture, and a few more topics. I hope you will gain some insights and hope you will also spend time learning the topics. Excited about the days ahead!

Day6 of MachineLearningDeepLearning

  • Sequence to Sequence Learning: A Seq2Seq is a type of neural network architecture used for tasks involving sequential data like machine translation, text summarization, etc. The major components of Seq2Seq are: Encoder and Decoder. Today I learned about Seq2Seq Learning, Encoder, Decoder, Problems with encoder & decoder, as well as it's solution. Also today I read two research paper. I hope you will gain some insights and hope you will also spend time learning the topics. Excited about the days ahead!
  • Paper:

Images Images

Day7 of MachineLearningDeepLearning

  • Attention is all you need" is a newer approach that has shown promising results in machine translation and other natural language processing tasks. It relies solely on self-attention mechanisms and does not use any recurrent or convolutional neural networks. This approach has several advantages, such as improved parallelism and reduced computation time, and has achieved state-of-the-art results on several benchmarks. Today, I acquired knowledge on Encoder, Decoder, Seq2Seq Learning, self-attention, embedding layer, and positional encoding. Additionally, I read the paper "Attention is All You Need", and although I struggled to comprehend most of it, I gained a theoretical understanding of its content. I am planning to apply this acquired knowledge in coding tomorrow. Excited about the days ahead!
  • Reference:

Images

Day8 of MachineLearningDeepLearning

  • The Transformer model consists of an encoder and a decoder, both of which use multi-head self-attention layers and feedforward neural networks. The Transformer architecture relies exclusively on self-attention mechanisms to process input sequences and produce output sequences, without any recurrent or convolutional layers. Today I did revise all the previous topics which I have read. Here I have presented the implementation of Pytorch Transformers from Scratch. PS: I usually work with Tensorflow, so I'm not very familiar with PyTorch. However, today I enjoyed writing code with PyTorch. Excited about the days ahead!
  • Reference:

Images Images Images Images

Day9 of MachineLearningDeepLearning

  • Natural Language Processing typically uses large bodies of linguistic data or corpora. A text corpus is a large body of text. In my MachineLearning DeepLearning journey, from today I have started reading Natural Language Processing with Python book where I learned the basics of NLP, and also I explored Gutenberg Corpus, Brown Corpus, Returs Corpus, Inaugural Address Corpus & Corpora in Other Languages & also explored NLTK library. Here, I have shown how to access corpora using NLTK in a simple way. I hope you will gain some insights and hope you will also spend time learning the topics. Excited about the days ahead!

Images

Day10 of MachineLearningDeepLearning

  • The most important source of texts is undoubtedly the Web. It's convenient to have existing text collections to explore. On my journey of MachineLearningDeepLearning, Today I have learned about WordNet, The WordNet Hierarchy, lexical relations, semantic similarity, and ways to process the raw text (dealing with HTML, Processing Search Engine Results, Processing RSS Feeds). Here, I have presented the ways to process the raw text from the web. I hope you will gain some insights and hope you will also spend time learning the topics. Excited about the days ahead!

Images

Day11 of MachineLearningDeepLearning

  • Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode translation into Unicode is called decoding. Conversely, to write out Unicode to a file or a terminal, we need to translate it into a suitable encoding. On my journey of MachineLearningDeepLearning today I learned about Text Processing with Unicode, Regular Expression for Detecting Words Patterns ( Basic Meta Characters, Finding Word Stems). Here, I have presented how to use regular expressions to identify patterns in a snapshot. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead!

Images

Day12 of MachineLearningDeepLearning

  • Tokenization is the segmentation of a text into basic units or tokens such as words and punctuation. Tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words. NLTK provides an off-the-shelf tokenizer nltk.word_tokenize().
  • Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical or citation form of the word, also known as the lexeme or lemma (e.g., appear).
  • Regular expressions are a powerful and flexible method of specifying patterns. On my journey of MachineLearningDeepLearning, Today I continued exploring Regular Expressions like Extracting Word Pieces, Finding Word Stems, Searching for Tokenized Text, and Normalizing Text (Stemming and Lemmatization). Also, I read about Tokenization i.e. an instance of a more general problem of segmentation. Here, I have presented the implementation of Tokenization, Stemming, and more in the below snapshot. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead!

Images

Day13 of MachineLearningDeepLearning

  • Part of Speech Tagging: The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS tagging or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tag set. A part-of-speech tagger processes a sequence of words and attaches a part-of-speech tag to each word. On my journey of MachineLearningDeepLearning, Today I have learned about ways of Automatic Tagging like Default Tagger, Regular Expression Tagger, The Lookup Tagger, & N-Gram Tagging like Unigram Tagging and Bigram Tagging also explored Tagged Corpora. Here, I have presented the implementation of Default Tagger, Regular Expression Tagger, Lookup Tagger as well as Unigram and Bigram Tagger in the below snapshot. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead!

Images

Day14 of MachineLearningDeepLearning

  • Supervised Classification: Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. A classifier is called supervised if it is built based on training corpora containing the correct label for each input. On my journey of MachineLearningDeepLearning, Today I learned about classification, ways of choosing the right features, Overfitting & Underfitting, Document Classification Classification, Naive Bayes Classification, and many more. Here, I have presented the implementation of documentation classification & sentence segmentation in the below snapshot. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead!

Images

Day15 of MachineLearningDeepLearning

  • In Natural Language Processing, a pipeline refers to a sequence of processing steps that transform raw text input into a form that is useful for a specific task, such as sentiment analysis, text classification, or named entity recognition. The NLP pipeline typically involves several stages, including tokenization, part-of-speech (POS) tagging, parsing, semantic analysis, and machine learning or deep learning algorithms. On my journey of MachineLearningDeepLearning, Today I started reading the book Practical Natural Language Processing where I learned the generic pipeline for data-driven NLP system development. I explored the ways of Data Acquisition, Text Extraction and Cleanup, HTML Parsing and Cleanup, Unicode Normalization, Spelling Correction, and System-Specific Error Correction. Here, I have presented the implementation of HTML Parsing and Cleanup, Unicode Normalization, and a few more in the below snapshot. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead!

Images

Day16 of MachineLearningDeepLearning

  • Feature Engineering: Feature engineering, also known as feature extraction, encompasses a variety of techniques aimed at achieving this objective. The aim of this process is to transform the textual attributes into a numerical vector that can be comprehended by machine learning algorithms. The two different approaches are taken in practice for feature engineering in a) a classical NLP and traditional ML pipeline and b) a DL pipeline. On my journey of MachineLearningDeepLearning, Today I completed reading chapter 2 from the book Practical Natural Language Processing where I learned about word tokenization, stemming and lemmatization, Code mixing, and transliteration. Also explored feature engineering for classical NLP and Deep Learning Based NLP. Here, I have presented the implementation of word tokenization, the removal of stop words and digits, and a few more. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead!

Images

Day17 of MachineLearningDeepLearning

  • Text Classification: Text classification is the task of assigning one or more categories to a given piece of text from a larger set of possible categories. It can be used to organize, structure, and categorize text data from various sources, such as emails, documents, social media, etc... Some common applications of text classification are sentiment analysis, topic labeling, spam detection, and intent detection. On my journey of MachineLearningDeepLearning, Today I learned about The Pipeline for Building Text Classifications Systems. Here, I have presented the implementation of Text Classification in the Economic News Dataset and applied some pre-processing techniques and applied CountVectorizer for transforming text documents into a matrix of tokens and implemented Naive Bayes, Logistic Regression, and Support Vector Machines algorithm in the given snapshot. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead!

    One typically follows these steps when building a text classification system:
    1. Collect or create a labeled dataset suitable for the task.
    2. Split the dataset into two (training and test) or three parts: training, validation
    (i.e., development), and test sets, then decide on evaluation metric(s).
    3. Transform raw text into feature vectors.
    4. Train a classifier using the feature vectors and the corresponding labels from the
    training set.
    5. Using the evaluation metric(s) from Step 2, benchmark the model performance
    on the test set.
    6. Deploy the model to serve the real-world use case and monitor its performance.
    

Images

Day18 of MachineLearningDeepLearning

  • Information Extraction: Information Extraction refers to the NLP task of extracting relevant information from text documents. It is used in a wide range of real-world applications, from news articles to social media, and even receipts. The overarching goal of IE is to extract ‘knowledge’ from the text, and each of these tasks provides different information to do that. IE tasks require deeper NLP pre-processing followed by models developed for those specific tasks. IE tasks are typically evaluated in terms of precision, recall, and F1 scores using standard evaluation sets. Today, as I continue my journey of MachineLearningDeep Learning, I explored into the topic of Information Extraction.

Images

Day19 of MachineLearningDeepLearning

  • Named Entity Recognition: Named Entity Recognition is a sub-task of information extraction. It deals with finding and classifying named entities mentioned in unstructured text. These entities are classified into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary, values, and percentages. NER is used in a variety of applications, such as information retrieval, question answering, and text summarization, among others. On my journey of Machine Learning and Deep Learning, Today I learned about Named Entity Recognition with spaCy using Python and Language Processing Pipelines using spaCy. I have presented the implementation of Named Entity Recognition in a document using the spacy large model in the below snapshot. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead!

Images

Day20 of MachineLearningDeepLearning

  • Topic Modeling: Topic modeling is a technique that’s used to address the problem of finding latent topics in a large collection of documents. It involves identifying the underlying themes or concepts that pervade a collection of text and grouping them into categories. This can be useful for tasks like document classification, information retrieval, and recommendation systems. Today, I learned about Topic Modeling, Single Value Decomposition, Non-negative Matrix Factorization, Stemming, and Lemmatization. I have presented the implementation of Topic Modeling below in the snapshot. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead.

Images

Day21 of MachineLearningDeepLearning

  • Pretraining RoBERTa Model: The initial BERT models brought innovative features to the initial transformer models, whereas RoBERTa increases the performance of transformers for downstream tasks by improving the mechanics of the pretraining process. KantaiBERT is a Robustly Optimized BERT Pretraining Approach (RoBERTa)-like model based on the architecture of BERT. Today, I learned about RoBERTa Model and tried to build KantaiBERT Model from Scratch, exploring Transformer. While pre-training the model, I used KantaiBERT to train a tokenizer on the dataset then, saved the tokenizer and created a customized dataset, trained the RoBERTa model, and saved it. I have presented the implementation of Pretraining the KantaiBERT Model from scratch below in the snapshot. I hope you will gain some insights and hope you will spend time learning the topics. Excited about the days ahead.

Images

machinelearning_deeplearning's People

Contributors

regmi-saugat avatar

Stargazers

 avatar Durgesh Samariya avatar Rijan Neupane avatar

Watchers

 avatar

Forkers

ayockj

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.