NLP and Web Scraping Lab by Omar Nouih

This repository contains a demonstration of web scraping techniques, Natural Language Processing (NLP) pipeline, and various NLP tasks such as text cleaning, tokenization, stop words removal, discretization, normalization, stemming, lemmatization, parts-of-speech tagging, and Named Entity Recognition (NER) for Arabic language text.

Objective
Tasks
Libraries Used
Web Scraping
Storing Raw Data
NLP Pipeline

Objective

The main objective of this project is to demonstrate proficiency in web scraping and NLP techniques for Arabic language text. By scraping data from a website related to census processes in Morocco and applying various NLP tasks, we aim to showcase skills in data acquisition, preprocessing, and analysis specific to the Arabic language.

Tasks

Web scraping from www.candidature-recensement.ma to extract information about census conditions, steps, compensation, tasks, and frequently asked questions.
Storing the raw data in MongoDB for further processing and retrieval.
Preprocessing the Arabic text data through tokenization, stop words removal, normalization, stemming, lemmatization, and other techniques.
Performing parts-of-speech tagging and Named Entity Recognition (NER) using Farasa library for Arabic language text.

Libraries Used

requests: For sending HTTP requests to the website and fetching the HTML content.
Beautiful Soup: For parsing the HTML content and extracting relevant data from the website.
pymongo: For interacting with the MongoDB database to store and retrieve the scraped data.
nltk: For various NLP tasks such as tokenization, stop words removal, stemming, and lemmatization.
qalsadi: For Arabic lemmatization.
farasa: For Arabic language processing tasks including parts-of-speech tagging and Named Entity Recognition (NER).

Web Scraping

We utilized the mentioned libraries in Python to scrape data from the www.candidature-recensement.ma website. The scraped data includes information about census conditions, steps, compensation, tasks, and frequently asked questions.

Storing Raw Data

We stored the raw scraped data in a MongoDB database named "NLP" and a collection named "atelier". This allows for easy retrieval and further processing of the data.

NLP Pipeline

The NLP pipeline involves several preprocessing steps such as tokenization, stop words removal, normalization, stemming, and lemmatization. We applied these techniques to the scraped Arabic text data to prepare it for analysis. Additionally, we performed parts-of-speech tagging and Named Entity Recognition (NER) using the Farasa library, specifically designed for Arabic language processing.

omarnouih / nlp_atelier1 Goto Github PK

nlp_atelier1's Introduction

NLP and Web Scraping Lab by Omar Nouih

Table of Contents

Objective

Tasks

Libraries Used

Web Scraping

Storing Raw Data

NLP Pipeline

nlp_atelier1's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent