Code Monkey home page Code Monkey logo

witpy's Introduction

witpy

witpy is a Python library that contains functions to parse Wikipedia XML. The library mainly focuses on extracting data from Revision pages in JSON format.

USE:

Redirect to witpy/dist and run the following command to install the library

  pip install witpy-1.0.1-py3-none-any.whl

To install all the dependencies for the library, Run the following command in terminal

  pip install -r requirements.txt

Setup A local Mongo Database

For More details on this read the Mongo DB Setup section below

Import the library by typing the command

For importing the database storing functions

from witpy.revision_parser import parser

For importing the information fetching functions

from witpy.fetch_details import *

Retrieving Functions

get_users

The function takes the xml file name as the input without the '.xml' part and return a dictonary containing all the users along with their number of comments

max_user

The function takes the xml file name as the input without the '.xml' part and return a tuple containing the name of the user with the most comments and it's number of comments

min_user

The function takes the xml file name as the input without the '.xml' part and return a tuple containing the name of the user with the least comments and it's number of comments

user_comments

The function takes the xml file name as the input without the '.xml' part and return a dictonary with users as the keys and their comments stored in a list as the corresponding value

user_sentiment_analyzer

The function takes the xml file name as the input without the '.xml' part as the first arguement. As the second arguement it you can provide a user's name if you want sentiment analysis only for that user otherwise it will return a dictonary containing all the users along with their rating score

document_sentiment_analyzer

The function takes the xml file name as the input without the '.xml' part and return the sentiment analysis score for the whole document

get_all_sections

The functions takes xml file name as argument without the '.xml' part and returns a dictionary of section names as keys and list of comments as the value of a particular key(section name).

plotSectionComments

The functions takes xml file name as argument without the '.xml' part and plots the number of comments v/s section name.

plotSectionCommentsStatistics

The functions takes xml file name as argument without the '.xml' part and outputs box plots showing statistics of comments under all sections.

plotUserSentiments

The functions takes xml file name as argument without the '.xml' part and plots user sentiment data.

For revision pages:

The Revision XML file contains revision sections. Each revision section has a contributor and a publishing date. Further there are replies under that section commented by other users sorted according to their posting date.

The Revision History Parser takes the XML file and and seperates different sections using python's inbuilt xml module. By detecting the user tag it easily finds the contributor for a particular section and the same is done for finding the publishing date, section-id and the parent-id. Further, the parser also takes care of replies under that section. It extracts the username, reply text and time of comment of each reply.

The replies sometimes have links referenced to some other content or resource over the internet. These links are are contained in a sequence of square and curly braces and some special characters. Here, the parser uses mwparserfromhell to find out content written in between curly braces. Then, the parser extracts useful data present between brackets, arranges it accordingly and merges into final JSON.

For each full revision xml, there is dedicated database in the Local cluster. After parsing of each revision section, its JSON is stored in the database as a separate collection.

For Plots:

All the plots are made using Plotly. Plotly is a python library used for plotting interactive web plots.

MongoDB Setup

Installer Download

To install the Community Edition of Mongo in your local machine, follow this link: [https://www.mongodb.com/try/download/community]

  • In the Version dropdown, choose the version of MongoDB to be downloaded. Recommended version is the current version.

  • In the Platform dropdown, selcet your OS.

  • In the Package dropdown, selcet msi.

  • Click on the Download button.

This will download the MongoDB installer in your machine. Once it is downloaded, run it.

Installation

In the Installation wizard;

  • Go through the End User Licence Agreement. To continue agree to the terms and press Next.

  • For Setup type, if you are new to this, just proceed with Complete.

  • In Service Configuration, procced with the default settings and keep a note of the Data and Log Directory.

  • In Install MongoDB menu, there is a checkbox asking permission to download MongoDB Compass. It is the UI of Mongo. Click that if you want to download it too and proceed.

It will install MongoDB in your machine.

  • To run the interactive shell, open teminal and execute the follwing command:
    "C:\Program Files\MongoDB\Server\4.4\bin\mongo.exe"

witpy's People

Contributors

descentis avatar aryan465 avatar riteksaxena avatar

Stargazers

 avatar  avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.