Code Monkey home page Code Monkey logo

mwparserfromhtml's Introduction

from dataclasses import dataclass
@dataclass
class Appledora:
    research: list[str] = ["CL", "CV", "XAI" ]
    interest: list[str] = ["Astrophysics","LifeScience","Literature","Popculture"]
    welcome: list[str] = ["Collaboration", "Competitions"]
    funfact: str = None

My stacks πŸ€

badge-pytorchbadge-opencvbadge-pyspark
badge-pythonbadge-sparqlbadge-javabadge-shellbadge-jsbadge-cpp
badge-androidbadge-reactbadge-django
badge-mysqlbadge-firebase

My stats πŸ₯

From: 08 December 2023 - To: 15 December 2023

Python   47 mins         β£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώβ£Ώ   100.00 %

zumrudu-anka


My recent projects 🌼

  • mwtokenizer : a language-agnostic sentence-word tokenizer.
  • mwparserfromhtml : a python library to parse and process wikipedia html-dumps.

My handles πŸŽƒ

Visitors count

This readme was done because I am bored and obsessed. If you liked my profile, please 🌟 the repo and to use this template you can fork it and mod ✨

mwparserfromhtml's People

Contributors

martingerlach avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

mwparserfromhtml's Issues

Determine the markers used for transcluded elements

Some transclusions links, don't have any marker for us to identify them and have the following format (the same as a standard WikiLink) : <a href="./Dictionary_of_National_Biography" rel="mw:WikiLink" title="Dictionary of National Biography"> Dictionary of National Biography </a>

Corresponding article : William Clark

In the same article, <a class="mw-disambig" href="./William_Clark_(disambiguation)" rel="mw:WikiLink" title="William Clark (disambiguation)"> William Clark (disambiguation) </a> - is both a disambiguation and a transclusion. The class attribute mw-disambig helps us identify the disambiguation, but not the transclusion.

In a closer inspection, it seems we need to look at the context in which the link is placed. i.e:

<div about="#mwt1" class="hatnote navigation-not-searchable" id="mwAw" role="note">
    For other people named William Clark, see
    <a class="mw-disambig" href="./William_Clark_(disambiguation)" rel="mw:WikiLink" title="William Clark (disambiguation)">
     William Clark (disambiguation)
    </a>
    .
</div>

For the same element, if we consider it's parent div tag, we see that it has a role = "note" and class=hatnote. This is preceded by a style-tag which likely performs the actual transclusion of the item.

<style about="#mwt1" data-mw='{"parts":[{"template":{"target":{"wt":"Other people","href":"./Template:Other_people"},"params":{"1":{"wt":"William Clark"}},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1033289096" id="mwAg" typeof="mw:Extension/templatestyles mw:Transclusion">
    .mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}
   </style> 

For reference, check thread : https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/8#note_9522

add function to extract categories to library

In GitLab by @martingerlach on Jun 23, 2022, 19:50

From primary observations, Categories appear inside the link tag and have a relation attribute "rel": "mw:PageProp/Category". There are also some subtypes of Category elements (i.e: normal or transcluded).
We need to write a library function to extract such categories and handle the subtypes, as well.

add function to extract templates to library

Likely the most complex element to extract. Appears in mainly two forms:

  1. as href of Wikilinks : also has a similar case, where WIKILINK hrefs have Category links - in which case we don't consider those elements as Categories. So, we have to make a decision regarding what can we do about such Template links.
  2. nested inside data-mw : is complicated to process, due to the asymmetric nature of the data-mw dictionary, JSON decoding errors due to the presence of escape characters, malformed strings, and bad use(single/double) of quotation mark etc.

Feature/structure - [merged]

Merges feature/structure -> main

  • Created initial structure of the library that looks like the following :
.
β”œβ”€β”€ docs
β”œβ”€β”€ README.md
β”œβ”€β”€ src
β”‚   β”œβ”€β”€ dump
β”‚   β”‚   β”œβ”€β”€ dump.py
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ parse
β”‚   β”‚   β”œβ”€β”€ data.py
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── utils.py
β”‚   └── temp_test.ipynb
└── tests
    β”œβ”€β”€ __init__.py
    └── test_dump.py
  • Created separate class files for HTML Dumps and Articles
  • Included the current requirements.txt
  • Implemented methods :
    - get_html()
    - get_comments()
    - get_headers()
    - get_sections()

add function to extract media to library

According to this manual, media files appear either inside figure or span wrapper nodes and they all have the attribute typeof = "mw:File/*" . Additionally, the specific information related to specific media types can be found inside specialized nodes, i.e:

  • Images -> <img>
  • Video -> <video>
  • Audio -> <audio>

However, upto HTML 2.4.0 there were separate mw attributes for each type of files, i.e: mw: image, mw:audio, mw:video . See this issue for reference.

Consider removing specific versions from requirements.txt file

In GitLab by @geohci on Jul 5, 2022, 23:51

Specifying versions unnecessarily in the requirements.txt file can create confusion when folks want to use this package as part of a larger environment that may already have libraries installed. It also risks not ever being updated as dependencies improve. While leaving the version open can lead to breaking changes, a robust test environment can help us detect those quickly. Open to discussion.

Add tests to CI pipeline

In GitLab by @geohci on Jul 1, 2022, 04:32

Auto-run pytest tests for merge requests. This will make it easier to check that the code continues to operate as expected and incentivize us expanding our test suite.

add funtion to extract external links to library

From primary observations, Categories appear inside the a tag and have a relation attribute {"rel":"mw:ExtLink"}. There are may be some subtypes according to this documentation:
image

We need to write a library function to extract the external links and handle the subtypes, as well.

Add namespace attribute to Wikilink objects

In GitLab by @geohci on Jul 8, 2022, 24:46

Now that we have a full mapping of the different namespace prefixes for each language, we'll want to use that to identify which namespace each wikilink falls in (based on its href).

For example:

  • <a href="./Michael_Potts_(actor)" id="mwBw" rel="mw:WikiLink" title="Michael Potts (actor)">Michael Potts (actor)</a> -> Wikilink(ns=0)
  • <a href="./Help:Disambiguation" rel="mw:WikiLink" title="Help:Disambiguation">disambiguation</a> -> Wikilink(ns=12)

Decide and Handle Categories that appear under WikiLink relations

Many categories seem to appear as Wikilinks , having the following format:
<a href="./Category:United_States_geography_stubs" rel="mw:WikiLink" title="Category:United States geography stubs">place or feature in the United States</a>
We need to decide, whether we should have them as categories or wikilinks.

Determine appropriate level of processing on instantiation of Article object

In GitLab by @geohci on Jun 23, 2022, 22:52

When an Article object is instantiated, there are a few levels of processing that can go on and we should choose the one that feels like the best balance of functionality without introducing too much overhead:

  • Minimal: just create an Article object with the HTML raw string (unparsed) and a few attributes possibly based on the dump metadata. No parsing of HTML.
  • Middle: creating the Article object leads to the basic bs4 processing of the HTML from string to DOM. This is the current behavior. Greater overhead in terms of time and memory usage but might be worth it if this doesn't slow down iteration too much and gives access to some basic metadata from the HTML that is useful.
  • High: do full processing of HTML into DOM and also extract wiki-specific features. Likely too much overhead to be default but an option.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.