appledora / mwparserfromhtml Goto Github PK

4.0 4.0 0.0 404 KB

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.

Home Page: https://pypi.org/project/mwparserfromhtml/

License: MIT License

Python 100.00%

html python wikimedia

mwparserfromhtml's Introduction

from dataclasses import dataclass
@dataclass
class Appledora:
    research: list[str] = ["CL", "CV", "XAI" ]
    interest: list[str] = ["Astrophysics","LifeScience","Literature","Popculture"]
    welcome: list[str] = ["Collaboration", "Competitions"]
    funfact: str = None

My stacks 🍀

My stats 🐥

From: 08 December 2023 - To: 15 December 2023

Python   47 mins         ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿   100.00 %

My recent projects 🌼

mwtokenizer : a language-agnostic sentence-word tokenizer.
mwparserfromhtml : a python library to parse and process wikipedia html-dumps.

My handles 🎃

Visitors count

This readme was done because I am bored and obsessed. If you liked my profile, please 🌟 the repo and to use this template you can fork it and mod ✨

mwparserfromhtml's People

Contributors

Stargazers

Watchers

mwparserfromhtml's Issues

add function to extract references to library

Note that, this method is for auto-generated References and not Bibliography.

References can appear in different ways:

Section: https://paste.ubuntu.com/p/y5mTmW7vgr/
- article: https://en.wikipedia.org/wiki/Hans_Rehmann
Section: https://paste.ubuntu.com/p/GpxXTFQg85/
- article: https://en.wikipedia.org/wiki/Guignen
Section: https://paste.ubuntu.com/p/zNmfWSKJsS/
- article: https://en.wikipedia.org/wiki/Shuntaro_Hida

Determine the markers used for transcluded elements

Some transclusions links, don't have any marker for us to identify them and have the following format (the same as a standard WikiLink) : <a href="./Dictionary_of_National_Biography" rel="mw:WikiLink" title="Dictionary of National Biography"> Dictionary of National Biography </a>

Corresponding article : William Clark

In the same article, <a class="mw-disambig" href="./William_Clark_(disambiguation)" rel="mw:WikiLink" title="William Clark (disambiguation)"> William Clark (disambiguation) </a> - is both a disambiguation and a transclusion. The class attribute mw-disambig helps us identify the disambiguation, but not the transclusion.

In a closer inspection, it seems we need to look at the context in which the link is placed. i.e:

<div about="#mwt1" class="hatnote navigation-not-searchable" id="mwAw" role="note">
    For other people named William Clark, see
    <a class="mw-disambig" href="./William_Clark_(disambiguation)" rel="mw:WikiLink" title="William Clark (disambiguation)">
     William Clark (disambiguation)
    </a>
    .
</div>

For the same element, if we consider it's parent div tag, we see that it has a role = "note" and class=hatnote. This is preceded by a style-tag which likely performs the actual transclusion of the item.

<style about="#mwt1" data-mw='{"parts":[{"template":{"target":{"wt":"Other people","href":"./Template:Other_people"},"params":{"1":{"wt":"William Clark"}},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1033289096" id="mwAg" typeof="mw:Extension/templatestyles mw:Transclusion">
    .mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}
   </style>

For reference, check thread : https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/8#note_9522

utils function to identify the element type of html string

tied in with the plaintext extraction issue #32

Resolve "add functions to extract plaintexts to library" - [closed]

Merges 32-add-functions-to-extract-plaintexts-to-library -> main

write test for existing extraction method - [merged]

Merges 21-section-test -> main

This MR is essentially a patch collection of several issues from #17 to #25 (except for #22).

Tests were implemented using pytest and verified by running pytest -v.

Closes #21

add function to extract categories to library

In GitLab by @martingerlach on Jun 23, 2022, 19:50

From primary observations, Categories appear inside the link tag and have a relation attribute "rel": "mw:PageProp/Category". There are also some subtypes of Category elements (i.e: normal or transcluded).
We need to write a library function to extract such categories and handle the subtypes, as well.

Feedback for gitlab issue experiment

In GitLab by @fab on Jun 16, 2022, 22:29

At the end of this project, please provide some feedback on the gitlab usage in this phab task

add functions to extract parents to library

In GitLab by @appledora on Jul 12, 2022, 15:44

pretty print article information

In GitLab by @appledora on Jul 12, 2022, 15:02

feature: extract external links - [merged]

Merges 13-extlink-extraction -> main

Closes #13

add function to extract templates to library

Likely the most complex element to extract. Appears in mainly two forms:

as href of Wikilinks : also has a similar case, where WIKILINK hrefs have Category links - in which case we don't consider those elements as Categories. So, we have to make a decision regarding what can we do about such Template links.
nested inside data-mw : is complicated to process, due to the asymmetric nature of the data-mw dictionary, JSON decoding errors due to the presence of escape characters, malformed strings, and bad use(single/double) of quotation mark etc.

Write test for dump module

In GitLab by @appledora on Jul 12, 2022, 15:46

Feature/structure - [merged]

Merges feature/structure -> main

Created initial structure of the library that looks like the following :

.
├── docs
├── README.md
├── src
│   ├── dump
│   │   ├── dump.py
│   │   ├── __init__.py
│   ├── __init__.py
│   ├── parse
│   │   ├── data.py
│   │   ├── __init__.py
│   │   └── utils.py
│   └── temp_test.ipynb
└── tests
    ├── __init__.py
    └── test_dump.py

Created separate class files for HTML Dumps and Articles
Included the current requirements.txt
Implemented methods :
- get_html()
- get_comments()
- get_headers()
- get_sections()

write test for section extraction method

In GitLab by @appledora on Jul 6, 2022, 14:28

add function to extract media to library

According to this manual, media files appear either inside figure or span wrapper nodes and they all have the attribute typeof = "mw:File/*" . Additionally, the specific information related to specific media types can be found inside specialized nodes, i.e:

Images -> <img>
Video -> <video>
Audio -> <audio>

However, upto HTML 2.4.0 there were separate mw attributes for each type of files, i.e: mw: image, mw:audio, mw:video . See this issue for reference.

Consider removing specific versions from requirements.txt file

In GitLab by @geohci on Jul 5, 2022, 23:51

Specifying versions unnecessarily in the requirements.txt file can create confusion when folks want to use this package as part of a larger environment that may already have libraries installed. It also risks not ever being updated as dependencies improve. While leaving the version open can lead to breaking changes, a robust test environment can help us detect those quickly. Open to discussion.

Add tests to CI pipeline

In GitLab by @geohci on Jul 1, 2022, 04:32

Auto-run pytest tests for merge requests. This will make it easier to check that the code continues to operate as expected and incentivize us expanding our test suite.

reduce redundancy in testing module

In GitLab by @appledora on Jul 14, 2022, 19:52

Resolve "add function to extract references to library" - [merged]

Merges 28-reference-extraction -> main

References are identified by looking for {"class": "mw-reference-text"} attributes inside <span> tags. We also store the id of references that can help track the position where the reference was used.

testing functions

Closes #28

add functions to extract tables to library

In GitLab by @appledora on Jul 12, 2022, 15:44

Determine all the different wiki-elements

In GitLab by @martingerlach on Jun 23, 2022, 19:52

write function to create a hierarchy tree of the HTML tags

In GitLab by @appledora on Jun 29, 2022, 16:55

test

In GitLab by @appledora on Jun 22, 2022, 23:22

Resolve "add functions to extract plaintexts to library" - [merged]

Merges 32-plaintext-extraction -> main

See this notebook for the outputs : https://public.paws.wmcloud.org/User:Isaac_(WMF)/Outreachy%20Summer%202022/plaintext_exp5.ipynb

No test written for this yet

Closes #32

feature: added namespace attribute to Wikilink instances, language attribute... - [merged]

Merges 22-wikilink-namespace -> main

feature: added namespace attribute to Wikilink instances, language attribute to Article class, link attribute to Category, Wikilink, ExternalLink and Template

Closes #22

add funtion to extract external links to library

From primary observations, Categories appear inside the a tag and have a relation attribute {"rel":"mw:ExtLink"}. There are may be some subtypes according to this documentation:

We need to write a library function to extract the external links and handle the subtypes, as well.

Write a Link Normalization method with better coverage of different use cases

In GitLab by @appledora on Jun 24, 2022, 24:16

Choose a license

In GitLab by @geohci on Jul 14, 2022, 23:12

I'm a personal fan of MIT but open to discussion.

Create Documentation

In GitLab by @appledora on Aug 4, 2022, 18:55

determine how to identify hidden categories

Refer to this thread for details : https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/8#note_9524

Add namespace attribute to Wikilink objects

In GitLab by @geohci on Jul 8, 2022, 24:46

Now that we have a full mapping of the different namespace prefixes for each language, we'll want to use that to identify which namespace each wikilink falls in (based on its href).

For example:

<a href="./Michael_Potts_(actor)" id="mwBw" rel="mw:WikiLink" title="Michael Potts (actor)">Michael Potts (actor)</a> -> Wikilink(ns=0)
<a href="./Help:Disambiguation" rel="mw:WikiLink" title="Help:Disambiguation">disambiguation</a> -> Wikilink(ns=12)

Decide and Handle Categories that appear under WikiLink relations

Many categories seem to appear as Wikilinks , having the following format:
<a href="./Category:United_States_geography_stubs" rel="mw:WikiLink" title="Category:United States geography stubs">place or feature in the United States</a>
We need to decide, whether we should have them as categories or wikilinks.

feature: extract categories and normalize category links - [merged]

Merges 7-category-extraction -> main

Closes #7

add functions to extract ancestors to library

In GitLab by @appledora on Jul 12, 2022, 15:44

feature: extract image audio and video media - [merged]

Merges 27-media-extraction -> main

Closes #27

Write test for category extraction method

In GitLab by @appledora on Jul 1, 2022, 20:26

add functions to extract plaintexts to library

In GitLab by @appledora on Jul 12, 2022, 15:45

write test for template extraction method

In GitLab by @appledora on Jul 12, 2022, 14:33

write test for wikilinks extraction method

In GitLab by @appledora on Jul 6, 2022, 14:28

feature: template extraction method - [merged]

Merges 14-template-extraction -> main

Closes #14

add static namespace list and utility for generating it to help with namespace... - [merged]

In GitLab by @geohci on Jul 5, 2022, 23:45

Merges 6-namespaces -> main

add static namespace list and utility for generating it to help with namespace detection for wikilinks

Closes #6

Create initial structure for tests with headings as example. - [merged]

In GitLab by @geohci on Jun 23, 2022, 21:23

Merges 4-test-files -> main

Create initial structure for tests with headings as example. Run via pytest from top-level of repo.

Closes #4

Feature/wikilink : issue 5 - [merged]

Merges feature/wikilinks -> main

Created a Base Element class
Extended it for WikiLinks
Categorized Wikilinks into subclasses

Solves issue #5

Minimal: just create an Article object with the HTML raw string (unparsed) and a few attributes possibly based on the dump metadata. No parsing of HTML.
Middle: creating the Article object leads to the basic bs4 processing of the HTML from string to DOM. This is the current behavior. Greater overhead in terms of time and memory usage but might be worth it if this doesn't slow down iteration too much and gives access to some basic metadata from the HTML that is useful.
High: do full processing of HTML into DOM and also extract wiki-specific features. Likely too much overhead to be default but an option.

appledora / mwparserfromhtml Goto Github PK

mwparserfromhtml's Introduction

My stacks 🍀

My stats 🐥

My recent projects 🌼

My handles 🎃

mwparserfromhtml's People

Contributors

Stargazers

Watchers

mwparserfromhtml's Issues

Recommend Projects

Recommend Topics

Recommend Org