The ural from medialab

It would be nice to have a function able to extract and iterate over urls in the given piece of text.

Invalid url makes urls_from_text hang

https://www.bfmtvregain-de-popularite-pour-emmanuel-macron-et-edouard-phi...

l.facebook.com

https://l.facebook.com/l.php?u=http%3A%2F%2Fwww.chaos-controle.com%2Farchives%2F2013%2F10%2F14%2F28176300.html&h=AT0iUqJpUTMzHAH8HAXwZ11p8P3Z-SrY90wIXZhcjMnxBTHMiau8Fv1hvz00ZezRegqmF86SczyUXx3Gzdt_MdFH-I4CwHIXKKU9L6w522xwOqkOvLAylxojGEwrp341uC-GlVyGE2N7XwTPK9cpP0mQ8PIrWh8Qj2gHIIR08Js0mUr7G8Qe9fx66uYcfnNfTTF1xi0Us8gTo4fOZxAgidGWXsdgtU_OdvQqyEm97oHzKbWfXjkhsrzbtb8ZNMDwCP5099IMcKRD8Hi5H7W3vwh9hd_JlRgm5Z074epD_mGAeoEATE_QUVNTxO0SHO4XNn2Z7LgBamvevu1ENBcuyuSOYA0BsY2cx8mPWJ9t44tQcnmyQhBlYm_YmszDaQx9IfVP26PRqhsTLz-kZzv0DGMiJFU78LVWVPc9QSw2f9mA5JYWr29w12xJJ5XGQ6DhJxDMWRnLdG8Tnd7gZKCaRdqDER1jkO72u75-o4YuV3CLh4j-_4u0fnHSzHdVD8mxr9pNEgu8rvJF1E2H3-XbzA6F2wMQtFCejH8MBakzYtTGNvHSexSiKphE04Ci1Z23nBjCZFsgNXwL3wbIXWfHjh2LCKyihQauYsnvxp6fyioStJSGgyA9GGEswizHa20lucQF0S0F8H9-

Add facebook urls related heuristics

(?un)handle trailing slashes ?

Currently normalize_url trashes trailing slashes, although it is not a so uncommon practice that a website would serve two different things at http://www.domain.tld/path/ and http://www.domain.tld/path

So I'm not sure this normalising should remain, or at least maybe it could be optional only, depending on the usage to be done further on (for instance requesting share values on facebook ;) )

Additional AMP shenanigans

http://bc-marfeel-com.cdn.ampproject.org/c/s/bc.marfeel.com/www.capital.fr/a-la-une/actualites/francois-fillon-face-aux-geants-du-numerique-pas-de-faiblesse-1199748?marfeeltn=amp

http://www-irishtimes-com.cdn.ampproject.org/c/s/www.irishtimes.com/business/transport-and-tourism/talks-stall-between-ryanair-and-irish-pilots-union-1.3466248?mode=amp

http://cdn.ampproject.org/c/www.journaldunet.com/solutions/dsi/1165409-qu-est-ce-que-le-datalake-le-nouveau-concept-big-data-en-vogue/?output=amp

?wpamp

Notes related to normalize_url enhancements

puny code [done]
sorting query params [done]
protocol option
authentication [done]
option for trailing / [done]
case? [done]
https://en.wikipedia.org/wiki/URL_normalization

Slight naming change

@boogheta @farjasju:

should we change extract_urls and extract_urls_from_html to urls_from_text and urls_from_html? And later add links_from_html with the link's title & url in a tuple?

Switch to strip_trailing_slash to default True

normalize_url crashes on some urls

Add `is_url`

Having a function able to tell whether the given mixed value is a url would be nice. There should be an option supporting urls without protocols also?

Check that urls_from_html does not yield a tags within script tags

url matching functions

Roadmap

What URL-related heuristics do we need to centralize here?

Ping @boogheta.

Urlsplit is not bijective regarding empty query and empty fragment

Parse Facebook post urls

https://www.facebook.com/astucerie/posts/428202057564823
https://www.facebook.com/permalink.php?story_fbid=1354978971282622&id=598338556946671
https://www.facebook.com/groups/175634843342347/permalink/235340200705144

url_from_text: option to detect urls without protocol

Need to check tld also to be sure and don't get false positives as much as possible.

Refine fragment stripping heuristic

Mine youtube url dump to formulate parsing heuristics

Add `hide_url_overflow` helper

It would be nice to have a helper able to display a url in a squeezed space (such as a terminal window for instance).

Add `extract_urls_from_html`

It would be nice to have a function able to extract and iterate over urls pointed by html a tags only.

LRUTrie.match crashes on IP urls

Example:
http://127.0.0.1/economie/2019/01/08/un-journaliste-poursuit-richard-ferrand-pour-lavoir-bloque-sur-twitter/

Traceback (most recent call last):
  File "bin/export_users_urls_lrumedia.py", line 56, in <module>
    media = trie.match(l) or ""
  File "/home/boo/.virtualenvs/gazouilloire-polarisation/lib/python2.7/site-packages/ural/lru/trie.py", line 33, in match
    stems = self.__lru_stems(url)
  File "/home/boo/.virtualenvs/gazouilloire-polarisation/lib/python2.7/site-packages/ural/lru/stems.py", line 93, in lru_stems
    return lru_stems_from_parsed_url(urlsplit(full_url), tld_aware=tld_aware)
  File "/home/boo/.virtualenvs/gazouilloire-polarisation/lib/python2.7/site-packages/ural/lru/stems.py", line 49, in lru_stems_from_parsed_url
    tld = '.'.join(domain_parts[non_zero_i:])
TypeError: 'NoneType' object has no attribute '__getitem__'

Add "fbclid" in the GET blacklist (added by Facebook on links to external pages)

Handling IPv6, authenticated & punycode urls in 'is_url'

fref query to drop

canonicalize/normalize/fingerprint

canonicalize = browser will see those variants as the same
normalize = those variants are the same permalink
key = those keys are not urls anymore but are useful to compute statistical aggregates

Adding variant of is_url tolerating spaces in path

Add google search url normalization

The Google AMP Conundrum

Add feature : ensure_protocol

lru_stems should handle bad tlds silently

Handle gl/hl?

https://www.youtube.com/?gl=FR&hl=fr

Add `strip_tld` option

The CLI

It would be nice for this library to expose a CLI tool able to read lists of urls from data (such as CSV files, typically) and able to perform web/url related tasks such as:

normalize the urls
ensure or strip the protocols
extract domain/subdomain/tld/...
fetch html
fetch json (maybe with some way to query inside)
scrape?
get FB like button data
use dragnet to extract content
etc.

cc @farjasju @boogheta

Should we also strip? What about spaces?

...\ural\urls_from_html.py in urls_from_html(string)
     29     for a_tag in re.finditer(HTML_URL_RE, string):
     30         url = a_tag.group(3)
---> 31         yield clean_link(url)

...\ural\urls_from_html.py in clean_link(link)
     13 def clean_link(link):
     14     """Removes leading and trailing whitespace and punctuation"""
---> 15     return link.strip("\t\r\n '\"\x0c")
     16 
     17 

AttributeError: 'NoneType' object has no attribute 'strip'

List of website where error happened

add a strip_protocol option to the normalize_url function :

normalize_url(http://www.google.com/page.html, strip_protocol = False)
>>> "http://google.com"

medialab / ural Goto Github PK

ural's People

Contributors

Stargazers

Watchers

Forkers

ural's Issues

add a strip_protocol option to the normalize_url function :

Recommend Projects

Recommend Topics

Recommend Org