medialab / ural Goto Github PK
View Code? Open in Web Editor NEWA helper library full of URL-related heuristics.
License: GNU General Public License v3.0
A helper library full of URL-related heuristics.
License: GNU General Public License v3.0
tmtc @boogheta
It would be nice to have a function able to extract and iterate over urls in the given piece of text.
Currently normalize_url trashes trailing slashes, although it is not a so uncommon practice that a website would serve two different things at http://www.domain.tld/path/ and http://www.domain.tld/path
So I'm not sure this normalising should remain, or at least maybe it could be optional only, depending on the usage to be done further on (for instance requesting share values on facebook ;) )
?wpamp
Having a function able to tell whether the given mixed value is a url would be nice. There should be an option supporting urls without protocols also?
What URL-related heuristics do we need to centralize here?
Ping @boogheta.
Need to check tld also to be sure and don't get false positives as much as possible.
It would be nice to have a helper able to display a url in a squeezed space (such as a terminal window for instance).
It would be nice to have a function able to extract and iterate over urls pointed by html a tags only.
Traceback (most recent call last):
File "bin/export_users_urls_lrumedia.py", line 56, in <module>
media = trie.match(l) or ""
File "/home/boo/.virtualenvs/gazouilloire-polarisation/lib/python2.7/site-packages/ural/lru/trie.py", line 33, in match
stems = self.__lru_stems(url)
File "/home/boo/.virtualenvs/gazouilloire-polarisation/lib/python2.7/site-packages/ural/lru/stems.py", line 93, in lru_stems
return lru_stems_from_parsed_url(urlsplit(full_url), tld_aware=tld_aware)
File "/home/boo/.virtualenvs/gazouilloire-polarisation/lib/python2.7/site-packages/ural/lru/stems.py", line 49, in lru_stems_from_parsed_url
tld = '.'.join(domain_parts[non_zero_i:])
TypeError: 'NoneType' object has no attribute '__getitem__'
https://www.youtube.com/?gl=FR&hl=fr
It would be nice for this library to expose a CLI tool able to read lists of urls from data (such as CSV files, typically) and able to perform web/url related tasks such as:
What to do?
http://enqueteur.cgdd.developpement-durable.gouv.fr/index.php/561714?lang=fr%0D
Should we also strip? What about spaces?
Relying on phylacterys TrieDict implementation we should, with the addition of a simplified lru method, add the simple LRUTrie.
Message error
...\ural\urls_from_html.py in urls_from_html(string)
29 for a_tag in re.finditer(HTML_URL_RE, string):
30 url = a_tag.group(3)
---> 31 yield clean_link(url)
...\ural\urls_from_html.py in clean_link(link)
13 def clean_link(link):
14 """Removes leading and trailing whitespace and punctuation"""
---> 15 return link.strip("\t\r\n '\"\x0c")
16
17
AttributeError: 'NoneType' object has no attribute 'strip'
List of website where error happened
Such as ], <br>
etc.
normalize_url(http://www.google.com/page.html, strip_protocol = False)
>>> "http://google.com"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.