Code Monkey home page Code Monkey logo

ural's People

Contributors

16arpi avatar 2lama avatar ameliepelle avatar annacharles avatar bmaz avatar boogheta avatar d3scmps avatar elanhermi avatar farjasju avatar kat-kel avatar miguellaura avatar oubine avatar paubre avatar yomguithereal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ural's Issues

Add `extract_urls`

It would be nice to have a function able to extract and iterate over urls in the given piece of text.

l.facebook.com

(?un)handle trailing slashes ?

Currently normalize_url trashes trailing slashes, although it is not a so uncommon practice that a website would serve two different things at http://www.domain.tld/path/ and http://www.domain.tld/path

So I'm not sure this normalising should remain, or at least maybe it could be optional only, depending on the usage to be done further on (for instance requesting share values on facebook ;) )

Additional AMP shenanigans

Slight naming change

@boogheta @farjasju:

should we change extract_urls and extract_urls_from_html to urls_from_text and urls_from_html? And later add links_from_html with the link's title & url in a tuple?

Add `is_url`

Having a function able to tell whether the given mixed value is a url would be nice. There should be an option supporting urls without protocols also?

Roadmap

What URL-related heuristics do we need to centralize here?

Ping @boogheta.

Add `hide_url_overflow` helper

It would be nice to have a helper able to display a url in a squeezed space (such as a terminal window for instance).

LRUTrie.match crashes on IP urls

Example:
http://127.0.0.1/economie/2019/01/08/un-journaliste-poursuit-richard-ferrand-pour-lavoir-bloque-sur-twitter/

Traceback (most recent call last):
  File "bin/export_users_urls_lrumedia.py", line 56, in <module>
    media = trie.match(l) or ""
  File "/home/boo/.virtualenvs/gazouilloire-polarisation/lib/python2.7/site-packages/ural/lru/trie.py", line 33, in match
    stems = self.__lru_stems(url)
  File "/home/boo/.virtualenvs/gazouilloire-polarisation/lib/python2.7/site-packages/ural/lru/stems.py", line 93, in lru_stems
    return lru_stems_from_parsed_url(urlsplit(full_url), tld_aware=tld_aware)
  File "/home/boo/.virtualenvs/gazouilloire-polarisation/lib/python2.7/site-packages/ural/lru/stems.py", line 49, in lru_stems_from_parsed_url
    tld = '.'.join(domain_parts[non_zero_i:])
TypeError: 'NoneType' object has no attribute '__getitem__'

canonicalize/normalize/fingerprint

  • canonicalize = browser will see those variants as the same
  • normalize = those variants are the same permalink
  • key = those keys are not urls anymore but are useful to compute statistical aggregates

The CLI

It would be nice for this library to expose a CLI tool able to read lists of urls from data (such as CSV files, typically) and able to perform web/url related tasks such as:

  • normalize the urls
  • ensure or strip the protocols
  • extract domain/subdomain/tld/...
  • fetch html
  • fetch json (maybe with some way to query inside)
  • scrape?
  • get FB like button data
  • use dragnet to extract content
  • etc.

cc @farjasju @boogheta

The LRUTrie

Relying on phylacterys TrieDict implementation we should, with the addition of a simplified lru method, add the simple LRUTrie.

Error with urls_from_html() on some websites

Message error

...\ural\urls_from_html.py in urls_from_html(string)
     29     for a_tag in re.finditer(HTML_URL_RE, string):
     30         url = a_tag.group(3)
---> 31         yield clean_link(url)

...\ural\urls_from_html.py in clean_link(link)
     13 def clean_link(link):
     14     """Removes leading and trailing whitespace and punctuation"""
---> 15     return link.strip("\t\r\n '\"\x0c")
     16 
     17 

AttributeError: 'NoneType' object has no attribute 'strip'

List of website where error happened

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.