ko-zu / psl Goto Github PK
View Code? Open in Web Editor NEWpublicsuffixlist for python
License: Mozilla Public License 2.0
publicsuffixlist for python
License: Mozilla Public License 2.0
When running publicsuffixlist-download --help
on windows, I get the following error:
Error:
Traceback (most recent call last):
File "C:\bld\publicsuffixlist_1675107463020\_test_env\Scripts\publicsuffixlist-download-script.py", line 9, in <module>
sys.exit(updatePSL())
^^^^^^^^^^^
File "C:\bld\publicsuffixlist_1675107463020\_test_env\Lib\site-packages\publicsuffixlist\update.py", line 41, in updatePSL
os.rename(psl_file + ".swp", psl_file)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\bld\\publicsuffixlist_1675107463020\\_test_env\\Lib\\site-packages\\publicsuffixlist\\public_suffix_list.dat.swp' -> 'C:\\bld\\publicsuffixlist_1675107463020\\_test_env\\Lib\\site-packages\\publicsuffixlist\\public_suffix_list.dat'
Logs: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=650026&view=logs&j=3ff94dba-189a-527c-65e3-ce8503824159&t=35acf2bd-66a8-5b9f-4368-b52d351bfcc2
Context: conda-forge/staged-recipes#21906
if there is a rule like:
*.abc.com
I would expect that if you give it
substuf.def.abc.com that the public suffix should be def.abc.com.
from publicsuffixlist import PublicSuffixList
# RULES TESTED:
# *.awdev.ca
# *.advisor.ws
#
# *.compute.amazonaws.com
# *.compute-1.amazonaws.com
# *.compute.amazonaws.com.cn
#
# *.elb.amazonaws.com
# *.elb.amazonaws.com.cn
psl = PublicSuffixList()
input = [
'test.awdev.ca',
'test.advisor.ws',
'test.compute.amazonaws.com',
'test.compute-1.amazonaws.com',
'test.compute.amazonaws.com.cn',
'test.elb.amazonaws.com',
'test.amazonaws.com.cn',
# add another level and it gets weird
'sub.test.awdev.ca',
'sub.test.advisor.ws',
'sub.test.compute.amazonaws.com',
'sub.test.compute-1.amazonaws.com',
'sub.test.compute.amazonaws.com.cn',
'sub.test.elb.amazonaws.com',
'sub.test.amazonaws.com.cn',
]
output = [(i, psl.privatesuffix(i)) for i in input]
for t in output:
print(f'{t[0]} -> {t[1]}')
Output from the run:
test.awdev.ca -> None
test.advisor.ws -> None
test.compute.amazonaws.com -> None
test.compute-1.amazonaws.com -> None
test.compute.amazonaws.com.cn -> None
test.elb.amazonaws.com -> None
test.amazonaws.com.cn -> amazonaws.com.cn
sub.test.awdev.ca -> sub.test.awdev.ca
sub.test.advisor.ws -> sub.test.advisor.ws
sub.test.compute.amazonaws.com -> sub.test.compute.amazonaws.com
sub.test.compute-1.amazonaws.com -> sub.test.compute-1.amazonaws.com
sub.test.compute.amazonaws.com.cn -> sub.test.compute.amazonaws.com.cn
sub.test.elb.amazonaws.com -> sub.test.elb.amazonaws.com
sub.test.amazonaws.com.cn -> amazonaws.com.cn
I would have expected the first set to return the domains unchanged and the second set to return the part minus the sub. part.
in either case the behavior is inconsistent for 2 reasons:
Hello,
Thank you for this project. If I am understanding the purpose of these methods correctly then I believe the parser is pulling the incorrect information. Its my understanding that the eTLD (effective top level domain) where an organization could register a private domain would be places like ".com" and "com.uk" would be "public suffixes". The domain that someone would register there such as "google.com" and "google.com.uk" would both be "private suffixes". However, that isn't what the tool produces.
>>> PublicSuffixList().is_private("com.uk") <-Should be False
True
>>> PublicSuffixList().is_private("com")
False
Furthermore if I try to retrieve the public and private suffixes I get incorrect data as well.
>>> PublicSuffixList().publicsuffix("google.com.uk") <- should be com.uk
'uk'
>>> PublicSuffixList().privatesuffix("google.com.uk") <- Should be google.com.uk
'com.uk'
>>> PublicSuffixList().privatesuffix("google.com") <- should be google.com and is correct
'google.com'
psl.is_public() is broken for upper case input with 2 or more labels.
psl.is_public("Jp") # => True
psl.is_public("Co.jp") # => False
TLD only domain has unintentionally returned the right value. related to #20
I recently pulled in the new release of the library into my project and had some difficulty with API changes to the privatesuffix()
method in particular. My project's existing code that used the old version passed in a str
hostname and relied on getting an Optional[str]
back, handling the result like this (as an isolated example):
r = self.psl.privatesuffix(hostname)
return r if r else r hostname
The updated version of privatesuffix() now takes in a RelaxDomain
(i.e., Union[str, BytesTuple, Iterable]
) and returns an Optional[Domain]
(i.e., Optional[Union[str, BytesTuple]
). Looking over the code, my understanding is that this is extending the method to include a new specialization for additionally taking in a BytesTuple or Iterable, and in turn outputting a BytesTuple, but the "original" str version still exists (that is, if you pass in a str
you get out an Optional[str]
effectively). This means that our calling code now does:
r = self.psl.privatesuffix(hostname)
return r if isinstance(r, str) else hostname
This isn't too bad, but it does mean we're making some assumptions about the internal implementation of this method, where the API/types contract is a bit opaque (there's no type-system guarantee that str in means str out).
If this was instead implemented as method overloads using @typing.overload, the type system would know that the str in type was explicitly connected to the str out type, like this:
@overload
def privatesuffix(self, domain: str, ...) -> Optional[str]: ...
@overload
def privatesuffix(self, domain: BytesTuple, ...) -> Optional[BytesTuple]: ...
def privatesuffix(self, domain, ...) -> Optional[str] | Optional[BytesTuple]:
# actual implementation here
If this was something that seemed desirable for this library, I'm happy to try working on a PR, but wanted to discuss before doing that (and if it was desirable I'd want to discuss how far to extend this pattern throughout the library). I also understand if the additional complexity doesn't seem warranted. Ultimately the downstream burden we have for this is pretty minimal :-)
In version 1.0.0, the tuple of bytes input matches the list if the bytes are valid UTF-8.
# custom psl rule to demo
psl = PublicSuffixList("例.example")
psl.publicsuffix("例.example") # "例.example"
psl.publicsuffix("xn--fsq.example") # "xn--fsq.example"
psl.publicsuffix((b"xn--fsq", b"example")) # (b"xn--fsq", b"example")
# UTF-8 binary of "例" does match, but it should not.
psl.publicsuffix((b"\xe4\xbe\x8b", b"example")) # (b"\xe4\xbe\x8b", b"example")
Expected behavior should be:
# b"\xe4\xbe\x8b" should not match b"xn--fsq". Only its level 1 tld should match.
psl.publicsuffix((b"\xe4\xbe\x8b", b"example")) # (b"example",)
The last case should not match in its entirety since the bytes object does not contain its encoding information. We should evaluate the binary input as-is, except for the ASCII case conversion defined in the evaluation rule.
This can be problematic if the encoding of arbitrary input cannot be enforced and/or the input must be decoded from bytes to str using punycode. Assuming UTF-8 is incorrect in this context.
In cases where evaluating binary as UTF-8 is required, the callers should encode the input to punycoded bytes tuples, as pspacesk commented in #29.
0.10.0.20240525
DNS encoding for "weird" names is not handled.
from publicsuffixlist import PublicSuffixList
psl = PublicSuffixList()
print(psl.privatesuffix("www.exa\\.mple.com"))
mple.com
Result "mple.com" is wrong because \.
character is not a label separator. It's an escaped dot which is part of the exa\.mple
label. The correct return value thus should be exa\.mple.com
.
Currently this library handles DNS names as strings. This does not match DNS definition of names: DNS names are defined as sequence of labels and individual labels can contain arbitrary binary data on the wire. "Unusual" bytes are then encoded with \
escape sequences when presented in text-format.
Processing real traffic from traffic captures. It has lots of weird names which require escaping and the current string-based processing leads to incorrect results for these weird names.
Extend the current API to accept tuple of labels instead of string. In that case it's responsibility of the caller to do the right thing, and if a software is reading stuff from PCAP files it's actually easier to pass the labels instead of constructing escaped string out of it, and then having it decoded once again in publicsuffic library again.
Alternative would be to implement full decoding of DNS names, but I think it's more work and slower performance for my use-case.
\X where X is any character other than a digit (0-9), is
used to quote that character so that its special meaning
does not apply. For example, "\." can be used to place
a dot character in a label.
\DDD where each D is a digit is the octet corresponding to
the decimal number described by DDD. The resulting
octet is assumed to be text and is not checked for
special meaning.
This problem was encountered by other people and there was a proposal to integrate PSL matching into a DNS-aware library dnspython:
rthalley/dnspython#1082
I think it would be better if we can get this improved in publicsufficlist itself. What do you think?
It'd be nice to not have to rely on a person to get the latest updates.
Thank you for writing publicsuffixlist
!
For those of us using buildout or other non-wheel-aware installers (or at least for me) it would be convenient to have an sdist available on PyPI. Could I bother you to upload one?
I reside in UTC+03. When I use the update.py
script:
>>> import time
>>> from email.utils import parsedate
>>> lastmod = "Thu, 28 May 2020 16:40:36 GMT"
>>> parsedate(lastmod)
(2020, 5, 28, 16, 40, 36, 0, 1, -1) # <-- ok
>>> time.mktime(parsedate(lastmod))
1590673236.0 # <-- not ok! 3 hours offset (caused by my TZ)
# 1590673236 is "Thursday, May 28, 2020 1:40:36 PM GMT" (notice the 3 hours difference, caused by my TZ)
# should be 1590684036
A resolution for this is to replace time.mktime()
with calendar.timegm()
.
Reference: https://docs.python.org/3/library/time.html#index-4
publicsuffix() in 0.7.14 returns non-lower suffix for TLDs.
psl = publicsuffixlist.PublicSuffixList()
psl.publicsuffix("example.COM") # => "com"
psl.publicsuffix("COM") # => "COM"
the shortcut code path for TLD-only domain should return lowered one for consistency.
Try this example, as I believe there is a bug with a dash. Note that all I did was change "compute-1" to "compute1" and then it works as expected.
>>> from publicsuffixlist import PublicSuffixList
>>> psl = PublicSuffixList()
>>> psl.privatesuffix('ec2-107-21-74-29.compute-1.amazonaws.com')
>>> psl.publicsuffix('ec2-107-21-74-29.compute-1.amazonaws.com')
'ec2-107-21-74-29.compute-1.amazonaws.com'
>>> psl.publicsuffix('ec2-107-21-74-29.compute1.amazonaws.com')
'com'
>>> psl.privatesuffix('ec2-107-21-74-29.compute1.amazonaws.com')
'amazonaws.com'
This is an important notice for users.
In the upcoming version 1.0.0, support for Python 2.7 and 3.4 will be discontinued. Version 0.10.x (or auto-released versions with the .yyyymmdd suffix) will be the last to support Python 2.7.
The minimum requirement for new versions will be Python 3.5 or later.
The new version will include type hinting to enhance API stability. The updated code is currently available in the devel branch.
https://github.com/ko-zu/psl/tree/devel
If you know of any users still relying on this module with Python 2.7, please comment here.
On pypi version 0.6.1 has been published, but the repo on github is still at version 0.6.0
Hi,
I made a test on the hostname "ec2-100-24-188-149.compute-1.amazonaws.com" , and was expecting it to return amazonaws.com.
But I'm getting None as return.
def test_amazonaws(self):
self.assertEqual(self.psl.privatesuffix("ec2-100-24-188-149.compute-1.amazonaws.com"), "amazonaws.com")
'amazonaws.com' != None
Expected :None
Actual :'amazonaws.com'
If I remove the first ec2-... I'm getting correct result:
def test_amazonaws(self):
self.assertEqual(self.psl.privatesuffix("compute-1.amazonaws.com"), "amazonaws.com")
PASSED [100%]
Process finished with exit code 0
In https://publicsuffix.org/list/public_suffix_list.dat I can see *.compute-1.amazonaws.com.
Should the first not match ?
Could you please tag the source? This allows distributions to get the complete source from GitHub if they want.
Thanks
Calling privatesuffix("something.com.mx") returns "com.mx".
cloudfront.net is a public suffix and belong to Amazon.
but before the TLD was registered, Amazon also has the domain cloudfront with TLD .net.
So it's confused to discern the root domain of *.cloudfront.net.
examples:
In [164]: ps.privatesuffix('d2os3n5ieuk9g5.cloudfront.net')
Out[164]: 'd2os3n5ieuk9g5.cloudfront.net'
In [165]: ps.privatesuffix('a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net')
Out[165]: 'tlv50-c1.cloudfront.net'
And we known every root domain has NS record, so check it.
dig d2os3n5ieuk9g5.cloudfront.net NS
; <<>> DiG 9.10.6 <<>> d2os3n5ieuk9g5.cloudfront.net NS
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1735
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 3
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;d2os3n5ieuk9g5.cloudfront.net. IN NS
;; ANSWER SECTION:
d2os3n5ieuk9g5.cloudfront.net. 830 IN NS ns-1961.awsdns-53.co.uk.
d2os3n5ieuk9g5.cloudfront.net. 830 IN NS ns-1525.awsdns-62.org.
d2os3n5ieuk9g5.cloudfront.net. 830 IN NS ns-765.awsdns-31.net.
d2os3n5ieuk9g5.cloudfront.net. 830 IN NS ns-224.awsdns-28.com.
;; ADDITIONAL SECTION:
ns-1961.awsdns-53.co.uk. 2488 IN A 205.251.199.169
ns-1525.awsdns-62.org. 8341 IN A 205.251.197.245
;; Query time: 36 msec
;; SERVER: 10.95.44.53#53(10.95.44.53)
;; WHEN: Wed Sep 09 13:12:46 CST 2020
;; MSG SIZE rcvd: 227
dig tlv50-c1.cloudfront.net NS
; <<>> DiG 9.10.6 <<>> tlv50-c1.cloudfront.net NS
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 868
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;tlv50-c1.cloudfront.net. IN NS
;; AUTHORITY SECTION:
cloudfront.net. 59 IN SOA ns-418.awsdns-52.com. hostmaster.cloudfront.net. 1377556270 16384 2048 1048576 60
;; Query time: 1018 msec
;; SERVER: 10.95.44.53#53(10.95.44.53)
;; WHEN: Wed Sep 09 13:13:56 CST 2020
;; MSG SIZE rcvd: 119
nslookup a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net 8.8.8.8
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
Name: a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.197
Name: a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.231
Name: a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.22
Name: a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.45
tlv50-c1.cloudfront.net has no NS recored, but a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net has A recored,
so the root domain of a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net is cloudfront.net.
Hi,
in the readme you link to https://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/data/test_psl.txt?raw=1 but that no longer exists.
The domain does not resolve anymore.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.