Comments (29)
Hi @lewisje
I've thought about this.
I occasionally find myself eyeballing various regions of the hosts file, for various reasons.
It seems much easier to scan a single column.
If we go to multiple hosts per line, I think I would keep it to 80-100-columns wide, or thereabouts, which would impose a constraint fewer than nine certainly.
Know what interests me greatly? Metrics for the performance of host files as a function of orthogonal factors such as
0.0.0.0
vs127.0.0.1
- How file length (number of lines) affects load and parse performance.
- The degree that multi-hosts per line helps, as seems reasonable to presume.
So far I've anecdotally seen few benefits, one way or another. The hosts file lookup appears to be sufficiently high in the latency stack to maybe not fret about?
Either way, I'm curious to know.
from hosts.
I think I should figure out how to precisely measure this, but I know that when I run ipconfig /displaydns
on my Windows machine (to force-map the hostnames in the local DNS cache), it takes less time with multiple hostnames per line than with one, even if I suppress output (just printing the output often takes lots of time with long-running commands).
I'm thinking this suggestion is more akin to delivering a minified JS file for wide-scale Web deployment while retaining a properly spaced-out JS file for development.
from hosts.
@StevenBlack wrote:
Know what interests me greatly? Metrics for the performance of host files as a function of orthogonal factors such as...
- 0.0.0.0 vs 127.0.0.1
- How file length (number of lines) affects load and parse performance.
- The degree that multi-hosts per line helps, as seems reasonable to presume.
That will be extremely valuable information if anyone performs the testing. I'm amazed that detailed tests have not already been publicly documented. Cross-platform testing is essential, and will enhance the value of the data even further.
from hosts.
Hey guys, I ran some short tests. First of all it's important to mention that I did NOT do any statistically evaluable stuff here. Just one try for every test case. No repetition - just a "let's see where this could possibly lead" thingy.
System
Router: Archer C7 v1
Router OS: openwrt BB
Router DNS: dnsmasq
The router contains the used hostsfile.
Client: Windows 7 Desktop
Software: Cygwin for Linux Tools on Windows
Connection: wired gigabit ethernet
Test Case
- Router: flush DNS cache
- Windows:
time nslookup $WEBSITE
get responsetime (uncached) - Windows:
time nslookup $WEBSITE
get responsetime (cached)
Remote DNS-Server is 85.214.20.141 (https://digitalcourage.de/support/zensurfreier-dns-server)
Results
I used a hostsfile with 355981 entries. This is 0.0.0.0 only file - no ::1 entries.
S = single entry (one host per line) - size 11 MB
N = 9 hosts per line - size 8,4 MB
Unblocked Sites
Site | S uncached (s) | S cached (s) | N uncached (s) | N cached (s) |
---|---|---|---|---|
github.com | 0.102 | 0.051 | 0.099 | 0.074 |
openwrt.org | 0.095 | 0.054 | 0.105 | 0.055 |
imgur.com | 0.094 | 0.054 | 0.083 | 0.054 |
Blocked Sites
Site | S uncached (s) | S cached (s) | N uncached (s) | N cached (s) |
---|---|---|---|---|
google-analytics.com | 0.059 | 0.059 | 0.060 | 0.057 |
zzzha.com | 0.057 | 0.051 | 0.056 | 0.054 |
Note: For this case I added the ::1 entry for googleanalytics and zzzha, so the AAAA-Request doesn't get forwarded.
single entry to nine entries per line conversion - Bash Script
I wrote a short script, so you can try it yourself. It needs input hostsfile as the argument. It writes the file hosts_nine.
#!/bin/bash
echo "127.0.0.1 localhost" > hosts_nine; cat $1 | grep "^0" | sed "s/0\.0\.0\.0//g" | tr -d "\n" | egrep -o '\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+' | sed 's/^/0\.0\.0\.0 /g' >> hosts_nine
NOTE: The there will be 0-8 entries missing in the generated file. With a base file of 300000+ entries this is "okay" for testing purposes I hope. This behaviour is a result of "let's not put too much time into this and live with the bias". The Problem here is the egrep expression. If the last entries of the file are not exactly 9 lines, they will be dismissed.
from hosts.
Thank you @hd074, that's vastly interesting.
This seems to confirm what I've seen through informal observation: not much, if any, measurable benefit.
from hosts.
Next Thing: (127.0.0.1 + ::1) vs (0.0.0.0 + ::) and Filesize
Again: I did NOT do any statistically evaluable stuff here.
Same setup as above.
Test Case 1: 127.0.0.1 vs 0.0.0.0
- Router: flush DNS cache
- Windows:
time nslookup $WEBSITE
get responsetime (pure DNS) - Router: flush DNS cache
- Windows:
time wget $WEBSITE
get responsetime (request for website or content)
Since the last test had shown that there's no real difference between cached or uncached entries when using blocked host names I did not test this separately this time.
Results
I used a hostsfile with 712131 entries.
L= localhost version (127.0.0.1 and ::1)
N= non-routable meta-addresses (0.0.0.0 and ::)
Site | L dns (s) | L wget (s) | N dns (s) | N wget (s) |
---|---|---|---|---|
google-analytics.com | 0.069 | 2.034 | 0.072 | 0.032 |
zzzha.com | 0.073 | 2.033 | 0.074 | 0.029 |
Surprise, surprise: The DNS-Request itself does not differ. That's what we expected.
But if we work later with the returned addresses to request content and whatnot the difference is pretty huge. We expected that too.
Test Case 2: Filesize
I just compared the results from both tests (355,981 vs 712,131 entries)
NOTE: What I compared here is the following:
File | 0.0.0.0 entries | ::1 entries |
---|---|---|
355,981 | 355,979 | 2 |
712,131 | 356,066 | 356,065 |
The fact that the second file doesn't contain new "unique" entries (its just all 0.0.0.0 entries duplicated and moved to ::1) MAY have an impact on the results. The point is that I can't (and don't want to) look into dnsmasq.
Nonetheless the result show the same behaviour as the time I moved from a from a pure 0.0.0.0 hostsfile with 25,000 entries to a pure 0.0.0.0 hosts file with 355,000+ entries some time ago.
Results
Site | 355981 entries (s) | 712131 entries (s) |
---|---|---|
google-analytics.com | 0.059 | 0.072 |
zzzha.com | 0.057 | 0.074 |
doubled file size, but the response time is not doubled.
When I moved from small a file to an approximately ten times larger file some time ago the response time increased from 0.032 to 0.050 (if I remember correctly). So the file size itself does not seem to have a very big impact on response time... if using dnsmasq.
from hosts.
This is great!
from hosts.
@hd074 This is _fantastic_ data you are generating.
For completeness, is this 32-bit or 64-bit Win7? Is it Win7 or Win7 SP1? Also, which edition of Windows are you testing?
from hosts.
@StevenBlack thank you very much.
@Gitoffthelawn thanks to you, too.
It's Windows 7 Professional 64-Bit, Service Pack 1.
further relevant:
ASUS P7P55D PRO Motherboard
Intel Core i7 860 @ 2,8 GHz
no additional network adapter.
from hosts.
I think that in your script, where you have /0.0.0.0/
, you should escape the periods and have /0\.0\.0\.0/
from hosts.
@lewisje you're right. thank you. corrected it.
from hosts.
I forgot another tiny thing: You could also match for the start of the line and for a space after 0.0.0.0
to be sure you don't strip out, say, subdomains like 0.0.0.0.example.net
from hosts.
So is there a best methodology that can be adopted based on this dataset?
from hosts.
See also #47 for more related discussion.
from hosts.
Relating OS X, see also the Open Radar Bug Long /etc/hosts entries lead to unbearably slow resolution rdar://24237290 and the response of an Apple engineer.
from hosts.
I guess that means that nine hostnames per line is a best practice for both Windows and Mac.
from hosts.
It means that a 9 hosts per line file performs better than a >9 hosts per line file (on a mac).
I don't really see the advantage of the nine hosts per line method (vs single entry per line).
The only thing that comes to my mind is the lower memory usage
but I think nowadays memory isn't a thing to worry about (edit: regarding this project).
My concerns regarding this method are the readability and the maintainability.
This is why I'm personally suspicious if it really is best practice.
from hosts.
The way I understood it, it's like Windows doesn't read hostnames after the ninth on a line, so the max. for that platform is nine per line, and I had remembered that OS X could read 24 per line (never tested higher) but bogged down, but I wasn't aware that 10 was the tipping point (and 9 is still within the safe zone for a Mac).
memory isn't a thing to worry about
never true.
With that said, it definitely is easier to maintain a list of hostnames with one per line and then output a nine-per-line version for deployment.
from hosts.
So is there a best methodology that can be adopted based on this dataset?
What | Why/When | But... |
---|---|---|
0.0.0.0 | always (bc timeout) | compatibility |
large # of entries | no (big) influences | (possibly) system depending |
9 entries per line | shortens filesize | readability/maintainability |
1 entry per line | readability/maintainability | filesize |
caching yes? | faster lookup of non blocked sites | no influence on speed with blocked sites |
from hosts.
@lewisje Maybe I got you wrong.
If we choose to use multiple entries per line then 9 hosts is the way to go. I agree with that.
I thought "9 entries is best practice" was referring to the whole "1 entry vs 9 entries vs X entries"-problem. In this case I did and do not agree.
from hosts.
Given that the only benefit of this proposed readability decrease is filesize reduction it seems to not be worth it. Even on mobile devices this filesize change is not significant.
from hosts.
So closing this now.
from hosts.
Are there any settings which can be made for dnsmasq which would load the full host file into memory and thereby making everything quicker? or is that default?
from hosts.
@RoelVdP dnsmasq by default is caching the hosts file(s) in the memory and it's by far the fastest dns resolver.If there are any slow downs on your end you need to look for the problem elsewhere.
from hosts.
@dnmTX thanks mate. Any way to check it is effectively loaded in memory when the file is rather large? Also, any way to make any cach(ing) larger? Thank you, very appreciated.
from hosts.
Any way to check it is effectively loaded in memory when the file is rather large?
@RoelVdP there is not really a easy way to check this as everything cached in the memory is in some hidden files,but i can assure you that this is the case. Dnsmasq is design to work from the memory and that is why is so fast.Along with the given hosts file(s) it caches every response as well so to check how effective it is,simply do time nslookup domain.com
and you'll see.Here,i made a example from my router:
Also, any way to make any cach(ing) larger?
Now,you need to clarify how you blocking those domains.There are two options,one is trough the .config
file,example: server=/domain.com/0.0.0.0
and so on and one is trough hosts files(s) with added entry in the .config
file to point to it: addn-hosts=/dir/to/your/file/hosts
.
First option has some limitations about how many entries can dnsmasq cache and whatnot,so it's not really recommended even though many repos here who offering hosts files have that option present.
Second option is the one to go with.The developer noted that dnsmasq was tested with one million entries successfully just for such a bug file it's required at least 1GHz CPU or faster.
So to answer your question,caching is plenty,unless you tell me that your hosts file(s) contain more then a million entries and no,no way to expand that as it is in the kernel.
from hosts.
@dnmTX Thank you very much for the detailed reply. Excellent idea on the nslookup. Tried that and results are about 0.5 seconds for first lookups. So, I am not using any special config in dnsmasq but rather a large /etc/hosts
(with 722k entries) file which dnsmasq then uses 'indirectly'. (See https://github.com/RoelVdP/MoralDNS). I wonder now if addn-hosts
in .config
can be pointed to the /etc/hosts
file and if this would cache it (perhaps it was not caching and the OS was the limiting factor. I am starting to understand why pages are loading slow - if there are many lookups then many * 0.5 seconds = long delay. Thank you again. Let me know if you have any other thoughts.
from hosts.
I wonder now if addn-hosts in .config can be pointed to the /etc/hosts file and if this would cache it...
@RoelVdP i'm really not sure what you mean by that.As long as you point dnsmasq to the file it will read it and cache it.Easiest way to check is from the system log(syslogd
).If it's disabled on your end,enable it and restart dnsmasq(or your system) and check the logs.Here,another example for you:
I would not recommend to override or append to /etc/hosts
as in some instances after restart that same hosts
will revert to it's previous state and all those blocked domains will be gone.It's always
better to to add it as a addition and to be stored where can't be deleted due to restart or some sudden shutdown.
Let me know if you have any other thoughts.
Yeah,like bunch.I went briefly trough your script and you can do some improvements to kind of lower the size(entries) and make it more responsive:
First: Get rid of this one: wget -Oc http://sysctl.org/cameleon/hosts
It's abandoned from the maintainer since 2017,if you wind out the duplicates and all the dead domains you'll end up with probably 5000+,out of what....23,000+(doesn't worth it)
Second: Check for empty lines,comments leftovers etc. especially in StevenBlack's lists.
use sed '/^#/d; s/ #.*//g; s/ #.*//g; /#/d; /^\s*$/d' a > tmp
in that order
Third Duplicates: They are a lot. If you manage to get rid of them you'll probably shrink your file to half. sed
will not cut it there,use awk
or even better gawk
for that task as it is blazing fast. Compare each file to StevenBlack's before you merge it.
This is your command:
gawk 'NR==FNR{a[$0];next}!($0 in a)' stevenblack the-other-file > no-duplicates-file
mv no-duplicates-file the-other-file
<- this is optional
Do this on each one then merge them all together.But first do clean(comments and whatnot) and add zeroes-IMPORTANT !!!
Still,you are loading too many lists,some of them are really not needed as they're based on others that you already using(especially EasyList,EasyPrivacy in my opinion) so some delay is to be expected.
I am starting to understand why pages are loading slow - if there are many lookups then many * 0.5 seconds = long delay.
You do realize what 0.05s
out of 1(one) second is,wright? You got that completely wrong.Can't go any faster than that bud. There are no lookups there,file is cached in memory=memory is fast=there is one lookup,or let's say ten(when open some page)=and there is comparison to all the entries in the cached file,which equals to 0.05s
each or 0.50s
combined.How is that not fast?
# With thanks, MalwareDomains list
wget -Ob https://mirror1.malwaredomains.com/files/justdomains
grep -vE "^#|^$" b | sed "s|^|0.0.0.0 |" > tmp
I just looked at it and it's wrong. This list does not come with any comments or empty lines and when i tried the command it was soooo slow. So for this one(only) just use sed 's/^/0.0.0.0 /g' b > tmp
.
Also grep
is not your friend here,sed
can do all those tasks on it's own(research commands)
For sed
double quotes are not needed(use single instead),straight brackets also(use /
instead)
You better inspect again each file and reconfigure your commands.
Another TIP: Some lists comes with bunch of comments on the top and that's it,the rest is only domain entries,so in this case(after confirmation aka visual inspection) use:
sed '1,8d' b > tmp
(adjust those numbers to your needs)
this will delete from line one to line eight and that's it,and it's ten time faster then:
sed '/^#/d' b > tmp
from hosts.
@RoelVdP this will be my last post here as we really went OFF TOPIC on this one and i know...some...are not happy about it. So good luck and i hope whatever i posted above would help to make your project better. 👍
from hosts.
Related Issues (20)
- adbox.lv and b.adbox.lv HOT 4
- Unblock fast.appcues.com HOT 2
- ## Inefficient regular expression HOT 3
- Authentic domain being blocked HOT 6
- Add this explicit website please HOT 2
- Remove legitimate research company domain HOT 3
- host removal request HOT 2
- Can no longer verify discord email because `click.discord.com` is blocked HOT 6
- add tosdr? HOT 2
- Add these websites to the "porn" hosts list please! HOT 2
- Error with Line 22 HOT 2
- Discord HOT 3
- TexturePacker marvelmetrix tracker
- add to deny list block nexus-websocket-a.intercom.io
- Add to Gambling HOT 2
- whitelist -> services.brid.tv HOT 1
- USPS Scam Site
- add {a,s}.{pem,meg}srv.com HOT 2
- Add to Social HOT 3
- Remove pipenv.org from blocked hosts HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hosts.