Code Monkey home page Code Monkey logo

Comments (29)

StevenBlack avatar StevenBlack commented on May 19, 2024

Hi @lewisje

I've thought about this.

I occasionally find myself eyeballing various regions of the hosts file, for various reasons.

It seems much easier to scan a single column.

If we go to multiple hosts per line, I think I would keep it to 80-100-columns wide, or thereabouts, which would impose a constraint fewer than nine certainly.

Know what interests me greatly? Metrics for the performance of host files as a function of orthogonal factors such as

  • 0.0.0.0 vs 127.0.0.1
  • How file length (number of lines) affects load and parse performance.
  • The degree that multi-hosts per line helps, as seems reasonable to presume.

So far I've anecdotally seen few benefits, one way or another. The hosts file lookup appears to be sufficiently high in the latency stack to maybe not fret about?

Either way, I'm curious to know.

from hosts.

lewisje avatar lewisje commented on May 19, 2024

I think I should figure out how to precisely measure this, but I know that when I run ipconfig /displaydns on my Windows machine (to force-map the hostnames in the local DNS cache), it takes less time with multiple hostnames per line than with one, even if I suppress output (just printing the output often takes lots of time with long-running commands).

I'm thinking this suggestion is more akin to delivering a minified JS file for wide-scale Web deployment while retaining a properly spaced-out JS file for development.

from hosts.

Gitoffthelawn avatar Gitoffthelawn commented on May 19, 2024

@StevenBlack wrote:

Know what interests me greatly? Metrics for the performance of host files as a function of orthogonal factors such as...

  • 0.0.0.0 vs 127.0.0.1
  • How file length (number of lines) affects load and parse performance.
  • The degree that multi-hosts per line helps, as seems reasonable to presume.

That will be extremely valuable information if anyone performs the testing. I'm amazed that detailed tests have not already been publicly documented. Cross-platform testing is essential, and will enhance the value of the data even further.

from hosts.

HansiHase avatar HansiHase commented on May 19, 2024

Hey guys, I ran some short tests. First of all it's important to mention that I did NOT do any statistically evaluable stuff here. Just one try for every test case. No repetition - just a "let's see where this could possibly lead" thingy.


System
Router: Archer C7 v1
Router OS: openwrt BB
Router DNS: dnsmasq
The router contains the used hostsfile.

Client: Windows 7 Desktop
Software: Cygwin for Linux Tools on Windows

Connection: wired gigabit ethernet


Test Case

  1. Router: flush DNS cache
  2. Windows: time nslookup $WEBSITE get responsetime (uncached)
  3. Windows: time nslookup $WEBSITE get responsetime (cached)

Remote DNS-Server is 85.214.20.141 (https://digitalcourage.de/support/zensurfreier-dns-server)


Results

I used a hostsfile with 355981 entries. This is 0.0.0.0 only file - no ::1 entries.

S = single entry (one host per line) - size 11 MB
N = 9 hosts per line - size 8,4 MB

Unblocked Sites

Site S uncached (s) S cached (s) N uncached (s) N cached (s)
github.com 0.102 0.051 0.099 0.074
openwrt.org 0.095 0.054 0.105 0.055
imgur.com 0.094 0.054 0.083 0.054

Blocked Sites

Site S uncached (s) S cached (s) N uncached (s) N cached (s)
google-analytics.com 0.059 0.059 0.060 0.057
zzzha.com 0.057 0.051 0.056 0.054

Note: For this case I added the ::1 entry for googleanalytics and zzzha, so the AAAA-Request doesn't get forwarded.


single entry to nine entries per line conversion - Bash Script

I wrote a short script, so you can try it yourself. It needs input hostsfile as the argument. It writes the file hosts_nine.

#!/bin/bash

echo "127.0.0.1 localhost" > hosts_nine; cat $1 | grep "^0" | sed "s/0\.0\.0\.0//g" | tr -d "\n" | egrep -o '\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+' | sed 's/^/0\.0\.0\.0 /g' >> hosts_nine

NOTE: The there will be 0-8 entries missing in the generated file. With a base file of 300000+ entries this is "okay" for testing purposes I hope. This behaviour is a result of "let's not put too much time into this and live with the bias". The Problem here is the egrep expression. If the last entries of the file are not exactly 9 lines, they will be dismissed.

from hosts.

StevenBlack avatar StevenBlack commented on May 19, 2024

Thank you @hd074, that's vastly interesting.

This seems to confirm what I've seen through informal observation: not much, if any, measurable benefit.

from hosts.

HansiHase avatar HansiHase commented on May 19, 2024

Next Thing: (127.0.0.1 + ::1) vs (0.0.0.0 + ::) and Filesize


Again: I did NOT do any statistically evaluable stuff here.
Same setup as above.


Test Case 1: 127.0.0.1 vs 0.0.0.0

  1. Router: flush DNS cache
  2. Windows: time nslookup $WEBSITE get responsetime (pure DNS)
  3. Router: flush DNS cache
  4. Windows: time wget $WEBSITE get responsetime (request for website or content)

Since the last test had shown that there's no real difference between cached or uncached entries when using blocked host names I did not test this separately this time.


Results
I used a hostsfile with 712131 entries.

L= localhost version (127.0.0.1 and ::1)
N= non-routable meta-addresses (0.0.0.0 and ::)

Site L dns (s) L wget (s) N dns (s) N wget (s)
google-analytics.com 0.069 2.034 0.072 0.032
zzzha.com 0.073 2.033 0.074 0.029

Surprise, surprise: The DNS-Request itself does not differ. That's what we expected.
But if we work later with the returned addresses to request content and whatnot the difference is pretty huge. We expected that too.


Test Case 2: Filesize

I just compared the results from both tests (355,981 vs 712,131 entries)

NOTE: What I compared here is the following:

File 0.0.0.0 entries ::1 entries
355,981 355,979 2
712,131 356,066 356,065

The fact that the second file doesn't contain new "unique" entries (its just all 0.0.0.0 entries duplicated and moved to ::1) MAY have an impact on the results. The point is that I can't (and don't want to) look into dnsmasq.

Nonetheless the result show the same behaviour as the time I moved from a from a pure 0.0.0.0 hostsfile with 25,000 entries to a pure 0.0.0.0 hosts file with 355,000+ entries some time ago.


Results

Site 355981 entries (s) 712131 entries (s)
google-analytics.com 0.059 0.072
zzzha.com 0.057 0.074

doubled file size, but the response time is not doubled.

When I moved from small a file to an approximately ten times larger file some time ago the response time increased from 0.032 to 0.050 (if I remember correctly). So the file size itself does not seem to have a very big impact on response time... if using dnsmasq.

from hosts.

StevenBlack avatar StevenBlack commented on May 19, 2024

This is great!

from hosts.

Gitoffthelawn avatar Gitoffthelawn commented on May 19, 2024

@hd074 This is _fantastic_ data you are generating.

For completeness, is this 32-bit or 64-bit Win7? Is it Win7 or Win7 SP1? Also, which edition of Windows are you testing?

from hosts.

HansiHase avatar HansiHase commented on May 19, 2024

@StevenBlack thank you very much.

@Gitoffthelawn thanks to you, too.
It's Windows 7 Professional 64-Bit, Service Pack 1.

further relevant:
ASUS P7P55D PRO Motherboard
Intel Core i7 860 @ 2,8 GHz
no additional network adapter.

from hosts.

lewisje avatar lewisje commented on May 19, 2024

I think that in your script, where you have /0.0.0.0/, you should escape the periods and have /0\.0\.0\.0/

from hosts.

HansiHase avatar HansiHase commented on May 19, 2024

@lewisje you're right. thank you. corrected it.

from hosts.

lewisje avatar lewisje commented on May 19, 2024

I forgot another tiny thing: You could also match for the start of the line and for a space after 0.0.0.0 to be sure you don't strip out, say, subdomains like 0.0.0.0.example.net

from hosts.

Gitoffthelawn avatar Gitoffthelawn commented on May 19, 2024

So is there a best methodology that can be adopted based on this dataset?

from hosts.

Gitoffthelawn avatar Gitoffthelawn commented on May 19, 2024

See also #47 for more related discussion.

from hosts.

sierkb avatar sierkb commented on May 19, 2024

Relating OS X, see also the Open Radar Bug Long /etc/hosts entries lead to unbearably slow resolution rdar://24237290 and the response of an Apple engineer.

from hosts.

lewisje avatar lewisje commented on May 19, 2024

I guess that means that nine hostnames per line is a best practice for both Windows and Mac.

from hosts.

HansiHase avatar HansiHase commented on May 19, 2024

It means that a 9 hosts per line file performs better than a >9 hosts per line file (on a mac).

I don't really see the advantage of the nine hosts per line method (vs single entry per line).
The only thing that comes to my mind is the lower memory usage
but I think nowadays memory isn't a thing to worry about (edit: regarding this project).

My concerns regarding this method are the readability and the maintainability.
This is why I'm personally suspicious if it really is best practice.

from hosts.

lewisje avatar lewisje commented on May 19, 2024

The way I understood it, it's like Windows doesn't read hostnames after the ninth on a line, so the max. for that platform is nine per line, and I had remembered that OS X could read 24 per line (never tested higher) but bogged down, but I wasn't aware that 10 was the tipping point (and 9 is still within the safe zone for a Mac).

memory isn't a thing to worry about

never true.

With that said, it definitely is easier to maintain a list of hostnames with one per line and then output a nine-per-line version for deployment.

from hosts.

HansiHase avatar HansiHase commented on May 19, 2024

@Gitoffthelawn

So is there a best methodology that can be adopted based on this dataset?

What Why/When But...
0.0.0.0 always (bc timeout) compatibility
large # of entries no (big) influences (possibly) system depending
9 entries per line shortens filesize readability/maintainability
1 entry per line readability/maintainability filesize
caching yes? faster lookup of non blocked sites no influence on speed with blocked sites

from hosts.

HansiHase avatar HansiHase commented on May 19, 2024

@lewisje Maybe I got you wrong.
If we choose to use multiple entries per line then 9 hosts is the way to go. I agree with that.

I thought "9 entries is best practice" was referring to the whole "1 entry vs 9 entries vs X entries"-problem. In this case I did and do not agree.

from hosts.

matkoniecz avatar matkoniecz commented on May 19, 2024

Given that the only benefit of this proposed readability decrease is filesize reduction it seems to not be worth it. Even on mobile devices this filesize change is not significant.

from hosts.

StevenBlack avatar StevenBlack commented on May 19, 2024

So closing this now.

from hosts.

RoelVdP avatar RoelVdP commented on May 19, 2024

Are there any settings which can be made for dnsmasq which would load the full host file into memory and thereby making everything quicker? or is that default?

from hosts.

dnmTX avatar dnmTX commented on May 19, 2024

@RoelVdP dnsmasq by default is caching the hosts file(s) in the memory and it's by far the fastest dns resolver.If there are any slow downs on your end you need to look for the problem elsewhere.

from hosts.

RoelVdP avatar RoelVdP commented on May 19, 2024

@dnmTX thanks mate. Any way to check it is effectively loaded in memory when the file is rather large? Also, any way to make any cach(ing) larger? Thank you, very appreciated.

from hosts.

dnmTX avatar dnmTX commented on May 19, 2024

Any way to check it is effectively loaded in memory when the file is rather large?

@RoelVdP there is not really a easy way to check this as everything cached in the memory is in some hidden files,but i can assure you that this is the case. Dnsmasq is design to work from the memory and that is why is so fast.Along with the given hosts file(s) it caches every response as well so to check how effective it is,simply do time nslookup domain.com and you'll see.Here,i made a example from my router:
Capture

Also, any way to make any cach(ing) larger?

Now,you need to clarify how you blocking those domains.There are two options,one is trough the .config file,example: server=/domain.com/0.0.0.0 and so on and one is trough hosts files(s) with added entry in the .config file to point to it: addn-hosts=/dir/to/your/file/hosts.
First option has some limitations about how many entries can dnsmasq cache and whatnot,so it's not really recommended even though many repos here who offering hosts files have that option present.
Second option is the one to go with.The developer noted that dnsmasq was tested with one million entries successfully just for such a bug file it's required at least 1GHz CPU or faster.
So to answer your question,caching is plenty,unless you tell me that your hosts file(s) contain more then a million entries and no,no way to expand that as it is in the kernel.

from hosts.

RoelVdP avatar RoelVdP commented on May 19, 2024

@dnmTX Thank you very much for the detailed reply. Excellent idea on the nslookup. Tried that and results are about 0.5 seconds for first lookups. So, I am not using any special config in dnsmasq but rather a large /etc/hosts (with 722k entries) file which dnsmasq then uses 'indirectly'. (See https://github.com/RoelVdP/MoralDNS). I wonder now if addn-hosts in .config can be pointed to the /etc/hosts file and if this would cache it (perhaps it was not caching and the OS was the limiting factor. I am starting to understand why pages are loading slow - if there are many lookups then many * 0.5 seconds = long delay. Thank you again. Let me know if you have any other thoughts.

from hosts.

dnmTX avatar dnmTX commented on May 19, 2024

I wonder now if addn-hosts in .config can be pointed to the /etc/hosts file and if this would cache it...

@RoelVdP i'm really not sure what you mean by that.As long as you point dnsmasq to the file it will read it and cache it.Easiest way to check is from the system log(syslogd).If it's disabled on your end,enable it and restart dnsmasq(or your system) and check the logs.Here,another example for you:
Capture
I would not recommend to override or append to /etc/hosts as in some instances after restart that same hosts will revert to it's previous state and all those blocked domains will be gone.It's always
better to to add it as a addition and to be stored where can't be deleted due to restart or some sudden shutdown.

Let me know if you have any other thoughts.

Yeah,like bunch.I went briefly trough your script and you can do some improvements to kind of lower the size(entries) and make it more responsive:
First: Get rid of this one: wget -Oc http://sysctl.org/cameleon/hosts
It's abandoned from the maintainer since 2017,if you wind out the duplicates and all the dead domains you'll end up with probably 5000+,out of what....23,000+(doesn't worth it)
Second: Check for empty lines,comments leftovers etc. especially in StevenBlack's lists.
use sed '/^#/d; s/ #.*//g; s/ #.*//g; /#/d; /^\s*$/d' a > tmp in that order
Third Duplicates: They are a lot. If you manage to get rid of them you'll probably shrink your file to half. sed will not cut it there,use awk or even better gawk for that task as it is blazing fast. Compare each file to StevenBlack's before you merge it.
This is your command:
gawk 'NR==FNR{a[$0];next}!($0 in a)' stevenblack the-other-file > no-duplicates-file
mv no-duplicates-file the-other-file <- this is optional
Do this on each one then merge them all together.But first do clean(comments and whatnot) and add zeroes-IMPORTANT !!!
Still,you are loading too many lists,some of them are really not needed as they're based on others that you already using(especially EasyList,EasyPrivacy in my opinion) so some delay is to be expected.

I am starting to understand why pages are loading slow - if there are many lookups then many * 0.5 seconds = long delay.

You do realize what 0.05s out of 1(one) second is,wright? You got that completely wrong.Can't go any faster than that bud. There are no lookups there,file is cached in memory=memory is fast=there is one lookup,or let's say ten(when open some page)=and there is comparison to all the entries in the cached file,which equals to 0.05s each or 0.50s combined.How is that not fast?

# With thanks, MalwareDomains list
wget -Ob https://mirror1.malwaredomains.com/files/justdomains
grep -vE "^#|^$" b | sed "s|^|0.0.0.0 |" > tmp

I just looked at it and it's wrong. This list does not come with any comments or empty lines and when i tried the command it was soooo slow. So for this one(only) just use sed 's/^/0.0.0.0 /g' b > tmp.
Also grep is not your friend here,sed can do all those tasks on it's own(research commands)
For sed double quotes are not needed(use single instead),straight brackets also(use / instead)
You better inspect again each file and reconfigure your commands.

Another TIP: Some lists comes with bunch of comments on the top and that's it,the rest is only domain entries,so in this case(after confirmation aka visual inspection) use:
sed '1,8d' b > tmp (adjust those numbers to your needs)
this will delete from line one to line eight and that's it,and it's ten time faster then:
sed '/^#/d' b > tmp

from hosts.

dnmTX avatar dnmTX commented on May 19, 2024

@RoelVdP this will be my last post here as we really went OFF TOPIC on this one and i know...some...are not happy about it. So good luck and i hope whatever i posted above would help to make your project better. 👍

from hosts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.