Code Monkey home page Code Monkey logo

Comments (10)

Lithopsian avatar Lithopsian commented on July 23, 2024 1

That looks better. The subgroup makes the totals more or less tally in each category.

from qdirstat.

shundhammer avatar shundhammer commented on July 23, 2024

That's probably limited to libraries because of the much more complex rules there. I don't think that applies to other categories as well. There might also be some overlap in the rules between different categories that may have crept in.

A check if this is any different from previous versions (stable 1.8.1, 1.8, 1.7 is probably also worthwhile. A first check here showed that it behaves no different in those older versions, but I might be wrong.

We now also have the "Find Files" function which also shows a count at the bottom of the results list (albeit it's limited to 1000 maximum), and there is the plain find command line tool.

from qdirstat.

shundhammer avatar shundhammer commented on July 23, 2024

AFAICS it only uses the suffix rules for that table. I don't exactly recall if that was intentional, but it might very well be; it's been a long while.

suffix-only

That whole file type statistics was something that more or less a single user wanted back then (and he kept nagging, and I finally gave in; not sure if that was a good idea). If you go back in the issues history you will see that I had always said that it's pretty pointless since on Linux the rules are much more complex than on Windows, where that whole idea originally comes from (WinDirStat), and I always said that it comes with a lot of caveats.

Some of that could be papered over with more complex regexps or checking permissions, but other aspects are and will always be a bit inconsistent. This isn't Windows where an .exe suffix clearly indicates an executable. There are tons of files with really creative suffixes, even more without any suffix whatsoever, and sometime even contradictory ones.

This is also one reason why in some areas I kept the MIME categories quite broad; just "Libraries" (in the broadest sense), not subdividing them even further like you obviously did on your system. It's just too easy to get contradicting rules if you don't pay very close attention.

This is not an exact science, it's more rules of thumb.

from qdirstat.

Lithopsian avatar Lithopsian commented on July 23, 2024

Older versions look the same. I can go all the way back to 1.6 at the weekend if you think that would be helpful.

The current categorizer doesn't report regexp matches like "lib*.a" as having a suffix even if they have a suffix. Matches like ".so." also even more obviously not reported as having a suffix. The dialog does derive a suffix for all its matches even when the categorizer doesn't report one, but then it starts lumping them all together and I think that's where it goes wrong. For example, every file that it finds with a suffix of .3 gets forced into the Other category even if the categorizer found an actual category for it. I think it should only lump suffixes together within each category, not into a single bucket, then everything would add up. Each category might end up with an Other grouping (or a separate grouping for no category?) either for junk suffixes or suffixes not reported by the categorizer. and the percentage reported for each suffix group within a category would match the category total.

The categorizer itself could also be more intelligent about matching complex wildcards that include a suffix. For example, "moc_*.cpp" should be matched before "*.cpp". Currently it never gets matched at all and all the qt-generated files end up in the source category instead of the generated-files category. Regexp patterns that include a suffix can be included in a multi-hash which would dramatically reduce the number of full regular expression tests that have to be done because they so much slower than the map lookups.

from qdirstat.

shundhammer avatar shundhammer commented on July 23, 2024

The more complex patterns get precedence (IIRC), so it's perfectly normal that anything that matches any of those is no longer put into any of the suffix categories. That is expected and intentional.

There is also the question if the "Locate Files" window could still reliably locate them all; if not, that would only make the problem move to a different place.

Also, see the extensive discussions about "cruft" somewhere in the issues discussing this file type statistics thing. IIRC there is also some debug logging in the code that is just commented or #ifdefed out that can show all the stuff that is also found, but that are not real suffixes; just weird filenames that some developer thought up, and hey, why not use a period or half a dozen in a filename if it strikes my fancy?

Let's not overdo this whole thing. Its usefulness is very limited to begin with, and it already created more problems than it's worth. The MIME categories are useful for colorizing the treemap, to get a visual impression about dominating file types; but the numeric file type statistics are useful only in certain cases, and as soon as more complex expressions are involved, it pretty much falls apart.

I also don't want to add an "other" section (that could not be used for "Locate Files") in each category. The suffixes we can see is the amount of detail that is viable and useful. Yes, there may be more stuff that is not shown. That's life.

If anybody really wants to get more detailed matches, there is now the "Find Files" function.

from qdirstat.

shundhammer avatar shundhammer commented on July 23, 2024

And BTW no, I don't think moving backwards in time beyond, say, V1.7 would be useful; there weren't many changes in that whole area for a long time.

Which also shows that this is very likely not used very much to begin with.

from qdirstat.

shundhammer avatar shundhammer commented on July 23, 2024

See also #45 and #48.

from qdirstat.

shundhammer avatar shundhammer commented on July 23, 2024

An even more drastic example is the file type statistics of my ~/src directory:

git-packs

Since I recently added pack-*.pack to the compressed archives MIME category, they are now accounted for in that category, but they don't show up there because it's not a simple suffix; only the one .tar.bz2 file in that subtree is shown there. Even worse, they now appear in the top 20 of the other category as *.pack because they are so large. You can spot them in the treemap as green tiles.

The file type statistics is much too simplistic. It works for just suffix rules, but it falls apart as soon as more complex expressions are used.

from qdirstat.

shundhammer avatar shundhammer commented on July 23, 2024

That last commit adds an other file type to collect all the files matching a non-suffix rule so hopefully this discussion will not appear again.

It's ugly, and that code is now in desperate need of a cleanup.

And to clarify for anybody reading this in the future: No, it is not and will not be possible to locate those files that match other from the GUI.

git-packs-fixed

libs-fixed

from qdirstat.

shundhammer avatar shundhammer commented on July 23, 2024

Cleaned up that code at least a little bit. It will never be pretty, but at least now it improved from fugly to just ugly.

Please test.

from qdirstat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.