Code Monkey home page Code Monkey logo

decuser_python_playground's Introduction

decuser_python_playground

A repository for python scripts of interest

News

20210804.0617 dircmp.py version 0.7.1 ready bugfix 14 included

20210803.0927 dircmp.py version 0.7.0 ready added single dir support, fixed all known counting issues

20191218.1639 added signing key, so commits are verified going forward

20191218.1448 dircmp.py version 0.6.1 ready bugfixes 7 and 8 included

20191218.1115 dircmp.py version 0.6.0 ready refactored

20191216.1813 dircmp.py version 0.5.1 ready fast mode and some fixes

20191212.1302 dircmp.py version 0.5.0 ready with recursion, support for hidden files, and crude tests

First up - dircmp.py, a *nix utility to compare two directories

What it does is:

  • Calculate sha1 checksums for all non-hidden files in a src and dst directory
  • Supports recursion
  • Supports hidden directories and files
  • Supports fast digests (not terribly accurate, but sufficient for quick scanning)
  • Supports single directory analysis
  • Gets a list and count of files that:
    • Only exist in src
    • Only exist in dst
    • Exist in both
    • Are duplicates in src
    • Are duplicates in dst
    • Have the same name in both, but different checksums
    • Have the same checksums, but different names

Systems Tested

  • Mac OS X 10.14.6 Mojave with Python 3.9.4
  • 10.15.1 Catalina with Python 3.7.3
  • Linux Mint 19.2 Tina with Python 3.7.5

Notes

I was tired of trying to understand other compare utilities that didn't seem to do quite what I wanted them to. This utility lets me see exactly what the state of the two directories are relative to each other. I use it to detect duplicates and to bring to directory trees into synchronization.

The utility isn't optimized, but it's good for most work. One of these days, I'll have to do some optimization. That said, it's very accurate.

Test Run

git clone https://github.com/decuser/decuser_python_playground.git
cd decuser_python_playground/dircmp
python dircmp.py tests/default/src tests/default/dst

	+----------------------------------+
	| Welcome to dircmp version 0.7.0  |
	| Created by Will Senn on 20191210 |
	| Last updated 20210803	 	   |
	+----------------------------------+
	Digest: sha1
	Source (src): tests/default/src/

	Destination (dst): tests/default/dst/
	Single directory mode: False
	Show all_flag files: False
	Recurse subdirectories: False
	Calculate shallow digests: False

	Scanning src ... 9 files found (0.01s).
	Calculating sha1 digests in src ...... done (0.0s).
	Scanning dst ... 7 files found (0.0s).
	Calculating sha1 digests in dst ... done (0.0s).
	Analyzing src directory ...done (0.0s).
	Analyzing dst directory ...done (0.0s).
	Comparing src to dst ...done (0.0s).
	Comparing dst to src ...done (0.0s).
	Checking for different names, same digest ...done (0.0s).

	Duplicates found in tests/default/src/: 6 files found.
	0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both
	0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both_copy
	c62a323c301dfb0f3cc8e27609c7f507d1965b64 only_in_src
	c62a323c301dfb0f3cc8e27609c7f507d1965b64 only_in_src_copy
	da39a3ee5e6b4b0d3255bfef95601890afd80709 empty
	da39a3ee5e6b4b0d3255bfef95601890afd80709 empty_in_both

	Duplicates found in tests/default/dst/: 2 files found.
	0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both
	0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both_copy

	Exact matches: 4 files found.
	75093aa729169179c9dbbca6aa2d95a97865ca03 b_same_in_both
	da39a3ee5e6b4b0d3255bfef95601890afd80709 empty_in_both
	0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both
	0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both_copy

	Only in tests/default/src/: 2 files found.
	c62a323c301dfb0f3cc8e27609c7f507d1965b64 only_in_src
	c62a323c301dfb0f3cc8e27609c7f507d1965b64 only_in_src_copy

	Only in tests/default/dst/: 1 files found.
	36969b074153d1e76fbd43fb3d3c59802b5f730d only_in_dst

	Same names but different digests: 2 files found.
	in_both_diff_content src:e3bbf99ae9bb23804155b25a82a943e8757fc07a
	in_both_diff_content dst:2690814b054f2ddf3435a30a65506ce4bedba1d2

	Different names but same digests: 8 files found.
	0026a27ffa78a4a4963175c35fbee11c332049ed src:same_in_both
	0026a27ffa78a4a4963175c35fbee11c332049ed src:same_in_both_copy
	0026a27ffa78a4a4963175c35fbee11c332049ed dst:same_in_both
	0026a27ffa78a4a4963175c35fbee11c332049ed dst:same_in_both_copy
	6476df3aac780622368173fe6e768a2edc3932c8 src:in_src_same_content_diff_name
	6476df3aac780622368173fe6e768a2edc3932c8 dst:in_dst_same_content_diff_name
	da39a3ee5e6b4b0d3255bfef95601890afd80709 src:empty
	da39a3ee5e6b4b0d3255bfef95601890afd80709 dst:empty_in_both

	Summary
	-------
	Started at 2021-08-03 09:38:40.373714
	0 dirs, 16 files analyzed.
	0 dirs, 9 files found in tests/default/src/.
	0 dirs, 7 files found in tests/default/dst/.
	6 duplicate files found in tests/default/src/.
	2 duplicate files found in tests/default/dst/.
	4 exact matches found.
	2 files only exist in tests/default/src/.
	1 files only exist in tests/default/dst/.
	2 files have same names but different digests.
	8 files have different names but same digest.
	Finished at 2021-08-03 09:38:40.392788

	Total running time: 0.02s.

Known Issues

  • the comparison effectively ignores empty directories - git ignores them too and this makes git hosted tests problematic for this sorta thing

Quirks

  • 20210803 "Only in" refers to file content, not filename, so a filename might exist in only one of the trees being compared, but if its contents match a file in the other tree, it will not be listed in "Only in". It will be noted in "Different names but same digests"

For example: In src, there's a file named only_in_src that contains the letter 'a'. In dst, there's a file named only_in_dst that contains the letter 'a'. The comparison would show 0 files Only in src, 0 files Only in dst and 2 files Different names but same digests. To be clear, the program privileges content over names. An enhancement would be to support Names only in and Content only in...

decuser_python_playground's People

Contributors

decuser avatar

Watchers

 avatar  avatar

decuser_python_playground's Issues

refactor phase 1

Without breaking anything, let's clean up the code in anticipation of automating some tests. Use a branch for this so there's not so many piddly versions. Prior to merging, do a regression test for all combinations of arguments and directory scenarios.

make a better progress bar

the current progress bar looks like it puts out a dot for every file processed, this is annoying when you've got more than 100 files, but it's obnoxious with 400k.

Consider printing brackets 50 chars apart, then filling with dots. First, regular dot, backup bold dot, etc.

[.................................................]

automate tests

The way times are displayed prevent a simple comparison of logs to determine if system is working. Either write standard unit tests or ditch the timings.

Total files calculation is wrong

It appears to be counting subdirectories as well as regular files. Something needs to be done - either keep separate counts or reconcile the count elsewise.

broken in windows - bad escape in re.sub call

Haven't a clue, but windows doesn't appear to be happy with:
skey = re.sub(r'^' + re.escape(srcpath), dstpath, key)

could be a problem with srcpath, dstpath, or key

python dircmp.py tests/default/src tests/default/dst

+------------------------------------+
|   Welcome to dircmp version 0.5.0  |
|  Created by Will Senn on 20191210  |
|       Last updated 20191212        |
+------------------------------------+
Digest: sha1
Source (src): tests/default/src\
Destination (dst): tests/default/dst\
Show all files: False
Recurse subdirectories: False

Scanning src ... 9 files found (0.0s).
Calculating sha1 digests in src ... done (0.0s).
Scanning dst ... 7 files found (0.0s).
Calculating sha1 digests in dst... done (0.0s).
Analyzing src directory ...done (0.0s).
Analyzing dst directory ...done (0.0s).
Comparing src to dst ...Traceback (most recent call last):
  File "dircmp.py", line 264, in <module>
	skey = re.sub(r'^' + re.escape(srcpath), dstpath, key)
  File "C:\py38\lib\re.py", line 208, in sub
	return _compile(pattern, flags).sub(repl, string, count)
  File "C:\py38\lib\re.py", line 325, in _subx
	template = _compile_repl(template, pattern)
  File "C:\py38\lib\re.py", line 316, in _compile_repl
	return sre_parse.parse_template(repl, pattern)
  File "C:\py38\lib\sre_parse.py", line 988, in parse_template
	this = sget()
  File "C:\py38\lib\sre_parse.py", line 256, in get
	self.__next()
  File "C:\py38\lib\sre_parse.py", line 245, in __next
	raise error("bad escape (end of pattern)",
re.error: bad escape (end of pattern) at position 17

brief mode with single dir mode isn't working - version 0.7.0

When -b and -s are specified with a src directory, nothing is displayed about duplicate files, whereas in plain single dir mode there is:

+----------------------------------+
| Welcome to dircmp version 0.7.0 |
| Created by Will Senn on 20191210 |
| Last updated 20210803 |
+----------------------------------+
......Started at 2021-08-04 06:09:06.029075
1 dirs, 9 files analyzed.
1 dirs, 9 files found in src/.
Finished at 2021-08-04 06:09:06.045889

vs
...
Duplicates found in src/: 6 files found.
0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both
0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both_copy
c62a323c301dfb0f3cc8e27609c7f507d1965b64 only_in_src
c62a323c301dfb0f3cc8e27609c7f507d1965b64 only_in_src_copy
da39a3ee5e6b4b0d3255bfef95601890afd80709 empty
da39a3ee5e6b4b0d3255bfef95601890afd80709 empty_in_both

Empty source dir (no non-hidden files present) causes div by zero error

Issue with effectively empty src:

Calculating sha1 digests in src Traceback (most recent call last):
  File "/Users/wsenn/bin/dircmp.py", line 394, in <module>
    [src_files_dict, revidx_src_files] = calculate_sha1s(args['srcdir'], "src", src_files)
  File "/Users/wsenn/bin/dircmp.py", line 70, in calculate_sha1s
    display_progress(current_progress, src_files_bytes, 50)
  File "/Users/wsenn/bin/dircmp.py", line 165, in display_progress
    per_progress = int((total / curr) * 100)
ZeroDivisionError: division by zero

$ tree dst
dst

0 directories, 0 files
$ tree src
src
└── a

1 directory, 0 files
$ ls -a dst
.		..		.DS_Store	.DS_Store.o
$ ls -a src
.	..	a
$ ls -a src/a
.		..		.empty		.emptytoo
$ 

add shallow digest support (sample a predictable 100 megs of larger files)

I'm not sure this is a 100% great idea, but it's worth looking into. Instead of processing files larger than 100 megs completely, grab a 100 meg sample from them and use that to calculate a digest. Seed the random number generator with a magic number to get consistent tables for seek. It should be fairly simple to do and if performance sucks, easy to remove. The hope is that this will make multi-gig files faster to get a digest while maintaining a degree of confidence that the files are the same, if they have the same shallow digest. The thinking being that the shallow digest version can be run frequently on directories that aren't expected to be changing much, and the deep version can be run as needed for increased certainty. May play with the 100meg minimum, and with the amount of sample.

Some considerations

  • buffered, or unbuffered i/o? seems like unbuffered is called for, but we'll see.
  • this is not for tamper detection, but for ad hoc change detection (most of the time, if you aren't trying to tamper with the files, say by bit fiddling, but are systematically editing or replacing the files in chunks or wholesale, this should detect the changes)

issue with ./dir argument

dircmp -crf ./notes ~/sandboxes/notes

+----------------------------------+
| Welcome to dircmp version 0.7.3  |
| Created by Will Senn on 20191210 |
| Last updated 20210805	 		   |
+----------------------------------+
Arguments: -crf ./notes /Users/wsenn/sandboxes/notes 
Digest: sha1
Source (src): ./notes/
Destination (dst): /Users/wsenn/sandboxes/notes/
Compact mode: True
Single directory mode: False
Show all files: False
Recurse subdirectories: True
Calculate shallow digests: True

Traceback (most recent call last):
  File "/Users/wsenn/bin/dircmp", line 482, in <module>
    [src_files, src_files_bytes, num_src_dirs, num_src_files] = get_files(args['srcdir'], "src")
  File "/Users/wsenn/bin/dircmp", line 345, in get_files
    [num_dirs, num_files, files] = recurse_subdir(dir_to_analyze, args['recurse'], args['all'])
  File "/Users/wsenn/bin/dircmp", line 379, in recurse_subdir
    if tfiles[0] == ".":
IndexError: list index out of range

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.