akamhy / videohash Goto Github PK
View Code? Open in Web Editor NEWNear Duplicate Video Detection (Perceptual Video Hashing) - Get a 64-bit comparable hash-value for any video.
Home Page: https://pypi.org/project/videohash
License: MIT License
Near Duplicate Video Detection (Perceptual Video Hashing) - Get a 64-bit comparable hash-value for any video.
Home Page: https://pypi.org/project/videohash
License: MIT License
Translation of the Hindi langauge written in Latin alphabet/Roman alphabet into English is:
The third wave of covid-19 is here
Again* I will pass the tests without studying**.
*
(exams were canceled in the past year)
**
(because of online tests and cheating)
If you are good at Python please write a script that would download the latest FFmpeg from https://www.gyan.dev/ffmpeg/builds/ffmpeg-git-full.7z
Uncompress the archive.
Copy the bin directory from the decompressed folder, and paste inside C:\Program Files\ffmpeg.
Add C:\Program Files\ffmpeg\bin\ to the Environment Variables.
See https://github.com/akamhy/videohash/wiki/Install-FFmpeg,-but-how%3F#install-ffmpeg-on-windows.
I've written a script for testing on windows, you may find it useful. Link : https://github.com/akamhy/videohash/blob/main/assets/windows_ffmpeg_downloader_at_cwd.py
Python Imaging Library (Fork)
Library home page: https://files.pythonhosted.org/packages/12/ad/61f8dfba88c4e56196bf6d056cdbba64dc9c5dfdfbc97d02e6472feed913/Pillow-6.2.2-cp27-cp27mu-manylinux1_x86_64.whl
Path to dependency file: videohash/requirements.txt
Path to vulnerable library: videohash/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 775d08735341d5bb435fcc1501160e1d150f31e0
Found in base branch: main
In libImaging/PcxDecode.c in Pillow before 7.1.0, an out-of-bounds read can occur when reading PCX files where state->shuffle is instructed to read beyond state->buffer.
Publish Date: 2020-06-25
URL: CVE-2020-10378
Base Score Metrics:
Type: Upgrade version
Origin: python-pillow/Pillow@41b554b
Release Date: 2020-06-25
Fix Resolution: 7.1.0
Step up your Open Source Security Game with WhiteSource here
Hello, everyone. Thank you for an excellent library!
Would it be possible to add an official command-line interface? Something like:
videohash_cmdline.bash video1 video0
99%
?
To Reproduce
Generate videohash of a video directly from its path
url1 = "C:\Users\PCNAME\Documents\myapp\static_media\vid1.mp4"
videohash1 = VideoHash(url=url1)
url2 = "C:\Users\PCNAME\Documents\myapp\static_media\vid2.mp4"
videohash2 = VideoHash(url=url2)
Expected behavior
please I want to Generate videohash of these videos directly from their path.
videohash1.is_similar(videohash2)
False
Screenshots
If applicable, add screenshots to help explain your problem.
Please complete the following information:
Additional context
Add any other context about the problem here.
When running videohash as part of a program that has also used subprocess
it seems to inherit the stdin
and that can result in various failures for ffmpeg
.
I have been documenting it here:
digitalmethodsinitiative/4cat#303 (comment)
Essentially, I can use videohash alone, but not with additional subprocesses unless I edit it and provide it with stdin=subprocess.DEVNULL
since the default stdin is in use.
Sending a PR shortly with needed edit.
Describe the bug
Hash collision for some videos of same length
To Reproduce
v1 = VideoHash(url="https://canvaz.scdn.co/upload/artist/3PhoLpVuITZKcymswpck5b/video/5e966e9c01f147cdae93a02c61a4bf7c.cnvs.mp4")
v2 = VideoHash(url="https://canvaz.scdn.co/upload/licensor/7JGwF0zhX9oItt9901OvB5/video/dc047df48f774d1590b61fd38bc082e4.cnvs.mp4")
print(v1 == v2)
Expected behavior
The hash should be different but is same.
Screenshots
NA
Please complete the following information:
Additional context
The issue can probably be resolved by extracting more features such as brightness levels or maybe the most dominant colors of frames extracted at a specific FPS. Increase the number of hash bits to accommodate more data.
OR
why not use colorhash + whash and change the bit site to 128( twice of 64, the current size)? They generate hash is very different ways and collisions should be highly unlikely.
call method on the videohashobject to delete the temp files created by the instance.
The temp folder (or cache folder on Mac) gets increasingly bigger when processing a lot of videos at the same time.
It SHOULD cleanup after itself, and not leave the working/temp files there.
Currently i have to do it manually when processing ~20k files
Describe the bug
Hash collision occurs with videos of the same length and with similar colour schemes.
To Reproduce
v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4')
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov')
print (v1 == v2)
Expected behavior
The hashes of the videos should be different.
Screenshots
NA
Please complete the following information:
Additional context
I am a non-native speaker and I suck at formal grammar. If you are a native speaker or just good at writing doc strings/comments/copy editing please open a pull request. Language must be formal, don't add any jokes or slang.
Thank you!
Problem:
When I try to get the hash of a long video (about 90 min), it turns out errors when making tile:
"encoder error -2 when writing image file" and
"Maximum supported image dimension is 65500 pixels".
My solution now:
Instead of just changing the frame_interval to shorten the width, I found that it might be a limit of the jpeg format.
So the solution I am using is to change the default output of the function "make_tile" in "tilemaker.py" to png format:
save_tiles(tiles, prefix="tile", directory=tiles_dir, file_format="png")
Suggestion:
Now, there is no error. However, I am not sure if this will affect the hash value (as generated with jpeg format), or if any confilct to any part of this lib.
As I notice the default parameter of "file_format" in th function "save_tiles" is already png, I am confused why the jpeg format is explictly given in the function "make_tile".
If there is no other problem, maybe using png as the default in "make_tile" is better considering some long videos?
Thank you!
Python Imaging Library (Fork)
Library home page: https://files.pythonhosted.org/packages/12/ad/61f8dfba88c4e56196bf6d056cdbba64dc9c5dfdfbc97d02e6472feed913/Pillow-6.2.2-cp27-cp27mu-manylinux1_x86_64.whl
Path to dependency file: videohash/requirements.txt
Path to vulnerable library: videohash/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 775d08735341d5bb435fcc1501160e1d150f31e0
Found in base branch: main
In Pillow before 8.1.0, PcxDecode has a buffer over-read when decoding a crafted PCX file because the user-supplied stride value is trusted for buffer calculations.
Publish Date: 2021-01-12
URL: CVE-2020-35653
Base Score Metrics:
Type: Upgrade version
Origin: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-35653
Release Date: 2021-01-12
Fix Resolution: 8.1.0
Step up your Open Source Security Game with WhiteSource here
Python Imaging Library (Fork)
Library home page: https://files.pythonhosted.org/packages/12/ad/61f8dfba88c4e56196bf6d056cdbba64dc9c5dfdfbc97d02e6472feed913/Pillow-6.2.2-cp27-cp27mu-manylinux1_x86_64.whl
Path to dependency file: videohash/requirements.txt
Path to vulnerable library: videohash/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 775d08735341d5bb435fcc1501160e1d150f31e0
Found in base branch: main
Pillow before 7.1.0 has multiple out-of-bounds reads in libImaging/FliDecode.c.
Publish Date: 2020-06-25
URL: CVE-2020-10177
Base Score Metrics:
Type: Upgrade version
Origin: python-pillow/Pillow@41b554b
Release Date: 2020-06-25
Fix Resolution: 7.1.0
Step up your Open Source Security Game with WhiteSource here
Python Imaging Library (Fork)
Library home page: https://files.pythonhosted.org/packages/12/ad/61f8dfba88c4e56196bf6d056cdbba64dc9c5dfdfbc97d02e6472feed913/Pillow-6.2.2-cp27-cp27mu-manylinux1_x86_64.whl
Path to dependency file: videohash/requirements.txt
Path to vulnerable library: videohash/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 775d08735341d5bb435fcc1501160e1d150f31e0
Found in base branch: main
In libImaging/Jpeg2KDecode.c in Pillow before 7.1.0, there are multiple out-of-bounds reads via a crafted JP2 file.
Publish Date: 2020-06-25
URL: CVE-2020-10994
Base Score Metrics:
Type: Upgrade version
Origin: python-pillow/Pillow@41b554b
Release Date: 2020-06-25
Fix Resolution: 7.1.0
Step up your Open Source Security Game with WhiteSource here
Python Imaging Library (Fork)
Library home page: https://files.pythonhosted.org/packages/12/ad/61f8dfba88c4e56196bf6d056cdbba64dc9c5dfdfbc97d02e6472feed913/Pillow-6.2.2-cp27-cp27mu-manylinux1_x86_64.whl
Path to dependency file: videohash/requirements.txt
Path to vulnerable library: videohash/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 775d08735341d5bb435fcc1501160e1d150f31e0
Found in base branch: main
In Pillow before 8.1.0, TiffDecode has a heap-based buffer overflow when decoding crafted YCbCr files because of certain interpretation conflicts with LibTIFF in RGBA mode.
Publish Date: 2021-01-12
URL: CVE-2020-35654
Base Score Metrics:
Type: Upgrade version
Origin: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-35654
Release Date: 2021-01-12
Fix Resolution: 8.1.0
Step up your Open Source Security Game with WhiteSource here
Describe the bug
VideoHash results in:
AttributeError: module 'PIL.Image' has no attribute 'ANTIALIAS'
Due to PIL.Image v10+ having deprecated ANTILIAS; fix appears to be LANCZOS or pinning PIL < 10:
https://pillow.readthedocs.io/en/stable/releasenotes/10.0.0.html#constants
To Reproduce
Install via pip
with PIL and pillow
from brew
(currently 10.0.0)
VideoHash("path/to.mp4")
Expected behavior
Object with .hash
and no AttributeError
Screenshots
N/A
Please complete the following information:
Additional context
I think this would be a good first issue for another contributor. Should I attempt a PR?
Currently, we are experiencing a high number of false positives when utilizing this library. In our scenario, approximately 70% of the results are false positives, which significantly impacts the accuracy of our application.
To address this issue, I suggest to use the following precheck before using the library:
Preprocessing based on video length: Consider incorporating a preprocessing step that filters out videos with durations less than 1 minute. This criteria can help eliminate irrelevant and short-duration videos, which often contribute to false positive matches.
Similarity threshold adjustment: Modify the similarity threshold used by the library to make it more stringent. By increasing the threshold, the library will only consider videos with a higher degree of similarity, reducing the occurrence of false positives. This adjustment can significantly improve the precision of the matching process.
Comparison of video durations: Introduce a comparison mechanism that checks the proximity of video durations when assessing similarity. This step would ensure that two videos are not considered similar if their durations differ significantly. By including this additional criterion, we can reduce the occurrence of false positives caused by videos with vastly different lengths.
But still thanks to the author to provide this library for low cost comparison, but if you're using it in a very serious scenario, I would suggest use it like the bloom filter, and do intensive algorithm after positive result.
Would modifying similar_percentage help? If so, which direction should I go?
We wanna support the image and videos to be compared to the hash. We can store the hashes of all the frames in mem and check for lowest score or user defined score.
Python Imaging Library (Fork)
Library home page: https://files.pythonhosted.org/packages/12/ad/61f8dfba88c4e56196bf6d056cdbba64dc9c5dfdfbc97d02e6472feed913/Pillow-6.2.2-cp27-cp27mu-manylinux1_x86_64.whl
Path to dependency file: videohash/requirements.txt
Path to vulnerable library: videohash/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 775d08735341d5bb435fcc1501160e1d150f31e0
Found in base branch: main
In Pillow before 8.1.0, SGIRleDecode has a 4-byte buffer over-read when decoding crafted SGI RLE image files because offsets and length tables are mishandled.
Publish Date: 2021-01-12
URL: CVE-2020-35655
Base Score Metrics:
Type: Upgrade version
Origin: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-35655
Release Date: 2021-01-12
Fix Resolution: 8.1.0
Step up your Open Source Security Game with WhiteSource here
if sett the worst flag or not. (youtube-dl/yt-dlp feature)
Please design a vector logo for this project.
The logo must contain the project name 'videohash'.
You must release the logo under the MIT License(same license as the project).
I want the output in SVG and PNG formats. The logo must be transparent(the background) and should be professional.
Examples of logos of some real projects that I like. The new logo can be of a similar design but MUST NOT be exactly copied.
Both the PNG and SVG formats should be inside the assets directory in the pull request.
Thank you!
This issue is not gonna get assigned before you open a pull request but the one I like the most is gonna be selected. Please open a pull request only if you are good at making vector logos.
Describe the bug
The download fails on reddit.
To Reproduce
less than or equal to v2.1.7
Python 3.9.0 (default, Oct 21 2021, 15:27:22)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> url1 = "https://www.reddit.com/r/IndianDankMemes/comments/rn2yxa/ha_bhai_normi_hu_mai/"
>>> from videohash import VideoHash
>>> url1 = "https://www.reddit.com/r/IndianDankMemes/comments/rn2yxa/ha_bhai_normi_hu_mai/"
>>> url2 = "https://www.reddit.com/r/IndianDankMemes/comments/rmw1o9/i_am_happy_i_am_happy_i_am_happi_today/"
>>> videohash1 = VideoHash(url=url1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/akamhy/projects/benchmark_videohash/venv/lib/python3.9/site-packages/videohash/videohash.py", line 85, in __init__
self._copy_video_to_video_dir()
File "/home/akamhy/projects/benchmark_videohash/venv/lib/python3.9/site-packages/videohash/videohash.py", line 288, in _copy_video_to_video_dir
Download(
File "/home/akamhy/projects/benchmark_videohash/venv/lib/python3.9/site-packages/videohash/downloader.py", line 51, in __init__
self.download_video()
File "/home/akamhy/projects/benchmark_videohash/venv/lib/python3.9/site-packages/videohash/downloader.py", line 85, in download_video
raise DownloadFailed(
videohash.exceptions.DownloadFailed: '/home/akamhy/projects/benchmark_videohash/venv/bin/yt-dlp' failed to download the video at 'https://www.reddit.com/r/IndianDankMemes/comments/rn2yxa/ha_bhai_normi_hu_mai/'.
[Reddit] rn2yxa: Downloading JSON metadata
[Reddit] rn2yxa: Downloading m3u8 information
[Reddit] rn2yxa: Downloading MPD manifest
ERROR: [Reddit] k4nqp99cdc781: Requested format is not available
>>> videohash1 = VideoHash(url=url1, download_worst=False)
>>> videohash2 = VideoHash(url=url2, download_worst=False)
>>> videohash1 - videohash2
4
>>>
Expected behavior
Download the video without any extra arguments.
Please complete the following information:
Additional context
I don't use Reddit but a friend of mine was using videohash to search posts by templates.
Both the URLs use the same template.
Should probably use [ffmpeg.git] / libavfilter /vf_cropdetect.c
or just use python to crop the frames post extraction.
The black bars aren't an issue if they occupy less than 15% of the area but they are quite problematic if the area occupied is more than 15%. The issue seems fixable.
When using the from_path
method of hashing a video, if the path to the video contains any number of spaces, it will break the ffmpeg commands given to subprocess.Popen
. This is because:
subprocess.Popen
is split on spaces before being interpreted, again forcing the system to interpret different parts of the path as new argumentsThis is easily fixed by inserting escaped quotation marks around any paths in the ffmpeg commands and dropping the .split() on operation in the command and setting shell=True
in Popen
.
I've taken the liberty of including the updated functions here:
def frames(input_file, output_prefix):
"""Extract the frames of the video.
Export frames as images at output_prefix as a 7 digit padded jpeg file.
"""
command = "ffmpeg -i \"{input_file}\" -r 1 \"{output_prefix}_%07d.jpeg\"".format(
input_file=input_file, output_prefix=output_prefix
)
process = Popen(command, shell=True, stdout=DEVNULL, stderr=STDOUT)
output, error = process.communicate()
def compressor(input_file, task_dir, task_uid):
# APPLY : ffmpeg -i input.webm -s 64x64 -r 30 output.mp4
output_file = join(task_dir, task_uid + "compressed.mp4")
command = "ffmpeg -i \"{input_file}\" -s 64x64 -r 30 \"{output_file}\"".format(
input_file=input_file, output_file=output_file
)
process = Popen(command, shell=True, stdout=DEVNULL, stderr=STDOUT)
output, error = process.communicate()
return output_file
Hope you find this useful! Thanks for the great module!
Python Imaging Library (Fork)
Library home page: https://files.pythonhosted.org/packages/12/ad/61f8dfba88c4e56196bf6d056cdbba64dc9c5dfdfbc97d02e6472feed913/Pillow-6.2.2-cp27-cp27mu-manylinux1_x86_64.whl
Path to dependency file: videohash/requirements.txt
Path to vulnerable library: videohash/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 775d08735341d5bb435fcc1501160e1d150f31e0
Found in base branch: main
In libImaging/SgiRleDecode.c in Pillow through 7.0.0, a number of out-of-bounds reads exist in the parsing of SGI image files, a different issue than CVE-2020-5311.
Publish Date: 2020-06-25
URL: CVE-2020-11538
Base Score Metrics:
Type: Upgrade version
Origin: python-pillow/Pillow@41b554b
Release Date: 2020-06-25
Fix Resolution: 7.1.0
Step up your Open Source Security Game with WhiteSource here
On shared OS with limited access, we might not be able to access the CLI if it's in a venv.
My use case for this software would only be in needing to compare the hash of the first few seconds of video for hundreds of files of varying lengths. The reason for this is part of a classification task ie. I have a lot of files and want to classify them based on the contents of the first few seconds.
I could create a script which trims the videos all to 2-3 seconds long then use videohash on those clips, then relate those results back to their original clip but it would be great if videohash could handle all of this for me.
What I imagine would be something like having an max_frames
parameter added to the VideoHash function.
eg. videohash.VideoHash(..., frame_interval=0.2, max_frames=10)
would provide me a hash based on 10 frames from the first ~2 seconds of video.
I could also see perhaps setting a time range being handy instead, eg. start_time: '2:00', end_time: '2:30'
would hash only that 30 second clip from the video. This would solve my use case but also be a more general solution for other use cases, though I think it may be a little more nuanced to implement vs. the first proposal.
Interested to hear the maintainers thoughts on this as I might be able to tackle a solution if there's interest.
for same length video it's great, but we should probably make it length agnostic
This just works, maybe ffmpeg ain't consistent in frame extraction for different containers and codecs.
Would be great to expose access to the frame interval. I quite often work with very long or very short videos and would be great to specifiy the frame interval to check. Alternatively, be able to select a total number of frames and have it randomly select that number of frames from across the video.
I'm trying to hash some videos and save them to DB. Then check aganist to a new video.
I hope there can be a Hash
object as video hash, can he serialize/deserialize to/from string or bytes, instead of current VideoHash
, which can be only calculated from video source.
Describe the bug
It takes quite a while to hash a video.
To Reproduce
from videohash import VideoHash
import time
start = time.time()
url = 'https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4'
v1 = VideoHash(url=url, frame_interval=12)
print(f"Finished in {time.time() - start} secs")
Expected behavior
It should realistically be doable in under a second
Please complete the following information:
Additional context
Currently takes about 3/4 seconds
Python Imaging Library (Fork)
Library home page: https://files.pythonhosted.org/packages/12/ad/61f8dfba88c4e56196bf6d056cdbba64dc9c5dfdfbc97d02e6472feed913/Pillow-6.2.2-cp27-cp27mu-manylinux1_x86_64.whl
Path to dependency file: videohash/requirements.txt
Path to vulnerable library: videohash/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 775d08735341d5bb435fcc1501160e1d150f31e0
Found in base branch: main
In Pillow before 7.1.0, there are two Buffer Overflows in libImaging/TiffDecode.c.
Publish Date: 2020-06-25
URL: CVE-2020-10379
Base Score Metrics:
Type: Upgrade version
Origin: python-pillow/Pillow@41b554b
Release Date: 2020-06-25
Fix Resolution: 7.1.0
Step up your Open Source Security Game with WhiteSource here
BUG REPORT
the videohash function is_similar returns True even when the videos are different.
NOTES:
One thing I noticed is that the is_similar function seems to be not correctly implemented. Basically this function sometimes returns True even though videos are completely different. I expect the is_similar function to return True for videos that at least shares some common characteristics as video length for example.
How are two videos supposed to be the equal if their length is not even similar? I would add a check prior than generating the hash value.
So, the videohash code should also take video length in consideration when comparing two videos for similarities
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.