Code Monkey home page Code Monkey logo

Comments (145)

78Alpha avatar 78Alpha commented on May 23, 2024 4

Updated my project at last! Testing is much easier and also harder now... Everything is done over a GUI, but that can easily be altered. It has some limits, it is meant to deal with single, large files. It acts unpredictably with multiple data inputs. While going through that process, I found that whenever you optimized for one, the other would cease to function, so it was either many files or big archives... I chose the one I used most...

Also, added in file hashing, so now I know the input and output are actually the same without having to checksum from terminal. It is using sha256, but it will not produce the same checksum as any other sha256 utility, kept this in for a very unlikely security reason. Time to archive whatever random junk I can find!

Sadly the previous 1.1.0 I was using made corrupted images...

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024 3

With the images way, wouldn't you be able to input random garbled data into a png wrapped file and just upload it to google photos? A multigigabyte photo may be odd but it could work.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024 3

I am not so sure that trusting OCR with this is a good idea. One thing that might be worth looking in to regarding images though would be trying to pack data in a JPG.

https://twitter.com/David3141593/status/1057042085029822464?s=19

https://www.google.com/amp/s/www.theverge.com/platform/amp/2018/11/1/18051514/twitter-image-steganography-shakespeare-unzip-me

Update: Apparently the source code is available now. I might play with it this weekend if time permits. - https://twitter.com/David3141593/status/1057609354403287040?s=19

from uds.

stewartmcgown avatar stewartmcgown commented on May 23, 2024 3

from uds.

Zibri avatar Zibri commented on May 23, 2024 3

You've got that error only for one reason... an RGB24 image is 3 byte x WIDTH x HEIGHT.
So your file must be a multiple of that formula OR you can change width and height so that W*H*3 mod filesize = 0

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024 3

The below statements are about offline testing, docs is not involved until DOCX gets mentioned...

Did a little testing and @Zibri 's method worked without flaw, in fact, it offered a slightly more compressed file.

Started with a WAV file (the test file), changed to RAW, then did some math to get a frame size close to the file as the size that should have equated to the file size wasn't working... Appended a bunch of zero bytes, then did the ffmpeg process of converting to PNG...

The PNG was slightly smaller than the original data, so it was compressed somewhat, BUT, it didn't lose the data it seems.

After using the reverse process the file came out the same size and played the same song, and to double verify, I added to the script to remove the junk data I added, sha256sum on both files, and it matched up nicely.

I uploaded the image to google docs and downloaded it as a DOCX file, the image in the DOCX was only 70 bytes... so substantial loss... However, I have a theory as to why, and that is that the unlimited image storage applies to these images too, and since mine had a width of several thousand pixels, it likely heavily compressed it, as it didn't do that in the other test that had the 1024x1024 image size.

So it would have a limit of 4920×3264, 16,058,880 bytes? (pixels), with the 3 bytes thing it could store maybe 15 MB to 45 MB per image, but that needs confirmation.

Has anyone downloaded a doc with an image through the API to see if works? Probably a silly question but a 3 am thought...

https://send.firefox.com/download/734144b66b656e55/#y2fH9f_Ca76BsICt_dDFmQ

The offline test if someone wants to look at the mess of test

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024 2

You were testing using Google Photos, right? Did you try putting altered images in a Word document, uploading to drive, and then converting? Perhaps that is doable?

Also, this is off topic, but I want to point out that I am not the same David who wrote the "zips in jpg" thing. I wish I was though. 😊

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024 2

Yes, I will make it at least 10FPS or probably more for bigger files
Bigger FPS generates smaller video files as well so it helps a lot

Be careful to not go for too many FPS. If you go for too many, you could risk losing frames depending on where you choose to store the video. I am sure you already know this but I figured I would mention it for anyone else who is interested in trying your approach who hasn't given that any consideration.

I only use a generic 54 bytes header, I don't see much of a need to generate an unique header for each file
Yes, use Python instead
It makes things so much easier and quicker
Without their libraries, I will have to use ImageMagick on my approach which is much much slower

@digicannon, the author of the bitmap generator we were trying to use, gave me permission to share it. Here it is: https://gist.github.com/DavidBerdik/290684facd7fb25a9775d30ff0cbdf52

I think he was attempting to use a generic header as well. Could you verify that? Or identify what is wrong otherwise?

No, I have tried that and it doesn't help a thing, in fact it makes things worse
You can't make the text too large or too small, otherwise the scan will generate more mistakes
Anyway, this OCR approach is very slow and inefficient
Not only you can't put too much text on a single image because it will only make the scanning slower and generate more mistakes, but also you can't store much data in a single image
At best you can only store 400 bytes data on a image compared to my approach which you can store more than 50KB data and more if you make bigger images

I assume you have tried testing with different fonts?

For this MP3 file, less than 3 minutes for making all the images and put them into a Docx
I have tested it on a Windows 7 running on X250
The process should be way faster on a much more powerful desktop machine
So far it is a decent working approach that acts as my last backup when others fail
One thing is sure, running it on Win10 is a lot slower and I don't know why

Not bad! How's the upload speed? Also, are you generating BMPs or PNGs for insertion? As for the speed difference between 7 and 10, how different are the systems you are using for testing?

Look at what I found, someone wrote a utility in C to convert binary data to PNG 6 years ago and it is very fast, much faster than mine and any other methods I have tested
https://github.com/leeroybrun/Bin2PNG

Let's have some tests on it

This is actually what we wanted to do as well but we agreed that it would be easier to create bitmaps and use a library to convert it. This is of course a better solution for that.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024 2

I gave it a very VERY rough test and didn't get a good result... Tried on one of the BMPMAN binaries I had around, it was 13 MB to start, came back 12 MB, with a lot of corrupt data... I think the images came back out of order. But I did learn some things, ffmpeg wanted the buffer and frame sizes to line up, and I am guessing the missing data was from the fact that I either had too much on the tail on not enough to fill a frame, so that can be fixed by appending 0's. Not sure about the image order part, hard to tell which was actually first, since it was static.

It went up, downloaded, unzipped and rebuilt a file. If I had measured it right it could have worked, so I will try appending zeros next go. It looks really promising, don't let my bad run get anyone down.

Specific error was:

Invalid buffer size, packet size 647408 < expected frame_size 3145728
Error while decoding stream #0:0: Invalid argument

It did come back with the right header data, so image1 was the actual image1, checked the bytes in Okteta. Nice work finding this out!

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024 2

@nerkkolner Our project generates PNGs using Pillow. We simply use the frombytes() (https://pillow.readthedocs.io/en/3.3.x/reference/Image.html#PIL.Image.frombytes) function to generate an image from a chunk of bytes that we read from the file that we want to upload. To get the original data back when downloading, we read the pixels in the image. At the moment, we do not want to share our project because it is not completely functional yet. Once it is though, the repository will be made public and I will share a link.

As for our ever-growing list of projects that try to tackle the same problem, @digicannon sent me a link to a recent one that he found that generates videos. - https://github.com/MarkMichon1/BitGlitter

from uds.

MarkMichon1 avatar MarkMichon1 commented on May 23, 2024 2

You beat me by a couple hours, David. I posted my library on reddit and someone suggested I bring it up here.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024 1

After attempting a few things I found what works and what might not work. I made a file with text and turned it into a PNG file, not just changing the extension but hex editing it to have the header... This did not work well... it requires some other header manipulation, changing data chunks and adding to crc to each chunk... manually writing it did not give good results...

However, the method I did find is a long one but did work. I converted a txt file to PNG, by that I mean I made a picture that showed the words "I am Text!". Editing the file doesn't have those words in any way. Getting it back into text used OCR... so... that method works but you have to account for the OCR being able to read small characters and of course turning it from text to whatever file it is... I guess this is covered by base64? Turns even the headers into plain text, so it should be as easy as adding it to a file with the right extension afterwards. I'll have to test this out more, I haven't found a way to automate it as I don't know how the sites I'm using do what they do, I suspect AI and that is a large scope...

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024 1

My apologies for the misunderstanding. I'll be trying that next. Currently fiddling with Steghide to store things, but it needs a jpg large enough to hide the data and good lord, i'm trying to make an 80,000 x 80,000 jpg on a little laptop... 4K images only offer 1.6 MB space.

I'll edit this once I tested the word document.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024 1

I tested it out and it did not go well... Putting a 2 MB text file required a near 20 MB image. Attempting to put a larger amount of data required a bigger image but I ran into a completely different problem. It consumes a ton of memory just to add the 2 MB into the image, I have an average 8 GB, and it requires 7 GB + 1 GB Swap per image, and that is just jpeg... I tried doing it with a PNG, but the software available requires even more memory to do it with a PNG. Even though said PNG can hold more data, it requires significantly more memory. Where JPEG took 8 GB, PNG was demanding 10 to 12 GB, freezing the system and crashing. It requires the same memory to extract too so... even though I had a test file it was not happy about taking the file out of it.

I also tested the word document. Google converts all images to PNG format, destroying the data injected into them... But it did create a zero space docs file. To do it, you would need to have a PNG in the first place... However, the requirements for making the PNG are way too high to be useful... 1 Image needs 10 GB RAM, but can only inject around 500 KB data, the image created would also be larger than a JPEG... that is ration of 1:30, for JPEG it's 1:10, and for Stewart's UDS method it's 2:3. If you account for upload speed, images can upload at full speed but UDS is limited to 1/5 total network bandwidth, so only 5x performance can be gained from each image format. PNG > 5:30 ? 1:6, JPEG > 5:10 > 1:2, UDS (still 1/5) = 2:3, or 1: 1.5. His method is smaller and the fastest of the image methods. The image methods just allow for cloud syncing in such a way that you don't have to deal with an ID and can easily resume upload.

The only methods I can see would be having multiple drives and uploading to each via processing, so it doesn't close connection due to too many access attempts at the same time.
Making offline word docs, uploading those and converting them (however you would have to delete the original word doc because it still takes up space).

And I used Steghide and OpenStego, if anyone is curious. Steghide is a command line tool while OpenStego is a GUI tool, also the only one of the two that can work with PNG files.

And about the uploading an image file with Garbled Data, I attempted that a while ago, giving the file all the headers necessary to appear to be a file, but Google, Twitter, etc. Require the file to be thumbnailed in order to prove it is an image and not, of course, what we are trying to do. That's why a cover image is used. Google photos refused to upload any image that it could not make a real thumbnail out of, but did work with ones that had real image data.

So... still at square one? UDS method is still the fastest available...

I have also found that using any steg tool has serious problems handing part files, a zip with multiple parts. The part file can be 1 KB but if the data inside is supposed to be larger than 2 MB, it just will not put the zip into the image because it thinks it is a larger file than it actually is...

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024 1

Regarding the BMP thing I suggested, why was that limit present? Can't you edit all the bytes outside of the header without corrupting the image?

Regarding the Doc size limit, I tested that.

Using an online PNG generator (https://onlinepngtools.com/generate-random-png), I generated a bunch of PNG images and placed those images in a word document ("Word PNG Image Test.docx") that was about 48MB in size. I uploaded the document to Google Drive and converted it. The conversion was successful. I then downloaded the converted file and checked its size ("Word to Google Doc Conversion Test.docx"). It was 28.1 MB. Using the .zip extension trick, I unzipped the two files to compare the images, and although both were in PNG format, the images were not technically the same. The ones in the Google Drive version were more compressed. I then tried creating a new Google Doc via the web UI and inserting all of the images from the original Word document as well as the converted document in to the new Google Doc. This worked, but it took a while for saving to complete. After this, I downloaded the Doc-created file ("Google Doc Manual Image Insertion Test.docx") which totaled 76.1 MB (note that this size is a sum of the previous two sizes). I then extracted this file using the zip trick and compared the hashes of the images to the hashes of the images from the documents they were sourced from and they all matched. So it looks like the best way to do this would be to insert the images directly in a Google Doc. Unfortunately I cannot find official documentation on what the maximum size is for a native Google Doc, but according to this obscure Google Products forum post, the limit is 250MB. The three documents I created during this test are attached in RAR archive fragments.

GitHub does not allow RAR archives to be uploaded so I had to change the extension to zip. To extract these, change all of the extensions back to .rar and use WinRAR to extract.

Sample Documents.part1.zip
Sample Documents.part2.zip
Sample Documents.part3.zip
Sample Documents.part4.zip
Sample Documents.part5.zip
Sample Documents.part6.zip
Sample Documents.part7.zip

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024 1

Cool!

Over the weekend, I participated in a local hackathon with two friends (@SMyrick98 and @digicannon) and we tried to implement a prototype of the "bitmaps in Word documents" thing that I mentioned earlier. Unfortunately, we did not get everything working due to apparent inconsistencies in the bitmap standard, but time permitting, I believe we have plans to attempt to finish it. If that happens, I will share the work with their permission.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024 1

I retested my method, and apparently the old 1.1.0 version had a weird bug. It created data and stored them in the images... But only files larger than buffer of 48 MB.

Made 1.4.0, changed everything, works out with large files now. No corruption, twice as fast writing, and my first GUI. If anyone is trying to make a GUI for UDS, I suggest PySimpleGUI, it's what I used, well, the tkinter version, was very straightforward.

Just on a side note, if you use the 1.4.0 version, make sure the directories it uses are empty beforehand, in a stop event it deletes the output folder...

I don't think my source is readable anymore.

With the API for an upload of images, it is not the greatest... The workaround I see is using a setting google offers in drive, where you click it and it converts everything to high quality. I have no clue if any of the APIs can interact with that or if it's just a user thing. It would only need to be pressed every 14 GB or so.

I also saw what you meant by the bitmap standard, I made a second one and the header was completely different, shorter too.

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024 1

Have you tried uploading PNGs then downloading them and converting back to bitmaps? Since PNG is lossless compression you should be able to go both ways and still retain the original data as long as the image is not resized. That is what we are trying for Google Drive. I still have not had a chance to glue in @78Alpha 's bitmap generator though.

No, because the maximum file size for 16MP PNG which I created on Photoshop is like 30MBs
Like @78Alpha said, hacking PNGs is too complex and there are formats like BMP and TIFF that are easier to work with and also make bigger files
Not to mention anything bigger than 16MP will be resized regardless of the file size, which in turn corrupts the original data

The thing I learned through playing with BMPs and Google Photos is that you certainly can upload bigger files than just 50MBs
In my testing, it seems the biggest BMP file people can upload with "high quality" is 75MBs
I did it with a 54 bytes BMP header and 74MB dummy file with all zeroes
It uploaded OK but not with 75MB dummy file

Next I will play with ODT file to see if I can hack it to insert multiple pictures there and make it untouched by Google after upload
Your tips is helpful, lets see where we will go from there

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024 1

Looks like ODT is just like DOCX which is a zip-like structure and compressed

So instead of DOCX and ODT, can we just encode the large images into RTF file instead?
RTF converts to Docs just fine I just tested
Any clue to get started on this?
https://stackoverflow.com/questions/1490734/programmatically-adding-images-to-rtf-document

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024 1

Hi all,
So I have just created this video using powershell and python scripts after some intense research in the past week
https://www.youtube.com/watch?v=kkmvJ1WK_os

Has anyone else thought of this idea before?
Using the hex values of the file to generate the images and put all the images together into a video and upload to Youtube to act as a data storage
Since Youtube has no limit on amount of videos uploaded, you can upload as many as you want
It is also nice to look at and easy to share with others
The video quality looks fine after uploaded to Youtube and should be good enough to restore to original file
If not, you can always download the original video which is uncompressed AVI with Google Takeout

**

The only downside with this is it will generate a temporary file with all the hex codes that is 20x as big as the original file during the process and uses some time to generate for use with numpy (Because I have drawed 6x each pixel, eg: #FFFFFF,#FFFFFF,#FFFFFF,#FFFFFF,#FFFFFF,#FFFFFF in each of the dot)

**

^ This is improved now with the use of scaling in python
https://stackoverflow.com/questions/7525214/how-to-scale-a-numpy-array

I am not a software developer myself but I think it would be great if this can inspire some others to build an application that is fast enough and have smaller footprint on temp files to use based on this approach
Is this even practical at all?

For making the image, pillow + numpy is used
https://graphicdesign.stackexchange.com/questions/49691/how-to-convert-an-array-of-html-colours-into-into-a-picture

On a side note, I also tried the OCR approach mentioned earlier, the images are generated fine but the issue is I still have a hard time finding a good font or even OCR software to scan the text off the images properly
The biggest issue I have now is barely any software can tell the O and 0 apart properly so I kinda give it up
If anyone can recommend a good font for OCR, I would test it again

from uds.

Zibri avatar Zibri commented on May 23, 2024 1

Just my 2 cents:

Example:

create a 300MB file with random data:
dd if=/dev/urandom of=test.raw bs=300k count=1k
ffmpeg -f rawvideo -pixel_format rgb24 -video_size 1024x1024 -i test.raw test%03d.png
[attach the png to a google doc document]
then... unpack the google doc document and
ffmpeg -i test%03d.png -f rawvideo -pixel_format rgb24 test2.raw

test,raw and test2.raw will be the same file.

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024 1

I assume you have tried testing with different fonts?

Yes, I have tried all the default installed fonts, Consolas works the best but still generate some mistakes
If anyone else wants to try the OCR approach, it is best to use a monospaced font

Not bad! How's the upload speed? Also, are you generating BMPs or PNGs for insertion? As for the speed difference between 7 and 10, how different are the systems you are using for testing?

The upload speed is much better as a large doc file with the images inside
I use PNG as the size for generated PNGs is smaller so that I can put more data in a single doc file
Not really measuring the time accurately, but Win10 probably takes at least 1/4 longer to finish the task

So for now, we can just use ffmpeg to encode the data to PNGs, as shared by @Zibri
Can't believe that I have never thought of using ffmpeg lol
At least I have learned something in python scripting

from uds.

Zibri avatar Zibri commented on May 23, 2024 1

@78Alpha glad I helped.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024 1

Testing using DOCS and it compresses the PNG...

Started with a 10 MB PNG and ended out with 5 MB, using DOCX, HTML, PDF, etc...

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024 1

In case if you want to encode the data to WAV, here you are:
https://medium.com/@__Tux/using-bandcamp-as-a-backup-solution-3b6549d24579

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024 1

I finally had a chance to play around with the hackathon project I worked on with @SMyrick98 and @digicannon. I was able to get bitmap generation working, but unfortunately, when the Word document containing the bitmap gets uploaded, Google does some major compression on the image stored in the document and the data gets lost. Specifically, the PNGs in each Word document were 1MB, before uploading, but after downloading, they were all approximately 60KB. I am not sure what exactly @nerkkolner did differently than we did, but his images are much smaller than what our program creates.

We were hoping to jam more data in to fewer images, but it seems that is not possible.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024 1

@MarkMichon1 Hey this issue thread is becoming famous! 😆

I haven't had a chance to play with your implementation yet, but it looks pretty cool. I noticed that the last frame of the demo video you linked to uses white for padding.

Does your project allow for the same file to be split across videos (ie. split data into several videos that are 10 minutes long)?

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024 1

@MarkMichon1 You're welcome! As for the white padding, I figured that was the case. I just found it to be an interesting choice because in our project, we are padding images with null byte which produces black. Or at least it did before we starting using PNGs with an alpha channel, which means we produce transparent pixels.

from uds.

stewartmcgown avatar stewartmcgown commented on May 23, 2024

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

The only suggestion that I have which has not been implemented as far as I can tell is batching requests to the Drive API. Perhaps a certain number of chunks (100?) could be encoded and then sent to Drive for processing on a separate thread while encoding of chunks continues on the main thread. Time-permitting, I will play with this idea. I'm not sure if time will permit though.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

Not sure how great an idea it would be but...

Google has mentioned you can convert text documents to google drive format. Now sure if that would set it as "0 space used doc", but it would allow for files of up to 50 MB to be uploaded and converted.

With the right threading, or processing, have one set to encoding, one set to uploading, and one to conversion. However they would have to be synced up neatly, as I found that calling the API in multiple instances terminated upload.

I attempted to multiprocess the upload but when more than one "user" accesses anything it cuts connection, so it was playing duck duck blackout with itself until stopped. Since every drive has a minimum of 15 GB, could be set to upload up to 7.5 GB then set to convert.

Uploading a solid file would at least be faster, but again, not sure if it converts neatly.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

@78Alpha From what I can tell from my admittedly brief research, converting an uploaded file to a Google Doc does produce a "0 space used doc."

from uds.

stewartmcgown avatar stewartmcgown commented on May 23, 2024

I have been unable to convert even 8MB text files to Google Docs format. Have you had any verifiable experience with this?

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

I have experience doing it with Word Documents years ago but never txt files. From what I've read though it is supposed to be possible.

I suspect you all have seen this already, but I will post it anyway: https://support.google.com/drive/answer/37603?hl=en

from uds.

stewartmcgown avatar stewartmcgown commented on May 23, 2024

No of course you can convert documents, that is exactly what my program does. But there is a technical limitation which says that Docs can have only 10 million characters. Trying to convert large text files to Google Docs format fails every time.

I'm still open to other speed improvement suggestions, but I don't think this is the way forward.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

Unless I am misinterpreting this, conversions do not have that limit: https://support.google.com/drive/answer/37603?hl=en

from uds.

stewartmcgown avatar stewartmcgown commented on May 23, 2024

I imagine that is to allow for word docs with images in them.

You can test the behaviour I'm talking about by creating a fake base64 txt file:

base64 /dev/urandom | head -c 40000000 > file.txt

and attempting to upload and convert it in your Google Drive. The error is the console is 'deadlineExceeded', which I assume means there is an internal time limit on how long a conversion can take on Google's servers.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

Yeah I see what you mean. I can't even get conversion to work properly through the web interface.

I have not had a chance to test converting documents that have images in them, but assuming that it works, it may be worth looking in to modifying the project to do the following.

  1. Generate "images" that contain slightly less than 50MB worth of data.
  2. Add those "images" to a Word document.
  3. Upload the Word document to Drive.
  4. Convert to the native Docs format.
  5. Delete the original.

What are you thoughts on this?

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

There's also the method of making several google accounts and uploading a part of a zip to each. The limit would be 5, as uds seems to only ever 1/5 total bandwidth on any network. Having each file with its own account wouldn't drop the connection like multiprocess uploading did.

I've made my own script that auto updates the ID whenever a file is deleted or the like, but I have ti add in the details manually unless I want to make a self evolving script. Instead of just saying "pull id" for 1 drive, it goes "pull my_picture" and pulls from each drive, or deletes from each or pushes and loads the ID into a shared json...

However, seeing as how David got a really nice setup on jpg zips it seems promising. I will test to see if it works on drive but drive is very picky on "altered images". Best of luck, great concept, and awesome execution.

Edit:

After testing it out I managed to upload one of those "zips in jpg" files to the unlimited storage. However it is limited to about 64 kilobytes per jpg...

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

Good luck to you! Unfortunately I have not really had any time to play with any of this. I've only been able to theorize over what might be possible. School work has kept me busy even over the weekend.

Another way to handle this could be to create a 50MB (probably slightly less) bitmap file and use that for storing data. If you want to hide data in a bitmap while retaining the image, you can use least significant bit steganography, but since there is really no incentive to retain the appearance of the unaltered image, there's really no reason why we can just overwrite the entire image with our bits and put the garbled-looking image in a document. Using MS Paint, I was able to generate a 256 Color Bitmap of 48.7MB by setting the dimensions of it to 7150px by 7150px. The question here is does Google do anything to bitmaps in Word Documents that are converted to the native Docs format?

In regards to generating Word documents with Python, here is the answer to that: https://python-docx.readthedocs.io/en/latest/

And no worries. I just want to make it clear that I am not trying to claim someone else's work as my own. I know what it feels like when someone does that and do not want to perpetuate it. 😊

Update: Here is the bitmap I created. Apparently GitHub does not take kindly to uploading 50MB bitmaps so I had to zip it.
Demo Image.zip

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

I am not sure that this adds much, but after a little investigation, I believe that the bitmap issue with the Word Documents is actually not Google's fault, but rather, Microsoft's fault. Here's how I found that.

  1. Created a new Word document and added my bitmap to it.
  2. Saved the document, closed it, and changed the extension to zip. (docx files are not thing more than glorified zip archives)
  3. Unzipped the file and found the image. It was in a PNG form.

I do have one more crazy idea. I am pretty sure that it is too impractical to be useful, but I will share it in the hopes that someone does find it valuable.

  1. Create an empty Word document.
  2. Create a new image file (the one that I am playing with is 7150px by 1px and the format should not matter).
  3. Set each pixel in the bitmap to white or black to indicate the bit setting in the file that is being uploaded. (0 = white, 1 = black, or the other way around if you prefer)
  4. Add the image to the Word document, save the Word document, and check the size.
  5. Repeat steps 2-4 while Word document size is under a certain threshold.
  6. Once size threshold is reached, upload document, convert to native Docs format, and delete original Word file.

I am of course aware of the drawbacks of treating each pixel as a bit instead of a byte, but using this method, I am not sure that each pixel could be reliably used as a byte. Since Word/Docs seems to like the PNG format, perhaps using bytes would be acceptable since we would not have to worry about what happens during conversion.

Does anyone have any thoughts on this? (Besides of course thinking that I am crazy.)

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

Recently tested multiprocessing in a range of ways... defined processes, self mutating processes, os processes, and star map... all of which got caught up on a certain part "Requires x docs to make", specifically. It seems to stop the system from spawning any new processes, or automatically calls join() on a process it doesn't know the name of. In just running multiple instances in different terminals it worked fine, at least for a small set of data. I once tried with very big files and it got caught up and just stopped both uploads. Using "os.system" to call UDS and whatever command I need to do also seems to cause a problem, it makes the group name "None" and for some reason that stops the whole thing, even when grouping does nothing at all... Trying to do it from an outside script... led to... VERY weird results. UDS started treating letters like files, it would fail to upload anything but encoded non-existent data and uploaded that with the name "."

I have run out of external ideas to speed it up... the only ways left would be to change parts of the core UDS, and that is way over my head.

And apparently there is a rate limit quota, so a single file can only edited so fast, I found this out when messing around with the concurrent.executor that was quoted out. And applying for extra quota requires asking google for more... So... My old idea of making multiple google accounts to access a file might be valid for multiplying speed, maybe I'll test that next...

from uds.

Asqii avatar Asqii commented on May 23, 2024

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

@78Alpha Have you bothered playing around with my latest crazy suggestion at all? If not, I will have a look at it myself when time permits.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

I am studying it at the moment. I haven't gotten to it as I personally wouldn't know how to execute it. From what I see, it would still be bound to the 700 KB doc size limit, but you would be able to group files, however, they wouldn't be able to be part files at the moment...

It allows for more organization but reduces the amount of storable data per picture. I'll have to work with BMP a little bit to see how it handles data.

Attempted the BMP you uploaded, it was apparently too short to hide a 5 MB file, but again managed to hide a 2 MB file. It is starting to appear that 2 MB is the limit for a single file.

I found that files can continuously be pumped into images. I added a zip into an image and made an image, then pumped data into that image... however, the time it takes to do so is exponential. The first time took 10 minutes, this second time is at 5.7% and has taken 2 hours already.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

I myself couldn't edit past the header bytes, it was far outside my field of expertise. I used one of the Kali linux tools, Steghide, and it attempts to inject data in such a way that it will work on sites that try to generate a preview. Since it pushes it to a singular block in the image, I assume that's the limit, if I could input data per block then it would only have a limit of block numbers instead of size (and if course ram when trying to open the image itself as a text document). That 250 limit seems very generous, I wonder who made a doc that big that it was placed so high. I'll have to learn more about all this, but as long as the data is in big chunks, it could boost upload to full potential. I'll take a few days to learn more, if I can't learn what is needed I might have ti pass off trying an implementation myself.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

So, I looked into the whole thing a bit and learned a lot. When I first started out, I was using PNG images, and that's where I went wrong. PNG files are the hardest to work with, as they have checksums for each block, making it nearly impossible to inject data into them, knowing that is helpful though. PNG files have the largest potential size (I downloaded a 500 MB from the Nasa site), but working with them is slow, tedious, and not very efficient...

I worked extensively with your BMP files too, but with Google Photos related tests. After doing some reading up, unlimited storage is for files less than 16 MP (4920 x 3264). So I made a BMP of size 4920 x 3264 with a simple gradient. It is ~50 MB in size, much better than JPG, but not as good as PNG, however, it works. The BMP uploaded to google photos, takes no storage space, and could be downloaded and unzipped.

https://photos.app.goo.gl/RvRR7H4bhcwQcCRu5 (contains a 7zip file)

That is the picture, you can tell how full it is by the amount of random static in it, from bottom to top, so you can also add in more stuff if you want, it's a pain to find the end bytes but is possible. (Also, the data in there is a game of my own design, if it brings up any concern).

I attempted to copy the BMP bytes and create arbitrary images with python, as python has binascii to do stuff like that. However, when writing up a script it threw out a nonsensical error, I state that simply because I ran the same code from an interactive prompt and worked flawlessly, so automation will be problematic...

I also tested your DOC idea, I added a very small jpg to a word document, converted it to a zero space file with docs, and downloaded the doc with its images... And, well... It destroyed the data again. The image was only a 30 KB JPG, so it wasn't turned into a PNG, however, it still tampered with the data such that it couldn't be extracted (or be seen as an archive).

Part files are also not working in multiple images so... I'll be working with hex for a while...

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

A rough snippet of code to help out...

def generic():
sequence = '1234567890ABCDEF'
base = binascii.unhexlify('424DD2E4DE02000000007A0000006C00000037130000BF0C0000010018000000000058E4DE02130B0000130B0000000000000000000042475273000000000000') # Header bytes of a BMP file of size 16 MP
with open("generic.bmp", 'wb') as byter:
byter.write(base)
temp = ''
for x in range(10):
for i in range(12000000):
temp += str(random.choice(sequence)) # Generate random noise to increase file size
#byter.write(secrets.choice(sequence))
byter.write(binascii.unhexlify(temp))
temp = ''
gc.collect()
byter.close()

The spacing is a github thing, it doesn't seem happy about sentences started with them...

The code generates a BMP of 60 MB, and yes, it is based solely on size. I used the header bytes to a BMP I had on hand, so it always has the same Width X Height and appears as a BMP. Although it is 60 MB, that's because google photos was not happy with the 240 MB generated one or the 120 MB... but different services should have different limits. In theory, you should be able to make a multi-gigabyte BMP file that always has the resolution of 16 MP.
BM����zl7�� ��X���� � BGRs

is what is made from the bytes...

424DD2E4DE02000000007A0000006C00000037130000BF0C0000010018000000000058E4DE02130B0000130B0000000000000000000042475273000000000000

So... it could be modified to have part files in each image and then consolidated into a single BMP file, not sure how clean that would be, but it means each DOC could have a full zipped file even if images are limited in raw data size. However, from my testing, taking part files from images generates noise of a weird kind, it added data that never existed, corrupting the archives... I guess a cleaner way would be to add the part files to an image and close the file there, without extra noise, such that you can just ignore the headers and stitch the files into one big file.

Hopefully my blunders lead to discoveries for others.

from uds.

jhdscript avatar jhdscript commented on May 23, 2024

I realize various tests:

  • Plain text
  • docx + convert
  • google sheet container

Speeds are always bad (250kbps max) so i think the only way to boost it is threading the process.

I look at rclone and with small files it works faster than all my test :'/

Moreover google limits file creation at 3 per seconds :-(

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

Here is a BMP tool, might make the process of putting them into docs easier, makes the standard more uniform. Sadly it's limited to 2.7 right now, 3.7 was having a cow about reading hex and bytes.

https://github.com/78Alpha/BMPMan

The only advantage google photos has is the fact it can make albums and continuously sync (at least from mobile). Still very manual, just makes pictures... I added in a license just for reasons, and read it over, so I guess I have to state this:

DavidBerdik, under the LGPL v3, you have free reign over the nightmarish code I have created in the link above, if you like.

from uds.

jhdscript avatar jhdscript commented on May 23, 2024

and you are sure Google doesn t compress bmp ?

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

@jhdscript

I made sure to test on google photos, it didn't alter the file, it looks to see if the file is greater 16 MP, So the header states the file is (16 MP - 1), I could have made it 1x1 pixel, however, I like being able to tell how full the image file is by looking at it (less static means less it is an end file). Google docs, however, always compresses images, destroying stegonographic images or "Bogus" images. I'm sure there's a way and that's what David is going after.

I'm after Google Photos, Stewart did docs, David is doing Docs x Photos.

If I can find a video format that isn't as picky about headers, I can make a large bodus video file (since Google photos allows 10 GB videos at 1080p, I could make the video appear to be 1x1 if necessary).

Just to test all the photos stuff, I compressed a GOG game I had, packed it into BMP files (35 total), uploaded it, downloaded it, unpacked it, then checksumed it, extracted it, ran it, etc... It worked perfectly.

Depending on where the data is put, Docs vs Photos vs whatever else gets made next, it has different rules.

from uds.

jhdscript avatar jhdscript commented on May 23, 2024

Mmm i generate few bmp files using a c# app, then uploaded to gdrive and redownload it. It seems original file have been altered.

Atm i dont read already your code but do you use a tricky bmp header or your header is a valid bmp header format?

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

I made a BMP using gip and just ripped the header from that, I think it still has the comment "Made with GIMP" in it, but I'll have to double check.

Edit:

Sadly it does not have the "made with gimp" comment

from uds.

jhdscript avatar jhdscript commented on May 23, 2024

Bmp arevmodified by gdrive upload

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

How about google photos?

Strange that it isn't working for you, it worked when I tested it

from uds.

jhdscript avatar jhdscript commented on May 23, 2024

What api do you use for uploading to google photos ?

The big problem with google photo is api queries is limited to 10000 per day

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

I haven't implemented any API use yet, I just copy the pictures to my phone and sync, at least for now.

At 10,000 per day, that is very good. ~470 GB per day, but if I find the cutoff image data size, then maybe it could get closer to 700 GB per day.

from uds.

jhdscript avatar jhdscript commented on May 23, 2024

I dev it and upload 300gb on 6h.
It seems but with chunks less than 20mb. Bmp was not compressed or retouched but google photos api is not very responsive and if google implement a compress algo on files we loss all.

Atm gdrive seems better but chunk size (700k) is the issue :-( it took more time to established conns than to upload

from uds.

jhdscript avatar jhdscript commented on May 23, 2024

After lot of tests using external tool i determine that 10 thread is the optimal for posting without any http errors on gdrive.

Now I touch 500kbps, so still slow...

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

Working with my own thing, it manages to go at full speed, 1.5 MB/s. After looking into the API documentation, it looks like a trainwreck to get it working at full capacity without error...

So, my idea is out. However, I am still using it manually in backups, since it works. I can create albums, change the cover photo to match the data, and it's all in a nice package. Keep a UDS copy, a BMPMan copy, and a local copy.

Just for fun, I put the images into Gimp and made a GIF out of them. It's a 500 MB GIF that shows the representation of the all the data in the file, through random colored static.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

Wow. It looks like I missed out on quite a lot here.

@78Alpha Unfortunately we have not made any additional progress on our hackathon project over the last week. At this point I will probably be taking over the project on my own and trying to finish it. Since the main problem we were having was generating bitmaps, I will probably try add your program to ours since it presumably works as intended.

@jhdscript Regarding your question about if Google does anything to bitmaps, what I found from testing my Google Drive idea was that when putting bitmaps either in Word documents directly or via Google Drive, the images become PNGs. Since PNG compression is lossless, I figure that the best way to handle this would be to generate a BMP, convert it to a PNG using Pillow, then pass it off to Word and Drive since apparently neither Word nor Google do anything to them if you give them PNGs from the start. And then when downloading use Pillow again to convert back to BMP.

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

If anyone is interested, this dude @xhighway999 and his group of friends created this utility to store data as videos on Youtube
https://github.com/xhighway999/PygaMasher

Maybe you guys can share some ideas together

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

That's pretty cool. I was just playing with the release build. It seems that the release build can generate the frames and store them as images properly but, at least on my machine, no video is being generated. This is probably the most sensible way to go about it though. Splitting data would still be necessary though in cases where either the resulting video is more than 12 hours long or the resulting video file is greater than 128GB (https://support.google.com/youtube/answer/71673?hl=en).

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

Here is a BMP tool, might make the process of putting them into docs easier, makes the standard more uniform. Sadly it's limited to 2.7 right now, 3.7 was having a cow about reading hex and bytes.

https://github.com/78Alpha/BMPMan

The only advantage google photos has is the fact it can make albums and continuously sync (at least from mobile). Still very manual, just makes pictures... I added in a license just for reasons, and read it over, so I guess I have to state this:

DavidBerdik, under the LGPL v3, you have free reign over the nightmarish code I have created in the link above, if you like.

Hi, I saw your code and see there is a padding of 0*32 to the generated BMP, may I know what it is for?
Also, I tried your tool and it seems that it doesn't convert to real image, but still can be uploaded to Google Photos, am I correct? Thanks

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

That's pretty cool. I was just playing with the release build. It seems that the release build can generate the frames and store them as images properly but, at least on my machine, no video is being generated. This is probably the most sensible way to go about it though. Splitting data would still be necessary though in cases where either the resulting video is more than 12 hours long or the resulting video file is greater than 128GB (https://support.google.com/youtube/answer/71673?hl=en).

128GB is a rather large file size
I would split it up too even if it's not the limit

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

Working with my own thing, it manages to go at full speed, 1.5 MB/s. After looking into the API documentation, it looks like a trainwreck to get it working at full capacity without error...

So, my idea is out. However, I am still using it manually in backups, since it works. I can create albums, change the cover photo to match the data, and it's all in a nice package. Keep a UDS copy, a BMPMan copy, and a local copy.

Just for fun, I put the images into Gimp and made a GIF out of them. It's a 500 MB GIF that shows the representation of the all the data in the file, through random colored static.

Hi, it is me again
I have finally made it working on a real picture with some ideas from your utility
But one thing I get it right is the BMP header is just 54 bytes from I have read
My end result is exactly like yours
Now we just need a tool that can upload 16MP BMP in "high quality" and can be used in scripting/automation environment
All the tools using Photos API can only uploads photos in original quality, which would count the drive space

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

128GB is a rather large file size
I would split it up too even if it's not the limit
"The maximum file size you can upload is 128GB or 12 hours, whichever is less."

I would try to go for 11:59:59 in less than 128GB. Even using HEVC that's probably not doable though.

Hi, it is me again
I have finally made it working on a real picture with some ideas from your utility
But one thing I get it right is the BMP header is just 54 bytes from I have read
My end result is exactly like yours
Now we just need a tool that can upload 16MP BMP in "high quality" and can be used in scripting/automation environment
All the tools using Photos API can only uploads photos in original quality, which would count the drive space

Have you tried uploading PNGs then downloading them and converting back to bitmaps? Since PNG is lossless compression you should be able to go both ways and still retain the original data as long as the image is not resized. That is what we are trying for Google Drive. I still have not had a chance to glue in @78Alpha 's bitmap generator though.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

Looks like ODT is just like DOCX which is a zip-like structure and compressed

So instead of DOCX and ODT, can we just encode the large images into RTF file instead?
RTF converts to Docs just fine I just tested
Any clue to get started on this?
https://stackoverflow.com/questions/1490734/programmatically-adding-images-to-rtf-document

I have not played around with using anything outside of DOCX, but it should be possible. There may not be any value to it though.

The biggest challenge when working with DOCX is that any BMPs that are added to the document get converted to PNGs. This may not be true for the RTF format, but even so, I would expect that uploading the file to Google Drive would result in a conversion. At least I know the conversion still happened whenever I tried to insert BMPs directly in to Google Docs.

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

I have not played around with using anything outside of DOCX, but it should be possible. There may not be any value to it though.

The biggest challenge when working with DOCX is that any BMPs that are added to the document get converted to PNGs. This may not be true for the RTF format, but even so, I would expect that uploading the file to Google Drive would result in a conversion. At least I know the conversion still happened whenever I tried to insert BMPs directly in to Google Docs.

lol just ignore my post, the images doesn't get converted to Docs after uploaded so RTF can be omitted right now
I will take another look at PNG but right now BMP is a much better choice
It just cannot be added to the Docs, but is it really necessary though?
Once you find a way to upload the pictures with high quality in automation, the Docs can be used as alternative way to store unlimited data
Thanks to Google's 2 files per second limit, it will never be an effective backup storage, but as secondary it is good enough
We can also separate the data into 2 different places, one in Docs and one in Photos so if we lose one set of data, we can restore from another one

However,I found the limit for Photos upload

Q. When uploading large / many files, Uploading was failed. A. It maybe limitations of Google Photos. Limitations is below. (FYI: issues#246, issues#256(comments)) 75 MB or 100 megapixels / 1 photo 10 GB / 1 video Total bandwidth maybe 10 GB / 1 day

from https://github.com/3846masa/upload-gphotos

Is 10GB good enough?
Looks like the tool can support uploading to Photos with high quality, will anyone of you try it?

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

I understand why you would prefer to work with BMP instead of PNG and agree with you, but since using the Docs approach requires the use of PNG, we might as well take advantage of the compression that PNG has to offer (although it's not like we really have a choice in that). Our Hackathon project takes the lazy approach of generating a BMP and then using the Pillow library to convert the BMPs we generate to PNGs which we then insert in to Word documents. And when downloading we just go the other direction. As far as I can tell, converting between the two does not cause problems. I have seen the header change, but the body of the restored BMP has always been identical to the original BMP that I fed in.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

@nerkkolner That's pretty cool! I do have one question about it though: I noticed that you give about 1 second to each block set. Why did you choose to do this instead of putting more frames in that time. Of course doing 26FPS is risky, but 1FPS seems kind of unnecessary.

As for our hackathon project, we're still stuck on getting a working bitmap generator, but we're all tied up with assignments and haven't really touched it.

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

@nerkkolner That's pretty cool! I do have one question about it though: I noticed that you give about 1 second to each block set. Why did you choose to do this instead of putting more frames in that time. Of course doing 26FPS is risky, but 1FPS seems kind of unnecessary.

As for our hackathon project, we're still stuck on getting a working bitmap generator, but we're all tied up with assignments and haven't really touched it.

It is not necessary, you can put more frames in a second
I only did it to show the images longer so that I can check the quality
Also my video is short with 3 frames anyway, so making a 1 second video seems silly to me

Another update here
The generate process is much quicker and the temp file is also much smaller now with the use of scaling method in python script, I use the repeat method here
https://stackoverflow.com/questions/7525214/how-to-scale-a-numpy-array

Where are you stuck on a bitmap generator?
Bitmap is fairly easy to hack, do you mean you are trying to put bitmap on Docs?

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

Bitmap is an easy hack itself, but David was converting them to PNG, if I remember correctly, because Google Docs always converts images to PNG from docx (or microsoft format).

For the actual making of a BMP I provided a method but it was very unstable at the start... I have my own BMP generator in a stable place right now. It has a predictable header that is overly generic. It just states " I AM BMP OF BIG SIZE" and everything treats it like an image. In my working with BMP images and Google Photos, I learned that Google Photos technically doesn't support BMP files... as in the upload dialog, BMP is not listed in the supported file format bar. By using the "Any File" option they upload just fine and don't count against storage limit. As I am typing this out I have s few hundred images uploading to Google Photos and it has devoured my RAM, so not the best option for low memory systems (8 GB RAM + 10 GB SWAP). My generator is very IO dependent, however, adding more threads or even processes actually slowed it down. It is also limited to 1 file input that isn't BMP, so 7z is a very good friend. Aside from those quirks, filesystem apparently matters, as ext4 is having a hard time with how it works, in it's effort to keep defragmented it runs 3 times slower (10 seconds as opposed to 3 seconds on something like FAT32 when working with a 1 GB test file).

A different approach came to mind though, like how you were going at it before, using the colors to define characters, I believe. Would it be functional to convert each byte (or hex or whatever character it actually is that defines the real data) into a unique color and generate a PNG using those colors as pixels? It would be limited to resolution but that could be remedied by splitting it into multiple files (might make doc work harder though). Then those images could be looked through sequentially and remake the original file. I suppose the speed is based on pure processing power in that case. I have a bit of interest in the topic, and so, I am testing every DOC format I can, from Rich Text to ODT, and seeing if one of them is more friendly about the image process.

OCR was one of the first ideas I had, but it immediately got riddled with flaws, since it reliant on AI or intensive algorithms. The best OCR tool I found was actually the Google Assistant... and it isn't 100% accurate. I do, however, remember getting text from a picture with google docs. It was an odd feature I looked up. A game I was installing had no English translation on the installer, so I screenshotted it and ran it through docs, etc. and it managed to read the text decently enough. For the OCR font, you may need to make a custom one if there isn't an easy answer. When looking for one, pay attention to how they treat O and 0, I (upper i) and l (lower L).

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

In my testing, the data to images approach works perfectly fine in storing the data as pixels in PNGs which were later made into a video as well as recovering the data from it
I did them all in powershell and python scripting, it is fairly easy to figure out with the links I provided earlier
But one thing I was wrong is that the video uploaded to Youtube got compressed hard enough that the pixels on extracted frames from the video presenting the actual data changed, eg: #00FF00 -> #00FF01
So you can't just recover the data by working on the video downloaded directly from Youtube
Good thing is Google still lets you download the original videos you have uploaded by using Google Takeout
I wish they let you choose which videos you want to download, not all the videos you have uploaded
Or you can just upload the PNGs to Google Photos, but your BMP generator is definitely faster so I am only interested in uploading them to Youtube as videos, as a proof of concept, since no one have tried that before
I am still working on how to get it processed faster though, is there a fast way to encode the data into HEX, like how you can just open the file using a HEX editor, copy the HEX values and save into new file as HEX, but in script-able way?

I have tried RTF and it doesn't keep the images after uploaded and converted to Docs format

On the OCR approach, I manage to get some good results with a combination of Tesseract-OCR and Consolas
There are still a few mistakes like 0/O, 2/Z, 5/S but you can fix them manually
However, it is quite slow to create images from the text
On my Lenovo X250 machine running Windows 7, it took 15 minutes to print the HEX values on images and uses 140MB on all the images for a 4MB MP3 file
It isn't really practical I think, I will stick with my data to pixels for now

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

Uploaded an OCR example here:
https://www.youtube.com/watch?v=vXkhfHK52co&feature=youtu.be

It is a picture of working joe from the game Alien Isolation

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

With the reading data and writing it, such as copy paste, that's how mine actually works, It copies 48 MB of data (I chose it at random but it's kind of stuck so it works with "legacy" images), and dumps it with an added header. For burst work, such as 20 or so images, it works fast, at 3 seconds per GB, however, the longer it takes, the slower it goes, reaching somewhere between 10 seconds per image in some cases. I attempted to the Memory Map feature to load some of the data into RAM so it would work faster but MMAP seems to... not work... I copy pasted the documentation example and it crashed, implementing it into a program crashed, etc... And since it's IO bound, I added threading in the latest version (my dev version), so people can dynamically pick threads, however, using more than 1 thread actually slows it down... I'll be looking into more ways to alter files in RAM, but the best thing I can think of right now is a RAMDISK, however, it has limits, can't work with large archives... (30 GB or more). Worst case scenario, I have to safely find a way to split between processes without flooding memory.

The actual encoding, python offers an easy way (I started with hex so I use decode, but it can be applied in reverse too)...

import codecs
codecs.encode(data, 'hex')
codecs.decode(data, 'hex')

Youtube is also very aggressive with their compression, blacks get turned into solids, variating pixels cause massive bitrate drops (GTA custom maps, tubes with the holes). Google photos is limited in a way too. You can upload a video, however, it has to be 1080p or below and can't exceed 10 GB if using free storage.

After looking over some documents, I might give Cython a try, not very familiar with C but it seems to offer multiple times more performance... And Numba.

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

Can't really help on the memory issue but my approach on saving data to BMPs is split the file into smaller parts on disk first and write the header to each part
The images can still display properly though the "noise" are all over the place
It uses the disk space but I don't have to worry about the memory issue which will only slow down the process for a lot of files

Thanks for your suggestion but I use binascii.hexlify instead in my script and it is very fast
Right now my data to pixels on PNG approach is running at a decent enough speed and can be used in actual production
I plan to upload tons of junk data to Youtube in following weeks to see if Google will notice anything
Or Google Photos since all the videos will be less than 1080P and 10GB

Next I will take another look on bumping the data to PNGs and save to MS Docx
If anyone is interested, someone already made a utility to hide a payload in PNG
You can look at his code and see if anything is interesting there
https://github.com/sherlly/PCRT

I can paste the modified PNGs into Google Docs directly and it will keep the payload in PNGs after saving
But like David tested previously, Google removes all the non standard chunk data from PNGs if you plan to upload the Docx via the web frontend and convert to GDocs
I haven't played with making MS Docx with the Python but I hope it helps keeping the payload

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

Update:
Created a docx with all the images from my data to pixels approach using the Docx library and uploaded to GDocs via the web, the images are compressed but the quality still stays the same, not a single data is lost!
So now I can create the Docx and upload them with rclone with much better speed👍

https://docs.google.com/document/d/1U6QJ6uV9zDQ6GcpBpKF1Wn9eSw08GiASkjWuUxLVUfM/edit?usp=sharing

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

It is not necessary, you can put more frames in a second
I only did it to show the images longer so that I can check the quality
Also my video is short with 3 frames anyway, so making a 1 second video seems silly to me

For this particular case, I can understand why you perhaps did not want to make the frame rate higher, but if this approach were to be used with a larger file, doing 1 FPS would be rather inefficient.

Where are you stuck on a bitmap generator?
Bitmap is fairly easy to hack, do you mean you are trying to put bitmap on Docs?

The issue that we are having with our bitmap generator is something related to the header but we are not sure what. We attempted to follow documentation we found online for generating the bitmap header, but for some reason the bitmaps that we generate can only be interpreted by some programs. Some programs load them just fine while others cannot read them at all. Further up in this discussion, @78Alpha linked to a generator that he wrote. I have not had a chance to try it out (nor has anyone else working with me), but I suspect that we are going to abandon our implementation. For what it's worth, our implementation is written in C anyway and I would prefer to go for an all-Python solution.

Bitmap is an easy hack itself, but David was converting them to PNG, if I remember correctly, because Google Docs always converts images to PNG from docx (or microsoft format).

That is correct. I am converting to PNG. The funny thing about that though is that even though that's technically the harder part, it's actually the easier part in my code because I'm just running the bitmap we made through Pillow to convert it to a PNG.

On the OCR approach, I manage to get some good results with a combination of Tesseract-OCR and Consolas
There are still a few mistakes like 0/O, 2/Z, 5/S but you can fix them manually

Uploaded an OCR example here:
https://www.youtube.com/watch?v=vXkhfHK52co&feature=youtu.be

Perhaps you could improve accuracy by putting fewer characters on each frame and making them larger? I imagine that this would bring a performance hit with it, but it could still be worth a shot.

Update:
Created a docx with all the images from my data to pixels approach using the Docx library and uploaded to GDocs via the web, the images are compressed but the quality still stays the same, not a single data is lost!
So now I can create the Docx and upload them with rclone with much better speed👍

https://docs.google.com/document/d/1U6QJ6uV9zDQ6GcpBpKF1Wn9eSw08GiASkjWuUxLVUfM/edit?usp=sharing

Nice work! It looks like you actually achieved what I have been hoping to get working. How long does it take to process?

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

For this particular case, I can understand why you perhaps did not want to make the frame rate higher, but if this approach were to be used with a larger file, doing 1 FPS would be rather inefficient.

Yes, I will make it at least 10FPS or probably more for bigger files
Bigger FPS generates smaller video files as well so it helps a lot

The issue that we are having with our bitmap generator is something related to the header but we are not sure what. We attempted to follow documentation we found online for generating the bitmap header, but for some reason the bitmaps that we generate can only be interpreted by some programs. Some programs load them just fine while others cannot read them at all. Further up in this discussion, @78Alpha linked to a generator that he wrote. I have not had a chance to try it out (nor has anyone else working with me), but I suspect that we are going to abandon our implementation. For what it's worth, our implementation is written in C anyway and I would prefer to go for an all-Python solution.

I only use a generic 54 bytes header, I don't see much of a need to generate an unique header for each file
Yes, use Python instead
It makes things so much easier and quicker
Without their libraries, I will have to use ImageMagick on my approach which is much much slower

Perhaps you could improve accuracy by putting fewer characters on each frame and making them larger? I imagine that this would bring a performance hit with it, but it could still be worth a shot.

No, I have tried that and it doesn't help a thing, in fact it makes things worse
You can't make the text too large or too small, otherwise the scan will generate more mistakes
Anyway, this OCR approach is very slow and inefficient
Not only you can't put too much text on a single image because it will only make the scanning slower and generate more mistakes, but also you can't store much data in a single image
At best you can only store 400 bytes data on a image compared to my approach which you can store more than 50KB data and more if you make bigger images

Nice work! It looks like you actually achieved what I have been hoping to get working. How long does it take to process?

For this MP3 file, less than 3 minutes for making all the images and put them into a Docx
I have tested it on a Windows 7 running on X250
The process should be way faster on a much more powerful desktop machine
So far it is a decent working approach that acts as my last backup when others fail
One thing is sure, running it on Win10 is a lot slower and I don't know why

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

Look at what I found, someone wrote a utility in C to convert binary data to PNG 6 years ago and it is very fast, much faster than mine and any other methods I have tested
https://github.com/leeroybrun/Bin2PNG

Let's have some tests on it

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

I am not very familiar with C, but after looking over the code, I have a general idea of what the header would look like as output... In terms of human reading, it should match up perfectly to a generic header, however, it looks like it is missing the "space". Those characters that aren't shown and appear as dots on a hex editor. Some applications could see this, reading it as compressed data or passively repair it, but the average app wouldn't go as far and might give up early. The one I use is from an actual BMP, copy-pasted, it still has trouble showing up in some apps (appears as just black instead the random colored noise). Not that it should be viewed anyways, devours RAM.

Given how my application works, it should be easy enough to copy and paste the header I use in place of that "struct" thing. Mine is glorified lego blocks of coding, snap the header piece on top of the data piece, following the guide book. I think if I remove all the code for the GUI, remove the hashing, etc... it is around 100 lines of code?

With the audio files, google also has their music service, not unlimited, but offers 50,000 individuals songs of storage (300MB per song), haven't compared that to the competition. Around 14 TB seems amazing though.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

I am not very familiar with C, but after looking over the code, I have a general idea of what the header would look like as output... In terms of human reading, it should match up perfectly to a generic header, however, it looks like it is missing the "space". Those characters that aren't shown and appear as dots on a hex editor. Some applications could see this, reading it as compressed data or passively repair it, but the average app wouldn't go as far and might give up early. The one I use is from an actual BMP, copy-pasted, it still has trouble showing up in some apps (appears as just black instead the random colored noise). Not that it should be viewed anyways, devours RAM.

I think @digicannon already tried that but I will ask him.

Given how my application works, it should be easy enough to copy and paste the header I use in place of that "struct" thing. Mine is glorified lego blocks of coding, snap the header piece on top of the data piece, following the guide book. I think if I remove all the code for the GUI, remove the hashing, etc... it is around 100 lines of code?

Or I could that instead. No use struggling with something that has already been done successfully.

With the audio files, google also has their music service, not unlimited, but offers 50,000 individuals songs of storage (300MB per song), haven't compared that to the competition. Around 14 TB seems amazing though.

Interesting! I was not aware that this service existed. That makes me wonder about another possible approach for storing files. Could we generate audio files out of data and upload them? As long as Google doesn't do anything to alter the audio tracks I would think it should be doable. Plus, it would be cool to listen to the sound of data.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

I've wanted to listen to data files, not sure which container handles the random noise best, would also have to be uncompressed... Wav? Flac? I'll have to research audio file structure.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

@78Alpha WAV and FLAC are both lossless, but FLAC is probably the better option because it compress the data and WAV doesn't. At least according to Wikipedia.

from uds.

78Alpha avatar 78Alpha commented on May 23, 2024

Right, I forgot that players don't compress themselves, they assume the media is compressed, my mistake. Now to find one with an easy header, hopefully one exists 😅.

https://en.m.wikipedia.org/wiki/Raw_audio_format

Something like that will be my first test format.

Edit:

Forgot Google only supports certain formats, though, they technically don't support bitmap so... Maybe I'll get lucky?

Supported file formats for upload include: MP3, AAC, WMA, FLAC, Ogg, or ALAC. Non-MP3 uploads will be converted to MP3. Files can be up to 300 MB after conversion.

Although that's a stumper

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

Non-MP3 uploads will be converted to MP3.

Okay that's bad. MP3 is lossy.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

@Zibri That's an interesting approach. I was not aware that ffmpeg could generate independent images as well.

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

I still haven't had much time to do anything outside of theorizing about this.

from uds.

petercyr avatar petercyr commented on May 23, 2024

Doubt OCR would be worth it but there's a better way of putting data into images.

Here's 2000 bytes of Lorem Ipsum, in a QR code, stored in a GIF, which takes 3,821 bytes

download

Click it, zoom in until its pretty big, open your phone camera, point at it. iOS has no problem reading the code.

edit: Also, the 3821 byte version, when opened in OSX preview stays crystal clear when zoomed. When opened in Chrome, zooming in is blurred, but scans fine anyway.

from uds.

stewartmcgown avatar stewartmcgown commented on May 23, 2024

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

Yes. I tried playing with QR codes as well and came to the same conclusion as @stewartmcgown. It's a cool proof of concept but it's much less efficient.

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

^ Hi David
How did you make the PNGs exactly?
Is it simply padding the arbitrary data to a predefined PNG header?
Google Docs will always compress the PNGs if you pack them into a Word file and later upload it to Google Drive
I tried both inserting a “critical” chunk containing the data using https://github.com/sherlly/PCRT AND just padding data right after the entire header and Google Drive will remove them once uploaded
You can never make it to work using this approach
However, if you just paste the PNGs on a Google Doc directly, the images are left untouched

My approach is working simply because I convert the arbitrary data to the raw pixels displayed on the image using python script or ffmpeg like Zibri suggested
Google Drive does compression on bytes level,it just takes away the useless or invalid random data off the image (anything considered non standard in a PNG image)
From what I have found out, the compression will not do anything to the pixels displayed on the image
What is shown in a dot (eg: #000000) in the image is always plain black and it will not change regardless what the compression does to the image

I hope this helps
I am also interested in your hackathon project, any demo to try?

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

Here is another C# project that converts arbitrary data to raw pixels, similar to my approach
https://github.com/YDKK/File2BMP
This one works the fastest from all the similar tools I have tested (other than ffmpeg)

There is also another data to video project using Mathematica (I haven't heard this thing at all)
https://github.com/dzhang314/YouTubeDrive

I haven't tried this but the example shown on the readme isn't looking good at all and can't be trusted to convert back to original data

from uds.

MarkMichon1 avatar MarkMichon1 commented on May 23, 2024

Thanks! The last frame looks the way it does simply because there isn't enough data to film a whole screen. The header contains information on how many blocks the reader must scan (specifically for cases like this), so it stops when the data terminates. After that last block, whatever is on the frame doesn't matter; I simply chose white as a default background color the blocks get overlaid on.

As for splitting the videos, this isn't possible currently, but could be implemented with a little modification. This does currently work for image outputs though- 1 pixel blocks with 24 bit color has a nearly 1 : 1 ratio for storage efficiency.

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

@nerkkolner Our project generates PNGs using Pillow. We simply use the frombytes() (https://pillow.readthedocs.io/en/3.3.x/reference/Image.html#PIL.Image.frombytes) function to generate an image from a chunk of bytes that we read from the file that we want to upload. To get the original data back when downloading, we read the pixels in the image. At the moment, we do not want to share our project because it is not completely functional yet. Once it is though, the repository will be made public and I will share a link.

As for our ever-growing list of projects that try to tackle the same problem, @digicannon sent me a link to a recent one that he found that generates videos. - https://github.com/MarkMichon1/BitGlitter

Normally, it should work though I use a different way to generate the image with Python
As for my script, the size of 1920x1080 PNG image generated is ~6MB and slightly bigger than the raw data feed into the script
What is the resolution of that 1MB image you generated?

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

Someone made a similar UDS tool for Google Sheets here
https://github.com/vinay-deshmukh/Sheet-Disk

from uds.

DavidBerdik avatar DavidBerdik commented on May 23, 2024

What is the resolution of that 1MB image you generated?

Unfortunately I do not remember and I can't seem to find it in our repository either. The issue that I mentioned above with Google excessively compressing the image goes away as long as you keep the dimensions below a certain threshold though.

I was hoping to get a fully working version of the project done yesterday and make the repository public, but...
https://www.engadget.com/2019/06/02/google-cloud-outage/?guccounter=1
...I couldn't get the Drive API to work.

Someone made a similar UDS tool for Google Sheets here
https://github.com/vinay-deshmukh/Sheet-Disk

I thought about trying out this approach as well but I decided against it because I am not sure that it buys you anything either in terms of speed or storage efficiency.

from uds.

nerkkolner avatar nerkkolner commented on May 23, 2024

Unfortunately I do not remember and I can't seem to find it in our repository either. The issue that I mentioned above with Google excessively compressing the image goes away as long as you keep the dimensions below a certain threshold though.

lol sorry I forgot that I never tried with 1080P images with Google Docs
When I made it to work with GDocs, my images are exactly 852x480
So that might be the reason why your PNGs got compressed heavily but mine doesn't
Thanks for clearing up!

I was hoping to get a fully working version of the project done yesterday and make the repository public, but...
https://www.engadget.com/2019/06/02/google-cloud-outage/?guccounter=1
...I couldn't get the Drive API to work.

We can wait for a little longer

I thought about trying out this approach as well but I decided against it because I am not sure that it buys you anything either in terms of speed or storage efficiency.

It really doesn't honestly, but a single Sheet file can store up to almost 50MB as per the readme, so doesn't that mean it should work much faster in terms of upload speed?

from uds.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.