batkinson / jrsync Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 672 KB

A minimal android-compatible library for efficient synchronization of large files

Java 100.00%

jrsync's People

Contributors

Stargazers

Watchers

Forkers

gaecom

jrsync's Issues

Add support for resuming an incomplete sync

Due to the fact that synchronization of large files can take a good amount of time, the expectation that the entire process completes in one attempt is often unrealistic. To allow for incremental progress in these scenarios, support should be added for resuming an aborted sync and resuming where it left off. This should provide better performance especially in Android, where Sync Adapters can actually be halted if they take too long.

Add the ability for clients to track progress

As an application developer, I want the ability to get status updates so that I can give users more feedback during longer sync operations.

There are effectively two operations that clients will be interested in:

Finding matching blocks - reported as % of file searched
Building result using matches - reported as % of target file written

Add support for canceling a sync

As an application developer, I want the ability to cancel a sync in progress so that I can give the application and users finer control over the process.

Rolling checksum recurrence doesn't hold on certain files

This was a particularly nasty defect to find.

I began to suspect that something was off when attempting to sync two large, but highly similar SQLite database files. Everything seemed normal because the result was constructed successfully, but it was taking a much longer than I expected.

After writing some tests against the larger files, I determined that the problem was that the checksum recurrence relationship didn't hold. However, this implied that the checksum was content-specific because it did hold for the test files. I applied the recurrence test to another binary test file in the project and it failed too. After minimizing the test and looking at it in the debugger, I saw that the byte being added was negative.

The checksum algorithm described in the paper assumes unsigned bytes, however Java's byte type is signed. Apparently, I already considered this because the update method converts its argument to a positive value of a wider type (int) by bit masking. However, the bytes read from the internal buffer were being used directly.

Enable alternative zsync download strategies

When syncing sparsely matching content, it can actually be more expensive to use incremental sync than it would be to just download the whole file directly. This is particularly true when using HTTP compression, since multi-range requests aren't compressed.

Provide an effective means to handle content like this. This could be as simple as aborting the sync with a certain exception so the client can handle directly, or the sync could attempt to handle it by allowing users to supply alternative strategies or handling it as part of the library itself.

Zsync http range requests fail with 400 due to long byte range headers

Discovered when testing synchronization on two large and significantly different files, the second phase of zsync consistently failed with HTTP 400. After sniffing the HTTP request, it appears that the reason is the size of the HTTP headers when there are large numbers of non-contiguous block matches. In this case, Apache Tomcat's default configuration limits the header size to 4KB but the headers were over 40KB.

Sync progress works unexpectedly during build phase for large sync ranges

Progress tracker doesn't always work properly with the way that file building currently works. Because the progress is only reported when seeking, and not during the actual copy, the status updates are not frequent enough to give proper feedback to the user.

For example, if it is determined that a single range is required that consists of the entire file minus the leading bytes, the progress stays around 0-1% and then jumps to 100% after a long wait for the actual copy operation.

CopyTracker erroneously reports previous status

While looking at #13 and #14 I found that the implementation of CopyTracker doesn't report the correct value.

Use buffered I/O to reduce kernel call overhead

While investigating the possible causes of on-device performance issues, I noticed that a good portion of CPU time was spent in kernel calls reading individual bytes required for the checksum of the next block offset. The reason is that the code uses random access file, which is not buffered. Changing the code to use buffered IO should improve performance significantly.

Add ability to gracefully cancel long-running sync operations

Currently, the programming interface for the sync does not provide the ability to gracefully abort a sync in progress. One possible way to handle this is inspect the thread's interrupted status.

Progress tracker does not report 0% during search stage

Similar to #13, the initial search status is not reported, which means clients are delayed in getting notification that the block search has started.

Metadata.generate fails when destination is not on same disk as java temp dir

While support for specifying temp directories was added in #1, the Metadata.generate method still used the default, which is the directory specified by the java.io.tempdir system property. When calling this method with a destination that does not share the same disk with the temp directory, the method fails to raname the temp file to the destination file. This means that the metadata is left in some unknown location, rather than at the specified one.

The implementation of this method should be changed to work as intended regardless of the destination directory's location relative to the temp directory.

Avoid memory allocations within the main block search loop

While investigating possible causes of slow sync performance on-device, a good portion of the time in method traces was unaccounted for even after converting to buffered IO. Memory allocation traces showed a large number of allocations of java.lang.Long for checksum comparisons in the main block search loop. Avoiding allocations in the main loop should help performance quite a bit, especially since the complexity of the loop is O(b) where b is the number of bytes. For large files, this can be millions of allocations.

Multipart boundaries not being parsed properly from headers

Reported as part of the cims-tablet project as cims-bioko/cims-tablet#236.

Due to the way that boundary is parsed out of the header, it includes trailing content. This causes the boundary delimiter to be wrong and as a result, misinterpretation of the entire payload.

Rather than simply looking for non-whitespace following boundary=, it should conform to the pattern specified in the RFC for multipart content: https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html

Mockito dependency should be test scope

The project metadata should be updated so the test dependencies aren't leaked through to dependent artifacts. Currently in 1.0, it appears mockito is at compile since it was unspecified.

Progress tracker does not report 0% during build stage

Due to the way progress caching is implemented, CopyTracker does not report the initial zero percent status. To clients using these progress events to track sync status, it appears as though the sync stays at 100% in the block search for a longer duration than expected.

Allow specifying location for temp files

Not sure which way to go with this, but sometimes file moves can fail. Where this is done in the library, we should actually be allowing the user to specify the location. For example, the unit tests fail when building the library on platforms where the default temp location is not on the same file system as the project.

Progress tracker called multiple times for same progress value during search

Unlike during the build stage, progress is reported multiple times for the same progress value during the block search.