In distributed file backup or file sharing systems, different source nodes may have different versions of the same file differing by a small number of edits including deletions and insertions. How to efficiently send a file to a remote node that has a different version of it? What is the fundamental limit of the number of bits that needs to be sent? Our result:
The minimum number of bits = “content” + “location” – “nature’s secret”. “Nature’s secret” is approximately 1.29 bits per burst of deletions.
What is “nature’s secret”?
Given the original and edited versions, sometimes one can never find out where nature (or editor) edited. For example, if the original sequence was “000” and the edited sequence is “00”, one never knows which zero out of the three zeros was deleted. That is nature’s secret. Interestingly, this phenomenon leads to a discount on the number of bits that the encoder sends, because the encoder should never bother describing nature’s secret.
How much is the discount? In a binary file of 1GB, if 1% of bits is deleted in a bursty fashion, and the average length of burst is 2, the naive encoder should use 10MB to describe the deleted content, 55MB to describe the locations of deletions. But the optimal encoder can get a discount of 6.5MB because of “nature’s secret”. That is a discount of 10%!
The resources required to maintain video databases like YouTube are increasing rapidly, both in terms of upload/download bandwidth and actual storage. However, it is often the case that video databases contain significant portions of similar or identical video. Although files are compressed, compression typically occurs at the level of individual videos and does not exploit similarities across the database. We aim to reduce both the bandwidth and storage necessary to manage a video database by harnessing similarities across the database to reduce the information uploaded and stored in the video database. This is accomplished in part by treating the upload process as a video synchronization problem in which we must identify and store only the differences between a query and previously uploaded videos. There are two main components to the proposed system. The first involves finding similar videos in the database. This is essentially a content-based video search, as we do not rely on metadata. The second part consists of video synchronization, which compares the query video with a set of videos in the database that have been deemed similar to the query; the differences between these videos are then uploaded to the database along with instructions for reconstruction of the query video.
Video Search: The video search component utilizes the query video’s content to find other videos in the database with similar visual content. That is, some subset of frames in two similar videos should have identical or almost identical composition. To determine which stored videos best match the query video, we use a hierarchical comparison system. The first layer of comparison is entirely based on audio data. First we compare the audio fingerprints between the query and database files 1. This is founded on the assumption that identical or similar audio generally corresponds to related video segments. Even if the audio matches only partially, we can use audio similarity as a filter to reduce the number of frame image comparisons necessary. Given a set of potential matches whose audio files are similar, we wish to determine whether the visual content is similar. This cannot be determined purely from bit streams because differences in encoding or video quality will completely alter the bit stream. As such, we use the feature-space video signatures 2 to compare whether two frames are identical or almost identical. If a file passes this more stringent check, then it is likely a close match.
Synchronization: For the synchronization portion of the problem, we use a pixel-level synchronization tool called VSYNC 3. VSYNC compares a query video with a reference video and determines content differences (demo). These differences are encoded in a summary file, which can then be uploaded to the database and used for reconstruction when the file is requested at a later point in time. This reduces the storage demands on the database as well as the bandwidth required to upload the video in the first place.