Watermarking and Fingerprinting of Multimedia Content

Jaap Haitsma
Philips Research

Imagine the following situation, you're listening on the radio and all of a sudden you hear the greatest song you have heard for a long time, but you don't know the song. Wouldn't it be nice that you could push a button on for example your mobile phone and after a few seconds, it would tell you the artist and title of the song. Audio fingerprinting or audio watermarking can make this possible.

This first part of this talk will give a general overview of the watermarking and fingerprinting technologies (both for audio and video) that have been developed by Philips Research over the past six years.

Watermarking is a technology where an extra signal, containing an identifier, is added to a multimedia signal in such a way that it is inaudible/invisible, but still can be reliably detected by electronic means.

As opposed to watermarking, fingerprinting does not add anything to the multimedia signal. It only extracts essential features (the fingerprint) from the signal and stores these in a database. An unknown multimedia signal can then be identified by extracting its essential features and subsequently looking these up in the database.

The second part will give a more detailed technical explanation about audio fingerprinting, on which the speaker has been working for the past 2 years.

Philips: Audio Fingerprinting

Last Update: 
 

 

Summary of the talk

Gemma Pons

In Philips Research, there are approximate 10 researchers doing research on Audio/Video Watermarking and Fingerprinting. The group is named ICEMan. The main projects they have worked on or they are still working on are the followings:

-         DVD video copy protection.

-         Broadcast Monitoring.

·       VIVA European Project.

·       Commercial product: WaterCast.

§         World wide monitoring service performed by TeleTrax.

-         Forensic tracking for Digital Cinema.

·       European Project: ITEA Digital Cinema.

-         Electronic Music Distribution

·       Every downloaded file gets a unique watermark to enable tracking.

-         Toys

·       By equipping dolls with watermark detectors they can react to video’s.

-         Currently a “startup company” in research.

-         Still more in a research phase.

-         Reversible watermarking

-         Semi Fragile watermarking

 

Now, we are going to concentrate on the Audio Fingerprinting System created by Philips. An Audio Fingerprint is a set of features which uniquely identifies a segment of audio. Conceptually is the same as human fingerprints. A fingerprint system generally consists of two components: a database generation which extracts fingerprints and an audio identification that works by comparing the associated fingerprints in a fingerprint database.

 

There are many applications we can consider for audio fingerprinting:

-         Broadcast Monitoring: monitoring how often which song is broadcast on the Radio.

-         Audio Recognition on Mobile Phone / CE Devices / PC’s: the user has option to buy the CD, concert tickets, T-shirts, etc.

-         File Sharing: filtering out copyrighted material for legal Napster services.

-         Transmission Verification

-         Organizing Digital Music Libraries: restoring Metadata.

 

We could ask for the differences between watermarking and fingerprinting. The best feature of the fingerprinting versus the watermarking is that the content is not affected by any modification and that it works for existing content. However, a connection to database is required and the database is more complex.

 

The main parameters of an audio fingerprint system are:

-         Robustness: “Can a song still be identified after (severe) degradation?” In order to achieve high robustness the fingerprint should be based on perceptual features that are invariant (at least to a certain degree) with respect to signal degradations.

-         Reliability: “How often is a song falsely identified?”

-         Fingerprint Size: “How much storage is needed for a fingerprint?” The fingerprint size determines the memory needed for a fingerprint database server.

-         Granularity: “How many seconds of audio is needed to identify a song?” This parameter can depend on the application.

-         Search Speed/Scalability: “How long does it take to find a fingerprint in the database?” “What if the database has to contain millions of songs?”

 

The Fingerprint Extraction Algorithm is based on the following approach:

-         Calculate Fourier Transform of every frame

-         Subdivide spectrum into bands

-         Calculate a robust property of every band (E.g. energy)

-         Apply threshold

 

First the audio signal is segmented into overlapping frames. The overlapping frames have a length of 0.4 seconds and are weighted by a Hanning window with an overlap factor of 31/32. For every 12 ms a 32 bit sub-Fingerprint is extracted. A Fingerprint block consists of 256 subsequent sub-fingerprints, corresponding to a granularity of only 3 seconds. A fingerprint of a song is formed by a list of sub-fingerprints.

 

The most important perceptual audio features live in the frequency domain. Therefore a spectral representation is computed by performing a Fourier transform on every frame. Due to the fact that the Human Auditory System (HAS) is relatively insensitive to phase, only the absolute value of the spectrum is retained.

 

In order to extract a 32-bit sub-fingerprint value for every frame, 33 logarithmic spaced bands ranging from 300-2000 Hz are selected. Afterwards, the energy of every band is computed, experimentally the energy is a very robust property.

 

Comparing the fingerprint block from an original song and a compressed version of the same excerpt, the result should be the same. But due to the compression some of the bits are retrieved incorrectly. These bit errors, are used as the similarity measure. The two fingerprint blocks are declared similar if the number of errors (between them) is below a certain threshold.

In order to analyze the choice of this threshold, it is assumed that the fingerprint bits are random i.i.d. (independent and identically distributed). Then the erroneous bits have a binomial distribution, that can be approximated by a normal distribution. In practice, by experimentally determination, the standard deviation is 3 times higher than the theoretical standard deviation and the detection threshold has a value of BER (Bit Error Rate)= 0.35. Arriving at a very low false alarm probability of 3.7x10-20 .

 

To show the robustness of the proposed audio fingerprinting scheme, they selected four shorts audio excerpts. All of the excerpts were subjected to different signal degradation. Thereafter the BERs between the fingerprint blocks of the original version and of all the degraded versions were determined for each audio clip. The results showed that almost all the resulting bit error rates are below the threshold of 0.35.

The system is inherently robust against linear speed changes from –2% until +2%. For larger speed changes there are two things to do: increase the database of every song at multiple speeds and extract fingerprints at multiple speeds. There is a new extraction algorithm inherently robust against large speed changes (approx. –5% to 5%) that will be presented at ICASSP 2003.

 

A naive approach to identify a fingerprint block originated from an unknown audio clip is to find the most similar fingerprint block in the database. In other words, to find the position in the 250 million sub-fingerprints where the bit error rate is minimal, which takes 250 million fingerprint block comparisons. Thus, this approach is impractical for large databases.

The next strategy is followed by “mild” degradations: sub-fingerprints frequently do not contain errors. If this assumption is valid it is very likely that at least one sub-fingerprint has an exact match at the optimal position in the database. The fingerprint block is build as an index on 32 bit sub-fingerprints, which through a lookup table (LUT) is the entry of the fingerprint database (song1, song2, …).

Whether instead of a soft degradation it is a “severe” degradation an other strategy is applied: the reliability of bits can be estimated. They propose to estimate and use the probability that a fingerprint bit is received correctly. This results to generate for every sub-fingerprint a list of probable sub-fingerprints.

 

When the match is computed between the fingerprint block query and the fingerprint database, the sub-fingerprints are reduced by two. Thus, saving storage. Increasing the sub sample factor the relative standard deviation holds. This is due to the large overlap of the subsequent sub-fingerprints, they have a large similarity and are slowly varying in time.

 

The Fingerprint server has a Master/Slave Architecture. The advantages are the following: cost effective, extendible and easy to upgrade.

-         Master’s functions:

·       Handles communication with the outside world: the Fingerprint API (FAPI) and the control PC.

·       Distributes the fingerprint database over its slaves.

·       Distributes identification requests over its slaves, collects the results and returns the respective song ID’s to the FAPI.

-         Slave’s functions:

·       Contains a subset of fingerprint database.

·       Granularity

-         3 seconds: 25,000 (4 minute) songs. 200 stations in parallel

-         6 seconds: 50,000 (4 minute) songs. 400 stations in parallel

 

From this talk we can conclude that the fingerprint extraction system is very robust and moreover only granularity of 3 seconds is needed. In addition the database is done with low complexity search strategy.