Watermarking and Fingerprinting of Multimedia Content
Jaap Haitsma
Philips Research
Imagine the following situation, you're listening on the radio and all
of a sudden you hear the greatest song you have heard for a long time, but you don't
know the song. Wouldn't it be nice that you could push a button on for example
your mobile phone and after a few seconds, it would tell you the artist and
title of the song. Audio fingerprinting or audio watermarking can make this
possible.
This first part of this talk will give a general overview of the
watermarking and fingerprinting technologies (both for audio and video) that
have been developed by Philips Research over the past six years.
Watermarking is a technology where an extra signal, containing an
identifier, is added to a multimedia signal in such a way that it is
inaudible/invisible, but still can be reliably detected by electronic means.
As opposed to watermarking, fingerprinting does not add anything to the
multimedia signal. It only extracts essential features (the fingerprint) from
the signal and stores these in a database. An unknown multimedia signal can
then be identified by extracting its essential features and subsequently
looking these up in the database.
The second part will give a more detailed technical explanation about
audio fingerprinting, on which the speaker has been working for the past 2
years.
Last Update:
Summary of the talk
Gemma Pons
In
Philips Research, there are approximate 10 researchers doing research on Audio/Video
Watermarking and Fingerprinting. The group is named ICEMan. The main projects
they have worked on or they are still working on are the followings:
-
DVD video copy protection.
-
Broadcast Monitoring.
·
VIVA European Project.
·
Commercial product: WaterCast.
§
World wide monitoring service
performed by TeleTrax.
-
Forensic tracking for Digital
Cinema.
· European Project: ITEA Digital Cinema.
-
Electronic Music Distribution
·
Every downloaded file gets
a unique watermark to enable tracking.
-
Toys
·
By equipping dolls with watermark
detectors they can react to video’s.
-
Currently a “startup
company” in research.
- Still more in a research phase.
-
Reversible watermarking
-
Semi Fragile watermarking
Now, we are going to concentrate on
the Audio Fingerprinting System created by Philips. An Audio Fingerprint is
a set of features which uniquely identifies a segment of audio. Conceptually
is the same as human fingerprints. A fingerprint system generally consists
of two components: a database generation which extracts fingerprints and an
audio identification that works by comparing the associated fingerprints in
a fingerprint database.
There are many applications we can consider for audio fingerprinting:
-
Broadcast Monitoring: monitoring
how often which song is broadcast on the Radio.
-
Audio Recognition on Mobile
Phone / CE Devices / PC’s: the user has option to buy the CD, concert
tickets, T-shirts, etc.
-
File Sharing: filtering out
copyrighted material for legal Napster services.
-
Transmission Verification
-
Organizing Digital Music
Libraries: restoring Metadata.
We could ask for the differences between watermarking and fingerprinting. The best feature of the fingerprinting versus the watermarking is that the content is not affected by any modification and that it works for existing content. However, a connection to database is required and the database is more complex.
The main parameters of an audio fingerprint
system are:
-
Robustness: “Can a
song still be identified after (severe) degradation?” In order to achieve
high robustness the fingerprint should be based on perceptual features that
are invariant (at least to a certain degree) with respect to signal degradations.
-
Reliability: “How often
is a song falsely identified?”
-
Fingerprint Size: “How
much storage is needed for a fingerprint?” The fingerprint size determines
the memory needed for a fingerprint database server.
-
Granularity: “How many
seconds of audio is needed to identify a song?” This parameter can depend
on the application.
-
Search Speed/Scalability:
“How long does it take to find a fingerprint in the database?”
“What if the database has to contain millions of songs?”
The Fingerprint Extraction Algorithm is based on the following approach:
- Calculate Fourier Transform of every frame
- Subdivide spectrum into bands
- Calculate a robust property of every band (E.g. energy)
- Apply threshold
First the audio signal is segmented into overlapping frames. The overlapping frames have a length of 0.4 seconds and are weighted by a Hanning window with an overlap factor of 31/32. For every 12 ms a 32 bit sub-Fingerprint is extracted. A Fingerprint block consists of 256 subsequent sub-fingerprints, corresponding to a granularity of only 3 seconds. A fingerprint of a song is formed by a list of sub-fingerprints.
The most important perceptual audio features live in the frequency domain. Therefore a spectral representation is computed by performing a Fourier transform on every frame. Due to the fact that the Human Auditory System (HAS) is relatively insensitive to phase, only the absolute value of the spectrum is retained.
In order to extract a 32-bit sub-fingerprint value for every frame, 33 logarithmic spaced bands ranging from 300-2000 Hz are selected. Afterwards, the energy of every band is computed, experimentally the energy is a very robust property.
Comparing the fingerprint block from an original song and a compressed version of the same excerpt, the result should be the same. But due to the compression some of the bits are retrieved incorrectly. These bit errors, are used as the similarity measure. The two fingerprint blocks are declared similar if the number of errors (between them) is below a certain threshold.
In order to analyze the choice of this threshold, it is assumed that the fingerprint bits are random i.i.d. (independent and identically distributed). Then the erroneous bits have a binomial distribution, that can be approximated by a normal distribution. In practice, by experimentally determination, the standard deviation is 3 times higher than the theoretical standard deviation and the detection threshold has a value of BER (Bit Error Rate)= 0.35. Arriving at a very low false alarm probability of 3.7x10-20 .
To show the robustness of the proposed audio fingerprinting scheme, they selected four shorts audio excerpts. All of the excerpts were subjected to different signal degradation. Thereafter the BERs between the fingerprint blocks of the original version and of all the degraded versions were determined for each audio clip. The results showed that almost all the resulting bit error rates are below the threshold of 0.35.
The system is inherently robust against linear speed changes from –2% until +2%. For larger speed changes there are two things to do: increase the database of every song at multiple speeds and extract fingerprints at multiple speeds. There is a new extraction algorithm inherently robust against large speed changes (approx. –5% to 5%) that will be presented at ICASSP 2003.
A naive approach to identify a fingerprint
block originated from an unknown audio clip is to find the most similar fingerprint
block in the database. In other words, to find the position in the 250 million
sub-fingerprints where the bit error rate is minimal, which takes 250 million
fingerprint block comparisons. Thus, this approach is impractical for large
databases.
The next strategy is followed by “mild”
degradations: sub-fingerprints frequently do not contain errors. If this assumption
is valid it is very likely that at least one sub-fingerprint has an exact
match at the optimal position in the database. The fingerprint block is build
as an index on 32 bit sub-fingerprints, which through a lookup table (LUT)
is the entry of the fingerprint database (song1, song2, …).
Whether instead of a soft degradation
it is a “severe” degradation an other strategy is applied: the
reliability of bits can be estimated. They propose to estimate and use the
probability that a fingerprint bit is received correctly. This results to
generate for every sub-fingerprint a list of probable sub-fingerprints.
When the match is computed between
the fingerprint block query and the fingerprint database, the sub-fingerprints
are reduced by two. Thus, saving storage. Increasing the sub sample factor
the relative standard deviation holds. This is due to the large overlap of
the subsequent sub-fingerprints, they have a large similarity and are slowly
varying in time.
The Fingerprint server has a Master/Slave Architecture. The advantages are the following: cost effective, extendible and easy to upgrade.
-
Master’s functions:
·
Handles communication with
the outside world: the Fingerprint API (FAPI) and the control PC.
·
Distributes the fingerprint
database over its slaves.
·
Distributes identification
requests over its slaves, collects the results and returns the respective
song ID’s to the FAPI.
-
Slave’s functions:
·
Contains a subset of fingerprint
database.
·
Granularity
-
3 seconds: 25,000 (4 minute)
songs. 200 stations in parallel
-
6 seconds: 50,000 (4 minute)
songs. 400 stations in parallel
From this talk we can conclude that
the fingerprint extraction system is very robust and moreover only granularity
of 3 seconds is needed. In addition the database is done with low complexity
search strategy.