Google engineers have trained a machine learning algorithm to write
picture captions using the same techniques it developed for language
translation.
Translating one language into another has always been a difficult
task. But in recent years, Google has transformed this process by
developing machine translation algorithms that change the nature of
cross cultural communications through Google Translate.
Now that company is using the same machine learning technique to
translate pictures into words. The result is a system that automatically
generates picture captions that accurately describe the content of
images. That’s something that will be useful for search engines, for
automated publishing and for helping the visually impaired navigate the
web and, indeed, the wider world.
The conventional approach to language translation is an iterative
process that starts by translating words individually and then
reordering the words and phrases to improve the translation. But in
recent years, Google has worked out how to use its massive search
database to translate text in an entirely different way.
The approach is essentially to count how often words appear next to,
or close to, other words and then define them in an abstract vector
space in relation to each other. This allows every word to be
represented by a vector in this space and sentences to be represented by
combinations of vectors.
Google goes on to make an important assumption. This is that specific
words have the same relationship to each other regardless of the
language. For example, the vector “king - man + woman = queen” should
hold true in all languages.
That makes language translation a problem of vector space mathematics.
Google Translate approaches it by turning a sentence into a vector and
then using that vector to generate the equivalent sentence in another
language.
Now Oriol Vinyals and pals at Google are using a similar approach to
translate images into words. Their technique is to use a neural network
to study a dataset of 100,000 images and their captions and so learn how
to classify the content of images.
But instead of producing a set of words that describe the image,
their algorithm produces a vector that represents the relationship
between the words. This vector can then be plugged into Google’s
existing translation algorithm to produce a caption in English, or
indeed in any other language. In effect, Google’s machine learning
approach has learnt to “translate” images into words.
To test the efficacy of this approach, they used human evaluators
recruited from Amazon’s Mechanical Turk to rate captions generated
automatically in this way along with those generated by other automated
approaches and by humans.
The results show that the new system, which Google calls Neural Image
Caption, fares well. Using a well known dataset of images called
PASCAL, Neural image Capture clearly outperformed other automated
approaches. “NIC yielded a BLEU score of 59, to be compared to the
current state-of-the-art of 25, while human performance reaches 69,”
says Vinyals and co.
That’s not bad and the approach looks set to get better as the size
of the training datasets increases. “It is clear from these experiments
that, as the size of the available datasets for image description
increases, so will the performance of approaches like NIC,” say the
Google team.
Clearly, this is yet another task for which the days of human supremacy over machines are numbered.
Ref: arxiv.org/abs/1411.4555 Show and Tell: A Neural Image Caption Generator
The numbers on this chart have 422,000 crossing the Neman with Napoleon, 22,000 taking a side trip in the beginning, 100,000 surviving the battles en route to Moscow, and of the 100,000 departing Moscow, only 4,000 in the feint attack northward Human translation, to leave only 10,000 crossing back to France out of the initial 422,000.
ReplyDelete