1. 程式人生 > >How to Automatically Generate Textual Descriptions for Photographs with Deep Learning

How to Automatically Generate Textual Descriptions for Photographs with Deep Learning

Captioning an image involves generating a human readable textual description given an image, such as a photograph.

It is an easy problem for a human, but very challenging for a machine as it involves both understanding the content of an image and how to translate this understanding into natural language.

Recently, deep learning methods have displaced classical methods and are achieving state-of-the-art results for the problem of automatically generating descriptions, called “captions,” for images.

In this post, you will discover how deep neural network models can be used to automatically generate descriptions for images, such as photographs.

After completing this post, you will know:

  • About the challenge of generating textual descriptions for images and the need to combine breakthroughs from computer vision and natural language processing.
  • About the elements that comprise a neural feature captioning model, namely the feature extractor and language model.
  • How the elements of the model can be arranged into an Encoder-Decoder, possibly with the use of an attention mechanism.

Let’s get started.

Overview

This post is divided into 3 parts; they are:

  1. Describing an Image with Text
  2. Neural Captioning Model
  3. Encoder-Decoder Architecture

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Describing an Image with Text

Describing an image is the problem of generating a human-readable textual description of an image, such as a photograph of an object or scene.

The problem is sometimes called “automatic image annotation” or “image tagging.”

It is an easy problem for a human, but very challenging for a machine.

A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. However, this remarkable ability has proven to be an elusive task for our visual recognition models

A solution requires both that the content of the image be understood and translated to meaning in the terms of words, and that the words must string together to be comprehensible. It combines both computer vision and natural language processing and marks a true challenging problem in broader artificial intelligence.

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.

Further, the problems can range in difficulty; let’s look at three different variations on the problem with examples.

1. Classify Image

Assign an image a class label from one of hundreds or thousands of known classes.

Example of classifying images into known classes

Example of classifying images into known classes
Taken From “Detecting avocados to zucchinis: what have we done, and where are we going?”, 2013.

2. Describe Image

Generate a textual description of the contents image.

Example of captions generated for photogaphs

Example of captions generated for photogaphs
Taken from “Long-term recurrent convolutional networks for visual recognition and description”, 2015.

3. Annotate Image

Generate textual descriptions for specific regions on the image.

Example of annotation regions of an image with descriptions

Example of annotation regions of an image with descriptions.
Taken from “Deep Visual-Semantic Alignments for Generating Image Descriptions”, 2015.

The general problem can also be extended to describe images over time in video.

In this post, we will focus our attention on describing images, which we will describe as ‘image captioning.’

Neural Captioning Model

Neural network models have come to dominate the field of automatic caption generation; this is primarily because the methods are demonstrating state-of-the-art results.

The two dominant methods prior to end-to-end neural network models for generating image captions were template-based methods and nearest-neighbor-based methods and modifying existing captions.

Prior to the use of neural networks for generating captions, two main approaches were dominant. The first involved generating caption templates which were filled in based on the results of object detections and attribute discovery. The second approach was based on first retrieving similar captioned images from a large database then modifying these retrieved captions to fit the query. […] Both of these approaches have since fallen out of favour to the now dominant neural network methods.

Neural network models for captioning involve two main elements:

  1. Feature Extraction.
  2. Language Model.

Feature Extraction Model

The feature extraction model is a neural network that given an image is able to extract the salient features, often in the form of a fixed-length vector.

The extracted features are an internal representation of the image, not something directly intelligible.

A deep convolutional neural network, or CNN, is used as the feature extraction submodel. This network can be trained directly on the images in the image captioning dataset.

Alternately, a pre-trained model, such as a state-of-the-art model used for image classification, can be used, or some hybrid where a pre-trained model is used and fine tuned on the problem.

It is popular to use top performing models in the ImageNet dataset developed for the ILSVRC challenge, such as the Oxford Vision Geometry Group model, called VGG for short.

[…] we explored several techniques to deal with overfitting. The most obvious way to not overfit is to initialize the weights of the CNN component of our system to a pretrained model (e.g., on ImageNet)

Feature Extractor

Feature Extractor

Language Model

Generally, a language model predicts the probability of the next word in the sequence given the words already present in the sequence.

For image captioning, the language model is a neural network that given the extracted features from the network is capable of predicting the sequence of words in the description and build up the description conditional on the words that have already been generated.

It is popular to use a recurrent neural network, such as a Long Short-Term Memory network, or LSTM, as the language model. Each output time step generates a new word in the sequence.

Each word that is generated is then encoded using a word embedding (such as word2vec) and passed as input to the decoder for generating the subsequent word.

An improvement to the model involves gathering the probability distribution of words across the vocabulary for the output sequence and searching it to generate multiple possible descriptions. These descriptions can be scored and ranked by likelihood. It is common to use a Beam Search for this search.

The language model can be trained standalone using pre-computed features extracted from the image dataset; it can be trained jointly with the feature extraction network, or some combination.

Language Model

Language Model

Encoder-Decoder Architecture

A popular way to structure the sub-models is to use an Encoder-Decoder architecture where both models are trained jointly.

[the model] is based on a convolution neural network that encodes an image into a compact representation, followed by a recurrent neural network that generates a corresponding sentence. The model is trained to maximize the likelihood of the sentence given the image.

This is an architecture developed for machine translation where an input sequence, say in French, is encoded as a fixed-length vector by an encoder network. A separate decoder network then reads the encoding and generates an output sequence in the new language, say English.

A benefit of this approach in addition to the impressive skill of the approach is that a single end-to-end model can be trained on the problem.

When adapted for image captioning, the encoder network is a deep convolutional neural network, and the decoder network is a stack of LSTM layers.

[in machine translation] An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence. Here, we propose to follow this elegant recipe, replacing the encoder RNN by a deep convolution neural network (CNN).

Example of the CNN and LSTM Architecture

Example of the CNN and LSTM Architecture.
Taken from “Show and Tell: A Neural Image Caption Generator”, 2015.

Captioning Model with Attention

A limitation of the Encoder-Decoder architecture is that a single fixed-length representation is used to hold the extracted features.

This was addressed in machine translation through the development of attention across a richer encoding, allowing the decoder to learn where to place attention as each word in the translation is generated.

The approach of attention has also been used to improve the performance of the Encoder-Decoder architecture for image captioning by allowing the decoder to learn where to put attention in the image when generating each word in the description.

Encouraged by recent advances in caption generation and inspired by recent success in employing attention in machine translation and object recognition we investigate models that can attend to salient part of an image while generating its caption.

A benefit of this approach is that it is possible to visualize exactly where attention is placed while generating each word in a description.

We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence.

This is easiest to understand with an example; see below.

Example of image captioning with attention

Example of image captioning with attention
Taken from “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, 2015.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Papers

Articles

Projects

Summary

In this post, you discovered how deep neural network models can be used to automatically generate descriptions for images, such as photographs.

Specifically, you learned:

  • About the challenge of generating textual descriptions for images and the need to combine breakthroughs from computer vision and natural language processing.
  • About the elements that comprise a neural feature captioning model, namely the feature extractor and language model.
  • How the elements of the model can be arranged into an Encoder-Decoder, possibly with the use of an attention mechanism.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.


相關推薦

How to Automatically Generate Textual Descriptions for Photographs with Deep Learning

Tweet Share Share Google Plus Captioning an image involves generating a human readable textual d

How to develop Android UI Component for React Native

In one of our project that we developed in React Native, we faced a problem. We wanted to use a video player with the text overlay. Though there are lots o

How to Choose a Blockchain Platform for Your Business

How to Choose a Blockchain Platform for Your BusinessThe growing popularity of crypto investments has aroused a keen interest in blockchain technologies an

Ask HN: How to build beautiful SVG graphics for websites?

I am really curious if anyone knows about courses / tutorials or other material I could use to learn how to do graphics such as the ones at:a) www.stripe.c

How to train Keras model x20 times faster with TPU for free

How to train Keras model x20 times faster with TPU for freeFor quite some while, I feel content training my model on a single GTX 1070 graphics card which

How to Develop Autoregressive Forecasting Models for Multi

Tweet Share Share Google Plus Real-world time series forecasting is challenging for a whole host

How to create role based accounts for your Saas App using FEAN? (Part 1)

Setup firebase in your angular app and express js// Front-endng new exampleAppcd exampleApp && cd exampleApp// For adding firebase to angular appng

How to use DeepLab in TensorFlow for object segmentation using Deep Learning

How to use DeepLab in TensorFlow for object segmentation using Deep LearningModifying the DeepLab code to train on your own dataset for object segmentation

How to automatically segment customers using purchase data and a few lines of Python

How to automatically segment customers using purchase data and a few lines of PythonA small educative project for learning “Customer Segmentation” with a s

How to use Python on microcontrollers for Blockchain and IoT applications

This tutorial will be exploring the potential of combining IoT and blockchain using simple Python directly on microcontrollers, thanks to Zerynth t

Subclassed: How to implement custom BotStorage class for Microsoft BotFramework

Since launch, the MS BotFramework has been changing very rapidly. So rapidly, in fact, that I recently gave up trying to keep up with my handrolled Python

How to Create an ARIMA Model for Time Series Forecasting in Python

Tweet Share Share Google Plus A popular and widely used statistical method for time series forec

How to choose the best channel for your chatbot

Use of cookies: We our own and third-party cookies to personalise our services and collect statistical information. If you continue browsing the site, you

How to create beautiful text stickers for Android

How to create beautiful text stickers for AndroidIn this article, you’ll learn how to draw text on canvas, position and update it in real time based on use

How to create Snapchat-like stickers for Android

How to create Snapchat-like stickers for AndroidAfter spending 2000+ hours and releasing 4+ successful apps working with image transformations, we’ve decid

How to Choose MTP/MPO Cable for 10G/40G/100G Connections?

As the data center expands, the traditional fiber optic cables can hardly meet the high requirements for networking, as they not only occu

How to Get Started with Deep Learning for Natural Language Processing (7

Tweet Share Share Google Plus Deep Learning for NLP Crash Course. Bring Deep Learning methods to

How to Get Good Results Fast with Deep Learning for Time Series Forecasting

Tweet Share Share Google Plus 3 Strategies to Design Experiments and Manage Complexity on Your P

How to Buy Cloud Computing Services for Your Agency

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

How to estimate the time required for a program.

Once an algorithm is given for a problem and decided to be correct, an important step is to determine how much in the way of resources,su