How to Automatically Generate Textual Descriptions for Photographs with Deep Learning
Captioning an image involves generating a human readable textual description given an image, such as a photograph.
It is an easy problem for a human, but very challenging for a machine as it involves both understanding the content of an image and how to translate this understanding into natural language.
Recently, deep learning methods have displaced classical methods and are achieving state-of-the-art results for the problem of automatically generating descriptions, called “captions,” for images.
In this post, you will discover how deep neural network models can be used to automatically generate descriptions for images, such as photographs.
After completing this post, you will know:
- About the challenge of generating textual descriptions for images and the need to combine breakthroughs from computer vision and natural language processing.
- About the elements that comprise a neural feature captioning model, namely the feature extractor and language model.
- How the elements of the model can be arranged into an Encoder-Decoder, possibly with the use of an attention mechanism.
Let’s get started.
Overview
This post is divided into 3 parts; they are:
- Describing an Image with Text
- Neural Captioning Model
- Encoder-Decoder Architecture
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
Describing an Image with Text
Describing an image is the problem of generating a human-readable textual description of an image, such as a photograph of an object or scene.
The problem is sometimes called “automatic image annotation” or “image tagging.”
It is an easy problem for a human, but very challenging for a machine.
A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. However, this remarkable ability has proven to be an elusive task for our visual recognition models
A solution requires both that the content of the image be understood and translated to meaning in the terms of words, and that the words must string together to be comprehensible. It combines both computer vision and natural language processing and marks a true challenging problem in broader artificial intelligence.
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.
Further, the problems can range in difficulty; let’s look at three different variations on the problem with examples.
1. Classify Image
Assign an image a class label from one of hundreds or thousands of known classes.
2. Describe Image
Generate a textual description of the contents image.
3. Annotate Image
Generate textual descriptions for specific regions on the image.
The general problem can also be extended to describe images over time in video.
In this post, we will focus our attention on describing images, which we will describe as ‘image captioning.’
Neural Captioning Model
Neural network models have come to dominate the field of automatic caption generation; this is primarily because the methods are demonstrating state-of-the-art results.
The two dominant methods prior to end-to-end neural network models for generating image captions were template-based methods and nearest-neighbor-based methods and modifying existing captions.
Prior to the use of neural networks for generating captions, two main approaches were dominant. The first involved generating caption templates which were filled in based on the results of object detections and attribute discovery. The second approach was based on first retrieving similar captioned images from a large database then modifying these retrieved captions to fit the query. […] Both of these approaches have since fallen out of favour to the now dominant neural network methods.
Neural network models for captioning involve two main elements:
- Feature Extraction.
- Language Model.
Feature Extraction Model
The feature extraction model is a neural network that given an image is able to extract the salient features, often in the form of a fixed-length vector.
The extracted features are an internal representation of the image, not something directly intelligible.
A deep convolutional neural network, or CNN, is used as the feature extraction submodel. This network can be trained directly on the images in the image captioning dataset.
Alternately, a pre-trained model, such as a state-of-the-art model used for image classification, can be used, or some hybrid where a pre-trained model is used and fine tuned on the problem.
It is popular to use top performing models in the ImageNet dataset developed for the ILSVRC challenge, such as the Oxford Vision Geometry Group model, called VGG for short.
[…] we explored several techniques to deal with overfitting. The most obvious way to not overfit is to initialize the weights of the CNN component of our system to a pretrained model (e.g., on ImageNet)
Language Model
Generally, a language model predicts the probability of the next word in the sequence given the words already present in the sequence.
For image captioning, the language model is a neural network that given the extracted features from the network is capable of predicting the sequence of words in the description and build up the description conditional on the words that have already been generated.
It is popular to use a recurrent neural network, such as a Long Short-Term Memory network, or LSTM, as the language model. Each output time step generates a new word in the sequence.
Each word that is generated is then encoded using a word embedding (such as word2vec) and passed as input to the decoder for generating the subsequent word.
An improvement to the model involves gathering the probability distribution of words across the vocabulary for the output sequence and searching it to generate multiple possible descriptions. These descriptions can be scored and ranked by likelihood. It is common to use a Beam Search for this search.
The language model can be trained standalone using pre-computed features extracted from the image dataset; it can be trained jointly with the feature extraction network, or some combination.
Encoder-Decoder Architecture
A popular way to structure the sub-models is to use an Encoder-Decoder architecture where both models are trained jointly.
[the model] is based on a convolution neural network that encodes an image into a compact representation, followed by a recurrent neural network that generates a corresponding sentence. The model is trained to maximize the likelihood of the sentence given the image.
This is an architecture developed for machine translation where an input sequence, say in French, is encoded as a fixed-length vector by an encoder network. A separate decoder network then reads the encoding and generates an output sequence in the new language, say English.
A benefit of this approach in addition to the impressive skill of the approach is that a single end-to-end model can be trained on the problem.
When adapted for image captioning, the encoder network is a deep convolutional neural network, and the decoder network is a stack of LSTM layers.
[in machine translation] An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence. Here, we propose to follow this elegant recipe, replacing the encoder RNN by a deep convolution neural network (CNN).
Captioning Model with Attention
A limitation of the Encoder-Decoder architecture is that a single fixed-length representation is used to hold the extracted features.
This was addressed in machine translation through the development of attention across a richer encoding, allowing the decoder to learn where to place attention as each word in the translation is generated.
The approach of attention has also been used to improve the performance of the Encoder-Decoder architecture for image captioning by allowing the decoder to learn where to put attention in the image when generating each word in the description.
Encouraged by recent advances in caption generation and inspired by recent success in employing attention in machine translation and object recognition we investigate models that can attend to salient part of an image while generating its caption.
A benefit of this approach is that it is possible to visualize exactly where attention is placed while generating each word in a description.
We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence.
This is easiest to understand with an example; see below.
Further Reading
This section provides more resources on the topic if you are looking go deeper.
Papers
Articles
Projects
Summary
In this post, you discovered how deep neural network models can be used to automatically generate descriptions for images, such as photographs.
Specifically, you learned:
- About the challenge of generating textual descriptions for images and the need to combine breakthroughs from computer vision and natural language processing.
- About the elements that comprise a neural feature captioning model, namely the feature extractor and language model.
- How the elements of the model can be arranged into an Encoder-Decoder, possibly with the use of an attention mechanism.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Develop Deep Learning models for Text Data Today!
Develop Your Own Text models in Minutes
…with just a few lines of python code
It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…
Finally Bring Deep Learning to your Natural Language Processing Projects
Skip the Academics. Just Results.
相關推薦
How to Automatically Generate Textual Descriptions for Photographs with Deep Learning
Tweet Share Share Google Plus Captioning an image involves generating a human readable textual d
How to develop Android UI Component for React Native
In one of our project that we developed in React Native, we faced a problem. We wanted to use a video player with the text overlay. Though there are lots o
How to Choose a Blockchain Platform for Your Business
How to Choose a Blockchain Platform for Your BusinessThe growing popularity of crypto investments has aroused a keen interest in blockchain technologies an
Ask HN: How to build beautiful SVG graphics for websites?
I am really curious if anyone knows about courses / tutorials or other material I could use to learn how to do graphics such as the ones at:a) www.stripe.c
How to train Keras model x20 times faster with TPU for free
How to train Keras model x20 times faster with TPU for freeFor quite some while, I feel content training my model on a single GTX 1070 graphics card which
How to Develop Autoregressive Forecasting Models for Multi
Tweet Share Share Google Plus Real-world time series forecasting is challenging for a whole host
How to create role based accounts for your Saas App using FEAN? (Part 1)
Setup firebase in your angular app and express js// Front-endng new exampleAppcd exampleApp && cd exampleApp// For adding firebase to angular appng
How to use DeepLab in TensorFlow for object segmentation using Deep Learning
How to use DeepLab in TensorFlow for object segmentation using Deep LearningModifying the DeepLab code to train on your own dataset for object segmentation
How to automatically segment customers using purchase data and a few lines of Python
How to automatically segment customers using purchase data and a few lines of PythonA small educative project for learning “Customer Segmentation” with a s
How to use Python on microcontrollers for Blockchain and IoT applications
This tutorial will be exploring the potential of combining IoT and blockchain using simple Python directly on microcontrollers, thanks to Zerynth t
Subclassed: How to implement custom BotStorage class for Microsoft BotFramework
Since launch, the MS BotFramework has been changing very rapidly. So rapidly, in fact, that I recently gave up trying to keep up with my handrolled Python
How to Create an ARIMA Model for Time Series Forecasting in Python
Tweet Share Share Google Plus A popular and widely used statistical method for time series forec
How to choose the best channel for your chatbot
Use of cookies: We our own and third-party cookies to personalise our services and collect statistical information. If you continue browsing the site, you
How to create beautiful text stickers for Android
How to create beautiful text stickers for AndroidIn this article, you’ll learn how to draw text on canvas, position and update it in real time based on use
How to create Snapchat-like stickers for Android
How to create Snapchat-like stickers for AndroidAfter spending 2000+ hours and releasing 4+ successful apps working with image transformations, we’ve decid
How to Choose MTP/MPO Cable for 10G/40G/100G Connections?
As the data center expands, the traditional fiber optic cables can hardly meet the high requirements for networking, as they not only occu
How to Get Started with Deep Learning for Natural Language Processing (7
Tweet Share Share Google Plus Deep Learning for NLP Crash Course. Bring Deep Learning methods to
How to Get Good Results Fast with Deep Learning for Time Series Forecasting
Tweet Share Share Google Plus 3 Strategies to Design Experiments and Manage Complexity on Your P
How to Buy Cloud Computing Services for Your Agency
Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So
How to estimate the time required for a program.
Once an algorithm is given for a problem and decided to be correct, an important step is to determine how much in the way of resources,su