CUDA和FFMPEG硬體解碼視訊流

阿新 • • 發佈：2019-01-23

本文主要講述了通過FFMPEG獲取H264格式的RTSP流資料（也可以獲取本地視訊檔案），並通過CUDA進行硬體解碼的過程。其他部落格給出的教程要麼只是給出了虛擬碼，非常的模糊，要麼是基於D3D進行顯示，使得給出的原始碼非常複雜，而無法看出CUDA解碼的核心框架，而本文將其他非核心部分剝離出去，視訊播放部分通過opencv呼叫cv::mat顯示。

當然本部落格的工作也參考了其他部落格的內容，CSDN上原創的東西比較難找，大部分都是轉載的，所以大家還是積極的貢獻力量吧。

本文將分為以下兩個部分：
1.CUDA硬體解碼核心原理和框架解釋；
2.解碼核心功能程式碼的實現

CUDA硬體解碼核心原理和框架

做過FFMPEG解碼開發的同學肯定都對以下函式比較熟悉avcodec_decode_video2()，該函式實現可以解碼從視訊流中獲取的資料包AVPACKET轉化為AV_FRAME,AV_FRAME中包含了解碼後的資料。通過CUDA硬體進行解碼，最核心的思想就通過回撥函式形式來呼叫CUDA硬體解碼介面，對該函式替換，將CPU解碼功能轉移到GPU中去。
部落格給出了一個很好的基礎性框架，本文也是借鑑了該部落格，該部落格中修改了原始的VideoSource，將視訊流的獲取改為了ffmpeg，而CUDA解碼部分框架如下圖

1.1 VideoSource
VideoSourceData中包含了CUvideoparser和FrameQueue，通過上圖可以看出，CUvideoparser是在VideoDecoder基礎上實現了介面的封裝，而VideoSource則是通過CUvideoparser進行解碼。FrameQueue是儲存硬體解碼後圖像的佇列，注意硬體解碼完的影象是存放在GPU視訊記憶體裡面了，而VideoDecoder中函式mapFrame，可完成從視訊記憶體到記憶體的對映。
1.2 VideoParser

VideoParser中最重要的是三個回撥函式，static int CUDAAPI HandleVideoSequence(void *pUserData, CUVIDEOFORMAT *pFormat)， HandlePictureDecode(void *pUserData, CUVIDPICPARAMS *pPicParams)，HandlePictureDisplay(void *pUserData, CUVIDPARSERDISPINFO *pPicParams)，實現對視訊格式變換、視訊解碼、解碼後顯示等處理功能。HandleVideoSequence主要負責視訊格式進行校驗，沒有實現其他功能，解碼函式HandlePictureDecode呼叫的就是VideoDecoder的解碼函式（CUDA的介面），顯示函式HandlePictureDisplay完成了解碼後GPU影象進入FrameQueue。
1.3 VideoDecoder

該類是最核心的硬體解碼功能類，CUVIDDECODECREATEINFO oVideoDecodeCreateInfo_是建立解碼資訊結構體，CUvideodecoder oDecoder_是最核心的CUDA硬體解碼器，VideoParser的解碼功能實際上是在CUvideodecoder解碼核心上封裝實現的（層層封裝導致原始碼有點複雜，所以想看懂實現機制需要有點耐心）。

2 核心解碼模組的實現
示例中NvDecodeD3D9.cpp實現了D3D環境的建立，CUDA模組的初始化，其中取視訊幀影象顯示的函式如下，該函式實現了從解碼影象佇列取出影象（實際上是視訊記憶體指標），完成格式轉換（NV12到ARGB），最後對映到D3D的Texture進行顯示等功能，程式碼中我給出了關鍵部位的解釋。

bool copyDecodedFrameToTexture(unsigned int &nRepeats, int bUseInterop, int *pbIsProgressive)
{
    CUVIDPARSERDISPINFO oDisplayInfo;

    if (g_pFrameQueue->dequeue(&oDisplayInfo))
    {
    CCtxAutoLock lck(g_CtxLock);
    // Push the current CUDA context (only if we are using CUDA decoding path)
    CUresult result = cuCtxPushCurrent(g_oContext);
    //建立解碼影象的視訊記憶體指標，注意儲存的是NV12格式的
    CUdeviceptr  pDecodedFrame[3] = { 0, 0, 0 }; 
    //用於解碼影象後進行格式轉換
    CUdeviceptr  pInteropFrame[3] = { 0, 0, 0 };

    *pbIsProgressive = oDisplayInfo.progressive_frame;
    g_bIsProgressive = oDisplayInfo.progressive_frame ? true : false;

    int num_fields = 1;
    if (g_bUseVsync) {            
        num_fields = std::min(2 + oDisplayInfo.repeat_first_field, 3);            
    }
    nRepeats = num_fields;

    CUVIDPROCPARAMS oVideoProcessingParameters;
    memset(&oVideoProcessingParameters, 0, sizeof(CUVIDPROCPARAMS));

    oVideoProcessingParameters.progressive_frame = oDisplayInfo.progressive_frame;
    oVideoProcessingParameters.top_field_first = oDisplayInfo.top_field_first;
    oVideoProcessingParameters.unpaired_field = (oDisplayInfo.repeat_first_field < 0);

    for (int active_field = 0; active_field < num_fields; active_field++)
    {
        unsigned int nDecodedPitch = 0;
        unsigned int nWidth = 0;
        unsigned int nHeight = 0;

        oVideoProcessingParameters.second_field = active_field;

        // map decoded video frame to CUDA surfae
        // 呼叫Videodecoder中對映功能，找到解碼後圖像的視訊記憶體地址，並得到Pitch關鍵引數
        if (g_pVideoDecoder->mapFrame(oDisplayInfo.picture_index, &pDecodedFrame[active_field], &nDecodedPitch, &oVideoProcessingParameters) != CUDA_SUCCESS)
        {
            // release the frame, so it can be re-used in decoder
            g_pFrameQueue->releaseFrame(&oDisplayInfo);

            // Detach from the Current thread
            checkCudaErrors(cuCtxPopCurrent(NULL));

            return false;
        }
        nWidth  = g_pVideoDecoder->targetWidth();
        nHeight = g_pVideoDecoder->targetHeight();
        // map DirectX texture to CUDA surface
        size_t nTexturePitch = 0;

        // If we are Encoding and this is the 1st Frame, we make sure we allocate system memory for readbacks
        if (g_bReadback && g_bFirstFrame && g_ReadbackSID)
        {
            CUresult result;
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[0], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[1], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[2], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[3], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[4], (nDecodedPitch * nHeight + nDecodedPitch*nHeight / 2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[5], (nDecodedPitch * nHeight + nDecodedPitch*nHeight / 2)));

            g_bFirstFrame = false;

            if (result != CUDA_SUCCESS)
            {
                printf("cuMemAllocHost returned %d\n", (int)result);
                checkCudaErrors(result);
            }
        }

        // If streams are enabled, we can perform the readback to the host while the kernel is executing
        if (g_bReadback && g_ReadbackSID)
        {
            CUresult result = cuMemcpyDtoHAsync(g_pFrameYUV[active_field], pDecodedFrame[active_field], (nDecodedPitch * nHeight * 3 / 2), g_ReadbackSID);

            if (result != CUDA_SUCCESS)
            {
                printf("cuMemAllocHost returned %d\n", (int)result);
                checkCudaErrors(result);
            }
        }

#if ENABLE_DEBUG_OUT
        printf("%s = %02d, PicIndex = %02d, OutputPTS = %08d\n",
               (oDisplayInfo.progressive_frame ? "Frame" : "Field"),
               g_DecodeFrameCount, oDisplayInfo.picture_index, oDisplayInfo.timestamp);
#endif

        if (g_pImageDX)
        {
            // map the texture surface
            g_pImageDX->map(&pInteropFrame[active_field], &nTexturePitch, active_field);
        }
        else
        {
            pInteropFrame[active_field] = g_pInteropFrame[active_field];
            nTexturePitch = g_pVideoDecoder->targetWidth() * 2;
        }

        // perform post processing on the CUDA surface (performs colors space conversion and post processing)
        // comment this out if we inclue the line of code seen above
        //呼叫CUDA功能模組，完成從NV12格式到ARGB格式的轉換，該功能模組比較複雜，後面我將給出一個簡單的實現方式
        cudaPostProcessFrame(&pDecodedFrame[active_field], nDecodedPitch, &pInteropFrame[active_field], 
                             nTexturePitch, g_pCudaModule->getModule(), g_kernelNV12toARGB, g_KernelSID);

        if (g_pImageDX)
        {
            // unmap the texture surface
            g_pImageDX->unmap(active_field);
        }

        // unmap video frame
        // unmapFrame() synchronizes with the VideoDecode API (ensures the frame has finished decoding)
        g_pVideoDecoder->unmapFrame(pDecodedFrame[active_field]);                  
        g_DecodeFrameCount++;

        if (g_bWriteFile)
        {
            checkCudaErrors(cuStreamSynchronize(g_ReadbackSID));
            SaveFrameAsYUV(g_pFrameYUV[active_field + 3],
                g_pFrameYUV[active_field],
                nWidth, nHeight, nDecodedPitch);
        }
    }

    // Detach from the Current thread
    checkCudaErrors(cuCtxPopCurrent(NULL));
    // release the frame, so it can be re-used in decoder
    g_pFrameQueue->releaseFrame(&oDisplayInfo);
}
else
{
    // Frame Queue has no frames, we don't compute FPS until we start
    return false;
}

// check if decoding has come to an end.
// if yes, signal the app to shut down.
if (!g_pVideoSource->isStarted() && g_pFrameQueue->isEndOfDecode() && g_pFrameQueue->isEmpty())
{
    // Let's free the Frame Data
    if (g_ReadbackSID && g_pFrameYUV)
    {
        cuMemFreeHost((void *)g_pFrameYUV[0]);
        cuMemFreeHost((void *)g_pFrameYUV[1]);
        cuMemFreeHost((void *)g_pFrameYUV[2]);
        cuMemFreeHost((void *)g_pFrameYUV[3]);
        cuMemFreeHost((void *)g_pFrameYUV[4]);
        cuMemFreeHost((void *)g_pFrameYUV[5]);
        g_pFrameYUV[0] = NULL;
        g_pFrameYUV[1] = NULL;
        g_pFrameYUV[2] = NULL;
        g_pFrameYUV[3] = NULL;
        g_pFrameYUV[4] = NULL;
        g_pFrameYUV[5] = NULL;
    }

    // Let's just stop, and allow the user to quit, so they can at least see the results
    g_pVideoSource->stop();

    // If we want to loop reload the video file and restart
    if (g_bLoop && !g_bAutoQuit)
    {
        HRESULT hr = reinitCudaResources();
        if (SUCCEEDED(hr))
        {
            g_FrameCount = 0;
            g_DecodeFrameCount = 0;
            g_pVideoSource->start();
        }
    }

    if (g_bAutoQuit)
    {
        g_bDone = true;
    }
}
return true;
}

以上功能模組與D3D摻和在一起，難以找到解碼後取影象資料功能的核心模組，下面我給出基於opencv Mat的取影象資料方法，程式碼如下：
bool GetGpuDecodeFrame(shared_ptr ptr_video_stream, Mat &frame)
{
CUVIDPARSERDISPINFO oDisplayInfo;

if (ptr_video_stream->p_cuda_frame_queue)
{
    if (ptr_video_stream->p_cuda_frame_queue->FrameNumInQueue()>0 && ptr_video_stream->p_cuda_frame_queue->dequeue(&oDisplayInfo))
    {
        CCtxAutoLock lck(cuvideo_ctx_lock_);
        CUresult result = cuCtxPushCurrent(cuda_context_);
        CUdeviceptr  pDecodedFrame = 0;
        int num_fields = 1;

        CUVIDPROCPARAMS oVideoProcessingParameters;
        memset(&oVideoProcessingParameters, 0, sizeof(CUVIDPROCPARAMS));

        oVideoProcessingParameters.progressive_frame = oDisplayInfo.progressive_frame;
        oVideoProcessingParameters.top_field_first = oDisplayInfo.top_field_first;
        oVideoProcessingParameters.unpaired_field = (oDisplayInfo.repeat_first_field < 0);
        oVideoProcessingParameters.second_field = 0;

        unsigned int nDecodedPitch = 0;
        unsigned int nWidth = 0;
        unsigned int nHeight = 0;
        //找到影象資料GPU視訊記憶體地址
        if (ptr_video_stream->p_cuda_video_decoder->mapFrame(oDisplayInfo.picture_index, &pDecodedFrame, &nDecodedPitch, &oVideoProcessingParameters) != CUDA_SUCCESS)
        {
            // release the frame, so it can be re-used in decoder
            ptr_video_stream->p_cuda_frame_queue->releaseFrame(&oDisplayInfo);
            // Detach from the Current thread
            checkCudaErrors(cuCtxPopCurrent(NULL));
            return false;
        }
        nWidth = ptr_video_stream->p_cuda_video_decoder->targetWidth();
        nHeight = ptr_video_stream->p_cuda_video_decoder->targetHeight();
        Mat raw_frame = Mat::zeros(cvSize(nWidth, nHeight), CV_8UC3);
        //直接對視訊記憶體影象資料進行NV12到RGB格式的轉換，並將轉換後的資料拷貝到記憶體
        VasGpuBoost::ColorConvert::Get()->ConvertD2HYUV422pToRGB24((uchar*)pDecodedFrame, raw_frame.data, nWidth, nHeight, nDecodedPitch);
        resize(raw_frame, frame, cvSize(ptr_video_stream->decode_param_.dst_width(), ptr_video_stream->decode_param_.dst_height()));
        raw_frame.release();
        ptr_video_stream->p_cuda_video_decoder->unmapFrame(pDecodedFrame);
        checkCudaErrors(cuCtxPopCurrent(NULL));
        ptr_video_stream->p_cuda_frame_queue->releaseFrame(&oDisplayInfo);

        pts = 0;
        No = 0;
        return true;
    }
    else return false;
}
else return false;
}

解碼用到的cuda核心函式如下，需要強調的Ptich的大小並不是影象寬度，而是解碼後圖像存放資料行的寬度，通常情況下要比影象寬度要大，實際上格式轉換過程參考了部落格。常見影象格式轉換

__global__ void DevYuv420iToRgb(const uchar* yuv_data, uchar *rgb_data, const int width, const int height, const int pitch, const uchar *table_r, const uchar *table_g, const uchar *table_b)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;

    int64 size = pitch*height;
    int64 compute_size = width*height;

    int x, y;
    x = i % width;
    y = i / width;

    CUDA_KERNEL_LOOP(i, compute_size)
    {
        int y_offset = pitch * y + x;
        int nv_index = y / 2 * pitch + x - x % 2;


        int v_offset = size + nv_index;
        int u_offset = v_offset + 1;

        int Y = *(yuv_data + y_offset);
        int U = *(yuv_data + u_offset);
        int V = *(yuv_data + v_offset);

        *(rgb_data + 3 * i) = table_r[(Y << 8) + V];
        *(rgb_data + 3 * i + 1) = table_g[(Y << 16) + (U << 8) + V];
        *(rgb_data + 3 * i + 2) = table_b[(Y << 8) + U];
    }
}

小結
本文旨在揭示CUDA的硬體框架，通過對比實驗發現硬體解碼還是強大的，GTX970能夠做到720p視訊大約800fps的速度。我也是基於此框架，實現了一套基於cuda的多路視訊硬體解碼C++介面，輸出opencv mat格式影象，做後續視訊分析。
由於時間關係，行文較為倉促，錯誤或者講的不清楚的地方，大家可以給我留言。

CUDA和FFMPEG硬體解碼視訊流

CUDA和FFMPEG硬體解碼視訊流

樹莓派x264和ffmpeg編譯（用於支援GPU硬體編解碼）

CUDA GPU架構-硬體和軟體

VideoLAN，VLC和FFmpeg社群聯合開發AV1解碼器

FFmpeg新版本(2016年10月份以後) 支援硬體解碼

FFmpeg支援的音訊和視訊編解碼格式

H.264流媒體UDP組播和ffmpeg解碼實驗

(四) FFmpeg軟硬解碼和多執行緒解碼(C++ NDK)

使用C#+FFmpeg+DirectX+dxva2硬體解碼播放h264流

Java-IO流之轉換流的使用和編碼與解碼原理

python字符串格式和編碼與解碼問題

Netty3：分隔符和定長解碼器

Caffe使用新版本CUDA和CuDNN

安裝CUDA和cuDNN

mencoder和ffmpeg引數詳解2 （轉）

android UVC h264 ffmpeg軟解碼

android UVC h264 ffmpeg硬解碼（RK3288 android5.1）

android全平臺編譯ffmpeg視訊解碼器實踐

CSDN學院中的Android影音（流媒體）播放器開發實戰基於NDK、C++和FFmpeg 課程已經上線

Linux檢查和收集硬體資訊的常用命令總結

CUDA和FFMPEG硬體解碼視訊流

相關推薦