Case Analysis of Deep Learning Technology Based on Spark and BigDL

February 28, 2023

This article mainly shares Intel and JD's practical experience in building a large-scale image feature extraction framework based on deep learning techniques based on Spark and BigDL.

background

Image feature extraction is widely used in similar image retrieval, deduplication and so on. Before using the BigDL framework (described below), we have tried to develop and deploy feature extraction applications on multi-machine multi-GPU cards and GPU clusters. However, the above frameworks have obvious disadvantages:

In a GPU cluster, the resource allocation strategy based on the GPU card is very complicated, and resource allocation is prone to problems, such as insufficient remaining memory and causing OOM and application crashes.

In the case of a single machine, the relative clustering method requires the developer to manually perform data fragmentation, load, and fault tolerance.

GPU mode applications have many dependencies, such as CUDA, which increases the difficulty of deployment and maintenance. If you encounter problems with different operating system versions and GCC versions, you need to recompile and package.

The above problems make the GPU-based forward program face many technical application challenges from the architecture.

Let's take a look at the scene itself. Because many pictures have complex backgrounds and the proportion of subject objects is usually small, in order to reduce the interference of background on feature extraction accuracy, the subject needs to be separated from the picture. Naturally, the framework for image feature extraction is divided into two steps. The target detection algorithm is used to detect the target, and then the feature extraction algorithm is used to extract the target feature. Here, we use SSD [1] (Single shot mulTIbox detector) for target detection and feature extraction using DeepBit [2] network.

There are a large number of hundreds of millions of commodity pictures inside Jingdong in the mainstream distributed open source database. Therefore, how to efficiently retrieve and process data in a large-scale distributed environment is a key issue in the image feature extraction pipeline. Existing GPU-based solutions face additional challenges in addressing the needs of the above scenarios:

Data downloads take a long time, and GPU-based solutions do not optimize them well.

For the picture data in the distributed open source database, the pre-data processing process of the GPU solution is very complicated. There is no mature software framework for resource management, distributed data processing and fault-tolerant management.

Because of the limitations of GPU software and hardware frameworks, extending GPU solutions to handle large-scale images is challenging.

BigDL integration solution

In a production environment, the use of existing software and hardware facilities will significantly increase production efficiency (such as reducing the development time of new products) while reducing costs. Based on this case, the data is stored in the mainstream distributed open source database in the big data cluster. If the deep learning application can use the existing big data cluster (such as Hadoop or Spark cluster) for calculation, it can be easily solved. The challenge.

Intel's open source BigDL project [3] is a distributed deep learning framework on Spark that provides comprehensive deep learning algorithm support. With the distributed scalability of the Spark platform, BigDL can be easily extended to hundreds or thousands of nodes. At the same time, BigDL utilizes the Intel MKL math library and parallel computing technology to achieve high performance on Intel Xeon servers (computing power can achieve performance comparable to mainstream GPUs).

In our scenario, BigDL is custom-developed to support various models (detection, classification); the model has been transplanted from a specific environment to support the general model (Caffe, Torch, Tensorflow) BigDL big data environment; the entire pipeline process Get optimized speed.

The pipeline for feature extraction in the spark environment via BigDL is shown in Figure 1:

Use Spark to read hundreds of millions of original images from a distributed open source database and build them into RDDs.

Use Spark to preprocess the image, including resizing, subtracting the mean, and composing the data into a batch.

Load the SSD model with BigDL, and perform large-scale and distributed target detection on the image through Spark, and obtain a series of detection coordinates and corresponding scores.

The test result with the highest score is retained as the subject target, and the original image is cropped according to the detected coordinates to obtain the target image.

Pre-processing the target image RDD, including resizing, composing Batch

Load the DeepBit model with BigDL, and perform distributed feature extraction on the detected target image through Spark to obtain the corresponding features.

Store the test result (extracted target feature RDD) on HDFS

Figure 1 Image FeatureExtraction Pipeline Based on BigDL

The entire data analysis pipeline, including data reading, data partitioning, preprocessing, prediction, and storage of results, can be easily implemented in Spark via BigDL. In the existing big data cluster (Hadoop/Spark), users can run deep learning applications using BigDL without modifying any cluster configuration. Moreover, BigDL leverages the high scalability of the Spark platform to easily scale to a large number of nodes and tasks, thus greatly accelerating the data analysis process.

In addition to distributed deep learning support, BigDL also provides a number of easy-to-use tools, such as image pre-processing libraries, model loading tools (including models that load third-party deep learning frameworks), etc., making it easier for users to build the entire pipeline.

Image preprocessing

BigDL provides an image preprocessing library based on OpenCV [5], which supports various common image conversion and image enhancement functions. Users can easily use these basic functions to build an image preprocessing pipeline. In addition, the user can also call the OpenCV operation provided by the library to customize the image conversion function.

åŸºäºŽSparkå’ŒBigDLçš„æ·±åº¦å¦ä¹ æŠ€æœ¯çš„æ¡ˆä¾‹è§£æž

LED Tunnel Light Series

LED Tunnel light,Tunnel light of high protection grade

Kindwin Technology (H.K.) Limited , https://www.ktlleds.com