An Introduction to Separable Convolutions with Literature Review

Inmeta
13 min readSep 18, 2020

And why would you want to apply them in your machine learning projects.

By Greta Elvestuen, PhD, Data Scientist consultant at Inmeta

Throughout the last decade, convolutional neural networks (CNNs) have brought significant improvements in performance to machine learning models. Especially deep learning models have gained high popularity within the field of computer vision, due to their vast achievements and implementation in a wide variety of everyday applications.

However, training such networks is very time-consuming. Large datasets are necessary in order to train a high performing model, leading to excessive training times for up to several weeks. This is especially unfortunate in cases of testing how well the network is performing in order to make the necessary adjustments. Even with application of virtual machines and far more powerful GPUs nowadays, training times still represent a challenge in machine learning projects.

Hence, how are separable convolutions relevant in this context? These networks are far more efficient, as they decrease computational complexity and require less memory during training. In addition, they tend to perform better than standard convolutions and have a wide variety of applications. Sounds unrealistic? Well, the reason lies in the number of multiplications during the training process, which will be explained in more detail at a later point. First, how did the concept begin?

The origins of separable convolutions

Spatial separable convolutions have longer history than the depthwise type. They have been applied in neural networks at least since the work of Mamalet and Garcia on simplifying ConvNets for fast learning in 2012, and probably even earlier. The pioneer work on depthwise separable convolutions, on the other hand, have been inspired by research of Sifre and Mallat on transformation invariant scattering. Thereafter, Laurent Sifre, as a part of his Google Brain internship in 2013, further developed this concept and employed it in AlexNet, providing somewhat higher accuracy, as well as decrease in model size.

The following year and the year after, Szegedy with colleagues used separable convolutions as the first layer of the famous Inception V1, while Ioffe and Szegedy did the same in Inception V2. In 2015, Jin with colleagues and Wang with colleagues in 2016 applied separable convolutions for decreasing the size and computational cost of convolutional neural networks. A year later, Abadi with colleagues implemented depthwise separable convolutions in the TensorFlow framework, significantly facilitating further work with the concept.

Probably the most known applications of these convolutions come from the work of Howard with colleagues, who introduced efficient mobile models under the name MobileNets, and Chollet, who applied them in Xception architecture. Now, let’s get into more detail on what this concept is all about.

Spatial separable convolutions

The first version of separable convolutions deals primarily with the spatial dimensions of an image and kernel — its height and width. It divides a kernel into two smaller kernels, where most common is to divide a 3x3 kernel into a 3x1 and 1x3 kernel. Hence, instead of conducting one convolution with 9 multiplications, two convolutions with 3 multiplications each are done (6) achieving the same effect.

Fewer multiplications lower computational complexity → faster network

Wang (2018)

One of the most famous kernels that can be separated spatially is Sobel (for edge detection):

In addition, there are fewer matrix multiplications in spatially separable convolution compared to standard convolutions:

Bai (2019)

However, despite their advantages, spatial separable convolutions are seldom applied in deep learning. This is mainly due to not all kernels being able to get divided into two smaller ones. Replacing all standard convolutions by spatial separable would also introduce a limit in searching for all possible kernels in the training process, implying worse training results.

Depthwise separable convolutions

The second version of separable convolutions deals with kernels that cannot be divided into two smaller ones and is more commonly used. This is the type of separable convolution implemented in Keras and TensorFlow. In addition to spatial dimensions, it also works with the depth dimension (number of channels). This convolution separates convolutional process into two stages — depthwise and pointwise.

Here are a couple of examples of their advantages, as well as their incorporation in the Xception architecture:

Separable Multidimensional Convolution
Comparison of the number of parameters for the first layers in the network with dense versus separable convolutions.

Sifre (2014)

MobileNet models can be applied to various recognition tasks for efficient on device intelligence

Experimental evaluation of MobileNets

Resource usage for modification standard convolution
MobileNet comparison to popular models
Depthwise Separable vs Full Convolution MobileNet
COCO object detection results comparison using different frameworks and network architectures

Howard et al. (2017)

The Xception architecture

Experimental evaluation of Xception

To the left: Training profile on ImageNet. To the right: Training profile on JFT, with fully- connected layers
To the left: Training profile on JFT, without fully-connected layers. To the right: Training profile with and without residual connections

Chollet (2017)

So, how do depthwise separable convolutions achieve such significantly better results? We should look for the answer in how they differ from the standard (dense) convolutions:

Standard 2D convolution to create output with 1 layer, using 1 filter
Standard 2D convolution to create output with 128 layers, using 128 filters
Depthwise separable convolution — first stage
Depthwise separable convolution — second stage

Bai (2019)

Let us examine the number of multiplications:

In case this did not seem impressive enough, let us look at an example with larger dimensions:

Hence,

Fewer computations → faster network

For more information on how to implement separable convolutions in your own model, there is a thorough Keras documentation on the topic:

Chollet et al. (2015)

Examples of their wide application

As previously mentioned, separable convolutions have gained a noticeable popularity in the field of machine learning and, specifically, computer vision over the last few years, with a number of cases emphasizing their advantages and simplicity of usage. In this section, some of the most relevant examples are presented, both in order to show variety of their application and for inspiration due to their achievements:

ShuffleNet

In 2017, Zhang with colleagues introduced a highly computation-efficient CNN architecture named ShuffleNet, designed specifically for mobile devices with very limited computing power. This architecture utilizes two operations — pointwise group convolution and channel shuffle, in order to retain accuracy while significantly reducing computation costs. The results showed that ShuffleNet maintains comparable accuracy whilst achieving approximately 13 times actual speedup over AlexNet on an ARM-based mobile device.

Complexity comparison

Zhang et al. (2017)

clcNet

Zhang (2018) suggested that depthwise convolution and grouped convolution can be considered as special cases of a generalized convolution operation named channel local convolution (CLC), where an output channel is computed using a subset of the input channels. This definition entails computation dependency relations between input and output channels, which can be represented by a channel dependency graph (CDG). By modifying the CDG of grouped convolution, a new CLC kernel named interlaced grouped convolution (IGC) is created.

Stacking IGC and GC kernels results in a convolution block (named CLC Block) for approximating regular convolution. By resorting to the CDG as an analysis tool, the rule was derived for setting the meta-parameters of IGC and GC, as well as the framework for minimizing the computational costs. Hence, A CNN model named clcNet was constructed using CLC blocks, showing significantly higher computational efficiency and fewer parameters compared to state-of-the-art networks tested using the ImageNet1K dataset.

Channel dependency graphs (CDG)

Convolution:

a) Regular

b) Grouped

c) Depthwise

Convolution blocks:

a) ResNet bottleneck structure

b) ResNeXt block

c) Depthwise separable convolution in MobileNet & Xception

CLC block and its channel dependency graph
Comparison with previous models for classification accuracy and computational cost

Zhang (2018)

Network Decoupling

Also in 2018, Guo with colleagues analyzed mathematical relationship between regular convolutions and depthwise separable convolutions, and showed that the former could be approximated with the latter in closed form. Depthwise separable convolutions were indicated as principal components of regular convolutions. Moreover, network decoupling (ND) was proposed, a training-free method in order to accelerate CNNs by transferring pre-trained CNN models into the MobileNet-like depthwise separable convolution structure, with a promising speed up and negligible accuracy loss.

In addition, it was experimentally verified that this method is orthogonal to other training-free methods (e.g. channel decomposition, spatial decomposition, etc.). Combining them reached even larger CNN speed up. Finally, ND’s wide applicability to classification networks and object detection networks was demonstrated.

Regular convolutions vs. depthwise separable convolutions

Guo et al. (2018)

ChannelNets

Gao et al. (2018) proposed compressing deep models by using channel-wise convolutions, which replace dense connections among feature maps with sparse ones in CNNs. Based on this operation, light-weight CNNs were built known as ChannelNets. ChannelNets use three instances of channel-wise convolutions — group channel-wise convolutions, depth-wise separable channel-wise convolutions, and the convolutional classification layer. Compared to prior CNNs designed for mobile devices, ChannelNets achieve a significant reduction in terms of the number of parameters and computational cost without loss in accuracy. This was the first attempt to compress the fully-connected classification layer, which usually accounts for about 25 % of total parameters in compact CNNs. Experimental results on the ImageNet dataset demonstrated that ChannelNets achieve consistently better performance compared to prior methods.

Different compact convolutions:

a) Depth-wise separable convolution

b) Group convolution

c) Group channel-wise convolution for information fusion

d) Depth-wise separable channel-wise convolution

Gao et al. (2018)

Encoder-Decoder with Atrous Separable Convolution

Also in 2018, Chen with colleagues proposed to combine the advantages from spatial pyramid pooling module and encode-decoder structure used in deep neural networks for semantic segmentation task. Specifically, the DeepLabv3+ model extended DeepLabv3 by adding a simple, yet effective decoder module to refine the segmentation results especially along object boundaries. In addition, Xception model was explored and depthwise separable convolution applied to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.

3×3 Depthwise separable convolution decomposes a standard convolution into (a) a depthwise convolution (applying a single filter for each input channel) and (b) a pointwise convolution (combining the outputs from depthwise convolution across channels); © atrous convolution adopted in the depthwise convolution with rate = 2

Chen et al. (2018)

CNN-based methods for LF SSR

Yeung et al. (2019) proposed effective and efficient end-to-end convolutional neural network models for spatially super-resolving light field (LF) images. Specifically, these models have an hourglass shape, which allows feature extraction to be performed at the low-resolution level in order to save both the computational and memory costs. With the aim to fully make use of the 4D structure information of LF data in both the spatial and angular domains, 4D convolution was proposed to characterize the relationship among pixels. Moreover, as an approximation of 4D convolution, spatial-angular separable (SAS) convolutions were also proposed for more computationally and memory-efficient extraction of spatial-angular joint features.

Extensive experimental results on 57 test LF images with various challenging natural scenes showed significant advantages from the proposed models over the state-of-the-art methods. Specifically, an average PSNR gain of more than 3.0 dB and higher visual quality were achieved, while these methods preserved the LF structure of the super-resolved LF images better, which is highly desirable for subsequent applications. In addition, the SAS convolution-based model is able to achieve three times speed-up with only negligible reconstruction quality decrease, as compared to the 4D convolution-based.

Spatial-angular separable convolution based model

Yeung et al. (2019)

Zhu-Net

Also in 2019, Zhang with colleagues designed a new CNN network structure in order to improve detection accuracy of spatial domain steganography. 3 x 3 kernels were used instead of the traditional 5 x 5 and convolution kernels optimized in the preprocessing layer. The smaller convolution kernels were applied in order to reduce the number of parameters and model features in a small local region. Further, separable convolutions were employed with the aim to utilize channel correlation of the residuals, compress the image content and increase the signal-to-noise ratio (between the stego signal and the image signal).

Moreover, spatial pyramid pooling (SPP) was used in order to aggregate the local features and enhance their representation ability by multi-level pooling. Finally, data augmentation was adopted to further improve network performance. The experimental results showed that this CNN structure is significantly better than other five methods (SRM, Ye-Net, Xu-Net, Yedroudj-Net and SRNet) in detecting three spatial algorithms (WOW, S-UNIWARD and HILL) with a wide variety of datasets and payloads.

Xception vs. sepconv blocks
The difference between the Yedroudj-Net, SRNet and Zhu-Net

Zhang et al. (2019)

In conclusion, what are the key take-outs from separable convolutions?

Investigating the two types of separable convolutions (spatial and depthwise), it is important to mention that they both save computational power, while demanding less memory compared to standard convolutions. Spatial separable convolutions are simpler of the two, dealing primarily with the spatial dimensions of an image and kernel. However, they are rarely used in deep learning, since not all kernels can be divided into two smaller ones as they require. In addition, this version of separable convolutions limits searching for all possible kernels during training, implying that training results may be suboptimal.

Depthwise separable convolutions, on the other hand, work with the depth dimension (number of channels) in addition to spatial dimensions. They drastically enhance efficiency without significantly reducing effectiveness, as they learn richer representations with fewer parameters. On the downside, this decrease in the number of parameters is suboptimal for small networks. Hence, the numerous advantages of depthwise separable convolutions come best to light when applied to large networks in neural computer vision architectures and may likely become a foundation for their future design due to ease of use as standard convolutional layers.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … & Ghemawat, S. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.

Bai, K. (2019). A Comprehensive Introduction to Different Types of Convolutions in Deep Learning. Towards Data Science. Assessed on October 25th, 2019

Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F. & Adam H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 801–818.

Chollet, F. (2015). Keras.

Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1251–1258.

Gao, H., Wang, Z. & Ji, S. (2018). ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions. In Advances in Neural Information Processing Systems. 5197–5205.

Guo, J., Li, Y., Lin, W., Chen, Y., Li, J. (2018). Network Decoupling: From Regular to Depthwise Separable Convolutions. arXiv preprint arXiv:1808.05517.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861.

Ioffe, S. & Szegedy, C. (2015) — Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv prepring arXiv:1502.03167.

Jin, J., Dundar, A. & Culurciello, E. (2015). Flattened Convolutional Neural Networks for Feedforward Acceleration. arXiv preprint arXiv:1412.5474.

Mamalet, F. & Garcia, C. (2012) — Simplifying ConvNets for Fast Learning. In International Conference on Artificial Neural Networks. Springer, Berlin, Heidelberg, 58–65.

Sifre, L. (2014) — Rigid-Motion Scattering For Image Classification. PhD Thesis. Ecole Polytechnique, CMAP.

Sifre, L. & Mallat, S. (2013) — Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1233–1240.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. & Rabinovich, A. (2014) — Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.

Wang, C. F. (2018). A Basic Introduction to Separable Convolutions. Towards Data Science. Assessed on October 5th, 2019

Wang, M, Liu, B. & Foroosh, H. (2016). Factorized Convolutional Neural Networks. In Proceedings of the IEEE International Conference on Computer Vision. 545–553.

Yeung, H. W. F., Hou, J., Chen, X., Chen, J., Chen, Z. & Chung, Y. Y. (2019). Light Field Spatial Super-Resolution Using Deep Efficient Spatial-Angular Separable Convolution. IEEE Transactions on Image Processing, 28(5): 2319–2330.

Zhang, D. Q. (2018). clcNet: Improving the Efficiency of Convolutional Neural Network using Channel Local Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7912–7919.

Zhang, R., Zhu, F., Liu, J. & Liu, G. (2019). Depth-wise separable convolutions and multi-level pooling for an efficient spatial CNN-based steganalysis. IEEE Transactions on Information Forensics and Security.

Zhang, X., Zhou, X., Lin, M. & Sun, J. (2017). ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.

--

--

Inmeta

True innovation lies at the crossroads between desirability, viability and feasibility. And having fun while doing it! → www.inmeta.no