Review — GhostNet: More Features from Cheap Operations

GhostNet is Formed, By Stacking Ghost Modules

Sik-Ho Tsang
4 min readMar 22


Visualization of some feature maps generated by the first residual group in ResNet-50, where three similar feature map pair examples are annotated with boxes of the same color. One feature map in the pair can be approximately obtained by transforming the other one through cheap operations (denoted by spanners).

GhostNet: More Features from Cheap Operations,
GhostNet, by Huawei Technologies, Peking University, and University of Sydney,
2020 CVPR, Over 1100 Citations (Sik-Ho Tsang @ Medium)
Image Classification

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • A novel Ghost module is proposed to generate more feature maps from cheap operations.
  • By stacking Ghost modules, GhostNet is formed.
  • Later on, GhostNetV2 is also proposed in 2022 NeurIPS.


  1. GhostNet
  2. Results

1. GhostNet

1.1. Ghost Module

An illustration of the convolutional layer and the proposed Ghost module for outputting the same number of feature maps. Φ represents the cheap operation.
  • (a) Standard convolution.
  • (b) Ghost Module. And the below source code is the ghost module implementation.
class GhostModule(nn.Module):
def __init__(self, inp, oup, kernel_size=1, ratio=2, dw_size=3, stride=1, relu=True):
super(GhostModule, self).__init__()
self.oup = oup
init_channels = math.ceil(oup / ratio)
new_channels = init_channels*(ratio-1)

self.primary_conv = nn.Sequential(
nn.Conv2d(inp, init_channels, kernel_size, stride, kernel_size//2, bias=False),
nn.ReLU(inplace=True) if relu else nn.Sequential(),

self.cheap_operation = nn.Sequential(
nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=False),
nn.ReLU(inplace=True) if relu else nn.Sequential(),

def forward(self, x):
x1 = self.primary_conv(x)
x2 = self.cheap_operation(x1)
out =[x1,x2], dim=1)
return out[:,:self.oup,:,:]
  • As seen from source code, standard convolution is performed for part of the feature maps, and x1 is outputted.
  • d×d cheap operation, i.e. depth-wise convolution, is performed on x1, and x2 is outputted.
  • Then, x1 and x2 are concatenated.
  • s is the ratio as above for generating m=n/s intrinsic feature maps.

1.2. Ghost Bottleneck

Ghost bottleneck. Left: Ghost bottleneck with stride=1. Right: Ghost bottleneck with stride=2.
  • The Ghost bottleneck appears to be similar to the basic residual block in ResNet. The proposed ghost bottleneck mainly consists of two stacked Ghost modules.
  • The first Ghost module acts as an expansion layer increasing the number of channels. The ratio is the expansion ratio.
  • The second Ghost module reduces the number of channels to match the shortcut path.
  • When stride=2, the shortcut path is implemented by a downsampling layer and a depthwise convolution with stride=2 is inserted between the two Ghost modules.

1.3. GhostNet

Overall architecture of GhostNet.
  • The model basically follows the architecture of MobileNetV3, replacing the bottleneck block in MobileNetV3 with our Ghost bottleneck.
  • All the Ghost bottlenecks are applied with stride=1 except that the last one in each stage is with stride=2.
  • The squeeze and excite (SE) module, as in SENet, is also applied to the residual layer in some ghost bottlenecks.
  • Yet, hard-Swish nonlinearity function as in MobileNetV3 is NOT used.
  • A width multiplier α is applied to scale the number of channels uniformly at each layer. GhostNet with width multiplier α as GhostNet-α×.

2. Results

2.1. Ablation Study

The performance of the proposed Ghost module with different d on CIFAR-10.
  • s is used for generating m=n/s intrinsic feature maps, and kernel size d×d of linear operations (i.e. the size of depthwise convolution filters) for calculating ghost feature maps.
  • s=2 and d is tuned in {1, 3, 5, 7}.

d=3 is the best.

The performance of the proposed Ghost module with different s on CIFAR-10.
  • d=3 and s is tuned in the range of {2, 3, 4, 5}.
  • Larger s leads to larger compression and speed-up ratio.

When s=2 which means compress VGG-16 by 2×, GhostNet performs even slightly better than the original model.

2.2. CIFAR-10

Comparison of state-of-the-art methods for compressing VGG-16 and ResNet-56 on CIFAR-10.

Using ghost modules obtain small model size while keeping the accuracy.

2.3. ImageNet

Comparison of state-of-the-art methods for compressing ResNet-50 on ImageNet dataset.
  • ResNet-50 has about 25.6M parameters and 4.1B FLOPs with a top-5 error of 7.8%.

Ghost-ResNet-50 (s=2) obtains about 2× acceleration and compression ratio, while maintaining the accuracy as that of the original ResNet-50.

Comparison of state-of-the-art small networks over classification accuracy, the number of weights and FLOPs on ImageNet dataset.
Top-1 accuracy v.s. FLOPs/Latency on ImageNet dataset.
  • GhostNet obtain about 0.5% higher top-1 accuracy than MobileNetV3 with the same latency, and GhostNet need less runtime to achieve similar performance.
  • For example, GhostNet with 75.0% accuracy only has 40 ms latency, while MobileNetV3 with similar accuracy requires about 45 ms to process one image.

Overall, GhostNets generally outperform the famous state-of-art models, i.e. MobileNetV2, MobileNetV3, ProxylessNAS, FBNet, and MnasNet.

2.4. MS COCO

Results on MS COCO dataset.

With significantly lower computational costs, GhostNet achieves similar mAP with MobileNetV2 and MobileNetV3.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.