CVNets: A library for training computer vision networks

Computer Vision Poshan Pandey 0 Comments

CVNets is a high-performance open-source library for training deep neural networks for visual recognition tasks, including classification, detection, and segmentation. CVNets supports image and video understanding tools, including data loading, data transformations, novel data sampling methods, and implementations of several standard networks with similar or better performance than previous studies.

With the rise of deep learning, significant progress has been made in visual understanding tasks, including novel light- and heavy-weight architectures, dedicated hardware and software stacks, advanced data augmentation methods, and better training recipes. There exist several popular libraries that provide implementations for different tasks and input modalities, including Torchvision, TensorflowLite, timm, and PyTorchVideo. Many of these libraries are modular and are designed around a particular task and input modality, and provide implementations and pre-trained weights of different networks with varying performance. However, reproducibility varies across these libraries. For example, the Torchvision library uses advanced training recipes (e.g., better augmentation) to achieve the same performance for training MobileNetv3 on the ImageNet dataset [3] as TensorflowLite with simple training recipes. We introduce CVNets, a PyTorch-based deep learning library for training computer vision models with higher performance. With ∗Project lead and main contributor CVNets, we enable researchers and practitioners in academia and industry to train either novel or existing deep learning architectures with high performance across different tasks and input modalities. CVNets is a modular and flexible framework that aims to train deep neural networks faster with simple or advanced training recipes. Simple recipes are useful for research in resource-constrained environments as they train models for fewer epochs with basic data augmentation (random resized crop and flipping) as compared to advanced training recipes, which train a model for 2 − 4× longer with advanced augmentation methods (e.g., CutMix and MixUp). With simple recipes (similar to the ones in original publications) and variable batch sampler, CVNets improve the performance of ResNet-101 significantly on the ImageNet dataset while for advanced training recipes with the same batch size and a number of epochs, it delivers similar performance to previous methods while requiring 1.3× fewer optimization updates.

Figure: CVNets can be used to improve the performance of different deep neural networks on the ImageNet dataset significantly with simple training recipes (e.g., random resized cropping and horizontal flipping). The official MobileNetv1 and ResNet results are from TensorflowLite and Torchvision, respectively.

CVNets follows the Following Design Principles

Modularity: CVNets provide independent components; allowing users to plug-and-play different components across different visual recognition tasks for both research and production use cases. CVNets implement different components, including datasets and models for different tasks and input modalities, independently. For example, different classification backbones (e.g., ResNet-50) trained in CVNets can be seamlessly integrated with object detection (e.g., SSD) or semantic segmentation (e.g., DeepLabv3) pipelines for studying the generic nature of architecture.
Flexibility: With CVNets, we would like to enable new use cases in research as well as production. We designed CVNets such that new components (e.g., models, datasets, loss functions, data samplers, and optimizers) can be integrated easily. We achieve this by registering each component. As an example, the ADE20k dataset for the task of segmentation is registered in CVNets as @register_dataset(name=“ade20k”, task=“segmentation”) To use this dataset for training, one can use dataset. name and dataset. category as command-line arguments.
Reproducibility: CVNets provide reproducible implementations of standard models for different computer vision tasks. Each model is benchmarked against the performance reported in original publications as well as the previous best reproduction studies. The pre-trained weights of each model are released online to enable future research.
Compatibility: CVNets are compatible with hardware-accelerated frameworks (e.g., CoreML) and domain-specific libraries (e.g., PyTorchVideo). The models from domain-specific libraries can be easily consumed in the CVNets, as shown in Listing 1; reducing researchers' overhead in implementing new components or sub-modules in CVNets

CVNets Library Components

CVNets include efficient data sampling and training methods, in addition to standard components (e.g., optimizers; Section 3.3), which are discussed below.

Data Samplers: CVNets offer data samplers with three sampling strategies:

Single-scale with fixed batch size (SSc-FBS): This method is the default sampling strategy in most deep learning frameworks (e.g., PyTorch, Tensorflow, and MixNet) and libraries built on top of them (e.g., the timm library [21]). At the 𝑡-th training iteration, this method samples a batch of 𝑏 images per GPU with a pre-defined spatial resolution of height 𝐻 and width 𝑊.
Multi-scale with fixed batch size (MSc-FBS): The SSc-FBS method allows a network to learn representations at a single scale (or resolution). However, objects in the real world are composed at different scales. To allow a network to learn representations at multiple scales, MSc-FBS extends SSc-FBS to multiple scales [16]. Unlike the SSc-FBS method that takes a pre-defined spatial resolution as an input, this method takes a sorted set of 𝑛 spatial resolutions S = {(𝐻1,𝑊1), (𝐻2,𝑊2), · · · , (𝐻𝑛,𝑊𝑛)} as an input. At the 𝑡-th iteration, this method randomly samples 𝑏 images per GPU of spatial resolution (𝐻𝑡 ,𝑊𝑡) ∈ S.
Multi-scale with variable batch size (MSc-VBS): Networks trained using the MSc-FBS methods are more robust to scale changes as compared to SSc-FBS [13]. However, depending on the maximum spatial resolution in S, MSc-FBS methods may have a higher peak GPU memory utilization (see Figure 2c) as compared to SSc-FBS; causing out-of-memory errors on GPUs with limited memory. For example, MSc-FBS with S = {(128, 128), (192, 192), (224, 224), (320, 320)} and 𝑏 = 256 would need about 2× more GPU memory (for images only) than SSc-FBS with a spatial resolution of (224, 224) and 𝑏 = 256. To address this memory issue, we extend MSc-FBS to variably-batch sizes in our previous work [13]. For a given sorted set of spatial resolutions S = {(𝐻1,𝑊1), (𝐻2,𝑊2), · · · , (𝐻𝑛,𝑊𝑛)} and a batch size 𝑏 for a maximum spatial resolution of (𝐻𝑛,𝑊𝑛), a spatial resolution (𝐻𝑡 ,𝑊𝑡) ∈ S with a batch size of 𝑏𝑡 = 𝐻𝑛𝑊𝑛𝑏/𝐻𝑡𝑊𝑡 is sampled randomly at 𝑡-th training iteration on each GPU.

Sample Efficient Training: Previous works remove and re-weight data samples to reduce optimization updates (or a number of forwarding passes) at the cost of performance degradation. Moreover, these methods are computed- and memory-intensive, and do not scale well to large models and datasets. This work aims to reduce optimization updates with minimal or no performance degradation. Models learn quickly during the initial phase of training. In other words, models can accurately classify many training data samples during earlier epochs and such samples do not contribute much to learning. Therefore, a natural question arises: Can we remove such samples to reduce total optimization updates? CVNets answer this question with a simple heuristic method, which we call Sample Efficient Training (SET). At each epoch, SET categorizes each sample as either hard or easy. To do so, SET uses a simple heuristic: if the model predicts the training data sample correctly with a confidence greater than a pre-defined threshold 𝜏, then it is an easy sample and we remove it from the training data. At each epoch, the model only trains using hard samples. Because of randomness in training due to data augmentation (e.g., random cropping), it is possible that the region of interest corresponding to the object category may be partially (or not) present in the model’s input and easily earlier may be classified harder during the later training stages. SET adds such samples back to the training data (see Figure 3c). Figure 3 shows results for ResNet-50 trained with and without SET using MSc-VBS. ResNet-50 without SET requires 22% more optimization updates while delivering similar performance; demonstrating the effectiveness of SET on top of MSc-VBS. Note that SET has an overhead. Therefore, the reduction in optimization updates does not translate to a reduction in training time. We believe SET can serve as a baseline in this direction and inspire future research to improve training speed while maintaining performance.

3. Standard Components: CVNets support different tasks (e.g., image classification, detection, segmentation), data augmentation methods (e.g., flipping, random resized crop, RandAug, and CutMix), datasets (e.g., ImageNet1k/21k for image classification, Kinetics-400 for video classification, MS-COCO for object detection, and ADE20k for segmentation), optimizers (e.g., SGD, Adam, and AdamW), and learning rate annealing methods (e.g., fixed, cosine, and polynomial).

4. Benchmarks: CVNets support different visual recognition tasks, including classification, detection, and segmentation. We provide comprehensive benchmarks for standard methods along with pre-trained weights.

Classification of ImageNet dataset. CVNets implement popular light- and heavy-weight image classification models. The performance of some of these models on the ImageNet dataset is shown in Table 1. With CVNets, we are able to achieve better performance (e.g., MobileNetv1/v2) or similar performance (ResNet-50/101) with fewer optimization updates (faster training).
Similar to image classification, CVNets can be used to train standard detection and segmentation models with better performance. For example, SSD with ResNet-101 backbone trained with CVNets at a resolution of 384 × 384 delivers a 1.6% better mAP than the same model trained at a resolution of 512 × 512. Similarly, on the task of semantic segmentation on the ADE20k dataset using DeepLabv3 with MobileNetv2 as the backbone, CVNets deliver 1.1% better performance than the MMSegmentation library with 2× fewer epochs and optimization updates. For more details, please see our benchmarking results at https://github.com/apple/ml-cvnets.

Installation

CVNets can be installed in the local python environment using the below command:

    git clone git@github.com:apple/ml-cvnets.git
    cd ml-cvnets
    pip install -r requirements.txt
    pip install --editable .

We recommend to use Python 3.7+ and PyTorch (version >= v1.8.0) with conda environment. For setting-up python environment with conda, see here.

Getting Started

General instructions for working with CVNets are given here.
Examples for training and evaluating models are provided here.
Examples for converting a PyTorch model to CoreML are provided here.

References:

This article is completely based on the original paper of CVNets which can be found here.

Computer Vision