How to Detect Persons on Bicycle, Motor or Scooter?

Yu Huang
32 min readFeb 15, 2021

Outline

— — — —

1. Introduction

2. Multi-sensor data for 3-D object detection

3. Multi-scale aggregation structure

4. Attention mechanism

5. Spatial contextual modeling

6. Spatio-temporal contextual modeling

a. RNN/LSTM/GRU

b. Transformer

7. Human pose estimation

8. Human action/behavior understanding

9. Few/zero shot & long tailed/open set learning

10. Adversarial learning

11. Multi-label learning

12. Uncertainty modeling

13. Continual, lifelong or incremental learning

14. Active learning

15. Learning from noisy labels

16. Unsupervised/Self-supervised learning

17. Multi-task learning

18. Conclusion

References

1. Introduction

— — — — — — — —

Recognizing objects in a visual scene is an effortless task for humans, one that has challenged computer vision since its foundation. The advent of deep learning approaches [1] over the last decades has delivered major improvements on this task on all benchmarks, although some difficulties remain.

Object detection is a fundamental and challenging task in computer vision. State-of-the-art deep learning based object detection methods usually assume training data and test data are both drawn from an identical distribution, classified as one stage or two stage approach [2].

Vulnerable road user (VRU)” is a term applied to those most at risk in traffic, i.e. those unprotected by an outside shield. Pedestrians, pedal cyclists, and scooter cyclists are accordingly considered as vulnerable since they benefit from little or no external protective devices that would absorb energy in a collision. Some example images of VRU from web search are shown in Fig. 1.

In this article, we will discuss how to implement a deep NN detector of bicycle, motor, moped or scooter cyclist, addressing various technical ways to realize the best performance applied for autonomous driving [3].

2. Multi-sensor data for 3-D object detection

— — — — — — — — — — — — — — — — — — — — — — — — —

Passive sensor camera could provide dense high resolution RGB information as images, but it is challenging to get 3-D information due to computer vision inherited shortcomings, much sensitive to illumination and occlusions. Active sensor LiDAR, gets the scene’s 3-D point cloud directly (as well as reflectivity), however mostly its 3-d data is sparse and incomplete, due to limitations of laser sources, scene materials (such as mirrors) and scanning distance. Point clouds have highly variable point density, which may cause detector difficulties in detecting distant or small objects. Fig. 2 gives an example of camera image and LiDAR point clouds in a street view.

Fig. 2 (a) Image from frontal camera
Fig. 2 (b) Point clouds from LiDAR

Different from 2-D detection, 3-D object detection outputs the object category, its 3-D bounding box (height, width, length, and centroid) and orientation angle.

3D object detection on Lidar-based point clouds plays an important role in robotic perception and autonomous driving applications [4–5]. Although natural image and video based object detection has witnessed great improvements in recent years [6–7], such as “pseudo LiDAR” [8] obtained from depth maps inferred by RGB images, recognizing and locating 3d objects from point clouds data remains challenging due to the irregular and uneven distribution of data points.

These 3D methods can be divided into two categories too: two stage and on stage methods. Two stage methods generate and classify the region proposals, which typically use multi-view, point cloud segmentation or frustum as representations. One stage methods directly predict class probabilities and regress 3D bounding boxes via single-stage networks. The network is applied on Bird’s Eye View (BEV), discretized voxel, or point clouds as representations.

The problem of fusing features from different views is challenging, especially for LiDAR points and RGB images, as the features obtained from RGB images and LiDAR points are represented in different perspectives. When the features from RGB images are projected onto 3D LiDAR coordinates, some useful spatial information about the objects might be lost since this transformation is a one to-many mapping.

Furthermore, the occlusions and illumination in RGB images from camera view may also introduce interference information that is harmful to the object detection task. Indeed, it has been difficult for the LiDAR-camera fusion-based methods to surpass the LiDAR-only methods in terms of performance. Furthermore, the large scale variations and occlusions in range view (compared to BEV) also introduce noise in the feature fusion process.

To detect the street person on a bicycle, motor, moped or scooter, it is seen that, camera images provides more information in texture and semantics, and LiDAR point clouds more clues in shapes and locations.

3. Multi-scale aggregation structure

— — — — — — — — — — — — — — — — — — — —

The multi-scale aggregation structure in the deep NN model is getting popular to handle complicated object detection situations.

FPN (feature pyramid network) is one of the representative works in this direction. Specifically, FPN builds a feature pyramid upon the inherent feature hierarchy in ConvNet by propagating the semantically strong features from high levels into features at lower levels. There are various modifications of FPN, like NAS-FPN, Bi-FPN and AugFPN etc. As an example, AugFPN [9] is introduced. Actually based on original FPN, it consists of three components: 1) consistent supervision, which narrows the semantic gaps between features of different scales before feature fusion; 2) residual feature augmentation, where ratio-invariant context information is extracted in feature fusion to reduce the information loss of feature map at the highest pyramid level; 3) soft RoI selection is employed to learn a better RoI feature adaptively after feature fusion. Fig. 3 shows the AugFPN detector pipeline.

Fig. 3 AugFPN detector pipeline [9]

It is known the multi-scale aggregation structure is very efficient for small object detection. Apparently the task of person detection on bike, motor or scooter typically belongs to this topic, especially the bike or scooter (a footboard mounted on two wheels and a long steering handle) object, which is thin and small in shape and size.

4. Attention mechanism

— — — — — — — — — — — — —

Attention has become enormously popular within the Artificial Intelligence (AI) community as an essential component of neural architectures for a remarkably large number of applications in Natural Language Processing, Speech and Computer Vision [10].

The intuition behind attention can be best explained using human biological systems. For example, our visual processing system tends to focus selectively on some parts of the image, while ignoring other irrelevant information in a manner that can assist in perception. The attention model incorporates relevance by allowing the model to dynamically pay attention to only certain parts of the input that help in performing the task at hand effectively.

CBAM (Convolutional Block Attention Module) [11] is a simple yet effective attention module for feed-forward CNNs. Given an intermediate feature map, CBAM sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. It can be integrated into any CNN architectures and is end-to-end trainable along with base CNNs. Fig. 4 is the diagram of both channel attention and spatial attention modules.

Fig. 4 Diagram of each attention module [11]

Transformer is a type of deep neural network mainly based on self-attention mechanism [12]. It consists of a set of encoders and decoders, which are composed of a stack of layers. In terms of encoders: each has two major components, which are (1) self-attention layer and (2) feed-forward neural network. On the other hand, the decoder has two similar components and an additional one: (1) self-attention layer, (2) encoder-decoder layer, and (3) feed-forward neural network.

Inspired by the strong representation ability of transformer, researchers propose to extend transformer for computer vision tasks. Transformer-based models show competitive and even better performance on various visual benchmarks compared to other network types such as convolutional networks and recurrent networks.

Attention mechanism has been used in object detection (including Transformer too). The intuition tells us that it should work vigorously for detection of persons on bike, motor, moped or scooter, because we expect the attention-based NN model learn the specific focus of feature maps that is embedded in visual layout.

5. Spatial contextual modeling

— — — — — — — — — — — — — — — — —

A notable feature of our visual sensory system is its ability to exploit contextual cues present in a scene, enhancing our perception and understanding of the image. Biederman [13] groups relationships between an object and its surroundings into five classes: interposition (spatially, objects interrupt their background), support (spatially, objects often rest on surfaces) and position (given an object in a scene, spatially it is often found in some positions but not others), probability (semantically, objects tend to be found in some environments but not others), and size (in scale, objects have a limited set of sizes relative to other objects).

Contextual information is defined [14] as any data obtained from an object’s own statistical property and/or from its vicinity, including intraclass and inter-class details. It is a tool used more with multiple objects so that relationships among objects can be deeply understood [15].

One way is to use the attention mechanism to obtain global context information. One specific way for the two stage detection methods, is to expand the object proposal by a few pixels when cropping the target from the feature map to obtain more surrounding information [16].

Assembling global context information of the target can enable the neural network to learn more about the relationship between the foreground and the background (extended as scene parsing or semantic segmentation), so that it can rely on this potential relationship feature to help the neural network highlight and identify the target.

PyramidBox [17] introduces a context anchor to supervise high-level contextual feature learning by a semi-supervised method, called PyramidAnchors. Next, a Low-level Feature Pyramid Network (LFPN) is designed to combine adequate high-level context semantic feature and Low-level object (for instance, face is surrounded with head, shoulder and body etc.) feature together, which also allows the PyramidBox to predict objects of all scales in a single shot. Fig. 5 is illustration of PyramidBox Loss, taking into account multi-objects.

Fig. 5 PyramidBox Loss [17]

For a detector of persons on bike, motor, moped or scooter, the contextual information of multi-objects is also specific, i.e. person, scooter/bike/moped/motor, steering handle and/or helmet. The variations of intra class are taken into account, for example, kids or adults, kick or electric. Globally, the background is also helpful as the context, i.e. road in the urban street, enclosed highway or countryside.

6. Spatio-temporal contextual modeling

— — — — — — — — — — — — — — — — — — — — — — —

The vast majority of algorithms only model single frame data, ignoring the temporal information of the sequence of data. Utilizing contextual information from adjacent frames of point cloud video to enhance the target frame is a promising direction to solve the prominent data sparsity issue.

a. RNN/LSTM/GRU

Following works further take spatio-temporal correlations among consecutive frames into account based on Recurrent Neural Network (RNN) or its variants, i.e. GRU and LSTM. LSTM ’s and GRU’s were created as the solution to short-term memory. They have internal mechanisms called gates that can regulate the flow of information. These gates can learn which data in a sequence is important to keep or throw away. By doing that, it can pass relevant information down the long chain of sequences to make predictions.

All of these spatio-temporal methods applied in LiDAR sensor treat the information from all point cloud frames equally and integrate them together to get the final representation of the target frame, which is not precise and would introduce irrelevant information for two reasons. First, consecutive Lidar data are highly redundant due to the high sampling frequency (i.e., 20 point cloud frames per second with a 32-beam Lidar sensor). Second, Lidar also records massive information about the surrounding environment instead of the objects-of-interest. Thus, achieving a dense yet precise representation of the target frame based on the adjacent frames is an indispensable and critical problem for accurate Lidar-based 3D object detection.

b. Transformer

Transformeris a novel architecture for learning long-range sequential dependency, which abandons the traditional building style of directly using RNN or LSTM architecture. It has been successfully applied to numerous natural language processing (NLP) tasks, such as machine translation and speech recognition.

In [18] a Temporal-Channel Transformer (TCTR) module is proposed to reconstruct the target frame with both intra-frame and inter-frame relevant information in a fine-grained voxel-wised manner. The basic idea is using attention mechanism to find and integrate the useful information from correlated voxels in the target frame (intra-frame) and correlated frames in the input Lidar video (inter-frame) for each voxel of the target frame.

Instead of deploying shallow attention layers, they adapt Transformer to the 3D Lidar video data analysis area in [18] by taking each channel of the compressed input frames as a Transformer encoder node and each voxel of the target frame as a Transformer decoder node, which enables to exploit diverse and complex spatial, temporal, and channel correlations among the whole input frames. Fig. 6 shows the object detector flowchart and TCTR network structure.

Fig. 6 TCTR network structure [18]

While the learned representation from TCTR is dense and only integrates information relevant with the target frame, it still contains object irrelevant information as the sparse target frame and thus harms the detection performance. They solve this problem in [18] by combining the dense representation and sparse representation of the target frame together as the final representation with gating mechanism, which can control the information flow and consistently refine the final representation by removing object-irrelevant information.

Besides of spatial contextual info, temporal context exists in the detection of persons on bike, moped, motor or scooter. Apparently, the body movements for pedestrians, bicyclists and motorists are different. Meanwhile, pixel flow (image) or scene flow (point clouds) contributes in the detection process.

7. Human pose estimation

— — — — — — — — — — — — — —

Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from input data such as images and videos.

2D pose estimation is easily achievable and high performance has been reached for the human pose estimation of a single person using deep learning techniques, in a top-down or bottom up pipeline. More recently, attention has been paid to highly occluded multi-person scenarios in complex scenes. In contrast, for 3D pose estimation, obtaining accurate 3D pose annotations is much more difficult than its 2D counterpart. When multiple viewpoints are available or LiDAR is deployed, 3D pose estimation can be a well-posed problem employing sensor fusion techniques [19].

Most of 3D body shape estimation methods from depth/RGB images mainly focus on utilizing temporal information to build point correspondences between consecutive frames and recover the 3D model of each frame with the correspondences. Instead in [19], they use point clouds to estimate sequential 3D human shape by predicting the vertex coordinates of multi-resolution 3D body meshes with deep learning.

Heatmap regression has become the most prevalent choice for pose estimation methods. However, the bottom-up methods need to handle a large variance of human scales and labeling ambiguities. In [20], the scale-adaptive heatmap regression (SAHR) method is proposed, which can adaptively adjust the standard deviation for each keypoint. They further introduce the weight-adaptive heatmap regression (WAHR) to help balance the fore-background samples, which largely improves the accuracy of bottom-up human pose estimation. Fig. 7 illustrates the flowchart of how the pose estimation NN model is trained and tested.

Fig. 7 Flowchart of the pose estimation model training and testing [20]

Body pose as one of attributes is the additional information for person detection on the bike, moped, motor or scooter. Just looking at the human skeleton figures from motion capture data, we can reckon their activities roughly. Therefore, human pose estimation definitely helps in person detection.

8. Human action/behavior understanding

— — — — — — — — — — — — — — — — — — — — — — — -

Through the human vision system, we can understand the action and the purpose of the humans. We can easily know that a person is exercising, and we could guess with a certain confidence that the person’s action is complied with the instruction or not. The term “human action/behavior” studied in computer vision research ranges from the simple limb movement to joint complex movement of multiple limbs and the human body [21].

A typical action/behavior understanding flowchart generally contains two major components, action/behavior representation and classification. The action representation component basically converts an action video into a feature vector or a series of vectors, and the action classification component infers an action label from the vector. Recently, deep networks merge these two components into a unified end-to-end trainable framework, which further enhance the classification performance in general.

Action or motion prediction approaches reason about the future and infer labels before action executions end. These labels could be the discrete action categories, or continuous positions on a motion trajectory. The capability of making a prompt reaction makes action/motion prediction approaches more appealing in time sensitive tasks. As a matter of fact, the joint perception and prediction deep NN model has been designed in [22], called MotionNet. It takes a sequence of LiDAR sweeps as input and outputs a BEV map, which encodes the object category and motion information in each grid cell. The backbone is a spatiotemporal pyramid network, which extracts deep spatial and temporal features in a hierarchical fashion. To enforce the smoothness of predictions over both space and time, the training is further regularized with spatial and temporal consistency losses. Fig. 8 illustrates the example of object detection (disabled person on a wheelchair) by MotionNet, also yielding the motion prediction, not included in the training data.

Fig. 8 Object (person on a wheelchair) perceiving and motion forecasting by MotionNet [22]

The current autonomous driving system has taken into account the human action prediction for planning and control, which are the downstream modules of perception and prediction. It is apparent motion forecasting and perception (detection) are complementary with each other in the autonomous driving. Person detection and behavior or action understanding/prediction are not performed in a sequential pipeline, instead in parallel.

9. Few/zero shot & long tailed/open set learning

— — — — — — — — — — — — — — — — — — — — — — — —

AI succeeds in data-intensive applications, but it lacks the ability of learning from a limited number of examples. To tackle this problem, Few shot learning is proposed. It can rapidly generalize from new tasks of limited supervised experience using prior knowledge [23]. As a learning paradigm, many methods endeavor to solve it, such as meta-learning (learning-to-learn) method [24], embedding learning method and generative modeling method.

Note: Transfer learning transfers knowledge learned from the source domain and source task where sufficient training data is available, to the target domain and target task where training data is limited. Domain adaptation is a type of transfer learning problem, where the tasks are the same but the domains are different.

In [25] it leverage fully labeled base classes and quickly adapts to novel classes, using a meta feature learner and a reweighting module within a one-stage detection architecture. The feature learner extracts meta features that are generalizable to detect novel classes, using training data from base classes with sufficient samples. The reweighting module transforms a few support examples from the novel classes to a global vector that indicates the importance or relevance of meta features for detecting the corresponding objects. These two modules, together with a detection prediction module, are trained end-to-end based on an episodic few-shot learning scheme and a carefully designed loss function. Fig. 9 is the architecture of the few-shot detection model.

Fig. 9 Architecture of few shot detection model [25]

Sometimes, there are some unseen person involved classes occurring and we need to apply the few shot learning scheme to handle using prior knowledge.

Zero-shot learning aims to recognize objects whose instances may not have been seen during training. Recent advances directly learns a mapping from an image feature space to a semantic space [26]. Other zero-shot learning approaches learn non-linear multimodal embeddings. Actually some form of side information is required to share information between classes so that the knowledge learned from seen classes is transferred to unseen classes.

In [27] a zero shot object detection algorithm called “Don’t Even Look Once (DELO)” is proposed that synthesizes visual features for unseen objects and augments existing training algorithms to incorporate unseen object detection. As shown in Fig. 10, (a) is an illustration of seen/unseen classes and the semantic description; (b) is a vanilla detector trained using seen objects only tends to relegate the confidence score of unseen objects; © is the proposed DELO: first it trains a visual feature generator by taking a pool of visual features of foreground/background objects and their semantics with a balanced ratio, then they use it to synthesize visual features for unseen objects, and finally they add the synthesized visual features back to the pool and re-train the confidence predictor module of the vanilla detector, where the re-trained confidence predictor can be plugged back into the detector and detect unseen objects.

Fig. 10 Illustration of the zero shot object detector DELO [27]

Unseen classes related to human events are unavoidable in the perception process of autonomous driving. Zero shot learning is a strong tool for person related visual category handling, by figuring the feature-semantic mapping and knowledge transfer.

Real scene datasets naturally exhibit the imbalanced and long-tailed distributions, where a few categories (majority categories) occupy most of the data while most categories (minority categories) are under-represented. CNNs trained on these long-tailed datasets deliver poor recognition accuracy, especially for under-represented minority categories.

Various methods, e.g. metric learning, meta learning and knowledge transfer, have been successfully explored for long-tailed recognition. Apart from these methods, existing training tricks in long-tailed visual recognition also play a major role [28], which just make simple refinements to the vanilla training procedure, such as adjustments in loss functions or data sampling strategies, such as resampling, reweighting, mix-up training, and two-stage training etc.

In [29] a balanced group softmax (BAGS) module is proposed for balancing the classifiers within the detection frameworks through group-wise training. It implicitly modulates the training process for the head and tail classes and ensures they are both sufficiently trained, without requiring any extra sampling for the instances from the tail classes. Fig. 11 illustrates the framework of the balanced group softmax module.

Fig. 11 The framework of the balanced group softmax module [29]

Apparently, the perception in autonomous driving is a long tailed learning task, where we spend lots of time to handle the scarce tail classes, such as persons along with the unseen devices or stuffs, by some “tricks”, like resampling and reweighting.

In real-world recognition/classification tasks, limited by various objective factors, it is usually difficult to collect training samples to exhaust all classes when training a recognizer or classifier. A more realistic scenario is open set recognition, where incomplete knowledge of the world exists at training time, and unknown classes can be submitted to an algorithm during testing [30], requiring the classifiers to not only accurately classify the seen classes, but also effectively deal with unseen ones.

While it is easy to draw a parallel to the prior definition of the open-set classification problem, the additional category, mixed unknown, is introduced in [31] because they think its determination is crucial and unique to the practical open-set object detection problem, from which an open-set object detection protocol is defined.

10. Adversarial learning

— — — — — — — — — — — —

The vulnerability of deep neural network architectures lies in small amplitude perturbations optimized to damage the networks’ performance. Security of deep learning systems are vulnerable to crafted adversarial examples, which may be imperceptible to the human eye, but can lead the model to misclassify the output.

One way of solving this issue is adding better intuition on the models, through explain-ability, but such models do not target the direct improvement of the model. The primary objective of the adversarial training is to increase model robustness by injecting adversarial examples into the training set [32]. Adversarial training is a standard brute force approach where the defender simply generates a lot of adversarial examples and augments these perturbed data while training the targeted model. The augmentation can be done either by feeding the model with both the original data and the crafted data, or by learning with a modified objective function.

Compositional convolutional neural networks (Compositional Nets) have been shown to be robust at classifying occluded objects by explicitly representing the object as a composition of parts [33]. Fig. 12 illustrates the feed-forward inference flowchart with a Compositional Net.

Fig. 12 feed-forward inference flowchart with a Compositional Net [33]

In [34], they apply Compositional Nets to detect partially occluded objects based on analyzing its two limitations: 1) Compositional Nets, as well as other DCNN architectures, do not explicitly separate the representation of the context from the object itself, so they propose to segment the context during training via bounding box annotations and then use the segmentation to learn a context-aware Compositional Net that disentangles the representation of the context and the object; 2) They extend the part-based voting scheme in Compositional Nets to vote for the corners of the object’s bounding box, which enables the model to reliably estimate bounding boxes for partially occluded objects.

Adversarial learning is an efficient approach to enhance the detector, by data augmentation or refined objective function, beneficial to detection of persons at cluttered environments, whom are “disguised” by various camouflages.

11. Multi-label learning

— — — — — — — — — — — —

Exabytes of data are generated daily by humans, leading to the growing needs for new efforts in dealing with the grand challenges for multi-label learning brought by big data. For example, extreme multi-label classification is an active and rapidly growing research area that deals with classification tasks with extremely large number of classes or labels; utilizing massive data with limited supervision to build a multi-label classification model becomes valuable for practical applications.

Besides these, there are tremendous efforts on how to harvest the strong learning capability of deep learning to better capture the label dependencies in multi-label learning, which is the key for deep learning to address real-world classification tasks [35].

An end-to-end unsupervised deep domain adaptation model is proposed in [36] for adaptive object detection by exploiting multi-label object recognition as a dual auxiliary task, called Multi-label Conditional distribution Alignment and detection Regularization model (MCAR). The model exploits multi-label prediction to reveal the object category information in each image and then uses the prediction results to perform conditional adversarial global feature alignment, such that the multimodal structure of image features can be tackled to bridge the domain divergence at the global feature level while preserving the discriminability of the features. Moreover, they introduce a prediction consistency regularization mechanism to assist object detection, which uses the multi-label prediction results as an auxiliary regularization information to ensure consistent object category discoveries between the object recognition task and the object detection task. Fig. 13 shows the MCAR model structure.

Fig. 13 MCAR model structure [36]

The detection of persons on bike, moped, motor or scooter can be considered in the multilabel prediction domain. Looking at the example, shown in Fig. 14, we can see multiple labels are generated closely related to this task, i.e. person, moped, wheels, steering handle, battery and helmet etc.

Fig. 14 Multi-labels in detection of persons on scooter

12. Uncertainty modeling

— — — — — — — — — — — — —

Capturing uncertainty in object detection is indispensable for safe autonomous driving. Determining reliable perceptual uncertainties, which reflect perception inaccuracy or sensor noises, could provide valuable information to introspect the perception performance, and help an autonomous car react accordingly. Further, cognitive psychologists have found that humans are good intuitive statisticians, and have a frequentist sense of uncertainties.

Predictive uncertainty in deep neural networks can be decomposed into epistemic uncertainty and aleatoric uncertainty [37]. Epistemic, or model uncertainty, indicates how certain a model is in using its parameters to describe an observed dataset. For instance, detecting an unknown object which is different from the training dataset is expected to show high epistemic uncertainty. Aleatoric, or data uncertainty, reflects observation noise inherent in sensor measurements of the environment. For example, detecting a distant object with only sparse LiDAR reflections, or using RGB cameras during the night drive should produce high aleatoric uncertainty.

Uncertainty estimation can be used in active learning to improve data efficiency. One measurement method is to evaluate the model on OOD (out of distribution) input data which do not belong to any of the existing classes.

Four practical methods for predictive uncertainty estimation in deep learning are: MC-Dropout, Deep Ensembles, Direct Modelling, and Error Propagation.

Probabilistic Object Detection, the task of detecting objects in images and accurately quantifying the spatial and semantic uncertainties of the detections, is introduced in [38]. Given the lack of methods capable of assessing such probabilistic object detections, they present the new Probability-based Detection Quality measure (PDQ). Fig. 15 is illustration of the key building blocks usually present in state-of-the-art probabilistic object detectors [37], including the base network, detection head and post-processing stages.

Fig. 15 Illustration of key blocks for probabilistic object detection [37]

As a special kind of objects, persons on bike, motor, moped or scooter, look more difficult to detect from appearance and structure. Uncertainty-ware designing and training a detector is important, by learning the probability distribution.

13. Continual, lifelong or incremental learning

— — — — — — — — — — — — — — — — — — — — — — — — — -

Humans and animals have the ability to continually acquire, fine-tune, and transfer knowledge and skills throughout their lifespan. This ability, referred to as continual learning, incremental learning or lifelong learning, is mediated by a rich set of neurocognitive mechanisms that together contribute to the development and specialization of our sensorimotor skills as well as to long-term memory consolidation and retrieval. Consequently, lifelong learning capabilities are crucial for computational systems and autonomous agents interacting in the real world and processing continuous streams of information.

However, lifelong learning remains a long-standing challenge for machine learning and neural network models since the continual acquisition of incrementally available information from non-stationary data distributions generally leads to catastrophic forgetting or interference [39]. This limitation represents a major drawback for state-of-the-art deep neural network models that typically learn representations from stationary batches of training data, thus without accounting for situations in which information becomes incrementally available over time.

In [40], a fact is leveraged that new training classes arrive in a sequential manner and incrementally refine the model so that it additionally detects new object classes in the absence of previous training data. To prevent abrupt performance degradation due to catastrophic forgetting, they propose to apply knowledge distillation on both the region proposal network and the region classification network in Faster RCNN, to retain the detection of previously trained classes. A pseudo-positive-aware sampling strategy is also introduced for distillation sample selection. Fig. 16 illustrates the architecture of lifelong object detection.

Fig. 16 Architecture of a lifelong object detection [40]

Perception in autonomous driving is absolutely a field where continual learning can play a significant role, subject to frequent and unpredictable changes in the complex environment. We should be able to learn (without forgetting) objects of unseen classes as well as improving its detection capabilities, as new instances of seen classes are discovered. Human is just this kind of object class, generating new instances from different scenarios and events.

14. Active learning

— — — — — — — —

Active learning approaches [41] have been proposed to progressively select and annotate the most informative unlabeled samples in a dataset to facilitate model refinement through user interaction when necessary. These approaches inspired us to attempt to give the more labor- and computation-intensive tasks to computers, while assigning the less labor-intensive tasks and those that require intelligence to humans. Therefore, the sample selection criteria play an important role in conventional active learning pipelines, and are typically defined in accordance with the classification uncertainty of samples. Specifically, the minority of unlabeled samples with low prediction confidences (i.e., high uncertainties), together with other informative criteria such as diversity and density, are generally treated as good candidates for model retraining.

Recently, active learning-based approaches have been proposed for object detection in a semi-supervised or weakly supervised manner. However, these approaches usually ignore the fact that the remaining majority samples (e.g., those with low uncertainty and high confidence) are also valuable for improving the detection performance. Moreover, manual annotations of unlabeled data are often noisy due to ambiguities or misunderstandings among different annotators, especially for the object detection task. Adding samples with incorrectly annotated bounding boxes may also reduce the detection performance. Therefore, both reducing user effort by mining the remaining majority samples and ensuring the appropriate treatment of outliers and noisy samples should be considered to improve the accuracy and robustness of object detectors.

Curriculum learning and self-paced learning are two learning regimes that mimic human and animal learning processes, in which training gradually progresses from easy to complex samples, providing a natural and iterative way to exploit labeled data for robust learning. In curriculum learning, a predefined learning constraint (i.e., a curriculum or curricular constraint) is employed to incrementally include additional labeled samples during training.

In self-paced learning, a weighted loss is introduced on all labeled samples, which acts as a general regularizer over the sample weights. By sequentially optimizing the model while gradually controlling the learning pace via the regularizer, labeled samples can be incrementally added into the training process in a self-paced manner. A so-called pseudo-labeling strategy is introduced, which is intended to automatically select unlabeled samples with high prediction confidence and iteratively assign pseudo-labels to them in a self-paced manner.

In [42], a principled Self-supervised Sample Mining (SSM) process is proposed for object detection. SSM process concentrates on automatically discovering and pseudo-labeling reliable region proposals for enhancing the object detector via the introduced cross image validation, i.e., pasting these proposals into different labeled images to comprehensively measure their values under different image contexts. Fig. 17 is the pipeline of the object detection framework with SSM process.

Fig. 17 Illustration of object detection pipeline with SSM process [42].

Apparently the person detection under complex background and various actions will face challenging difficulties of data annotation for deep NN model training. Active learning provides a way to handle a lot of unlabeled data in improving the detector performance when the person is on bike, motor, moped or scooter.

15. Learning from noisy labels

— — — — — — — — — — — — —

It is known labeled data are expensive and time-consuming to obtain. Some non-expert sources, such as Amazon’s Mechanical Turk and the surrounding tags of collected data, have been widely used to mitigate the high labeling cost; however, the use of these source often results in unreliable labels. In addition, data labels can be extremely complex even for an inexperienced person; they can also be adversarially manipulated by a label-flipping attack. Such unreliable labels are called noisy labels because they may be corrupted from ground-truth labels.

For decades, numerous methods have been proposed to manage noisy labels using conventional machine learning techniques. These methods can be categorized into four groups: data cleaning, surrogate loss, probabilistic, and model-based.

Deep learning methods to handle noisy labels mostly focused on making a supervised learning process more robust to label noise [43]. Robust loss function and loss adjustment aim to modify the loss function or its value; robust architecture aims to change an architecture to model a noise transition matrix of a noisy dataset; robust regularization aims to enforce a DNN to overfit less to false labeled samples; sample selection aims to identify true-labeled samples from noisy training data. Beyond supervised learning, researchers have recently attempted to further improve noise robustness by adopting meta learning and semi-supervised learning. In general, possibly false-labeled samples in noisy data are treated as unlabeled, whereas the remaining samples are treated as labeled. Subsequently, semi-supervised learning is performed using the transformed data.

In [44], researchers apply imperfect label assignment in an one-stage object detection framework where the contributions of anchors are dynamically determined by a carefully constructed cleanliness score associated with each anchor. Exploring outputs from both regression and classification branches, the cleanliness scores, estimated without incurring any additional computational overhead, are used as soft labels to supervise the training of the classification branch and to sample re-weighting factors for improved localization and classification accuracy.

Luxurious labeling data unavoidably brings noisy labels, especially for hard cases like person detection. It is necessary to deploy the learning framework handling noisy data.

16. Unsupervised/Self-supervised learning

— — — — — — — — — — — — — — — — — —

As mentioned before, researchers try to incorporate unlabeled data into the training process to reach equal results with fewer labels. Due to this benefit, many researchers and companies work in the in the field of semi-, self- and unsupervised learning [45]. The main goal is to close the gap between semi-supervised and supervised learning or even surpass these results. In semi-supervised, there are Fast-Stochastic Weight Averaging, Mean Teacher, MixMatch, Temporal Ensembling, Pseudo Labeling, Unsupervised Data Augmentation and Virtual Adversarial Training methods, as examples. In self-supervised, they have proposed Augmented Multiscale Deep InfoMax, Contrastive Predictive Coding, DeepCluster, Invariant Information Clustering, and Representation learning (context, jigsaw, rotation or exemplar) etc. In unsupervised, there are some methods like Deep Adaptive Image Clustering, Invariant Information Clustering and Information Maximizing Self-Augmented Training etc.

One issue in object detection is, object boundaries derived only from 2D image appearance are ambiguous and unreliable. In [46], LiDAR clues is exploited to aid unsupervised object detection. By exploiting the 3D scene structure, the issue of localization can be considerably mitigated. Firstly, candidate object segments based on 3D point clouds are generated. Then, an iterative segment labeling process is conducted to assign segment labels and to train a segment labeling network, which is based on features from both 2D images and 3D point clouds. The labeling process is carefully designed so as to mitigate the issue of long-tailed and open-ended distribution. The final segment labels are set as pseudo annotations for object detection network training. Fig. 18 is illustration of this un-supervised object detection framework.

Fig. 18 Illustration of unsupervised object detection method [46]

17. Multi-task learning

— — — — — — — — —

Many real-world problems call for a multi-modal approach and, therefore, for multi-tasking models. Multi-task learning aims to leverage useful information across tasks to improve the generalization capability of a model. To achieve this, they placed assumptions on the task parameter space, such as: task parameters should lie close to each other w.r.t. some distance metric, share a common probabilistic prior, or reside in a low dimensional subspace or manifold. These assumptions work well when all tasks are related, but can lead to performance degradation if information sharing happens between unrelated tasks. The latter is a known problem in multi-task learning, referred to as negative transfer. To mitigate this problem, some of these works opted to cluster tasks into groups based on prior beliefs about their similarity or relatedness.

In the deep learning era, multi-task learning translates to designing networks capable of learning shared representations from multi-task supervisory signals [47]. Compared to the single-task case, where each individual task is solved separately by its own network, such multi-task networks theoretically bring several advantages to the table. First, due to their inherent layer sharing, the resulting memory cost is substantially reduced. Second, as they explicitly avoid to repeatedly calculate the features in the shared layers, once for every task, they show increased inference speeds. Most importantly, they have the potential for improved performance if the associated tasks share complementary information, or act as a regularizer for one another.

These methods were historically classified into soft or hard parameter sharing techniques. In soft parameter sharing, each task is assigned its own set of parameters (i.e. task-specific networks) and feature sharing mechanisms handle the cross-task talk. A concern with this kind of approaches is scalability, as the size of the multi-task network tends to grow linearly with the number of tasks. In hard parameter sharing, the parameter set is divided into shared and task-specific operations. The most characteristic hard parameter sharing design consists of a shared encoder that branches out into task-specific decoding heads. Despite the progress, the joint learning of multiple tasks is prone to negative transfer if the task dictionary contains unrelated tasks.

In [48] a multi-task visual perception network is presented on unrectified fisheye images to enable the vehicle to sense its surrounding environment. It consists of six primary tasks necessary for an autonomous driving system: depth estimation, visual odometry, semantic segmentation, motion segmentation, object detection, and lens soiling detection. It is demonstrated that the jointly trained model performs better than the respective single task versions. The multi-task model has a shared encoder providing a computational advantage and has synergized decoders where tasks support each other. Fig. 19 is overview of OmniDet, a multi-task visual perception framework based on surround view cameras.

Fig. 19 Overview of the multi-task visual perception framework based on surround view cameras [48]

18. Conclusion

— — — — — — —

In this article, we discuss how to realize object detection for a special case: persons on bike, motor or scooter. It is elaborated from different points of views, i.e. visual domain knowledge (like camera/LiDAR sensor data, human pose estimation and human action recognition/prediction), deep learning architecture (like multi-scale aggregation, attention mechanism and contextual modeling), and machine learning strategies (few shot/zero shot/long tailed/open set learning, adversarial learning, multi-label learning, uncertainty modeling & OOD, continual/incremental learning, active learning, unsupervised/semi-supervised learning and multi-task learning) etc.

References

— — — — — — —

【1】 S Pouyanfar et al., “A Survey on Deep Learning: Algorithms, Techniques, and Applications”, ACM Computing Surveys, Vol. 51, №5, 9, 2018

【2】 L Jiao et al., “A Survey of Deep Learning-based Object Detection”, arXiv 1907.09408, 2019

【3】 Y Huang, Y Chen, “Autonomous Driving with Deep Learning: A Survey of State-of-Art Technologies ”, arXiv 2006.06091, 2020

【4】 Guo Y et al., “Deep Learning for 3D Point Clouds: A Survey”, arXiv 1912.12033, 2019

【5】 Bello Y, Yu S, Wang C, “Review: Deep Learning on 3D Point Clouds”, arXiv 2001.06280, 2020

【6】 Arnold E et al., “A Survey on 3D Object Detection Methods for Autonomous Driving Applications”, IEEE T-ITS, 20(10), Oct. 2019

【7】 Rahman M M et al., “Recent Advances in 3D Object Detection in the Era of Deep Neural Networks: A Survey”, IEEE T-IP, Nov. 2019

【8】 A Simonelli et.al, “Demystifying Pseudo-LiDAR for Monocular 3D Object Detection”, arXiv:2012.05796, 2020.

【9】 C Guo et al., “AugFPN: Improving Multi-scale Feature Learning for Object Detection”, IEEE CVPR 2020.

【10】 S Chaudhari et al., “An Attentive Survey of Attention Models”, arXiv: 1904.02874, 2020

【11】 S Woo et al., “CBAM: Convolutional Block Attention Module”, ECCV 2018.

【12】 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, “Attention is all you need”, Advances in NIPS, 2017.

【13】 I. Biederman, R. J. Mezzanotte, and J. C. Rabinowitz. “Scene Perception: Detection and Judging Objects undergiong relational violations”. Cognitive Psychology 1982.

【14】 F Alamri and N Pugeault. “Improving object detection performance using scene contextual constraints”. IEEE T-CDS, July 2020.

【15】 C Desai, D Ramanan, and C C. Fowlkes. “Discriminative models for multi-class object layout”. Int. J. of Computer Vision, Oct 2011.

【16】 J.Wu, Z. Kuang, L.Wang, W. Zhang, G.Wu, “Context-aware rcnn: A baseline for action detection in videos”, arXiv:2007.09861, 2020.

【17】 X Tang et al., “PyramidBox: A Context-assisted Single Shot Face Detector”, arXiv:1803.07737, 2018

【18】 Z Yuan et al., “Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving”, arXiv:2011.13628, 2020.

【19】 K. Wang, J. Xie, G. Zhang, L. Liu, and J. Yang, “Sequential 3d human pose and shape estimation from point clouds,” IEEE CVPR, 2020.

【20】 Z Luo et al., “Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation”, arXiv:2012.15175, 2021

【21】 Y Kong, Y Fu, “Human Action Recognition and Prediction: A Survey”, arXiv:1806.11230, 2018

【22】 P Wu, S Chen, D Metaxas, “MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird’s Eye View Maps”, IEEE CVPR, 2020.

【23】 Y Wang et al., “Generalizing from a Few Examples: A Survey on Few-Shot Learning”, arXiv 1904.05046, May 2019

【24】 Vanshoren J, “Meta-Learning: A Survey”, arXiv 1810.03548, 2018

【25】 B Kang et al., “Few-shot Object Detection via Feature Reweighting”, ICCV 2019.

【26】 W Wang et al., “A Survey of Zero-Shot Learning: Settings, Methods, and Applications”, ACM Trans. Intell. Syst. Technol. 10, 2, January 2019

【27】 P Zhu, H Wang, V Saligrama, “Don’t Even Look Once: Synthesizing Features for Zero-Shot Detection”, IEEE CVPR 2020.

【28】 Y Zhang et al., “Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks”, AAAI, 2021.

【29】 Y Li et al, “Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax”, arXiv:2006.10408, 2020

【30】 C Geng, S Huang, S Chen, “Recent Advances in Open Set Recognition: A Survey”, arXiv:1811.08581, 2020

【31】 A R Dhamija et al, “The Overlooked Elephant of Object Detection: Open Set”, WACV, 2020

【32】 A Chakraborty et al., “Adversarial Attacks and Defences: A Survey”, arXiv:1810.00069, 2018

【33】 A Kortylewski, J He, Q Liu, A Yuille. “Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion”. IEEE CVPR, 2020

【34】 A Wang et al., “Robust Object Detection under Occlusion with Context-Aware Compositional Nets”, arXiv:2005.11643, 2020

【35】 W Liu et al., “The Emerging Trends of Multi-Label Learning”, arXiv:2011.11197, 2020.

【36】 Z Zhao et al., “Adaptive Object Detection with Dual Multi-Label Prediction”, ECCV, 2020

【37】 D Feng et al., “A Review and Comparative Study on Probabilistic Object Detection in Autonomous Driving”, arXiv:2011.10671, Nov. 2020.

【38】 D Hall et al., “Probabilistic Object Detection: Definition and Evaluation”, WACV, 2020

【39】 G I Parisi et al., “Continual Lifelong Learning with Neural Networks: A Review”, arXiv:1802.07569, 2019

【40】 W Zhou et al., “Lifelong Object Detection”, arXiv:2009.01129, 2020

【41】P Ren et al., “A survey of active learning”, arXiv:2009.00236, 2020

【42】 K Wang et al., “Towards Human-Machine Cooperation: Self-supervised Sample Mining for Object Detection”, IEEE CVPR 2018.

【43】 H Song et al., “Learning from Noisy Labels with Deep Neural Networks: A Survey“, arXiv: 2007.08199, 2020

【44】 H Li et al., “Learning from Noisy Anchors for One-stage Object Detection”, IEEE CVPR 2020.

【45】 L Schmarje et al., “A survey on Semi-, Self- and Unsupervised Techniques in Image Classification -Similarities, Differences & Combinations”, arXiv: 2002.08721, 2020

【46】 H Tian et al., “Unsupervised Object Detection with LiDAR Clues”, arXiv:2011.12953, 2020

【47】 S Vandenhende et al., “Revisiting Multi-Task Learning in the Deep Learning Era”, arXiv: 2004.13379, 2020

【48】 V R Kumar et al., “OmniDet: Surround View Cameras based Multi-task Visual Perception Network for Autonomous Driving”, arXiv: 2102.05150, 2021

--

--

Yu Huang

Working in Computer vision, deep learning, AR & VR, Autonomous driving, image & video processing, visualization and large scale foundation models.