Some Words about Autonomous Driving’s Engineering Landing

18 min readFeb 17, 2022

1. Introduction

We have a sense that autonomous driving has entered the “second half”. The early work like demo or POC is no longer cared by people. The so-called “first half” here is mostly to solve common problems, such as the solution algorithms of perception, localization, prediction, planning, decision-making and control in typical scenes (i.e. highway, urban street and parking lot) and implementation schemes (chassis by wire technology).

In addition, during the “first half”, the R&D process of computing platform (AI chip and its SOC) and sensor technology also realized initial results, such as NVIDIA’s Xavier and Orin, HDR camera, solid-state LiDAR and 4-D radar.

It has become a “consensus” that the main tasks are to build a “closed-loop” data framework in the “second half“ and to solve the “long tail” distribution effects. In this process, how to realize the technical engineering landing of automatic driving is the key, including the work of development standardization and platform, mass production and scalability, and landing commercialization (cost, vehicle regulations and OTA).

In following sessions, we will list those typical issues in autonomous driving engineering work.

2. Chassis by Wire

The chassis system accounts for about 10% of the whole vehicle cost, and the chassis by wire is the key component of automatic driving, because the final output control signal of automatic driving may not be implemented correctly without its support.

Drive by wire or X-by-wire is the form of wire (electrical signal) used to replace the mechanical, hydraulic or pneumatic connection, so that it does not need to rely on the driver’s input of force or torque.

The chassis by wire mainly includes braking system, steering system, drive system and suspension system. It has such characteristics as fast response speed, high control accuracy and strong energy recovery. It is an indispensable part to realize automatic driving.

The safety of chassis by wire technology is the most critical element for automatic driving. In the past, the pure mechanical control had low efficiency but high reliability; Although the wire control technology is suitable for automatic driving, it also faces the latent dangers caused by the failure of electronic software. Only by realizing dual or even multiple functional redundancy, can we ensure that its basic functions can still be realized in case of failure.

Lincoln MKZ is the most popular debug and development vehicle of L4 self-driving start-up companies in the world. It is due to its high-performance and high-precision wire control ability, easiness of using reverse engineering to realize modification, and mature service providers of wire control modifying as AutonomousStuff and DataSpeed, which jointly provide a stable and easy-to-use platform for R&D of self-driving start-ups.

The electric vehicle chassis has three electric systems (battery, motor and electric control), energy recovery and heat management system, steering by wire, braking system and suspension system etc. The skateboard chassis is a modular layout of steering, braking, three electric and suspension systems installed on the chassis. Based on the requirements of the vehicle model, the demand modules are changed accordingly, so as to shorten the development cycle. Because its shape is similar to a skateboard, it is called “skateboard chassis”. The skateboard chassis has a high degree of flexibility to meet the needs of the automatic driving system.

European and American start-ups such as Arrival, Rivian, Canoo and REE, have announced the adoption of skateboard chassis. Automobile companies such as Toyota, Hyundai and Citroen, as well as Tier-1 like Schaeffler and ZF, have begun to develop the skateboard chassis.

3. E2A (Electronic and Electrical Architecture)

With the autonomy development driven by the trend of “networking, autonomy, sharing and electrification (case)” in the automotive industry, the distributed architecture is transformed into a centralized architecture. E2A is a general layout scheme integrating various sensors, processors, electronic and electrical distribution systems, software and hardware (including data center platform and high-performance computing platform).

Through E2A, the powertrain, driving information and entertainment information can be transformed into electronic and electrical solutions such as physical layout, signal network, data network, diagnosis, fault tolerance and power consumption management.

E2A is basically divided into three generations: distributed multi-MCU networking architecture, functional cluster domain controller, zone connected controller and central platform computer (CPC).

Autonomous vehicles requires a large number of sensors, and the interior harness is rapidly increasing. The amount of data to be transmitted in the vehicle soars. At the same time, the vehicle harness not only carries more signals, but also requires faster data transmission rate.

Under the new generation E2A platform, the autonomous vehicle realizes the decoupling of software and hardware through standardized API interface, which can be supported by stronger computing power. At the same time, the bandwidth of data communication is also enhanced, and the source allocation and task scheduling are more flexible. In addition, it is also convenient for OTA (over the air).

For the smart automobile E2A, APTIV proposed a scheme combining “brain” and “neuron”, including three parts: central compute cluster, standard power supply and data backbone network and power data center. This autonomous vehicle architecture focuses on three characteristics: flexibility, continuous upgrade in the life cycle, fault tolerance and robustness of the system architecture.

The E2A of Tesla Model 3 is divided into domain control architecture and power distribution architecture. The AICM (Autopilot & Infotainment Control Module) is merged into the CCM (central compute module), while the power distribution architecture takes into account the power redundancy requirements of the autonomous driving system.

4. Middleware Software Platform

Middleware is a large category in basic software. Above the operating system, network and database, it is below the application software. It is to provide an environment for the operation and development of application software, so as to facilitate the flexible and efficient development and integration of complex application software. It shares sources and manages compute sources and network communication.

In addition, middleware is not located at the operating system, but a set of software frameworks, although it includes real-time operation system (RTOS), Micro-control abstraction layer (MCAL), service communication layer, protocols and services.

The core of middleware is “unified standard, decentralized implementation and centralized configuration”. It has the following functions: Solution for the usability and safety requirements of automotive functions; Maintenance of certain automotive electronic system redundancy; Transplant on different platforms; Realizing the basic system functions of the standard; Sharing software functions through the network; Integrating software modules provided by multiple developers; Better software maintenance in the product life cycle; Full usage of the hardware platform; Renew and upgrade of automotive electronic system software.

Service-Oriented Architecture (SOA) has a loosely coupled system, that is, it has a neutral interface definition, which means that the components and functions of the application are not forced to bound, and the different components and functions of the application are not closely related to the structure. When the internal structure and implementation of application services change gradually, the software architecture will not be greatly affected.

SOA makes the deployment of service components no longer depend on specific operating system and programming language, and realizes the separation of software and hardware to a certain extent. Software development of SOA considers functions from the perspective of users, takes service as the center, and abstracts and encapsulates service logic.

The autonomous driving software supported by the new generation middleware platform carries out function abstraction through SOA, at appropriate granularity, pluggable software codes (independent development, testing, deployment and release), service-oriented software functioning and loose coupling between functions.

AUTOSAR is a standardized interface for software, jointly developed by major automobile OEM and accessories manufacturers.

5. AI Model Compression and Acceleration

AI model compression and acceleration are two different topics. The focus of compression is to reduce the amount of network parameters. The purpose of acceleration is to reduce the computational complexity and improve the parallelism.

The technology of compressing and accelerating AI model can be roughly divided into four schemes, as follows.

1) Parameter pruning and sharing: explore the redundancy in model parameters and try to remove redundant and unimportant parameters;

2) Low rank factorization: matrix / tensor decomposition is used to estimate the information parameters of the deep CNN model;

3) Transferred/compact convolution filter: design special structure convolution filter to reduce parameter space and save storage / calculation;

4) Knowledge distillation: learn the distillation model and train a more compact neural network to reproduce the output of a larger network.

Generally, parameter pruning and sharing, low rank factorization and knowledge distillation methods can be used in deep neural network models with full connected layer and convolution layer; On the other hand, the method of using transferred / compact filter is only applicable to the model with convolution layer. Low rank factorization and transferred / compact filter based methods provide the end-to-end pipeline, which can be easily implemented in CPU / GPU environment. Parameter pruning and sharing will use different methods, such as vector quantization, binary coding and sparse constraints. In short, compression and acceleration require multiple steps.

As for the training method, the model based on parameter pruning / shared and low rank factorization can be extracted from the pre-training method, or train from scratch. The transferred / compact convolution filter and knowledge distillation model can only be trained from scratch. These methods are designed independently and complement each other. For example, the transferred network layer and parameter pruning & sharing can be used together, or model quantization and binarization can be used together with low rank factorization approximation.

Knowledge distillation compresses the deep-and-wide network into a shallow network, in which the compression model simulates the functions learned by the complex model. The main idea based on distillation method is to learn the class distribution of the softmax outputs, and transfer the knowledge from a large teacher model to a small student model.

6. In-Vehicle Autonomous Driving Chips

Autonomous driving chips and SOC (system-on-chip) are designed to realize an efficient, low-cost and low-power consumption computing platform. The autonomous driving platform realized by industrial computer is difficult to realize mass production, scalable and cost control.

An SOC may include AI chip (deep learning model implementation), CPU/GPU cores, DSP chip, ISP chip and CV (computer vision) chip. On the basis of the chip platform, a neural compiler supporting the implementation of deep learning models needs to be developed to maximize the utilization of the chip platform and avoid processor data bottleneck.

Among them, the adaptability of algorithm (modularity and multi-process decomposition), the efficient running of autonomous driving software (including multi-process data communication, deep learning model acceleration, task scheduling and resource management) and its safety assurance (functional safety/intended functional safety) all require a lot of hard engineering work and necessary cost (such as system redundancy).

So far, NVIDIA’s Xavier and Orin are the most successful and open autonomous driving chips on the market.

7. Data Closed-Loop Platform

One of the most challenging applications of AI, Autonomous Driving, is a typical example of the “long tail” effect. Not a few rare “corner cases” are often lack of data, which requires us to continuously find in a closed loop and put into the training set after labeling/annotation, as well as the test set or the simulation scenario database; After the NN model is iteratively upgraded, it will be deployed to the autonomous vehicle to go to a new cycle, that is, the “data closed loop”.

The figure below is Tesla’s data closed-loop framework: identifying an inaccuracy of NN models, data labeling and cleaning, model training and deployment/delivery.

The following figure shows the data closed-loop platform of Google WayMo: data mining/active learning, automatic/manual labeling, automatic model tunning and optimization, test & verification, deployment/release.

Data closed loop requires a cloud computing / edge computing platform and big data processing technology, which cannot be realized in a single vehicle or a single machine. Big data cloud computing has been developed for many years, providing data closed-loop infrastructure support in data batch/streeam processing, workflow management, distributed computing, state monitoring and database storage.

In terms of the NN model training platform, mainly machine learning (deep learning), Caffe was the earliest open source, and Tensorflow and PyTorch (Caffe2) are the most popular at present. The deployment of deep learning models training on cloud platform generally adopts distributed training. According to the parallel mode, distributed training is generally divided into data parallelism and model parallelism. Of course, a mixture of data parallelism and model parallelism can also be adopted:

• Model parallelism: different GPUs are responsible for different parts of the network model. For example, different network layers are assigned to different GPUs, or different parameters of the same layer are assigned to different GPUs.

• Data parallelism: different GPUs have multiple copies of the model. Each GPU allocates different data and combines all GPU calculation results in some way.

Model parallelism is not commonly used, while data parallelism involves how to synchronize model parameters between GPUs, which is divided into synchronous update and asynchronous update. After the gradient calculation of all GPUs is completed, synchronous update then calculates the new weight. After synchronizing the new value, it proceeds to the next calculation cycle. Asynchronous update is to update the weight immediately after each GPU gradient is calculated without waiting, and then synchronize the new value for the next calculation cycle.

The distributed training system includes two architectures: parameter server architecture (PS) and ring-AllReduce architecture.

The goal of active learning is to find an effective method to select the data which is labeled from the unlabeled data pool to maximize the accuracy. Active learning is usually an iterative process. In each iteration, the model is learned, and some heuristic methods are used to select a set of data from the unlabeled data pool for annotation.

There are unexpected cases that deviate from the normal in the data samples, that is, the so-called corner case. Online corner case detection can be used as a safety monitoring and warning system to identify when corner case occurs. Offline corner case detection can be applied to a large number of collected data and select appropriate training and related test data.

Machine learning models often fail on out-of-distribution (OOD) data. Detecting OOD is a way to determine uncertainty. It can not only trigger safety warning, but also find valuable data samples for training.

There are two sources of uncertainty: aleatoric uncertainty and epistemic uncertainty. The data irreducible uncertainty that leads to prediction uncertainty is an aleatoric uncertainty (also known as data uncertainty). Another type of uncertainty is epistemic uncertainty caused by inappropriate knowledge and data (also known as knowledge/model uncertainty).

The most commonly used uncertainty estimation methods are Bayesian approximation method and ensemble learning method.

A class of OOD identification methods are based on Bayesian neural network inference, including dropout-based variable inference method, Markov chain Monte Carlo (MCMC) and Monte Carlo dropout method. Another kind of OOD identification methods include (1) training methods such as auxiliary loss or NN architecture modification, and (2) post hoc statistics.

8. DevOps and MLOps

DevOps, in short, is to better optimize the processes of development (DEV), testing (QA) and operation (OPS), which integrates development, operation and maintenance, makes software developent, test and release faster, more frequent and more reliable through highly automated tools and processes.

DevOps is a whole workflow for IT operation. IT automation as well as continuous integration (CI) / continuous deployment (CD) are used as the basis to optimize all links such as program development, test and operation.

Trunk-based development (TBD) is the premise of CI. Automation and concentrated code management are the necessary conditions for the CI implementation. DevOps is an extension of the CI idea, and CD/CI is the technical core of DevOps.

The core goal of MLOps is, from training to deployment, to make the AI model’s whole end-to-end link run stably and efficiently in the product environment, to meet the client business requirements of customers.

MLOps also puts forward the corresponding requirements for the core technology of the AI system. For example, deployment automation will put forward clear requirements for the front-end design of the AI framework. If the front-end design of the AI framework is not beneficial to derive complete AI model files, a large number of downstream components have to introduce “plug-in” for respective service scenario requirements in the deployment link.

The demand for deployment automation will also give birth to some software components around AI core systems, such as model inference deployment optimization, reproducibility of model training prediction results and system scalability of AI production.

9. Scenario Database Building and Testing

Scenario based automatic driving test is an effective way to accelerate test and evaluation.

“As a comprehensive embodiment of driving environment and scene, the scenario describes the road layout, surrounding traffic and atmosphere (weather and illumination) of the external driving environment, the driving task and state, which is an abstraction and mapping of the set of factors, affecting and judging the autonomous driving function and performance, and has the characteristics of high uncertainty, unrepeatability, unpredictability and inexhaustibility “.

The classification methods of test scenarios are different in two ways:

1) According to the abstraction degree of the scenario, it can be divided into functional scenario, logical scenario and concrete scenario;

2) According to the data source of the test scenario, it can be divided into natural driving scenario, hazard working condition scenario, standard regulation scenario and parameter recombination scenario.

The dimensions of a scenario database include:

• Scenario: static and dynamic part

• Traffic: driving behavior and non-motorized behavior such as vulnerable road users (VRU), i.e. pedestrian, bicycle,

• Weather: sensors (camera, radar, lidar) and disturbance

The construction of scenario database is basically based on different data sources such as real, virtual and expert data, and a complete system is built hierarchically through scenario data mining, scenario classification and scenario deduction.

PEGASUS is a traditional method-based project in Germany for establishing scenario database.

10. Safety Redundancy Design

The application of ISO 26262 is specific to applications for passenger vehicles, motorcycles and commercial motor vehicles, and more specifically to the practice of functional safety. In this standard, standard risk is determined and communicated or mapped using Automotive Safety Integrity Levels (ASIL). As practitioners of functional safety, it is understood that an E/E fault leading to a failure at the system-level capabilities and functions will contribute to incorrect steering or braking, which are considered the highest safety-related risk of ASIL D.

The basic concept of the ISO/PAS 21448 safety of the intended functionality (SOTIF) approach is to introduce an iterative function development and design process that includes validation and verification (V&V) and that leads to an intended function that could be declared safe. Several activities will be derived based on an approach that argues that these activities are adequate for developing an automated functionality that is safe. The goal is to reduce the known potentially unintended behaviors and the unknown potential behavior to an acceptable level of residual risk.

A minimal risk maneuver (MRM) is the system’s capability of transitioning the vehicle between minimal risk conditions (MRC). The concept of MRCs and MRMs derives from the principles of ISO 26262 and is defined as an operating mode (in the case of a failure) of an item with a tolerable level of risk. In terms of ISO 26262, an MRM is an emergency operation to reach an MRC — referred to as a safe state.

The Operational Design Domain (ODD) is the set of design parameters in which the ADS is programmed to operate within. The ODD includes all conditions that must be met in order for the ADS to be operable. These conditions include but are not limited to: geographical limitations, environmental limitations, and human driver limitations.

Under fault conditions, the safe function can be achieved via a “fail operational” strategy (redundancy), a “fail degraded” strategy (operating with degradation), or a “fail safe” strategy (bringing the vehicle to a safe stop). Which approach is chosen always depends on the nature of the design element under fault condition and the remaining capabilities of the system.

Design safety considerations taken into account include:

— design architecture
— sensors
— actuators

— communication failure
— potential software errors
— reliability
— potential inadequate control
— undesirable control actions
— potential collisions with environmental objects and other road users

— potential collisions that could be caused by actions of an ADS
— leaving the roadway
— loss of traction or stability
— violation of traffic laws
— deviations from normal/expected driving practices

Safety focuses on the proper functioning of a system, and security focuses on the system’s ability to resist some form of intentionally malicious action. In particular, these center around safety worries about risks presented by passive adversaries, randomness in nature and human-caused accidents or crashes and security worries about risks presented by active adversaries in the form of creative, determined and malicious human beings acting intentionally.

Security makes heavy use of cryptography, which is often resource-intensive, but active safety mechanisms should be deterministic. Safety-related data often comes with requirements for short processing deadlines, which makes it difficult to ensure required levels of data authenticity, confidentiality, etc. Satisfying both safety and security will impact resources and the architecture.

ISO / SAE 21434 standard is for Cybersecurity. Cybersecurity vulnerabilities may damage the ability of the system to achieve basic security objectives, so cybersecurity is an important part. It is urgent to ensure the integrity and security of autonomous vehicles.

The autonomous driving vehicle is a unique challenge in cybersecurity. It combines the advanced nature of automobiles, frequent software updates and cloud control features. The general cybersecurity strategy includes not only electronic devices, sensors and automatic driving systems of vehicles, but also any functions connected with them, such as data ports, mobile applications and customer service systems.

The cybersecurity industry standard provides a “defense in depth” method to security layering between multiple overlapping systems. It reduces the possibility of hackers entering the vehicle network by using component isolation technology, memory protection and access control for any embedded system (especially ones with external interface or security function).

The cybersecurity system also reduces the ability of damaged equipment to affect behavior through message authentication, verification and credential provisioning. In addition, the impact of any violation can be minimized through measures such as network isolation, physical and virtual partitions. The whole validation technology helps to ensure that no single damaged component can endanger the health of the whole system.

Upon system fallback, the Level 3 BMW ADS Vision iNEXT, will send a cascade of warnings to the human driver in the form of visual, auditory, and haptic alerts with increasing levels of urgency. This warning cascade comprises the Level 3 BMW ADS takeover request and utilizes the Human Machine Interface (HMI).

In the event that the fallback-ready user (i.e., the human driver) is not receptive to the warning cascade of the takeover request, the Level 3 BMW ADS will perform a risk mitigation maneuver. This simply means that the vehicle will take an action up to and including bringing the vehicle to a safe stop on the hard shoulder or in the traffic lane if reaching the shoulder is not feasible, for example during heavy traffic, shown in Figure as below.

The BMW Group sees the necessity of a diverse redundancy (diversity): both the primary and the secondary channel are themselves redundant, and have their own diagnostic units. This allows the detection of the channel under fault and lets the other channel take over. In the case a fault affects both channels, a third rudimentary channel takes over to allow for reaching a minimal risk condition. The whole redundancy concept is shown in the following Figure:

BMW is an active participant in the Automotive Information Sharing and Analysis Center (Auto ISAC), an industry platform for sharing cybersecurity threats and intelligence information related to the automotive industry. It implements a security architecture for all on-board and off board vehicle systems, including connecting devices and BMW backend, which utilizes design security methods and is based on the latest industry best practices.

For all network physical systems, it implements a basic level of protection, which may include encryption and authentication. In addition, for the most critical systems and data of the vehicle, additional protection measures shall be implemented to provide BMW customers and all road users with a higher level of protection.

BMW Group adopts “defense in depth” in many other cybersecurity principles to ensure that different control measures on different system layers are in place, so vehicles will not rely solely on their surroundings to resist network attacks. The effective utilization of cybersecurity technology and functions in BMW Group’s products is the product of “defense in depth” mode.

11. Conclusion

Autonomous driving has entered a period of engineering implementation. Some necessary engineering landing elements are mentioned here, such as wire-control chassis, electronic and electrical architecture, middleware software platform, AI model compression acceleration, in-vehicle autonomous driving chip (computing platform), data closed loop, DevOps/MLOps, scenario database construction & testing, and safety redundancy design etc.

In addition, there are some engineering issues not mentioned here, such as sensor cleaning, memory/instruction optimization and task scheduling etc.