1. Motivation
首先作者抛出一个问题:怎么解决长尾问题?大量的驾驶数据包含 easy and uninteresting behaviors,我们人类很少经历过车祸的情景。但是我们却见过这些情景,所以类似的:当收集数据的 ego-vehicle 并没有收集到很多关于车祸的数据时,他应该至少有关于周边车辆处于那种状态 which is interesting and safety-critical 的日志。【所以可以利用 other vehicles 的数据来训练,后面再回来看是否准确】
似乎借鉴这篇论文的比较多Learning by Cheating,打开一看发现还是作者本人的文章,wow!
2. Related Work
2.1 Perception for autonomous driving
【相关的一些工作具体可以查看论文里给的相关论文】
-
典型的感知系统是基于LiDAR扫描的数据进行 detection and tracking。还有人将 rgb 图像和 LiDAR扫描数据融合来获得更多的semantic information。
-
感知系统可分为两类按照是否使用提前做好的 HD-MAP:map-based and map-less.
-
map-based systems localize themselves in the pre-recoded maps.
-
map-less systems either perform online mapping, or they implicitly predict road-related affordances.
-
-
有人将感知的输出表示为鸟瞰(BEV, bird’s eye viewed)空间网格,还有人将输出表示为参数化的向量空间以获得更紧凑的表示。
2.2 Behavior prediction
这个模块关注于预测未来道路的状态,他的输入要么是 perception 模块的输出,要么是原始的传感器数据,预测其他动态车辆的轨迹。
- 有人predict single, deterministic future trajectories of the detected vehicles.
- 有人model multi-modal future trajectories by using conditional models.
- 有人将轨迹预测为高斯混合以表示欧几里得空间中的不确定性。
- 有人使用潜在变量和VAEs来模拟行为和场景特定的不确定性。
- 还有人直接将 perception 和 behavior prediction 模块合并直接预测被占用的地图状态。
2.3 Learning-based motion planning
这一块主要的两种方法是 imitation learning and reinforcement learning。
-
Pomerleau 是将 imitation learning 应用到自动驾驶领域的先驱。regress sensor inputs to controls by imitating the recorded expert trajectories.
-
有人 conditional branching and high-level commands to extend imitative models for urban driving。
-
有人用 imitation learning 来train a cost volume predictor。【这是个啥?】
-
有人predict actions from the learned affordances【启示?】
-
有人uses on-policy distillation to handle distribution shift 并提供更强的模仿监督信号。
-
-
另一方面,强化学习从用户定义的奖励函数中训练策略。
- [有人](learning to dirve in a day)利用DDPG训练了一个车道巡迹(lane following)
- 有人利用Rainbow-IQN训练了一个城市驾驶策略。
- 有人基于模型的强化学习和升华,以离线方式训练驾驶策略。
和作者的方法非常类似的:训练一个特权模仿学习策略,从场景中的其他车辆中学习。
-
PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning
-
Learning by Watching
3. Method
分为三个阶段:perception, motion planner, and controller.
3.1 A vehicle-independent perception model
图3(a)可见,是一个 combination of semantic segmentation and detection losses.
we use three RGB cameras $\mathbf{I}_t={I^1_t , I^2_t,I^3_t}$ surrounding the vehicle and one LiDAR sensor $L_t$ as an input. We combine the color and LiDAR inputs using pointpainting【?】 from RGB inputs and a light-weight CenterPoint【?】 with PointPillars 3D backbone【?】. The backbone provides us with a map-view feature representation $f\in\mathbb{R}^{W ×H×C}$ of width $W$ and height $H$ with $C$ channels.
3.2 Learning to plan motion from all vehicles
这个模块是一个两阶段的设计:
first-stage:
- We use a standard RNN formulation to predict $n = 10$ future waypoints $y_1, . . . , y_n\in \mathbb{R}^2$. The motion planner uses a high level command $c$ and intermediate GNSS coordinate goal $g\in \mathbb{R}^2$ to perform different driving maneuvers.
- Possible highlevel commands $c$ include
turn-left
,turn-right
,go-straight
,follow-lane
,change-lane-to-left
,change-lane-to-right
. - Let $M ( \hat{f} , c) :\rightarrow \mathbb{R}^{n\times2}$ be the motion planner conditioned on high-level command $c$ and warped features $\hat{f}$ for the Region of Interest (RoI) at the location and orientation of the vehicle in question.
- For all vehicles, we observe their future trajectory to obtain supervision for future waypoints $y$.
- For the ego-vehicle, the simulator provides a ground truth high-level command $\hat{c}$ and provides sufficient supervision to train the motion planner.
$$\begin{equation}\mathcal{L}^{ego}_M = E_{\hat{f} ,y,\hat{c}}\left[|| y − M (\hat{f} , \hat{c})|| _1\right] \end{equation} $$
- However, other vehicles do not expose their high-level commands. We instead allow the model to infer the high-level command directly and optimize the plan for the most fitting high-level command.
$$ \mathcal{L}^{other}_M = E_{\hat{f} ,y}\left[\underset{c}{min} || y − M (\hat{f} , c)|| _1\right] $$
- At training time we optimize both losses $L^{ego}_M + L^{other}_M $ jointly.
second-stage:
- In a second stage, we refine the motion plan using an additional RNN-based motion planning network $M'(\hat{f},g,\tilde{y})\in\mathbb{R}^{n\times2}$. The motion refinement network uses the same ROI warped feature $\hat{f}$, the previously predicted motion plan $\tilde{y}$, and the more fine-grain GNSS goal $g$ as input.
- Since GNSS goals are only available for the ego-vehicle, we train the refinement $M’$ only on ego-vehicle trajectories
$$
\mathcal{L}^{refine} _{M} = E _{\hat{f},y,\tilde{y},\hat{g}}\left[|| \tilde{y} + M'(\hat{f},\hat{g},\tilde{y}) – y || _1 \right]
$$
- During both training and testing, we roll out the same refinement network multiple times to recursively refine the predicted trajectory. The above loss then applies to each step of the rollout.
图3(b)(c)可见,we learn the motion planner in a privileged distillation framework. We first learn motion planning on ground truth trajectories and ground-truth perception outputs and regions of interest using the losses $\mathcal{L}^{ego}_M,\mathcal{L}^{other}_M, \mathcal{L}^{refine}_M$.
【We then use the privileged motion planner to supervise a motion planner that uses the inferred perception outputs. During this second stage, we supervise predictions on all high-level commands which leads to a richer supervisory signal. We additionally distill a high-level command classifier for other vehicles which we use later in the vehicle-aware controller. This stage trains end-to-end by backpropagating gradients from motion prediction and planning to the perception backbone, allowing perception models to attend to the low-level details in the scenes. We keep the pre-training perception loss in the previous stage as an auxilliary supervision to regularize the features.】这一段说的啥?不太懂
3.3 Vehicle-aware control
使用两个PID controller来分别负责 steering 和 acceleration.
Perception
We use PointPillars with PointPainting as our multi-modal 3D perception backbone $P_B$
Prediction and Planning
rollouts along waypoint and rollouts along refinement iterations.
Both motion planners use a linear layer to transform GRU states into the desired outputs.
Control
The controller C takes as input refined egotrajectory $\tau = M'(\hat{f} ,\tilde{y},\hat{g})$
4. Experiments
CARLA leaderboard上的表现和 Ablation study。相融性实验的评估无法在leaderboard上获得,所以作者在本地环境下进行了测试。
5. Discussion
traffic infractions 这个指标不太高,也就是会经常不遵守交通规则。
打赏作者