An Introduction to Motion Prediction with Literature Review

Inmeta
22 min readJan 29, 2021

And which approaches you can choose from for your computer vision projects

By Greta Elvestuen, PhD, Data Scientist consultant at Inmeta

Most humans are able to naturally navigate through many social interaction scenarios, having an intrinsic capacity to reason about other people’s intents, beliefs and desires. They use such reasoning in order to predict what might happen in the future and make corresponding decisions. As technology develops, the relevance of this ability to predict changes in an environment, as well as object behavior, has been constantly increasing throughout the recent decades.

Hence, motion prediction has found its way into a variety of areas, such as autonomous driving, surveillance systems, scene understanding, weather forecast and many others, representing probably one of the most engaging domains within the field of computer vision nowadays. While motion may be predicted by using both vision-based (visual information about subjects and scenes) and sensor data (motion tracking sensors, such as wearables or environmental), the focus of present review is elaboration of work based on visual appearances.

In this sense, motion prediction may be divided into five main areas — video prediction, action prediction, trajectory prediction, human motion prediction and other applications. In the following sections, each area is briefly presented, with subsequent recent and maybe most relevant examples from research on the matter. Since the aim is to lower the threshold for getting familiar with the field and to start employing ideas in own computer vision projects, all examples are solely articles with publicly available code.

Video prediction

The task of video prediction is forecasting the future frames based on previous frames. It has received much interest due to its relevance for many computer vision applications, such as autonomous vehicles or robotics. However, supervised methods for video frame prediction rely on labeled data, which may not always be available. In addition, despite much recent progress, the task remains challenging primarily due to high nonlinearity in the spatial domain.

Prediction of next frames (t+1, …,t+m) given a sequence of context frames (Xt–n, …, Xt), where n and m denote the number of context and predicted frames, respectively.

Oprea et al. (2020)

Uncertainty is a fundamental issue in the field of video prediction, as many future outcomes are possible for a sequence of observations, implying that prediction of future frames in a video sequence is a demanding generative modeling task. Predictions from deterministic models rapidly degrade over time as uncertainty grows, converging to an average of the plausible future outcomes. Moreover, the predictor needs to model both scene contents and motion. During recent years, deep learning approaches became the favorite choice for this task. Although having a deep network which can learn all the aspects of the task by itself is appealing, the history of deep learning shows that an appropriate network structure is the key for learning from limited data. For instance, typical properties of images are reflected in the structure of hierarchical convolutional networks.

Regarding most promising approaches in video prediction, they include feedforward architectures and Recurrent Neural Networks (RNNs) — such as Gated Recurrent Units (GRUs), Long-Short Term Memory (LSTM), Convolutional LSTMs (ConvLSTMs) or combination. Architectural designs and training strategies such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) or Conditional VAEs (CVAEs) are also highly popular within the field, while most common metrics are presented in the figure below. Following are exemplary research papers within the area.

Metrics most commonly used in video prediction applications.

Rasouli (2020)

Order Matters: Shuffling Sequence Generation for Video Prediction

Compared to previous methods focusing on generating more realistic contents, Wang with colleagues extensively studied in 2019 the importance of sequential order information for video generation. A novel Shuffling sEquence gEneration network (SEE-Net) was proposed that can learn to discriminate unnatural sequential orders by shuffling the video frames and comparing them to the real video sequence. Systematic experiments on three datasets with both synthetic and real-world videos manifested the effectiveness of shuffling sequence generation for video prediction in the proposed model. State-of-the-art performance was also demonstrated by both qualitative and quantitative evaluations.

Human can figure out the correct order of shuffled video frames (2–1–3). By doing so, attention is enforced to be paid on the temporal information
The proposed video prediction framework
Qualitative comparison to state-of-the-art methods on the Moving MNIST dataset
Quantitative results of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) on KTH (a and b) and MSR (c and d) datasets. Compared with MCNet, DrNet and proposed model without shuffling sequence.

Wang et al. (2019)

Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction

Kim with colleagues proposed also in 2019 a deep video prediction model conditioned on a single image and an action class. In order to generate future frames, keypoints of a moving object were detected first and future motion as a sequence of keypoints was predicted. The input image was then translated following the predicted keypoints sequence to compose future frames. Detecting the keypoints is central to this algorithm and present method was trained to detect the keypoints of arbitrary objects in an unsupervised manner.

Moreover, the detected keypoints of the original videos were used as pseudo-labels in order to learn the motion of objects. Experimental results showed that this method was successfully applied to various datasets without the cost of labeling keypoints in videos. The detected keypoints were similar to human-annotated labels and prediction results were more realistic than in previous methods.

Overview of the method at inference time

The method generates future frames through three stages: keypoint detection, motion generation and keypoints-guided image translation

Failure cases.

Kim et al. (2019)

The results of the keypoints-guided image Failure cases translation from:

a) The baseline method

b) Proposed network without the mask

c) Proposed network

Inception-inspired LSTM for Next-frame Video Prediction

The same year Hosseini with colleagues provided a novel self-supervised deep-learning method called Inception-based LSTM for video frame prediction. The general idea of inception networks is to implement wider networks instead of deeper networks. Such design was shown to improve the performance of image classification. Present method was evaluated on both Inception-v1 and Inception-v2 structures. The proposed Inception LSTM methods were compared with convolutional LSTM when applied using PredNet predictive coding framework for both the KITTI and KTH datasets. It was observed that the Inception based LSTM outperformed the convolutional LSTM. In addition, Inception LSTM had better prediction performance compared to Inception v2 LSTM, while Inception v2 LSTM had a lower computational cost compared to Inception LSTM.

Comparison of output of the Convolutional LSTMs and Inception LSTM on the KITTI dataset

a) The actual frame

b) Prediction using a convolutional LSTM

c) Prediction using Inception-inspired LSTM Version 1

d) Prediction using Inception-inspired LSTM Version 2

KITTI dataset next-frame prediction performance as a function of the number of previous frames used in the history. Left: Mean Square Error (MSE). Right: SSIM.

Hosseini et al. (2019)

Action prediction

Anticipating the near future is a natural task for humans and a fundamental one for intelligent systems when it is necessary to react before an action is completed (e.g., to anticipate a pedestrian crossing the street from an autonomous vehicle) or even before it starts (e.g., to notify a user who is performing the wrong action in a known workflow). Nevertheless, tasks such as action anticipation and early action recognition pose a series of fundamental challenges from a computational perspective. Indeed, methods addressing these tasks need to model the relationships between past, future events and incomplete observations.

Action prediction from the observed sequence.

Liu et al. (2017)

One of the major challenges for autonomous vehicles in urban environments is to understand and predict other road users’ actions, especially pedestrians’ at the point of crossing. The common approach to solving this problem is to use the motion history of the agents in order to predict their future trajectories. However, pedestrians exhibit highly variable actions, most of which cannot be understood without visual observation of the pedestrians themselves and their surroundings. While most work on analyzing human actions in videos focus on classifying and labeling observed video frames or anticipating the very recent future, making long-term predictions over more than just a few seconds is an important and challenging task with many practical applications.

Most popular methods within action anticipation is a variation of RNN-based architectures, such as LSTMs, ConvLSTMs and Quasi-RNNs (QRNNs). In early action prediction, there is no strong preference for feedforward or recurrent approaches. Temporal CNNs are also often applied. Most common metrics used in the area of action prediction are depicted in the figure below. Following are exemplary research papers in this domain.

Metrics used in action prediction applications.

Rasouli (2020)

When Will You Do What? — Anticipating Temporal Occurrences of Activities

Abu Farha with colleagues proposed in 2018 two methods for predicting a considerably large amount of future human actions and their durations. Both a CNN and an RNN were trained to learn future video labels based on previously seen content. These methods generated accurate predictions of the future even for long videos of up to several minutes’ length. The predictions scaled well among different datasets and videos with varying lengths, varying quality of input information (noisy or erroneous observed data) and large variations in the possible future actions.

Architecture of the RNN system
Architecture of the CNN-based anticipation approach
Results for future action prediction without ground-truth observations.

Abu Farha et al. (2018)

What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention

Egocentric action anticipation refers to understanding which objects the camera wearer will interact with in the near future and what actions they will perform. Furnari and Farinella tackled this challenge in 2019 by proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs in order to: 1) summarize the past and 2) formulate predictions about the future. The input video was processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality specific predictions were fused using a novel Modality ATTention (MATT) mechanism, which learns to weigh modalities in an adaptive fashion.

Extensive evaluations on two large-scale benchmark datasets showed that this method outperformed prior art by up to +7 % on the challenging EPICKitchens dataset including more than 2,500 actions and generalizes to EGTEA Gaze+. The approach has also shown to generalize to the tasks of early action recognition and action prediction. This method also received the best ranking in the public leaderboard of the EPIC-Kitchens egocentric action anticipation challenge 2019.

Egocentric Action Anticipation
Example of the proposed architecture with two modalities.

Furnari & Farinella (2019)

Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs

A year after, Rasouli with colleagues proposed a solution for the challenge of pedestrian action anticipation at the point of crossing in the traffic. This approach used a novel stacked RNN architecture in which information collected from various sources (both scene dynamics and visual features) was gradually fused into the network at different levels of processing. After extensive empirical evaluations, the proposed algorithm achieved a higher prediction accuracy compared to alternative recurrent network architectures. Experiments were conducted in order to investigate the impact of the length of observation, time to event and types of features on the performance of the proposed method. In addition, it was demonstrated how different data fusion strategies impact prediction accuracy.

Examples of pedestrians prior to making crossing decisions. Green and red colors indicate whether the pedestrian will or will not cross
The architecture of the proposed algorithm SF-GRU comprised of five GRUs each processing a concatenation of features from different modalities and the hidden states of the GRU in the previous level.

Rasouli et al. (2020)

Trajectory prediction

Trajectory prediction algorithms predict future trajectories of objects, hence, their future positions over time. These approaches are especially relevant for applications within intelligent driving and surveillance. Predicted trajectories may be applied directly, such as in route planning for autonomous vehicles, or employed for predicting future events, actions or anomalies.

Moreover, deciphering human behaviors in order to predict their future paths/trajectories and what they would do from videos is highly important in many areas. For instance, for autonomous vehicles (AVs) to behave appropriately and safely on roads populated by human-driven vehicles, they need to be able to reason about the uncertain intentions and decisions of other drivers from rich perceptual information. Autonomous driving requires reasoning about the future behaviors of agents in a variety of situations, such as stop signs, roundabouts, crosswalks, parking, merging, etc. In multi-agent settings, each agent’s behavior affects the behavior of many other agents involved.

Trajectory prediction of other agents.

Ivanovic (2020)

Most common architectures for trajectory prediction are recurrent, such as LSTMs and GRUs. The feedforward algorithms are also applied, while few algorithms use hybrid approaches where both convolutional and recurrent reasoning are employed. Most prevalent metrics in trajectory prediction are presented in the figure below. Following is briefly presented exemplary research in the area.

Metrics used in trajectory prediction applications.

Rasouli (2020)

TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic Using Weighted Interactions

Chandra with colleagues presented in 2019 a new algorithm for predicting the near-term trajectories of road agents in dense traffic videos. It was designed for heterogeneous traffic, where the road agents may correspond to buses, cars, scooters, bicycles, or pedestrians. The interactions between different road agents were modeled using a novel LSTM-CNN hybrid network for trajectory prediction. Specifically, heterogeneous interactions that implicitly accounted for the varying shapes, dynamics and behaviors of different road agents were taken into account.

In addition, horizon-based interactions were modeled, used to implicitly model the driving behavior of each road agent. The performance of this prediction algorithm, TraPHic, was evaluated on the standard datasets. A new dense heterogeneous traffic dataset corresponding to urban Asian videos and agent trajectories was also introduced. It outperformed state-of-the-art methods on dense traffic datasets by 30 %.

Trajectory prediction in dense heterogeneous traffic conditions. Proposed algorithm (TraPHic) in red and ground truth (GT) in green
TraPHic network architecture.

Chandra et al. (2019)

Peeking into the Future: Predicting Future Person Activities and Locations in Videos

Liang with colleagues studied also in 2019 prediction of a pedestrian’s future path jointly with future activities. An end-to-end, multi-task learning system utilizing rich visual features about human behavioral information and interaction with their surroundings was proposed. To facilitate the training, the network was learned with an auxiliary task of predicting future location in which the activity would happen. Experimental results demonstrated state-of-the-art performance over two public benchmarks on future trajectory prediction. Moreover, this method was able to produce meaningful future activity prediction in addition to the path. The results provide the first empirical evidence that joint modeling of paths and activities benefits future path prediction.

The goal is to jointly predict a person’s future path and activity
Overview of the model.

Liang et al. (2019)

PRECOG: PREdiction Conditioned On Goals in Visual Multi-Agent Settings

The same year, Rhinehart with colleagues presented a probabilistic forecasting model of future interactions between a variable number of agents with the aim of enabling AVs to reason about the uncertain intentions and decisions of other drivers in traffic. Both standard forecasting and the novel task of conditional forecasting was performed, which reasons regarding how all agents will likely respond to the goal of a controlled agent (here, the AV). The models were trained on real and simulated data in order to forecast vehicle trajectories given past positions and LIDAR.

The evaluation showed that proposed model was substantially more accurate in multi-agent driving scenarios compared to existing state-of-the-art. Beyond its general ability to perform conditional forecasting queries, it was shown that the model’s predictions of all agents improved when conditioned on knowledge of the AV’s goal, further illustrating its capability to model agent interactions.

Forecasting on nuScenes
Factorized latent variable model of forecasting and planning shown for two agents.

Rhinehart et al. (2019)

Human motion prediction

Human motion prediction aims to forecast future human poses given some past motion. These algorithms primarily focus on the prediction of changes in the dynamics of the observed agents. They have a crucial role in prediction approaches (such as video or trajectory prediction), as an intermediate step by, for instance, reflecting types of actions to expect or how the future visual representations would look like. In line with other fields of prediction, human motion prediction is dominated by deep learning methods, while some models are still based on classical approaches.

Human motion prediction, short- and long-term.

Mao et al. (2019)

These methods broadly use recurrent architectures, such as LSTMs and GRUs, or their combination. Some approaches also apply feedforward network architectures. In order to train prediction models, adversarial training methods may be employed as well, where a discriminator is used for classification of whether the predicted poses are real. The most common metrics within human motion prediction are depicted in the figure below.

Metrics used in human motion prediction applications.

Rasouli (2020)

We know that human movement is goal-directed and influenced by the spatial layout of the objects in the scene. In order to plan future human motion, it is essential to perceive the environment (imagine how hard it is to navigate a new room with lights off). Notwithstanding the development of many effective human motion prediction algorithms, the majority of these works have had little focus on the scene context and, thus, struggled with long-term predictions. Whether based on recurrent or feed-forward neural networks, most methods also strive to model the observation that human motion has a tendency to repeat itself, even for complex sports and cooking activities. The following exemplary research looks into some of these challenges.

Long-Term Human Motion Prediction with Scene Context

Cao with colleagues proposed in 2020 a novel three-stage framework that exploits scene context in order to tackle the task of perceiving the environment for long-term prediction of human motion. Given a single scene image and 2D pose histories, this method first sampled multiple human motion goals, then planed 3D human paths towards each goal, and finally predicted 3D human pose sequences following each path. A diverse synthetic dataset with clean annotations is contributed for stable training and rigorous evaluation. In both synthetic and real datasets, this method showed consistent quantitative and qualitative improvements over previous methods.

Long-term 3D human motion prediction
Network architecture.

Cao et al. (2020)

History Repeats Itself: Human Motion Prediction via Motion Attention

Mao with colleagues introduced also in 2020 an attention-based feed-forward network that explicitly leveraged the observation that human motion tends to repeat itself. Specifically, instead of modeling frame-wise attention via pose similarity, extraction of motion attention in order to capture the similarity between the current motion context and the historical motion sub-sequences was proposed. Aggregating the relevant past motions and processing the result with a graph convolutional network allowed to effectively exploit motion patterns from the long-term history in order to predict future poses. Experiments on Human3.6M, AMASS and 3DPW showed the benefits of this approach for both periodical and non-periodical actions. The attention model yielded state-of-the-art results on all three datasets.

Overview of the approach
Long-term prediction of joint angles on H3.6M.

Mao et al. (2020)

Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction

The same year, Li with colleagues proposed novel dynamic multiscale graph neural networks (DMGNN) in order to predict 3D skeleton-based human motions. The core idea of DMGNN was to use a multiscale graph for comprehensive modeling of the internal relations of human body for motion feature learning. This multiscale graph was adaptive during training and dynamic across network layers. Based on this graph, a multiscale graph computational unit (MGCU) was proposed to extract features at individual scales and fuse features across scales.

The entire model was action-category-agnostic and followed an encoder-decoder framework. The encoder consisted of a sequence of MGCUs in order to learn motion features. The decoder used a proposed graph-based gate recurrent unit to generate future poses. Extensive experiments showed that the proposed DMGNN outperformed state-of-the-art methods in both short and long-term predictions on the datasets of Human 3.6M and CMU Mocap. The learned multiscale graphs were further investigated for the interpretability.

The architecture of DMGNN, which used an encoder-decoder framework for motion prediction
Comparison of Mean Absolute Errors (MAEs) between proposed model and the state-of-the-art methods on the eight actions of CMU Mocap dataset.

Li et al. (2020)

Other motion prediction applications

Beside the four elaborated areas of motion prediction, there are many other and smaller applications that are rather distinct from each other and probably suit best within the merged category of “other”. These applications of motion prediction involve autonomous robotics, predicting the popularities of tweets based on tweeted images used and the users’ histories, predicting election results based on candidates’ facial attributes, weather and Solar irradiance forecasting, predicting fashion trends, steering angle prediction, storyline forecasting, anticipating the effect of force after manipulating objects, pain prediction, anticipating the winner of Miss Universe based on the appearances of contestants’ gowns and many others.

In the field of autonomous robotics, some algorithms are intended to predict occupancy grid maps (OGMs), representing grayscale representations of the robot’s surroundings, indicating which sections of the environment are traversable. Such methods are often object-agnostic and focused on generating future OGMs then used by an autonomous agent with the aim of path planning. Prediction techniques are also broadly applied in other visual processing areas, such as anomaly detection, video summarization, active object recognition, tracking, action recognition and detection, etc.

Anomaly detection.

Liu et al. (2018)

Both classical and deep learning methods have been applied in these approaches within the course of recent years. The deep learning techniques are rather similar to the video prediction approaches, where the model receives a sequence of OGMs as input and predicts the future ones for some time period. Both feedforward and recurrent architectures are common in this setting. Metrics often applied among other motion prediction applications are rather similar to those within the four main areas and include accuracy, precision, recall, percentage of correct predictions (PCP), Matthews correlation coefficient (MCC), Euclidean Distance (ED), Root Mean Square Error (RMSE), MSE, MAE, Mean Absolute Percentage Error (MAPE), normalized MAPE (nMAPE) Spearman’s ranking Correlation (SRC), OGM, segmentation maps, SSIM, PSNR, psi (ψ), True Positive (TP), True Negative (TN), Receive Operator Characteristic (ROC), curve over TP and TN, F1-score, Area Under the Curve (AUC), Run Time (RT), Intersection over Union (IoU), Average Precision (AP), End-Point error (EPE), Probabilistic Rand Index (RI), Global Consistency Error (GCE) and Variation of Information (VoI). Following are exemplary research papers within this area.

Predicting Deeper into the Future of Semantic Segmentation

Luc with colleagues introduced in 2017 a novel task of predicting semantic segmentations of future frames. Given a sequence of video frames, the goal was to predict segmentation maps of not yet observed video frames that lie up to a second or further in the future. An autoregressive convolutional neural network was developed that learned to iteratively generate multiple frames. The results on the Cityscapes dataset showed that directly predicting future segmentations is substantially better than predicting and then segmenting future RGB frames. Prediction results up to half a second in the future were visually convincing and much more accurate than those of a baseline based on warping semantic segmentations using optical flow.

Proposed models learn semantic-level scene dynamics to predict semantic segmentations of unobserved future frames given several past frames
Multi-scale architecture of the S2S model that predicts the semantic segmentation of the next frame given the segmentation maps of the N1 previous frames.

Luc et al. (2017)

Predicting Future Instance Segmentation by Forecasting Convolutional Features

In order to predict semantic segmentation of future frames, forecasting at the semantic level has been shown as more effective than forecasting RGB frames and then segmenting these. Luc with colleagues considered in 2018 the more challenging issue of future instance segmentation, which additionally segments out individual objects. In order to deal with a varying number of output labels per image, a predictive model in the space of fixed-sized convolutional features of the Mask R-CNN instance segmentation model was developed. The “detection head” of Mask R-CNN was applied on the predicted features to produce the instance segmentation of future frames. Experiments showed that this approach significantly improved over strong baselines based on optical flow and repurposed instance segmentation architectures.

Predicting 0.5 s into the future
Left: Features in the Feature Pyramid Network (FPN) backbone obtained by upsampling features in the top-down path and combining them with features from the bottom-up path at the same resolution. Right: FPN features extracted from frames t — τ to t and predict FPN features for frame t + 1.

Luc et al. (2018)

Real-World Anomaly Detection in Surveillance Videos

The same year, Sultani with colleagues proposed to learn anomalies by exploiting both normal and anomalous videos, as surveillance videos capture a variety of realistic anomalies. In order to avoid annotating the anomalous segments or clips in training videos (which is extremely time consuming), learning anomaly through the deep multiple instance ranking framework by leveraging weakly labeled training videos was proposed, i.e. the training labels (anomalous or normal) were at video level instead of clip-level. In this approach, normal and anomalous videos were considered as bags and video segments as instances in multiple instance learning (MIL). A deep anomaly ranking model was automatically learned that predicted high anomaly scores for anomalous video segments. Furthermore, sparsity and temporal smoothness constraints in the ranking loss function were introduced in order to better localize anomalies during training.

In addition, a new large-scale first of its kind dataset of 128 hours of videos was presented. It consists of 1,900 long and untrimmed real-world surveillance videos with 13 realistic anomalies, such as fighting, road accident, burglary, robbery, etc., as well as normal activities. This dataset is applicable for two tasks: 1) general anomaly detection considering all anomalies in one group and all normal activities in another group; and 2) for recognizing each of 13 anomalous activities.

The experimental results showed that MIL method for anomaly detection achieved significant improvement on anomaly detection performance as compared to the state-of-the-art approaches. The results of several recent deep learning baselines on anomalous activity recognition were provided. The low recognition performance of these baselines revealed that this dataset is very challenging and opens more opportunities for future work.

The flow diagram of the proposed anomaly detection approach
Qualitative results of the proposed method on testing videos. Colored window shows ground truth anomalous region.

Sultani et al. (2018)

a) Animal abuse (beating a dog)

b) Explosion

c) Road accident

d) Shooting

e) and f) Normal videos

g) and h) Failure cases

In conclusion, what are the key take-outs from motion prediction?

Ability to anticipate future events is an essential prerequisite for intelligent behavior. It is of outmost importance in real-time systems, such as robotics or autonomous driving, which depend on visual scene understanding for decision making. Video forecasting has been studied as a proxy task towards this goal. Hence, there has been an exponential growth of published research within the field of motion prediction in the recent years, which might be challenging to structure and get familiar with.

Accordingly, there are three main takeaways from this review. First, there are five core areas within motion prediction — 1) video prediction, 2) action prediction, 3) trajectory prediction, 4) human motion prediction, and 5) other applications. Second, there are several architectures that may be successfully applied on the motion prediction challenge at hand. What the best choice is relies mainly on the factors like types of input data and expected output, computational efficiency, application-specific constraints, etc. The most common ones are recurrent and feedforward networks, generative models and attention modules.

Finally, most relevant metrics within motion prediction are MSE, SSIM, PSNR, accuracy, precision, recall, average distance (ADE), final distance (FDE), Negative Log-Likelihood (NLL), Mean Average Error in joint space (MJE), Mean average error in angle space (MAnE), Mean Per Joint Prediction Error (MPJPE) and Percentage of Correct Keypoints (PCK). Which would be most applicable for the task at hand depends on the type of application.

With more than 100 publicly available datasets for evaluation of motion prediction algorithms and deep learning approaches targeting deployment on real-world safety-critical robotic systems, there is a plenty of opportunities and a vast need for building more reliable, safer and more robust computer vision models within this area.

References

Abu Farha, Y., Richard, A., & Gall, J. (2018). When will you do what? — Anticipating temporal occurrences of activities. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5343–5352).

Cao, Z., Gao, H., Mangalam, K., Cai, Q. Z., Vo, M., & Malik, J. (2020, August). Long-term human motion prediction with scene context. In European Conference on Computer Vision (pp. 387–404). Springer, Cham.

Castrejon, L., Ballas, N., & Courville, A. (2019). Improved Conditional VRNNs for Video Prediction. In Proceedings of the IEEE International Conference on Computer Vision (pp. 7608–7617).

Chandra, R., Bhattacharya, U., Bera, A., & Manocha, D. (2019). Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8483–8492).

Farazi, H., & Behnke, S. (2019). Frequency Domain Transformer Networks for Video Prediction. arXiv preprint arXiv:1903.00271.

Furnari, A., & Farinella, G. M. (2019). What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In Proceedings of the IEEE International Conference on Computer Vision (pp. 6252–6261).

Hosseini, M., Maida, A. S., Hosseini, M., & Raju, G. (2019). Inception-inspired LSTM for Next-frame Video Prediction. arXiv preprint arXiv:1909.05622.

Ivanovic, B. (2020). Back to the Future: Planning-Aware Trajectory Forecasting for Autonomous Driving. The Stanford AI Lab Blog, Assessed at http://ai.stanford.edu/blog/trajectory-forecasting/ on December 21st, 2020

Kim, Y., Nam, S., Cho, I., & Kim, S. J. (2019). Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction. In Advances in Neural Information Processing Systems (pp. 3814–3824).

Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., & Tian, Q. (2020). Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 214–223).

Liang, J., Jiang, L., Niebles, J. C., Hauptmann, A. G., & Fei-Fei, L. (2019). Peeking into the future: Predicting future person activities and locations in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5725–5734).

Liu, C., Lu, Y., Shi, X., Li, Z., & Zhao, L. (2017). Action Prediction Using Unsupervised Semantic Reasoning. In: Liu D., Xie S., Li Y., Zhao D., El-Alfy ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science, vol 10636. Springer, Cham. https://doi.org/10.1007/978-3-319-70090-8_50

Liu, W., Luo, W., Lian, D., & Gao, S. (2018). Future Frame Prediction for Anomaly Detection — A New Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6536–6545).

Luc, P., Couprie, C., Lecun, Y., & Verbeek, J. (2018). Predicting future instance segmentation by forecasting convolutional features. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 584–599).

Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (pp. 648–657).

Mao, W., Liu, M., & Salzmann, M. (2020, August). History repeats itself: Human motion prediction via motion attention. In European Conference on Computer Vision (pp. 474–489). Springer, Cham.

Mao, W., Liu, M., Salzmann, M., & Li, H. (2019). Learning Trajectory Dependencies for Human Motion Prediction. In Proceedings of the IEEE International Conference on Computer Vision (pp. 9489–9497).

Oprea, S., Martinez-Gonzalez, P., Garcia-Garcia, A., Castro-Vargas, J. A., Orts-Escolano, S., Garcia-Rodriguez, J., & Argyros, A. (2020). A Review on Deep Learning Techniques for Video Prediction. arXiv preprint arXiv:2004.05214.

Rasouli, A. (2020). Deep Learning for Vision-based Prediction: A Survey. arXiv preprint arXiv:2007.00095.

Rasouli, A., Kotseruba, I., & Tsotsos, J. K. (2020). Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs. arXiv preprint arXiv:2005.06582.

Rhinehart, N., McAllister, R., Kitani, K., & Levine, S. (2019). Precog: Prediction conditioned on goals in visual multi-agent settings. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2821–2830).

Sultani, W., Chen, C., & Shah, M. (2018). Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6479–6488).

Wang, J., Hu, B., Long, Y., & Guan, Y. (2019). Order matters: Shuffling sequence generation for video prediction. arXiv preprint arXiv:1907.08845.

--

--

Inmeta

True innovation lies at the crossroads between desirability, viability and feasibility. And having fun while doing it! → www.inmeta.no