Automatic understanding of the visual world

Size: px

Start display at page:

Download "Automatic understanding of the visual world"

Rolf Stephens
5 years ago
Views:

1 Automatic understanding of the visual world 1

2 Machine visual perception Artificial capacity to see, understand the visual world Object recognition Image or sequence of images Action recognition 2

3 Machine visual perception applications Face detection (auto focus in cameras, surveillance) Courtesy Fujifilm Courtesy Ricoh 3

4 Machine visual perception applications Pedestrian detection, action recognition (car safety, surveillance) Courtesy Volvo Courtesy Embedded Vision Alliance 4

5 Machine visual perception applications Image retrieval (search for places/objects with a smartphone) Courtesy Google 5

6 Machine visual perception applications Complete description (story) of a video As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 6

7 Machine visual perception applications Complete description (story) of a video As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 7

8 Machine visual perception applications Complete description (story) of a video As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 8

9 Machine visual perception applications Complete description (story) of a video As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 9

10 Today s machine visual perception Machine learning Largescale & deep learning Learning with noisy labels Data (images & videos) Large quantity, but quality? Manual / weaklysupervised annotation, synthetic data Machine visual perception Understanding of the visual world Design of models 10

11 Image and video data Manually annotated data Weaksupervised learning Synthetic data Selfsupervised learning 11

12 Manual annotated collections Increase in size images in 2007 several million images today Increase in complexity of annotation Image labels Bounding boxes labels Semantic segmentation labels Action labels 12

PASCAL VOC 20072012 Image collected from Flickr with keywords Exhaustive

Separate training and test set (held out) [The PASCAL Visual Object Classes

13 PASCAL VOC Image collected from Flickr with keywords Exhaustive manual annotation with 20 classes labels and corresponding bounding boxes Separate training and test set (held out) [The PASCAL Visual Object Classes (VOC) Challenge, M. Everingham, L. Van Gool, C. Williams, J. Winn, A. Zisserman, IJCV] 13

ImageNet dataset ImageNet has 14M images from 22k classes ImageNet Large Scale Visual Recognition Challenge: 1000 classes and 1.4M images, image labels only [O. Russakovsky, J.

14 ImageNet dataset ImageNet has 14M images from 22k classes ImageNet Large Scale Visual Recognition Challenge: 1000 classes and 1.4M images, image labels only [O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg and L. FeiFei, ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015] 14

Coco dataset Coco: common objects in context 80 object classes with segmentation masks, 200.000 images [T.Y. Lin, M.

15 Coco dataset Coco: common objects in context 80 object classes with segmentation masks, images [T.Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, L. Zitnick, P. Dollar, Coco, ECCV, 2014] 15

Atomic Visual Actions (AVA) dataset Definition of atomic actions, 80 atomic actions in 65k movie clips with 197k labels, multiple labels per person [AVA: A Video

16 Atomic Visual Actions (AVA) dataset Definition of atomic actions, 80 atomic actions in 65k movie clips with 197k labels, multiple labels per person [AVA: A Video Dataset of Spatiotemporally Localized Atomic Visual Actions; Gu, Sun, Vijavanarasimhan, Pantofaru, Ross, Toderici, Li, Ricco, Sukthankar, Schmid Malik, arxiv 17] 16

17 Information difficult to annotate Optical flow dense correspondence between pixels Impossible to precisely annotate manually FlyingThings dataset [Mayer et al., CVPR 16] 17

18 Information difficult to annotate 3D human shape Impossible to precisely annotate manually [F. Bogo et al., Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image, ECCV 2016] 18

19 Image and video data Manually annotated data Weaksupervised learning Synthetic data Selfsupervised learning 19

20 Visual models from weaklysupervised data Massive and ever growing amount of digital image and video content Flickr and YouTube Audiovisual archives (BBC, INA) Personal collections Comes with metadata Text, audio, user click data, 20

21 Visual models from weaklysupervised data Largescale & weakly supervised learning of visual models Object detection Action recognition 21

22 Joint learning of actors and actions Rick? Rick? Walks? Walks? Rick walks up behind Ilsa [Bojanowski et al. ICCV 2013] 22

23 Joint learning of actors and actions Rick Walks Rick walks up behind Ilsa [Bojanowski et al. ICCV 2013] 23

25 Weaklysupervised learning of actions [Weinzaefel et al., arxiv 17] 25

26 Extraction of human tubes Stateoftheart Faster RCNN detector Large annotated dataset of humans Humanspecific trackingbydetection approach DALY: 95% at

27 27

28 Weakly supervised learning of relations Input: Object detections + Image labels [Peyre et al., ICCV 17] 28

29 Weakly supervised learning of relations Output: Learnt spatial relations [Peyre et al., ICCV 17] 29

30 Weakly supervised learning of relations Output: Learnt spatial relations [Peyre et al., ICCV 17] 30

31 Weakly supervised learning of relations Output: Learnt spatial relations [Peyre et al., ICCV 17] 31

32 Image and video data Manually annotated data Weaksupervised learning Synthetic data Selfsupervised learning 32

33 Learning optical flow FlowNet2.0 FlyingThings dataset [Mayer et al., CVPR 16] FlowNet 2.0 [Illg et al., CVPR 17] 33

34 Visual models from synthetic data [Learning from Synthetic Humans, Varol, Romero, Martin, Mahmood, Black, Laptev, Schmid, CVPR 17] 34

35 SURREAL dataset Synthetic humans for REAL tasks a body with random 3D shape + 3D pose from MoCap data 2D image is rendered with a random camera + random lighting + random cloth texture + a random static scene image Output: RGB image, 2D/3D pose, optical flow, depth image, segmentation map for body parts

36 CAESARS dataset for human body shapes LSUN dataset for static background images CAESARS dataset and another collection of 3D scans for body textures (clothes) CMU dataset for MoCap sequences (marker data)

37 37

38 Approach for body part segmentation Stacked hourglass network [Newell et al., 2016] head 2D pose Segmentation left arm left arm backg. right foot torso MSE for regressing heatmaps Softmax error for classifying pixels as one of the parts

39 Experimental results Evaluation on Freiburg Sitting People Dataset

40 Results on YouTubePose dataset

41 Image and video data Manually annotated data Weaksupervised learning Synthetic data Selfsupervised learning 41

42 Selfsupervised learning from video Regularities in the video data are used for learning [I. Misra et al., Shuffle and Learn: Unsupervised Learning using Temporal Order Verification, ECCV 16] 42

43 Automatic video object segmentation [Learning Video Object Segmentation with Visual Memory, Tokmakov et al., ICCV 17] 43

44 Conclusion Recent largescale data collection Key to next generation systems Importance moving away from fully supervised approaches Crossmodal learning from vision, language and robotics 44

45 Merci! Suiveznous sur

Colorful Image Colorizations Supplementary Material

Colorful Image Colorizations Supplementary Material Richard Zhang, Phillip Isola, Alexei A. Efros {rich.zhang, isola, efros}@eecs.berkeley.edu University of California, Berkeley 1 Overview This document