Understanding egocentric imagery, for fun and science

Size: px

Start display at page:

Download "Understanding egocentric imagery, for fun and science"

Julius Nelson
5 years ago
Views:

(Dartmouth), Apu Kapadia, Chen Yu; PhD Students: Sven Bambach, Roberto Hoyle,

1 Understanding egocentric imagery, for fun and science David Crandall School of Informa-cs and Compu-ng Indiana University Joint work with: Denise Anthony (Dartmouth), Apu Kapadia, Chen Yu; PhD Students: Sven Bambach, Roberto Hoyle, Mohammed Korayem, Stefan Lee, Robert Templeman; Undergrads: Steven Armes, Dennis Chen

2 - Google - Narra-ve - Autographer - Samsung - GoPro

4 June 2, 11:10 am July 2, 11:10 am Aug 2, 11:10 am Sept 2, 11:10 am

6 Daguerreotypes, 1839 The Kodak, 1888 Polaroid Land Camera, 1948 Digital camera, 1975 J-phone, 2000

7 Daguerreotypes, 1839 The Kodak, 1888 Polaroid Land Camera, 1948 Digital camera, 1975 J-phone, 2000

10 What if your device were hacked? Robert Templeman, Zahid Rahman, David Crandall, Apu Kapadia. PlaceRaider: Virtual The_ in Physical Spaces with Smartphones. NDSS, 2013.

11 R. Templeman, Z. Rahman, D. Crandall, A. Kapadia. PlaceRaider: Virtual The_ in Physical Spaces with Smartphones. NDSS, 2013.

12 Vision to the rescue! Could computer vision automa-cally help Organize and analysis egocentric image streams? Find the great photos amongst all the bad ones? Warn before sharing a photo with private data to Facebook? Block or censor private photos from being taken by the device, and/or uploaded to the cloud?

13 What makes an image sensi-ve? 37 undergrads, wearing life-logging cameras for a week, each day reviewing images and labeling them in various ways. R. Hoyle, R. Templeman, S. Armes, D. Anthony, D. Crandall, A. Kapadia, "Privacy behaviors of lifeloggers using wearable cameras," UBICOMP, 2014.

14 Ethical, legal, IRB considera-ons

15 Why not share the photo? Reason Responses No good reason to share it 36.0% Objects (other than people) in the photo 30.7% Where this photo was taken 22.6% People within the photo 18.4% Participant was in the photo 11.5% It had private information 11.5% It would have been embarrassing to share it 5.4% It would have violated someone else s privacy 3.8% It was a bad photo 1.5%

Why not share the photo? Reason Responses No good reason to share it 36.0% Objects (other than people) in the photo 30.7% Where this photo was taken 22.6% People within the photo 18.

16 Why not share the photo? Reason Responses No good reason to share it 36.0% Objects (other than people) in the photo 30.7% Where this photo was taken 22.6% People within the photo 18.4% Participant was in the photo 11.5% It had private information 11.5% It would have been embarrassing to share it 5.4% It would have violated someone else s privacy 3.8% It was a bad photo 1.5% 16

17 Why not share the photo? Reason Responses No good reason to share it 36.0% Objects (other than people) in the photo 30.7% Where this photo was taken 22.6% People within the photo 18.4% Participant was in the photo 11.5% It had private information 11.5% It would have been embarrassing to share it 5.4% It would have violated someone else s privacy 3.8% It was a bad photo 1.5% 17

18 Why not share the photo? Reason Responses No good reason to share it 36.0% Objects (other than people) in the photo 30.7% Where this photo was taken 22.6% People within the photo 18.4% Participant was in the photo 11.5% It had private information 11.5% It would have been embarrassing to share it 5.4% It would have violated someone else s privacy 3.8% It was a bad photo 1.5% 18

19 Why not share the photo? Reason Responses No good reason to share it 36.0% Objects (other than people) in the photo 30.7% Where this photo was taken 22.6% People within the photo 18.4% Participant was in the photo 11.5% It had private information 11.5% It would have been embarrassing to share it 5.4% It would have violated someone else s privacy 3.8% It was a bad photo 1.5% 19

Use SVMs with standard image features HOG, GIST, LBP, SIFT, R. Templeman, M.

20 Place recogni-on in lifelogging images Given a stream of lifelogging photos, where was each photo taken? In which specific room? In which type of room? Use SVMs with standard image features HOG, GIST, LBP, SIFT, R. Templeman, M. Korayem, D. Crandall, A. Kapadia. PlaceAvoider: Steering first-person cameras away from sensi-ve spaces, NDSS 2014.

21 Classifying photo streams with HMMs ProbabiliHes with individual photo classifiers: Bathroom: Bedroom: Garage: Living: Office: ProbabiliHes aker applying HMM: Bathroom: Bedroom: Garage: Living: Office:

2% 95.0% House 2 31.0% 55.0% 56.4% 74.6% 76.8% House 3 20.9% 97.4% 86.9% 98.7% 99.8% Workplace 1 32.

22 Evalua-on Tested on 5 realis-c lifelogging datasets Dataset Baseline Local features + HMM Global features + HMM Local+global features + HMM Local+global +human + HMM House % 89.2% 64.0% 89.2% 95.0% House % 55.0% 56.4% 74.6% 76.8% House % 97.4% 86.9% 98.7% 99.8% Workplace % 75.5% 89.2% 87.7% 91.0% Workplace % 92.3% 81.2% 98.7% 100.0% Average 28.5% 81.9% 74.8% 89.8% 92.5%

23 Sample results

24 Detec-ng screens CNNs, plus temporal smoothing, gives ~95% 2-way (monitor vs nomonitor) performance on real lifelogging data (vs ~73% baseline). M. Korayem, R. Templeman, D. Chen, D. Crandall, A. Kapadia. ScreenAvoider: Protec-ng computer screens from ubiquitous cameras. (Arxiv preprint), 2014.

25 Vannevar Bush, The Atlan:c, 1945

26 child s first person view parent s first person view from eye camera from eye camera from head camera from head camera

27 (Very broad) Research ques-ons How do children coordinate their hand movement, eye gaze, and head movements? How do these develop and change over -me? Are certain parerns predic-ve of later deficiencies? How do children and parents interact, and how does this differ across subjects? How do they jointly coordinate aren-on? Which interac-on parerns are most successful for learning?

Typical experiments 13 child-parent dyads

: Exchanging toys back and forth Joint ac-ons

28 Typical experiments 13 child-parent dyads (childrens mean age = 13 months, σ = 3.2 months) Parents told to engage child with toys, naturally, e.g.: Exchanging toys back and forth Joint ac-ons with toys ~5 minutes of video per trial In lab sexng, more recently in naturalis-c environments

Analysis Video data processing: Pixel-level

29 Analysis Video data processing: Pixel-level visual saliency es-ma-on Saliency map models e.g. of Ix et al Object segmenta-on and recogni-on Posi-ons of toys, hands, and faces Object holding ac-vi-es MoHon data processing: Op-cal flow Head, hand iner-al sensors

30 Eye gaze within the visual field Children μ = (340, 231) σ x = 86, σ y = 65 N = 148,279 Parents μ = (361, 224) σ x = 53, σ y = 60 N = 148,279 S. Bambach, D. Crandall, C. Yu. Understanding embodied visual aren-on in child-parent interac-ons, IEEE Interna:onal Conference on Development and Learning, 2013.

31 Head-eye coordina-on Moving Head StaHonary Head Children µ = (333, 253) σ x = 80, σ y = 70 N = 15,626 µ = (335, 226) σ x = 84, σ y = 63 N = 71,388 Parents µ = (351, 229) σ x = 57, σ y = 62 N = 63,035 µ = (374, 210) σ x = 47, σ y = 56 N = 33,981

32 Eye-hand coordina-on Empty Hands Holding Toy Children µ = (332, 231) σ x = 84, σ y = 68 N = 55,704 µ = (301, 242) σ x = 85, σ y = 62 N = 10,937 Parents µ = (336, 244) σ x = 71, σ y = 70 N = 27,062 µ = (333, 270) σ x = 63, σ y = 73 N = 11,239

33 Saliency in first-person views Comparison of average saliency (based on 148,000 frames each, using Ix et al) à No significant difference in child and parent views Child Parent

34 Saliency in first-person views Comparison of average saliency within hotspot around gaze loca-on hotspot à Gaze predic-veness differs significantly Child Parent

35 Longer-term ques-on Can we jointly model head and hand pose, eye gaze, saliency, and ac-vity, both to berer perform egocentric computer vision, and to help explain human vision? Head Eye Hand

36 Why start with hands? Hands are in nearly every frame of egocentric video Hand configura-on reflects what we are doing and what we are paying aren-on to Detec-ng hands is a fundamental problem for both computers and people The feeling of ownership of our limbs is a fundamental aspect of self-consciousness. [Ehrsson 2004]

"Detecting activities of daily living in

37 A. Fathi, X. Ren, J. Rehg, "Learning to recognize objects in egocentric activities," CVPR Pirsiavash, Hamed, and Ramanan. "Detecting activities of daily living in first-person camera views." CVPR 2012 A. Fathi, A. Farhadi, and J. Rehg. Understanding egocentric activities, ICCV 2011

38 Hand detec-on and disambigua-on In egocentric video of interac-ng people, locate: The observer s hands (my le_, my right) The other person s hands (your le_, your right) The other person s head (your head) Our approaches so far: Strong temporal models, weak spa-al/appearance models S. Bambach, J. Franchak, D. Crandall, C. Yu. Detec-ng hands in children s egocentric views to understand embodied aren-on during social interac-on. CogSci Strong spa-al/temporal models, weak appearance models, explicit camera (head) mo-on model S. Lee, S. Bambach, C. Yu, D. Crandall. This hand is my hand: A probabilis-c approach to hand disambigua-on in egocentric video, CVPR Egovision, Strong appearance models, weak spa-al/temporal models S. Bambach, S. Lee, D. Crandall, C. Yu. Lending a hand: Detec-ng hands and recognizing ac-vi-es in complex egocentric interac-ons. ICCV, 2015.

Strong appearance models: CNNs (of course) Mostly off-the-shelf Caffe, fine-tuned from

over size, posi-on, shape of hand regions in training data, biased by skin color detector

39 Strong appearance models: CNNs (of course) Mostly off-the-shelf Caffe, fine-tuned from ImageNet Generate candidates using domain-specific informa-on Sample from distribu-on over size, posi-on, shape of hand regions in training data, biased by skin color detector Other s right hand My le_ hand Other s le_ hand My right hand Coverage Direct Sampling Direct Sampling (only spatial) Sampling from Selective Search Selective Search Sampling from Objectness Objectness # Window Proposals

40 Hand detec-on & disambigua-on results 4 actors x 4 ac-vi-es x 3 loca-ons = 48 unique videos 15,053 pixel-level ground truth segmenta-ons Precision Precision ours AP: AUC: SS AP: AUC: obj. AP: AUC: Recall own left AP: AUC: own right AP: AUC: other left AP: AUC: other right AP: AUC: Recall

41 Hand detec-on & segmenta-on

ac-vi-es (puzzle, jenga, cards, chess) on masked out images Single frame accuracy: 66.

43 Does hand pose alone reveal first person ac-vi-es? Different interac-ons afford different hand grasps [Napier 1965] Train and test CNNs for 4 ac-vi-es (puzzle, jenga, cards, chess) on masked out images Single frame accuracy: 66.4% with GT masks, 50.9% with automa-c masks (vs 25% baseline) With temporal cues (50 frames): 92.9% with GT masks, 73.4% with automa-c masks

44 Hand PosiHons during Sustained ASenHon, within child's field of view Parent s Right Hand Parent s Left Hand Child s Left Hand Child s Right Hand 6 child-parent dyads (children s mean age = 19 months, σ = 2.56 months) 67,913 frames (~38 minutes) of child-view video

45 Daguerreotypes, 1839 The Kodak, 1888 Polaroid Land Camera, 1948 Digital camera, 1975 J-phone, 2000

46 Future work Detec-ng and disambigua-ng hands Generalize hand-based ac-vity recogni-on to more ac-vi-es, finer-grained ac-ons More challenging social situa-ons (e.g. more than two interac-ng people, moving camera wearer) Applica-ons for egocentric video data

47 For more informahon: hsp://vision.soic.indiana.edu/ Based on joint work with: Faculty: Denise Anthony, Apu Kapadia, Chen Yu PhD Students: Sven Bambach, Roberto Hoyle, Mohammed Korayem, Stefan Lee, Robert Templeman Undergrad students: Steven Armes, Dennis Chen Sponsors: NSF (III CAREER, SaTC, EAGER, DIBBs), Intelligence Advanced Research Projects Ac-vity (IARPA), Air Force Office of Scien-fic Research (AFOSR), Google, Nvidia, IU FRSP, IUCRG, IU D2I Center, Lily Endowment

Exploring Wearable Cameras for Educational Purposes

70 Exploring Wearable Cameras for Educational Purposes Jouni Ikonen and Antti Knutas Abstract: The paper explores the idea of using wearable cameras in educational settings. In the study, a wearable camera