Privacy Policy. What is Data Privacy? Privacy Policy. Data Privacy Friend or Foe? Some Positives

Similar documents
Workshop on anonymization Berlin, March 19, Basic Knowledge Terms, Definitions and general techniques. Murat Sariyar TMF

Curriculum Vitae Bradley A. Malin

Ethics of Data Science

Data Anonymization Related Laws in the US and the EU. CS and Law Project Presentation Jaspal Singh

Foundations of Privacy. Class 1

BCCDC Informatics Activities

Practical Image and Video Processing Using MATLAB

Systematic Privacy by Design Engineering

AUTOMATED MALARIA PARASITE DETECTION BASED ON IMAGE PROCESSING PROJECT REFERENCE NO.: 38S1511

Clinical Natural Language Processing: Unlocking Patient Records for Research

Privacy in a Networked World: Trouble with Anonymization, Aggregates

Privacy-Preserving Collaborative Recommendation Systems Based on the Scalar Product

Societal and Ethical Challenges in the Era of Big Data: Exploring the emerging issues and opportunities of big data management and analytics

Diet Networks: Thin Parameters for Fat Genomics

INTERACTIVE EVOLUTIONARY GENERATION OF FACIAL COMPOSITES FOR LOCATING SUSPECTS IN CRIMINAL INVESTIGATIONS/

Capture-recapture studies

Privacy-Preserving Learning Analytics

e-science Acknowledgements

Challenges and opportunities of digital social research: Access and Anonymity

Northfleet Technology College Course Outline: Information Technology in a Global Society

Second Symposium & Workshop on ICAO-Standard MRTDs, Biometrics and Security

Extraction and Recognition of Text From Digital English Comic Image Using Median Filter

The EFPIA Perspective on the GDPR. Brendan Barnes, EFPIA 2 nd Nordic Real World Data Conference , Helsinki

Privacy preserving data mining multiplicative perturbation techniques

Guidance on the anonymisation of clinical reports for the purpose of publication in accordance with policy 0070

Enabling Trust in e-business: Research in Enterprise Privacy Technologies

Experiments with An Improved Iris Segmentation Algorithm

Automatic Morphological Segmentation and Region Growing Method of Diagnosing Medical Images

How Machine Learning and AI Are Disrupting the Current Healthcare System. Session #30, March 6, 2018 Cris Ross, CIO Mayo Clinic, Jim Golden, PwC

Vision Defect Identification System (VDIS) using Knowledge Base and Image Processing Framework

Online Diagnosis and Monitoring for Power Distribution System

Big Data, privacy and ethics: current trends and future challenges

Intelligent Agents & Search Problem Formulation. AIMA, Chapters 2,

Data Dictionary: HES-ONS linked mortality data

The General Data Protection Regulation and use of health data: challenges for pharmaceutical regulation

Responsible Data Use Assessment for Public Realm Sensing Pilot with Numina. Overview of the Pilot:

Record Linkage between the 2006 Census of the Population and the Canadian Mortality Database

Pan-Canadian Trust Framework Overview

November 6, Keynote Speaker. Panelists. Heng Xu Penn State. Rebecca Wang Lehigh University. Eric P. S. Baumer Lehigh University

COMPUTATONAL INTELLIGENCE

TRANSLATION OF GENOMICS FOR PATIENT CARE AND RESEARCH PATIENT S PERSPECTIVE

Analysis of Privacy and Data Protection Laws and Directives Around the World

Good afternoon. Under the title of Trust and Co-creation in the Digital Era, I would like to explain our research and development strategy.

Machine Vision for the Life Sciences

Dependable AI Systems

WANT TO PARTICIPATE IN RESEARCH? THERE S AN APP FOR THAT!

MEASURING PRIVACY RISK IN ONLINE SOCIAL NETWORKS. Justin Becker, Hao Chen UC Davis May 2009

Towards a Modern Approach to Privacy-Aware Government Data Releases

Visual Search using Principal Component Analysis

This Privacy Policy describes the types of personal information SF Express Co., Ltd. and

Privacy-Enhanced Linking

Image Extraction using Image Mining Technique

CONSENT IN THE TIME OF BIG DATA. Richard Austin February 1, 2017

Introduction to Image Analysis with

Exercise 4 Exploring Population Change without Selection

Ken Buetow, Ph.D. Director, Computation Science and Informatics, Complex Adaptive ASU Professor, School of Life Science

February 24, [Click for Most Updated Paper] [Click for Most Updated Online Appendices]

Technical Aspects in Digital Pathology

Fujitsu Laboratories Advanced Technology Symposium 2018

Confidently Assess Risk Using Public Records Data with Scalable Automated Linking Technology (SALT)

Finding U.S. Census Data with American FactFinder Tutorial

Guidance on the anonymisation of clinical reports for the purpose of publication

Disclosure: Within the past 12 months, I have had no financial relationships with proprietary entities that produce health care goods and services.

Auto-tagging The Facebook

Guidance for Calculating Incidence by Census Tract Poverty Level Using 2010 Census and ACS

Security and Risk Assessment in GDPR: from policy to implementation

A Game Theoretic Framework for Analyzing Re-identification Risk : Supporting Information

Sketching Interface. Larry Rudolph April 24, Pervasive Computing MIT SMA 5508 Spring 2006 Larry Rudolph

CSC 320 H1S CSC320 Exam Study Guide (Last updated: April 2, 2015) Winter 2015

Privacy engineering, privacy by design, and privacy governance

WISEid Student Person Export/ Import (SRN)

Sketching Interface. Motivation

A COMPUTER VISION AND MACHINE LEARNING SYSTEM FOR BIRD AND BAT DETECTION AND FORECASTING

Efficient Car License Plate Detection and Recognition by Using Vertical Edge Based Method

Best Practices for Automated Linking Using Historical Data: A Progress Report

Privacy Implications of New York City's Stop-and- Frisk Data

IDENTIFICATION OF SIGNATURES TRANSMITTED OVER RAYLEIGH FADING CHANNEL BY USING HMM AND RLE

Global Alliance for Genomics & Health Data Sharing Lexicon

A Case Study on the Use of Unstructured Data in Healthcare Analytics. Analysis of Images for Diabetic Retinopathy

COMPARATIVE PERFORMANCE ANALYSIS OF HAND GESTURE RECOGNITION TECHNIQUES

SAFETY CASES: ARGUING THE SAFETY OF AUTONOMOUS SYSTEMS SIMON BURTON DAGSTUHL,

Human Pedigree Genetics Answer Key

Internet Based Artificial Neural Networks for the Interpretation of Medical Images

DigiMam Conformance Statement for DICOM V3.0

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification

Privacy Values and Privacy by Design Annie I. Antón

Introductions. Characterizing Knowledge Management Tools

8 Executive summary. Intelligent Software Agent Technologies: Turning a Privacy Threat into a Privacy Protector

Is Privacy Still an Issue for Data Mining? Chris Clifton 11 October, 2007

WP6 Genomics Organizing the societal debate on the use of genomic information in healthcare

An Automated Record Linkage System - Linking 1871 Canadian census to 1881 Canadian Census

Learning to Predict Indoor Illumination from a Single Image. Chih-Hui Ho

Privacy Preserving Optics for Miniature Vision Sensors

Responsible AI and the Future of Customer Engagement. Peter van der Putten, Director Decisioning Solutions WW,Pegasystems

Multi-PIE. Robotics Institute, Carnegie Mellon University 2. Department of Psychology, University of Pittsburgh 3

Number Plate Recognition Using Segmentation

Owning Identity One or many: Do we have a choice?

Person De-identification in Activity Videos

IAB Europe Guidance THE DEFINITION OF PERSONAL DATA. IAB Europe GDPR Implementation Working Group WHITE PAPER

What is AI? AI is the reproduction of human reasoning and intelligent behavior by computational methods. an attempt of. Intelligent behavior Computer

Transcription:

Privacy Policy Data Privacy Friend or Foe? Some Limitations Need robust language Need enforcement Scope of world / interaction Syntax, not semantics Bradley Malin, malin@cscmuedu Data Privacy Laboratory, School of Computer Science Carnegie Mellon University October 26, 24 What is Data Privacy? WHERE does data come from? WHAT does data reveal? HOW do we prove data does not reveal more than specified? Privacy Policy Some Positives Procedure Specifies how data can (not) be used Logical Cognition Requires active involvement and thought regarding information Standardization equal opportunity Legal Enforcement What is Data Privacy? The study of computational solutions for releasing data such that the data remain practically useful while the aspects of the subjects of the data are not revealed Privacy Protection ( data protectors ): release information such that entity-specific properties (eg identity) are controlled restrict what can be learned Data Linkage ( data detectives ) combining disparate pieces of entity-specific information to learn more about an entity

Relationships Between Related Areas INFERENCE CONTROL PUBLIC POLICY DISCLOSURE CONTROL PRIVACY- PRESERVING DATA MINING LAW PRIVACY SPECIFICATION ANONYMITY (De-identification) Diagram courtesy of Michael Shamos HUMAN INTERFACE ORGANIZATIONAL PRACTICES ENFORCEMENT & IMPLEMENTATION AUDIT & ACCOUNTABILITY SECURITY TRUSTED HARDWARE CRYPTO Data Data Data What kind of data? Field Structured Databases Text Documents Genomic Image Video Network (Physical or Social) Communications All kinds! Data Privacy is Interdisciplinary Table courtesy of Latanya Sweeney AI learning theory database language security IS anonymity heavy some heavy rights mgt some heavy database some heavy some ubiquitous heavy heavy heavy some some some AI primarily concerns knowledge representation and semantics Learning focuses on data mining algorithms Theory includes zero-knowledge proofs and multi-party computations Information Explosion 3 Increase in technological capability for collection, storage, 25 transfer Growth in 2 active web 5 Decrease servers in cost Global Disk Storage Per Person (GDSP) Sewrvers (in Millions) GDSP (MB/person) 35 5 5 45 4 35 3 25 2 5 5 ~(hard drive space) / (world population) Storage (tera) Population ( 9 ) Person-time / page 983 986 9 45 2 months 6, hour [Sweeney 997] 983 985 987 989 99 993 995 997 999 2 23 Growth in available disk storage 993 First WWW 57 conference 2 2,8, 6 35 min 983 985 987 989 99 993 995 997 999 2 23 99 Year 996 2 2

Anonymity & De-identification Anonymous: Data can not be manipulated or linked to identify an individual De-identified: All explicit identifiers, such as name, address, & phone number are removed, generalized, or replaced with made up values Does Anonymous = De-identified? Direct Linkage Uses the combination of attributes to determine the uniqueness of an entity in a dataset Second dataset with identified subjects is used to make the re-identification by drawing inferences between the two datasets on the related attributes The attributes do not have to be equal, but there must exist some ability for inference of between attributes Medical Information Linking to re-identify data Attributes Recommended by NAHDO (outside scope of HIPAA) Patient Zip Code Patient Birth Date Patient Gender Patient Racial Background Patient Number Visit Date Principle Diagnosis Codes (ICD-9) Procedure Codes Physician ID Number Physician Zip Code Total Charges Ethnicity Visit date Diagnosis Procedure Medication Total charge Zip Birthdate Sex Medical Data 3

Linking to re-identify data Zip Birthdate Sex Name Address Date registered Party affiliation Date last voted {date of birth, gender, 5-digit ZIP} uniquely identifies 87% of USA [Sweeney 97, 98] Few fields are needed to uniquely identify individuals Voter List Linking to re-identify data Ethnicity Visit date Diagnosis Procedure Medication Total charge Zip Sex Name Address 87% of the United States is Birthdate RE-IDENTIFIABLE Date registered Party affiliation Date last voted [Sweeney 97, 98] {date of birth, gender, 5-digit ZIP} [Sweeney 97, 98] uniquely identifies 87% of USA ZIP 6623, 2,67 people, %, not % insufficient # above the age of 55 living there Medical Data Voter List 4

{date of birth, gender, 5-digit ZIP} uniquely identifies 87% of USA pop ZIP 794, 548 people, primarily between 9 and 24 (4666 of 548 or 86%), only 3% [Sweeney 97, 98] Chain of Links Voter List D G Z Medical Data Semantic Learning Chain of Links Mining strategic information from text, and from video Automated profiles (putting disparate pieces together) Resolving ambiguous identities in data (eg Michael Jordan, the basketball player, vs Michael Jordan, the computer scientist) Edoardo Airoldi William Gronim Ralph Gross Kishore Madhava Bradley Malin Algorithms for learning sensitive information from seemingly innocent information Voter List D G Z Medical Data So what do you do? DNA Data Mutation Analysis Prediction and Risk Pharmaco-Genomic Relations Familial Relations ATCGATCGAT 5

Building Up Relations Ethnicity Visit date Zip Diagnosis Birthdate Procedure Sex Medication Total charge Medical Data DNA Data ATCGATCGAT CleanGene Step : gender via genetic locus Step 2: classify disease by known mutations Step 3: hospital ID and other inferences Step 4: linking DNA with the medical profile [Malin & Sweeney 2] 2 3 n HID k m 2 HID k f 3 HID k m n HID k f 2 3 n STEP Full Sequences known STEP 3 STEP 4 2 3 n 2 3 n m f m f STEP 2 ESTs / cdnas ZIP i DOB HID k m 2 ZIP j DOB 2 HID k f 3 ZIP i DOB 3 HID k m m ZIP i DOB m HID k f Genotype-Phenotype Relations Can infer genotype-phenotype relationships out of both DNA and medical databases Medical Database DIAGNOSIS Phenotype With Genetic Trait ACTG Genomic DNA Longitudinal Genomic Learning Model Clinical Profiles Diagnoses Clinical Phenotype State Mapping Classify Profile Visits [Malin & Sweeney 22] DIAGNOSIS Disease Phenotype ACTG Disease Sequences DNA Database Constrain Profile State Alignment DNA Predictions 6

Experimental Results DNA with Huntington s Disease Example: Huntington s disease Exists strong correlation between age of onset and DNA mutation (# of CAG repeats) Given longitudinal clinical info, accurately infer age of onset in 2 of 22 cases Size of Repeat vs Age of Onset y = -248Ln(x) + 2266 R 2 = 889 Malin B and Sweeney L Inferring genotype from clinical phenotype through a knowledge-based algorithm Privacy Policy, In Technology, Pacific and Symposium Law on Biocomputing Data Privacy: pp Foe, 4-52, then Jan Friend 22 # CAG repeats Age 9 8 7 6 5 4 3 7 6 5 4 3 2 5 5 2 25 3 35 4 45 5 55 6 Age of Onset Age of Onset Prediction 5 5 2 2 5 individual Individual act ual age min age max age Websites Share Weblogs 2826578 Yoda 34689 346872 Leah 829232 Obi 8254 Han 34689 829232 Obi 2826578 Yoda 34689 32246793 C3PO 2789654 Jabba 32246793 C3PO 829232 Obi 346872 Leah 535677 Lando Learning from Trails Identity algorithms to learn where a person has been by the trail left behind eg, IP addresses left behind while visiting websites [Malin & Sweeney, 4], [Malin 2] IP IP IP 2 IP 3 IP 4 2826578 34689 346872 829232 8254 34689 829232 2826578 34689 32246793 2789654 Websites Share Consumer Lists Yoda Leah Obi Han Obi Yoda C3PO Jabba 32246793 C3PO 829232 Obi 346872 Leah 535677 Lando Yoda Leah Obi Han Obi Yoda C3PO Jabba C3PO Obi Leah Lando 7

Identity REIDIT-I Example Example of what was learned: (known by name) visited Amazon even though he never bought anything at Amazon IP IP IP 2 IP 3 IP 4 Reidentified IP2 IP IP 4 IP 3 Protection Models Jcd Jwq Jxy Dan Don Dave Ann Abe Al Subjects Population Universe Ann /2/6 239 cardiac Abe 7/4/6 239 cancer Al 3/8/6 238 liver Private Information Jcd Jwq Jxy Null-Map Al 3/8/6 238 cardiac Ann /2/6 239 cancer Abe 7/4/6 239 liver Wrong-Map A* 96 23* cardiac A* 96 23* cancer A* 96 23* liver k-anonymity cardiac cancer liver Overview Idea of k-map and k-anonymity [Sweeney 97, 98] General Idea of Data Privacy Data Analysis in Personal Information Learning Demographic Data Genetic Data Data Protection Formal Models Video Data For every record released, there will be at least k individuals to whom the record indistinctly refers In k-map, the k individuals exist in the world In k-anonymity, the k individuals appear in the release 8

Model Examples Distributions of a Population Register k-map: For each tuple t in the release, t must indistinctly refer to at least k entities in the population A* 963 23* cardiac A* 96 23* cancer A* 964 23* liver Gil Hal Jim There are three colors with frequencies: red, 3 green and 2 blue There are 2 types of figures, with 2 of one type and 4 of the other k-anonymity: k in the release A* 96 23* cardiac A* 96 23* cancer A* 96 23* liver Ken Len Mel Register The combination of color and figure labeled as Hal and Len are each unique k-anonymity, enforce on release Quasi-identifier, profile {Birth, ZIP, Gender} Generalization /27/59 959 Suppression 239 Encryption 324523 268582 Re-identification Example Gil Hal Jim Ken Len Mel Register Release There are 3 green figures and 2 figures having the same profile as the release But only Hal is green and has the same figure type as the profile in the release It is a unique match 9

Re-identification Example Overview Gil Hal Jim Ken Len Mel Register Release There are two matches for this profile, Jim and Mel There is no unique match General Idea of Data Privacy Data Analysis in Personal Information Learning Demographic Data Genetic Data Data Protection Formal Models Video Data Formal Protection Example Gil Hal Jim Ken Len Mel Register + = Release To achieve k-map where k=2, agents for Gil, Hal and Ken agree to merge their information together Information released about any of them results in the same merged image Video Data Privacy Modify video images so that Privacy: automated attempts to recognize faces fail Utility: knowledge learned from data is useful Solution to problem Enables sharing of data for specified purposes Protects rights as specified in policy eg your identity won t be revealed unless you have done something illegal

De-identification of Faces Face Recognition: The Big Idea Example Captured images are below Here is a known image of Bob Which person is Bob? Identification Algorithm* Name of Person Identity Unknown? Face Recognition Software: ~7% % of Samples Where Correct Match Found 95 85 75 65 55 5 25 5 5 All Rank Data Privacy: Foe, then Friend 24 Bradley Malin PCA-Based Face Recognition Systems Limiting the success of Module 2 is our goal

Basic Approach in Eigenfaces Use a training set to identify a set of characteristic faces 2 Given a gallery of known faces and a probe image of an unknown person, compare each face to the characteristic faces to get a distance measure for each Eigenvectors The characteristic function: (A-λI) = 3 The probe s identity is determined by the shortest distance to a gallery image 4 There is one image per person in the gallery and one corresponding picture per person in the probe set where A is the covariance matrix C Module 2: Eigenfaces / PCA Training Set Gallery Probe Face Space of Average Face Projected Gallery Distance Measure Projected Probe De-identification: T-mask Example continued Captured images are deidentified below Here is a known image of Bob Which person is Bob? 2

De-identification: T-mask Example continued Captured images are deidentified below Here is a known image of Bob Which person is Bob? De-identification: pixel reduction Example continued Captured images are deidentified below Here is a known image of Bob Which person is Bob? Automated Recognition Fails! % of Samples Where Correct Match Found 8 6 4 2 De-identification: pixel reduction Example continued Captured images are deidentified below Here is a known image of Bob Which person is Bob? 5 25 5 5 All Rank 3

Face recognition Works, Humans Don t! % of Samples Where Correct Match Found 9 8 7 6 5 5 25 5 5 All Rank Some Experiments Single Bar Mask T-Mask Black Blob Mouth Only Grayscale Black & White Ordinal Data Threshold Pixelation Negative Grayscale Black & White Random Grayscale Black & White Mr Potato Head Pixelation: Automated Recognition Easier! Ad Hoc Methods = Poor Protection % of Samples Where Correct Match Found 9 8 7 6 5 5 25 5 5 All Rank % of Samples Where Correct Match Found Percent Identifie 9 8 7 6 5 4 3 2 All Black Test 4

% of Samples Where Correct Match Found Uh Oh, Not Looking Good 8 6 4 2 T = 65 T = 5 Don t be Naïve Again, de-identified anonymous Masks can be removed and trained against Some cases naïve de-identification even harms privacy! pixelation and blur improves performance Time to get logical 4 8 2 6 2 24 255 Threshold Level % of Samples Where Correct Match Found Random Changes to Grayscale Images Percent Correctly Identified 9 8 7 6 5 4 3 2 Original Gray Scale/Rand Experiment: ID rate v Number of Pixels Changed Identification Rate for Randomly Changed set in Gallery Randomly v Originals changed gallery Identification Rate for Originals v Randomly Changed Randomly changed probe set in Probe 2 4 6 8 2 4 Number of Values Changed 2 4 6 8 2 4 Number of Pixels Changed (R) R = 3 R = 9 Back to k-protection Models [Newton, Sweeney, Malin 4,5] k-anonymity: For every record, there are at least k individuals to whom it refers (realized upon release) k-same: For every face, there are at least k people to whom that face refers No face actually refers to a single real person 5

Ranking of Faces Example How does everyone rank against each other? Who is closest? Who is farthest? Examples of k-same -Pixel -Eigen k = 2 3 5 5 Example of k-same for k=2 Demonstration Time! K-Same Demo 6

Performance of k-same Algorithms Thanks! % of Samples Where Correct Match Found Percent Correct, Top Rank 9 88 7 66 5 44 3 2 2 Expected[k-Same] k-same-pixel k-same-eigen 2 2 3 4 5 k5 6 7 8 9 Rank Upper-bound on Recognition Performance = k Some slides adapted from: Elaine Newton Michael Shamos Latanya Sweeney More information: http://privacycscmuedu http://wwwcscmuedu/~malin Overview General Idea of Data Privacy Data Analysis in Personal Information Learning Demographic Data Genetic Data Data Protection Formal Models Video Data 7