신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Similar documents
11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Artificial Intelligence and Deep Learning

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Networks 1

Music Recommendation using Recurrent Neural Networks

Deep Learning for Autonomous Driving

Attention-based Multi-Encoder-Decoder Recurrent Neural Networks

Deep Learning Basics Lecture 9: Recurrent Neural Networks. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Machine Learning

Radio Deep Learning Efforts Showcase Presentation

Neural Network Part 4: Recurrent Neural Networks

Image Manipulation Detection using Convolutional Neural Network

The Game-Theoretic Approach to Machine Learning and Adaptation

Research on Hand Gesture Recognition Using Convolutional Neural Network

Deep Neural Network Architectures for Modulation Classification

Introduction to Machine Learning

PURELY NEURAL MACHINE TRANSLATION

Continuous Gesture Recognition Fact Sheet

Neural Networks The New Moore s Law

Sketching Interface. Larry Rudolph April 24, Pervasive Computing MIT SMA 5508 Spring 2006 Larry Rudolph

Sketching Interface. Motivation

Transformation to Artificial Intelligence with MATLAB Roy Lurie, PhD Vice President of Engineering MATLAB Products

INTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

Reinforcement Learning for CPS Safety Engineering. Sam Green, Çetin Kaya Koç, Jieliang Luo University of California, Santa Barbara

On the Use of Convolutional Neural Networks for Specific Emitter Identification

Service Robots in an Intelligent House

Application Areas of AI Artificial intelligence is divided into different branches which are mentioned below:

Relation Extraction, Neural Network, and Matrix Factorization

REAL TIME EMULATION OF PARAMETRIC GUITAR TUBE AMPLIFIER WITH LONG SHORT TERM MEMORY NEURAL NETWORK

An Hybrid MLP-SVM Handwritten Digit Recognizer

Using Deep Learning for Sentiment Analysis and Opinion Mining

Prediction of Cluster System Load Using Artificial Neural Networks

Generating an appropriate sound for a video using WaveNet.

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition

The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search

Attention-based Information Fusion using Multi-Encoder-Decoder Recurrent Neural Networks

Applications of Music Processing

Detection and Segmentation. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 11 -

The game of Bridge: a challenge for ILP

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification

Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT

Lecture 1 What is AI?

HUMAN-LEVEL ARTIFICIAL INTELIGENCE & COGNITIVE SCIENCE

CS 730/830: Intro AI. Prof. Wheeler Ruml. TA Bence Cserna. Thinking inside the box. 5 handouts: course info, project info, schedule, slides, asst 1

Poker AI: Equilibrium, Online Resolving, Deep Learning and Reinforcement Learning

DEEP LEARNING ON RF DATA. Adam Thompson Senior Solutions Architect March 29, 2018

Deep Learning. Dr. Johan Hagelbäck.

An Introduction to Convolutional Neural Networks. Alessandro Giusti Dalle Molle Institute for Artificial Intelligence Lugano, Switzerland

Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation

Using Artificial intelligent to solve the game of 2048

Application of Deep Learning in Software Security Detection

CSC321 Lecture 23: Go

Classification of Road Images for Lane Detection

Convolutional neural networks

Data-Starved Artificial Intelligence

Voice Activity Detection

Kernels and Support Vector Machines

Counterfeit Bill Detection Algorithm using Deep Learning

Learning Artificial Intelligence in Large-Scale Video Games

CandyCrush.ai: An AI Agent for Candy Crush

Convolutional Neural Network-based Steganalysis on Spatial Domain

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Representation Learning for Mobile Robots in Dynamic Environments

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

SIGNAL PROCESSING OF POWER QUALITY DISTURBANCES

Midterm for Name: Good luck! Midterm page 1 of 9

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 1: Intro

Consideration of Utilization of Artificial Intelligence for Business Innovation

Learning Deep Networks from Noisy Labels with Dropout Regularization

Statistical Tests: More Complicated Discriminants

Lecture 23 Deep Learning: Segmentation

Semantic Segmentation on Resource Constrained Devices

IBM SPSS Neural Networks

Electric Guitar Pickups Recognition

GESTURE RECOGNITION FOR ROBOTIC CONTROL USING DEEP LEARNING

Distinguishing Mislabeled Data from Correctly Labeled Data in Classifier Design

Adversarial Examples and Adversarial Training. Ian Goodfellow, OpenAI Research Scientist Presentation at Quora,

KIPO s plan for AI - Are you ready for AI? - Gyudong HAN, KIPO Republic of Korea

Recurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1

AI for Autonomous Ships Challenges in Design and Validation

Tools for Advanced Sound & Vibration Analysis

Goals of this Course. CSE 473 Artificial Intelligence. AI as Science. AI as Engineering. Dieter Fox Colin Zheng

Carnegie Mellon University, University of Pittsburgh

Neural Network-Based Abstract Generation for Opinions and Arguments

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Are there alternatives to Sigmoid Hidden Units? MLP Lecture 6 Hidden Units / Initialisation 1

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1

Scheduling. Radek Mařík. April 28, 2015 FEE CTU, K Radek Mařík Scheduling April 28, / 48

A Fuller Understanding of Fully Convolutional Networks. Evan Shelhamer* Jonathan Long* Trevor Darrell UC Berkeley in CVPR'15, PAMI'16

Latest trends in sentiment analysis - A survey

Neural Turing Machines

Decoding Brainwave Data using Regression

Creating Intelligence at the Edge

Vehicle Color Recognition using Convolutional Neural Network

CSE 473 Artificial Intelligence (AI) Outline

GESTURE RECOGNITION WITH 3D CNNS

Classroom Konnect. Artificial Intelligence and Machine Learning

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Deep learning architectures for music audio classification: a personal (re)view

2 TD-MoM ANALYSIS OF SYMMETRIC WIRE DIPOLE

Transcription:

신경망기반자동번역기술 Konkuk University Computational Intelligence Lab. http://ci.konkuk.ac.kr kikim01@kunkuk.ac.kr 김강일

Index Issues in AI and Deep Learning Overview of Machine Translation Advanced Techniques in NMT Issues in NMT Research

Issues in AI and Deep Learning

Issues in AI and Deep Learning 1. What is the distinguished property of deep learning? 2. What is the range of problems solved by deep learning? 3. Why deep learning can abstract features? 4. Why deep learning can extract features?

Issues in AI and Deep Learning No Free Lunch Theorem Performance Adaptation algorithms to specific problems AI Expert! Algorithm2 Domain Expert! Algorithm1 Problem Feature Engineering Structure Design Heuristics

Issues in AI and Deep Learning No Free Lunch Theorem Performance Deep Learning! Algorithm2 Problem Algorithm1 Benefit: (almost) Automated AI System Building Very good for industrialization

Issues in AI and Deep Learning 1. What is the distinguished property of deep learning? 2. What is the range of problems solved by deep learning? 3. Why deep learning can abstract features? 4. Why deep learning can extract features?

Issues in AI and Deep Learning To represent information.. (minimum description length..) Error Information = + Model Information

Issues in AI and Deep Learning To represent information.. (minimum description length..) Error Information = + Model

Issues in AI and Deep Learning Small Model - Good for representing information as regular patterns - May restrict representing very complex patterns by implicit model constraints - Simplified pattern is better for unseen prediction (Belief ) VS Large Model - Good for representing all patterns - Only represent the observed patterns (overfitting)

Issues in AI and Deep Learning Small Model Large Model Predictable cases Training cases All cases Model VS Model

Issues in AI and Deep Learning To represent information.. (minimum description length..) Information = Model + Error Neural networks are good for representing very accurate and large size models

Issues in AI and Deep Learning Overfitting? -> collect more and more data Model Collect Data!!!

Issues in AI and Deep Learning Impossible to collect large data Problems Deep Learning (Neural Networks) Other AI Approach Possible to collect large data

Issues in AI and Deep Learning 1. What is the distinguished property of deep learning? 2. What is the range of problems solved by deep learning? 3. Why deep learning can abstract features? 4. Why deep learning can extract features?

Issues in AI and Deep Learning Simple example in NLP Z I you we. period Value of each dimension: a word A dimension: whole vocabulary Ex) I you we love like on in for. period I you we. like.. period X Input nodes X Y Z Y Input Vector Space

Issues in AI and Deep Learning Simple example in NLP I Z I you we. period I you we. like.. period Value of each dimension: a word A dimension: whole vocabulary Ex) I you we love like on in for. period I b X w_x w_y w_z Input nodes X Y Z Y Input Vector Space

Issues in AI and Deep Learning Simple example in NLP Z I you we. period I3 I you we. like.. period Value of each dimension: a word A dimension: whole vocabulary Ex) I you we love like on in for. period I1 I1 I2 I3 X b w_x w_y w_z X Y Z Input nodes Y I2 Input Vector Space

Issues in AI and Deep Learning Simple example in NLP Z I you we. period I you we. like.. period Value of each dimension: a word A dimension: whole vocabulary Ex) I you we love like on in for. period I1 I2 I3 b X w_x w_y w_z Input nodes X Y Z Y Input Vector Space

Issues in AI and Deep Learning Simple example in NLP Z Z I you we. period I you we. like.. period X NP VP X Y Input Vector Space Y Others Output Vector Space

Issues in AI and Deep Learning In training with supervised data Z Z I you we. period I you we. like.. period X NP VP X Y Input Vector Space Y Others Output Vector Space (softmax)

Issues in AI and Deep Learning In two layers Z Z Z X X NP X Y Input Vector Space Y Layer1 Output Space Y Output Vector Space (softmax)

Issues in AI and Deep Learning Feature abstraction Z Z Z X X NP X Y Input Vector Space Y Layer1 Output Space Y Output Vector Space (softmax)

Issues in AI and Deep Learning In many layers Z Z Z X X NP X Y Input Vector Space Y Layer1 Output Space Y Output Vector Space (softmax)

Issues in AI and Deep Learning 1. What is the distinguished property of deep learning? 2. What is the range of problems solved by deep learning? 3. Why deep learning can abstract features? 4. Why deep learning can extract features?

Issues in AI and Deep Learning Compared to a generative probabilistic graphical model? I want to go to school How to assign observation to the variable? model accuracy X Random Variable In neural networks, if two observation values are dependent, their hidden outputs generates the same output. If the values are independent, The vectors generate the same value.

Issues in AI and Deep Learning In classification (determined by segmentation) Z The final decision is dependent to only X Z Z X X NP X Y Input Vector Space Y Layer1 Output Space Y Output Vector Space (softmax)

Issues in AI and Deep Learning In regression (determined by the location on the effective region nonzero gradient region) The final value is dependent to only X Z Z Z X X NP X Y Input Vector Space Y Layer1 Output Space Y Output Vector Space (softmax)

Issues in AI and Deep Learning In classification (determined by segmentation) Small rotation and movement of a segment? -> changing dependency of many input vectors Z Z Z X X NP X Y Y Y Input Vector Space Layer1 Output Space Output Vector Space (softmax)

Overview of Machine Translation

Overview of Machine Translation The range of translation to be discussed in this tutorial Translator Interface A sentence / sentences Bilingual Human Translator A sentence / sentences Interface Computer

Overview of Machine Translation How to build a translator? Simplified problem definition used in the current academic community -Input: a source sentence -Output: a target sentence -To build: f(source) = target How to build f? How to model f?

Overview of Machine Translation Save the mapping between two sentences in computer. If the source is matching to a saved mapping, translate it 나는사과먹고싶어 -> I want to eat an apple. Too many sentences! usual number of words in simple conversation > 40,000 mean word size : 10 (actually it is close to 30) 40,000^10 ~ 10e+46 sentences Too large model -> weak to unseen data

Overview of Machine Translation Save the mapping between partial components, and build a translation 나 -> I 사과 -> an apple 먹 -> eat ~ 고싶다 -> want to 나는사과먹고싶어 I 사과먹고싶어 I an apple 먹고싶어 I an apple eat 고싶어 I an apple eat want to I want to eat an apple We don t need to save frequently used expressions and words repeatedly. But.. We may ignore dependency between expressions

Overview of Machine Translation I want to have an apple -> 나는사과를먹고싶어 I want to have a car -> 나는차를가지고싶어 have -> 먹 have -> 가지 Translation: I want to have a car -> 나는차를먹 / 가지고싶어 How to select the correct expression? This is not caused by ambiguity, but caused by losing dependency

Overview of Machine Translation I want to have an apple -> 나는사과를먹고싶어 I want to have a car -> 나는차를가지고싶어 have an apple -> 사과를먹 have a car -> 차를가지 Translation: I want to have a car -> 나는차를가지고싶어 Issue 1: How to know the dependency for an expression? Issue 2: How to collect all expressions with their all dependent components?

Overview of Machine Translation Rule-based machine translation - Collect rules from corpus through algorithms or human experts. A simple rule-based translation - Source sentence analysis -> rule application -> reordering -> additional post processing So many rules!! - Collecting rules need too much costs - Conflicts between rules

Overview of Machine Translation I want to have an apple -> 나는사과를먹고싶어 have an apple -> 사과를먹 want to have -> 가지고싶 Translation: I want to have an apple -> 나는사과를가지고 / 먹고싶어

Overview of Machine Translation Statistical machine translation (SMT) - Managing all rules and combinations in a probabilistic model - Rule selection completely relies on the probabilistic model Goal of SMT? Selecting rules and combinations maximizing the probability of generating the target sentence

Overview of Machine Translation aaaaaaaaaaxx ee pp ee ff) = aaaaaaaaaaxx ee pp ff ee pp(ee) f: a source sentence e: a target sentence Translation Model - Probability of mapping components Language Model - Probability of the sentence in the target language

Overview of Machine Translation Probabilistic Model Representation for TM and LM - N-gram, Bayesian Network, Markov Random Field, discriminative approaches - SVM, Gaussian Mixtures, other classifiers.. - Hidden Markov Model, Conditional Random Field, other sequential classifiers.. Any traditional probabilistic models can be applied A large number of categories for each variable -> usually n-gram (fully connected graphical model with a given cardinality)

Overview of Machine Translation Information in flat structures is insufficient Expressions often have long distance dependency -> difficult to be detected in simple word-level decomposition of a given source sentence Mapping patterns are often very abstract S V O -> S O V Syntactic and semantic analysis are required

Overview of Machine Translation Final Translation Quality Is very low Error Propagation 80% Logic 90% Grammatical Relation Semantic Level Syntax Level 90% Dependency 99% POS tagging Word/Phrase Level Source Sentence Decoder (segmentation, alignment, reordering) Target Sentence

Overview of Machine Translation Neural Machine Translation? aaaaaaaaaaxx ff pp ee ff) Learn the probability through neural networks -> Learning conditional Language Model -> No specific analysis and decoding process -> every step will be trained in a neural network

Neural Machine Translation

Neural Machine Translation Recurrent Neural Networks (Simple Elman Network) flattened by time * Wikipedia Recurrent neural network page

Neural Machine Translation Applicable to various types of classification problems Translation

Neural Machine Translation Recurrent Neural Networks in translation???? <s> 나는학교에가 <e> RNN I go to school <e>

Neural Machine Translation Recurrent Neural Networks - Gradient Vanishing over time

Neural Machine Translation Recurrent Neural Networks with Long Short Term Memory

Neural Machine Translation Recurrent Neural Networks with Long Short Term Memory A cell 1 RNN-LSTM Layer c(t-1) f X + X X i c o Memory Cell c(t) Cell Control Vector h(t-1) History Decoding Layer + h(t) Word-Info Decoding Layer Word Vector

Neural Machine Translation Recurrent Neural Networks with Long Short Term Memory Stacked LSTM many output values Impact to multiple outputs too dense vector distribution -> difficult to train -> requires sufficient expression power many Input values

Neural Machine Translation Recurrent Neural Networks with Long Short Term Memory Stacked LSTM What if structural information is required? Stacking!

Neural Machine Translation Recurrent Neural Networks with Long Short Term Memory Stacked LSTM

Neural Machine Translation Recurrent Neural Networks with Long Short Term Memory Stacked LSTM???? <s> 나는학교에가 <e> Stacked RNN-LSTM 4 ~ 8 stacks are required for good translation *in empirical reports I go to school <e>

Neural Machine Translation Recurrent Neural Networks with Long Short Term Memory Stacked LSTM - detailed structure Target Sentence Target Sentence... h0 hk h0 hk RNN-LSTM Stacked Layer Input Sequence Source Sentence Target Sentence RNN-LSTM Stacked Layer Input Sequence Source Sentence...

Neural Machine Translation We saw, -How to apply RNN, RNN with LSTM, RNN with LSTM Stacks -Why we need complex LSTM and LSTM stacks -How LSTM is applied to translation Some issues to discuss.. -LSTM is proposed at about 1990, why LSTM-based translation becomes popular now? GPU, Computing Power! (Jürgen Schmidhuber, 2014, Deep Learning in Neural Networks: An Overview, IDSIA lab, Switzerland)

Neural Machine Translation Stacked LSTM is expected to learn structural information, long distance relation, translation equivalence, sentence decomposition (segmentation, tagging, parsing, alignment, reordering, post processing,, everything) Simple LSTM can learn every information for a good translation? No, it may represent all the conditions, but training is difficult -> next issues in NMT: How to build networks efficiently train required information?

Advanced Techniques in Neural Machine Translation

Advanced Techniques in NMT recurrent neural network LSTM/GRU bidirectional attention syntactic guide direct link from input to hidden layers 2-dimensional grid structure ensemble explicit rare word models zero-resource Training

Advanced Techniques in NMT Recurrent Neural Network with Long Short Term Memory (Sutskever, 2014, Sequence to Sequence Learning with Neural Networks)

Advanced Techniques in NMT LSTM/GRU (Chung, 2014, Empirical evaluation of gated recurrent neural networks on sequence modeling)

Advanced Techniques in NMT Attention and Bidirectional Model (Bahdanau, 2015, NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE)

Advanced Techniques in NMT Rare Word Modeling (Sutskever, 2015, Addressing the Rare Word Problem in Neural Machine Translation)

Advanced Techniques in NMT Syntactic Guide (Stahlberg, 2016, Syntactically Guided Neural Machine Translation)

Advanced Techniques in NMT Direct Link between LSTM Stacks (Deep-Att.) (J Zhou, 2016, Deep recurrent models with fast-forward connections for neural machine translation)

Advanced Techniques in NMT Multidimensional LSTM (Kalchbrenner, 2016, GRID LONG SHORT-TERM MEMORY) c h c h

Advanced Techniques in NMT Combining most of the techniques.. (Wu, 2016, Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation)

Advanced Techniques in NMT Zero-Resource Training (Shared Attention Model) (Firat, 2016, Zero-Resource Translation with Multi-Lingual Neural Machine Translation) Pivot Shared Attention Model Not independent training

Issues in NMT Research

Issues in NMT Research Google NMT Report

Issues in NMT Research Google NMT Report

Issues in NMT Research Google NMT Report Model Representation Bidirectional (shallow layer only) 1024 nodes per layer Optimization Translation Simple attention Direct link (input to LSTM stacks) Stochastic Gradient Descent/Adam mixture Gradient clipping Uniform weight initialization Asynchronous parallel computation of gradients Dropout Quantization Beam Search Postprocessing Model (reinforcement learning) Rare word replacement (target side) 1024 nodes per layer 1024 nodes per layer Explicit model Explicit model

Issues in NMT Research Google NMT Report Training Data Set (En-Fr) internal set (3.6G ~36G sent.)? WMT14 (36Mset.) Hardware 12 node cluster (8 GPUs per node) Nvidia K80 (24G) Tensor Processing Unit? Training Time 6 days

Issues in NMT Research Following up state-of-the-art of NMT -> GPU Clusters For one best performance validation Google : 6 days Single titan X : 96 (GPUs) x 8 (ensembles) x 6 (days) = 4608 days (23 years) May be overestimated in terms of speed improvement by parallelism Let s assume that?? is just 2 (Not likely) Then 96 days 16 ~ 768 times faster What if they use TPU in training? 160 ~ 7680 times faster

Summary We saw, - Properties of AI and Deep learning - Machine translation history - basic NMT - The latest NMT techniques Next NMT issues? - efficient network structures in training - reducing training speed (parallel processing, HW/SW, architecture ) Google NMT Huge computing power is required (20M ~ sentences, En-Fr) - at least 8 GPU machine is recommended