P (o w) P (o s) s = speaker. w = word. Independence bet. phonemes and pitch. Insensitivity to phase differences. phase characteristics

Similar documents
Intermediate Conversation Material #10

相関語句 ( 定型のようになっている語句 ) の表現 1. A is to B what C is to D. A と B の関係は C と D の関係に等しい Leaves are to the plant what lungs are to the animal.

U N I T. 1. What are Maxine and Debbie talking about? They are talking about. 2. What doesn t Maxine like? She doesn t like. 3. What is a shame?

Omochi rabbit amigurumi pattern

Lesson 5 What The Last Supper Tells Us

L1 Cultures Go Around the World

アルゴリズムの設計と解析. 教授 : 黄潤和 (W4022) SA: 広野史明 (A4/A8)

Decisions in games Minimax algorithm α-β algorithm Tic-Tac-Toe game

Installation Manual WIND TRANSDUCER

TED コーパスを使った プレゼンにおける効果的な 英語表現の抽出

The seven pillars of Data Science

Chronicle of a Disaster: Understand

Delivering Business Outcomes

次の対話の文章を読んで, あとの各問に答えなさい ( * 印の付いている単語 語句には, 本文のあとに 注 がある )

レーダー流星ヘッドエコー DB 作成グループ (murmhed at nipr.ac.jp) 本規定は レーダー流星ヘッドエコー DB 作成グループの作成した MU レーダー流星ヘッド エコーデータベース ( 以下 本データベース ) の利用方法を定めるものである

artist Chim Pom Chim Pom (Ryuta Ushiro, Ellie)

D80 を使用したオペレーション GSL システム周波数特性 アンプコントローラー設定. Arc 及びLine 設定ラインアレイスピーカーを2 から7 までの傾斜角度に湾曲したアレイセクションで使用する場合 Arcモードを用います Lineモード

Big thank you from Fukushima Friends UK (FF)

Final Product/Process Change Notification Document # : FPCN22191XD1 Issue Date: 24 January 2019

[ 言語情報科学論 A] 統計的言語モデル,N-grams

HARD LOCK Technical Reports

Ⅲ. 研究成果の刊行に関する一覧表 発表者氏名論文タイトル名発表誌名巻号ページ出版年. lgo/kourogi_ pedestrian.p df. xed and Augmen ted Reality

修士 / 博士課程専門課題 Ⅱ 試験問題

研究開発評価に関する国際的な視点や国際動向

Page No. 原文 リライト EDITOR'S NOTES 1 4 NATURAL ART

Keio University Global Innovator Accelera6on Program 2015 Day 7 Design Process Exercise

特集 米国におけるコンシューマ向けブロードバンド衛星サービスの現状

GDC2009 ゲーム AI 分野オーバービュー

Present Status of SMEs I

Season 15: GRAND FINAL PLAYER GUIDE. ver.2019/1/10

Standardization of Data Transfer Format for Scanning Probe Microscopy


CER7027B / CER7032B / CER7042B / CER7042BA / CER7052B CER8042B / CER8065B CER1042B / CER1065B CER1242B / CER1257B / CER1277B

On Endings 終結について. Ted Goossen

国際会議 ACM CHI ( ) HCI で生まれた研究例 2012/10/3 人とコンピュータの相互作用 WHAT IS HCI? (Human-Computer Interaction (HCI)

Effects and Problems Coming in Sight Utilizing TRIZ for Problem Solving of Existing Goods

TDK Lambda A /9

SanjigenJiten : Game System for Acquiring New Languages Visually 三次元辞典 : 第二言語学習のためのゲームシステム. Robert Howland Emily Olmstead Junichi Hoshino

PH75A280-* RELIABILITY DATA 信頼性データ

1XH DC Power Module. User manual ユーザマニュアル. (60V 15A module version) HB-UM-1XH

JSPS Science Dialog Program Kofu Higashi High School

CPM6018RA Datasheet 定電流モジュール. Constant-current Power Modules. TAMURA CORPORATION Rev.A May, / 15

The Current State of Digital Healthcare

TDK-Lambda A C 1/27

Supporting Communications in Global Networks. Kevin Duh & 歐陽靖民

4. Contact arrangement 回路形式 1 poles 1 throws 1 回路 1 接点 (Details of contact arrangement are given in the assembly drawings 回路の詳細は製品図による )

Title inside of Narrow Hole by Needle-Typ. Issue Date Journal Article. Text version author.

りれきしょ. What to do before writing. Advice on writing your Entry Sheet Content. Entry Sheets and rirekisho. III. To Succeed in the Screening Process

Private Equity: where should you invest today? P&I Global Pension Symposium, Tokyo

The Bright Side of Urban Shrinkage: Steps toward Restructuring Cities

Finding Near Optimal Solutions for Complex Real-world Problems

Title of the body. Citation. Issue Date Conference Paper. Text version author. Right

Omni LED Bulb. Illustration( 实际安装, 설치사례, 設置事例 ) Bulb, Downlight OBB. OBB-i15W OBB-i20W OBB-i25W OBB-i30W OBB-i35W. Omni LED.

Creation of Digital Archive of Japanese Products Design process

Gary McLeod is a Tokyo-based teacher of English and

Study on Multipath Propagation Modeling and Characterization in Advanced MIMO Communication Systems. Yi Wang

磁気比例式 / 小型高速応答単電源 3.3V Magnetic Proportion System / Compact size and High-speed response. Vcc = +3.3V LA02P Series

車載カメラにおける信号機認識および危険運転イベント検知 Traffic Light Recognition and Detection of Dangerous Driving Events from Surveillance Video of Vehicle Camera

Developing Visual Information Processing Technology through Human Exploration

XG PARAMETER CHANGE TABLE

超伝導加速空洞のコストダウン. T. Saeki (KEK) 24July ILC 夏の合宿一ノ関厳美温泉

How Capturing the Movement of Ions can Contribute to Brain Science and Improve Disease Diagnosis

Hacked ace gangster. City Hacked. Key hacks [3] Money [4] Health [5] Exp [6] Ammo for all weapons [7] Attribute points [8] Skill

CG Image Generation of Four-Dimensional Origami 4 次元折り紙の CG 画像生成

F01P S05L, F02P S05L, F03P S05L SERIES

Toward The Organisational Innovation Study: A Critical Study of Previous Innovation Research

Lepton Flavor Physics with Most Intense DC Muon Beam Yusuke Uchiyama

Effective Utilization of Patent Information in Japanese global companies

IMPORTANT SAFETY INSTRUCTIONS Regulatory Safety Information

Future Perspectives of Science, Technology and Innovation

Indonesian Printing Industry Trends, Current Technology, and Future Development

P Z N V S T I. センサ信号入力仕様 Input signal type. 1 ~ 5 V 4 ~ 20 ma 1 ~ 5 V 4 ~ 20 ma 1 ~ 5 V 4 ~ 20 ma 1 ~ 5 V 4 ~ 20 ma

Magellan Systems Japan, Inc.

品名 :SCM1561M 製品仕様書. LF No RoHS 指令対応 RoHS Directive Compliance 発行年月日 仕様書番号 SSJ SANKEN ELECTRIC CO., LTD. 承認審査作成 サンケン電気株式会社技術本部 MCD 事業部

TDK Lambda C /35

Yupiteru mvt F) 帯 FM 放送 テレビ音声 航空. 12 янв Yupiteru MVT-7300,

都市基盤工学 ( リモートセンシングと GIS 入門 ) Introduction to Remote Sensing and GIS. Ground-based sensors 地上からのセンサ 第 4 回 千葉大学大学院融合理工学府

カシャニサラ Sarah S. Kashani

The Fort Worth Japanese Society Newsletter

Immersive and Non-Immersive VR Environments: A Preliminary EEG Investigation 没入型および非没入型 VR 環境 :EEG の比較. Herchel Thaddeus Machacon.

Specifications characterize the warranted performance of the instrument under the stated operating conditions.

宇宙飛行生物学 (Bioastronautics( 宇宙飛行生物学 (Bioastronautics) の大学院教育への利用. Astrobiology)? 宇宙生物学 (Astrobiology( 宇宙生物学 カリキュラム詳細

科学技術 学術審議会大型プロジェクト作業部会 2015 年 12 月 22 日 永野博

無線通信デバイスの技術動向 松澤昭 東京工業大学大学院理工学研究科電子物理工学専攻 TiTech A. Matsuzawa 1

TDK Lambda A A 1/14

3 안전을위한주의사항 AAH-02B3W. Product Composition & Specifications. Product Manual. Cautions for Safety. Cautions for Safety. Cautions.

Establishing an international cooperative strategy for the conservation of Oriental White Storks in Northeast Asia

レイ ブライアントふたたび ~ ボーカルとの共演を中心に ~

第 1 回先進スーパーコンピューティング環境研究会 (ASE 研究会 ) 発表資料

Local Populations Facing Long- Term Consequences of Nuclear Accidents: Lessons learned from Chernobyl and Fukushima

NINJA LASER INNOVATORS BY DESIGN SINCE 1770

Call for a Pro-Innovation

TDK Lambda INSTRUCTION MANUAL. TDK Lambda C A 1/35

Instruction Manual. Model IB100 Interface Box. IM 12B06J09-01E-E 2nd edition. IM 12B06J09-01E-E_ed02.indd 1 01/12/16 15:52

記号 / 定格 /Ratings. B. 電気的特性 /Electrical Characteristics 測定条件 /Measure Condition (Tc = 25 ±3 ) 記号 / 測定条件 /Measure Condition

情Propagation Characteristics of 700MHz Band V2X Wireless Communication*

Kurt Vonnegut s Postmodern Peace Strategy in Cat s Cradle. Reiko NITTA

Btd 5 hacked money. 06/28/2018 Quick cpr cheat sheet 06/28/2018. Google chrome mobile adblock 07/01/2018

Japan America Society of Minnesota

Studies on Modulation Classification in Cognitive Radios using Machine Learning

Transcription:

Independence bet. phonemes and pitch 0 0 0 0 0 0 0 0 0 0 "A_a_512" 0 5 10 15 20 25 30 35 speech waveforms Insensitivity to phase differences phase characteristics amplitude characteristics source characteristics filter characteristics o s = speaker P (o w) P (o s) w = word 0 5 10 15 P (o w) = s P (o, s w) = s P (o w, s)p (s w) s P (o w, s)p (s) P (o s) = w P (o, w s) = w P (o w, s)p (w s) w P (o w, s)p (w)

c 1 c 1 Bhattacharyya distance c 2 c 4 c D c 3 c 2 c 3 c 4 c D BD-based distance matrix spectrogram (spectrum slice sequence) cepstrum vector sequence distribution sequence

Really speaker-independent features Deep neural network [Hinton+ 06, 12] Deeply stacked artificial neural networks Results in a huge number of weights Unsupervised pre-training and supervised fine-tuning Findings in DNN-based ASR [Mohamed+ 12] First several layers seem to work as extractor of invariant features or speaker-normalized features. Still difficult to interpret structure and weights of DNN physically. Interpretable DNNs are becoming one of the hot topics [Sim 15]. A simple question asked in tutorial talks of DNN What are really speaker-independent features? Asked by N. Morgan at APSIPA2013 and ASRU2013

DNN as posterior estimator General framework for training DNN Unsupervised pre-training and supervised training In the latter training, speaker-adapted HMMs are used to prepare posteriors (=labels) for each frame of the training data. DNN is trained so that it can extract speaker-invariant features and estimate posteriors in a speaker-independent way. Output of DNN = posteriors (phoneme state posteriors in ASR)

Posteriors = normalized similarities Posteriors of { } Can be interpreted as normalized similarity scores biased by priors. Output of DNN = normalized similarity scores to a definite set of speaker-adapted acoustic anchors of { }. 1 2 3 N... 1 2 3 N... : speaker-dependent : speaker-independent(invariant) Similarities scores can be converted to distances to anchors. Either of similarity matrix or distance matrix is used for clustering.

Distances to anchors Speech structure extracted from an utterance spectrogram (spectrum slice sequence) cepstrum vector sequence distribution sequence Structure extraction for speakers and : speaker-dependent : speaker-independent(invariant)

Invariant contrasts DNN as speaker-invariant contrast estimation Use of spk-dependent HMMs to prepare posterior labels A huge data to train DNN to guarantee spk-invariance Str. extraction as speaker-invariant contrast detection Use of within-utterance acoustic events only Spk-invariance is guaranteed by invariant properties of f-div. 1 2 3 N...

Origin and evolution of language

Origin and evolution of language A MODULATION-DEMODULATION MODEL FOR SPEECH COMMUNICATION AND ITS EMERGENCE NOBUAKI MINEMATSU Graduate School of Info. Sci. and Tech., The University of Tokyo, Japan, mine@gavo.t.u-tokyo.ac.jp Perceptual invariance against large acoustic variability in speech has been a long-discussed question in speech science and engineering (Perkell & Klatt, 2002), and it is still an open question (Newman, 2008; Furui, 2009). Recently, we proposed a candidate answer based on mathematically-guaranteed relational invariance (Minematsu et al., 2010; Qiao & Minematsu, 2010). Here, transform-invariant features, f-divergences, are extracted from the speech dynamics in an utterance to form an invariant topological shape which characterizes and represents the linguistic conveyed in that utterance. In this paper, this representation is interpreted from a viewpoint of telecommunications, linguistics, and evolutionary anthropology. Speech production is often regarded as a process of modulating the baseline timbre of a speaker s voice by manipulating the vocal organs, i.e., spectrum modulation. Then, extraction of the linguistic from an utterance can be viewed as a process of spectrum demodulation. This modulation-demodulation model of speech communication has a strong link to known morphological and cognitive differences between humans and apes.

Modulation used in telecommunication From Wikipedia A musician modulates the tone from a musical instrument by varying its volume, timing and pitch. The three key parameters of a carrier sine wave are its amplitude ( volume ), its phase ( timing ) and its frequency ( pitch ), all of which can be modified in accordance with a content signal to obtain the modulated carrier. carrier = modulated carrier modulation demodulation modulated carrier carrier = demodulation modulation

A way of characterizing speech production Speech production as spectrum modulation Modulation in frequency (FM), amplitude (AM), and phase (PM) = Modulation in pitch, volume, and timing (from Wikipedia) = Pitch contour, intensity contour, and rhythm (= prosodic features) What about a fourth parameter, which is spectrum (timbre)? = Modulation in spectrum (timbre) [Scott 07] = Another prosodic feature? Tongue = modulator Schwa = most lax = most frequent = home position = spk.-specific baseline timbre Front Central Back Front Central Back Front Central beat Back boot Front Central beat Back boot put Front Central beat Back bit tbootbird put Front Central beat Back bit bootbirdput Front Central beat Back bit bootbird bought beat put bit bootbird bet bought beat put bit about bootbirdput bet bought bit about birdput bet bought bit about bat bird bet bought about bat bet bought about but pot bat bet bought about but pot bat bet about but pot bat but pot bat but pot bat but pot but pot Low Mid High Low Mid High Low Mid High Low Mid High Low Mid High Low Mid High Low Mid High time

Modulation spectrum Critical-band based temporal dynamics of speech In pursuit of an invariant representation (Greenberg 97) RASTA (=RelAtive SpecTrA, Hermansky 94) lowpass cutoff = 28 Hz 100X Normalize by long-term avg. FFT 2 speech Critical-band FIR filter bank lowpass cutoff = 28 Hz No mathematical proof for invariance 100X Normalize by long-term avg. Limiting to peak 30 db and bilinear interpolation image Direction of a trajectory is rotated by VTL difference (Saito 08) FFT 2 (Greenberg 97)

Invariant speech structure Utterance to structure conversion using f-div. [Minematsu 06] c 1 c 1 Bhattacharyya distance c 2 c 4 c D c 3 c 2 c 3 c 4 c D BD-based distance matrix spectrogram (spectrum slice sequence) cepstrum vector sequence distribution sequence An event (distribution) has to be much smaller than a phoneme.

Demodulation used in telecommunication Demodulation in frequency, amplitude, and phase Demodulation = a process of extracting a intactly by removing the carrier component from the modulated carrier signal. Not by extensive collection of samples of modulated carriers (Not by hiding the carrier component by extensive collection) carrier = modulated carrier modulation demodulation modulated carrier carrier = demodulation modulation

Spectrum demodulation Speech recognition = spectrum (timbre) demodulation Demodulation = a process of extracting a intactly by removing the carrier component from the modulated carrier signal. By removing speaker-specific baseline spectrum characteristics Not by extensive collection of samples of modulated carriers (Not by hiding the carrier component by extensive collection) carrier = modulated carrier modulation modulated carrier = demodulation carrier demodulation modulation

Two questions Q1: Does the ape have a good modulator? Does the tongue of the ape work as a good modulator? Q2: Does the ape have a good demodulator? Does the ear (brain) of the ape extract the intactly? carrier = modulated carrier modulation modulated carrier = demodulation carrier demodulation modulation

Structural diff. in the mouth and the nose pharynx larynx stomach lung pharynx larynx lung stomach

Structural diff. in the mouth and the nose pharynx larynx stomach lung pharynx larynx lung stomach

Flexibility of tongue motion The chimp s tongue is much stiffer than the human s. Morphological analyses and 3D modeling of the tongue musculature of the chimpanzee (Takemoto 08) Less capability of manipulating the shape of the tongue.

Q1: Does the ape have a good modulator? Morphological characteristics of the ape s tongue Two (almost) independent tracts [Hayama 99] One is from the nose to the lung for breathing. The other is from the mouth to the stomach for eating. Much lower ability of deforming the tongue shape [Takemoto 08] The chimp s tongue is stiffer than the human s. carrier modulation carrier modulation

The nature s solution for static bias? How old is the invariant perception in evolution? [Hauser 03] 1 2 1 = 2 At least, frequency (pitch) demodulation seems difficult.

Language acquisition through vocal imitation VI = children s active imitation of parents utterances Language acquisition is based on vocal imitation [Jusczyk 00]. VI is very rate in animals. No other primate does VI [Gruhn 06]. Only small birds, whales, and dolphins do VI [Okanoya 08]. A s VI = acoustic imitation but H s VI = acoustic =?? Acoustic imitation performed by myna birds [Miyamoto 95] They imitate the sounds of cars, doors, dogs, cats as well as human voices. Hearing a very good myna bird say something, one can guess its owner. Beyond-scale imitation of utterances performed by children No one can guess a parent by hearing the voices of his/her child. Very weird imitation from a viewpoint of animal science [Okanoya 08].?

Q1: Does the ape have a good modulator? Morphological characteristics of the ape s tongue Two (almost) independent tracts [Hayama 99] One is from the nose to the lung for breathing. The other is from the mouth to the stomach for eating. Much lower ability of deforming the tongue shape [Takemoto 08] The chimp s tongue is stiffer than the human s. carrier modulation carrier modulation

Q2: Does the ape have a good demodulator? Cognitive difference bet. the ape and the human Humans can extract embedded s in the modulated carrier. It seems that animals treat the modulated carrier as it is. From the modulated carrier, what can they know? The apes can identify individuals by hearing their voices. Lower/higher formant frequencies = larger/smaller apes carrier modulated carrier = modulation modulated carrier = demodulation carrier demodulation modulation

Function of the voice timbre What is the original function of the voice timbre? For apes The voice timbre is an acoustic correlate with the identity of apes. For speech scientists and engineers They had started research by correlating the voice timbre with s conveyed by speech stream such as words and phonemes. Formant frequencies are treated as acoustic correlates with vowels. Speech recognition started first, then, speaker recognition followed. f n = f n = f = c 2π c 2l 1 n c n 2l 2 [ A2 A 1 l 1 l 2 ] 1/2

Invariant speech structure Utterance to structure conversion using f-div. [Minematsu 06] c 1 c 1 Bhattacharyya distance c 2 c 4 c D c 3 c 2 c 3 c 4 c D BD-based distance matrix spectrogram (spectrum slice sequence) cepstrum vector sequence distribution sequence An event (distribution) has to be much smaller than a phoneme.

What is the goal of speech engineering?

高校生のためのオープンキャンパスにて 言葉が分かるコンピュータってどんなコンピュータ 東大で言葉の研究をする工学系教員から高校生への素朴な問いかけ Siri 喋ってコンシェル IBM Watson 彼らは 言葉が分かる コンピュータなのか ニューヨークは今何時 8月6日午後10時です 清水寺の舞台の高さは 約13メートルです ソーダ瓶の回転が止まった時に 瓶の口の前にいる人は唇を突き出すゲームは Spin-the-bottleです 彼らは話された 書かれた内容を理解して 吟味して 返答しているように見える では 彼らは本当に 言葉が分かる のか それとも 言葉が分かったように見せかけている だけなのか このポスターは 言葉が分かる とはどういうことなのか 高校生の皆さんにちょっと深く考えてもらいたくて作 りました 上の問いに対して先人達はどのように考えてきたのか を紹介します もしかしたら 本当に言葉が分 かるコンピュータを作ることになるのは 数年後 いや数十年後の貴方 かもしれません チューリング テスト って知ってますか 数学者アラン チューリングが考案した ある機械が知的であるかどうか を判定するテスト 人間の判定者Cが 隔離された相手A, Bと通常の言語で会話する A, Bは一方が機 械 他方が人間である 会話の後Cはどちらが人間 機械なのかを当る その区別が 困難であれば この機械はテストに合格 つまり 知的であると判定する 今でも 人工知能 研究でしばしば利用される判定基準である

高校生のためのオープンキャンパスにて 中国語の部屋 って知ってますか チューリングテストに対して哲学者ジョン サールが問うた鋭い突っ込み 思考実験 ある小部屋にアルファベットしか理解できない人を閉じこめておく この部屋には外 部と紙切れのやりとりをする穴が一つ空いている この穴を通してこの人に一枚の紙 切れが差し入れられる そこには漢字で何か書いてあるが 彼には単なる記号列でし かない 彼の仕事はこの記号列に対して 新たな記号列を書き加えて外に返すことで ある どういう記号列を書き加えればよいのかは 一冊のマニュアルに書いてある 例えば とあれば と書き加えて外に出せ のように 部屋の外で紙切れを観測している人にすれば 中国語が分かる人が内部にいる と考え るだろう 部屋にいるのは漢字が全然理解できない人なのに XXするように見せかけている例というのは 結構沢山あるのかも プラネタリウム あれは基本的に天動説に基づいて星を動かしています 座席は動きませんから でも 星の見た 目の動きを再現するという目的であれば 天動説も地動説も結果は殆ど変わりませんよね 賢馬ハンス 20世紀初頭 ドイツで有名になった 計算できる 馬 後に科学的手法によりトリックが判明 DaiGo 21世紀初頭 日本のテレビ業界を賑わしているメンタリスト 彼の場合は トリックがあります と自 分で明言してますけど 見た目を上手に作り込むのか 中の メカニズムにまでこだわるのか 結局 何ができれば 言語が分かる コンピュータなのか その定義が難しいのですよ

高校生のためのオープンキャンパスにて XXするように見せかけている例というのは 結構沢山あるのかも プラネタリウム あれは基本的に天動説に基づいて星を動かしています 座席は動きませんから でも 星の見た 目の動きを再現するという目的であれば 天動説も地動説も結果は殆ど変わりませんよね 賢馬ハンス 20世紀初頭 ドイツで有名になった 計算できる 馬 後に科学的手法によりトリックが判明 DaiGo 21世紀初頭 日本のテレビ業界を賑わしているメンタリスト 彼の場合は トリックがあります と自 分で明言してますけど 見た目を上手に作り込むのか 中の メカニズムにまでこだわるのか 結局 何ができれば 言語が分かる コンピュータなのか その定義が難しいのですよ 言語が分かる コンピュータを実現するための必要十分条件の定義が難しい できるのは 必要条件を洗い出す ことだけなのかもしれない で どの必要条件に着目し 技術として実装するのか それは各研究者のこだわりと なって 研究戦略に現れるのだと思います さてさて 貴方が 言語が分かる コンピュータを作ろうとしたら どんなコンピュータを作りますか 貴方自身の答えを この部屋で見つけてみて下さい