会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 1. 发明申请
    • EVALUATING TEXT-TO-SPEECH INTELLIGIBILITY USING TEMPLATE CONSTRAINED GENERALIZED POSTERIOR PROBABILITY
    • 使用模板约束的一般化后验概率评估文本到语音智能
    • US20140025381A1
    • 2014-01-23
    • US13554480
    • 2012-07-20
    • Linfang WangYan TengLijuan WangFrank Kao-Ping SoongZhe GengWilliam Brad WallerMark Tillman Hanson
    • Linfang WangYan TengLijuan WangFrank Kao-Ping SoongZhe GengWilliam Brad WallerMark Tillman Hanson
    • G10L13/08
    • G10L25/69G10L13/00
    • Instead of relying on humans to subjectively evaluate speech intelligibility of a subject, a system objectively evaluates the speech intelligibility. The system receives speech input and calculates confidence scores at multiple different levels using a Template Constrained Generalized Posterior Probability algorithm. One or multiple intelligibility classifiers are utilized to classify the desired entities on an intelligibility scale. A specific intelligibility classifier utilizes features such as the various confidence scores. The scale of the intelligibility classification can be adjusted to suit the application scenario. Based on the confidence score distributions and the intelligibility classification results at multiple levels an overall objective intelligibility score is calculated. The objective intelligibility scores can be used to rank different subjects or systems being assessed according to their intelligibility levels. The speech that is below a predetermined intelligibility (e.g. utterances with low confidence scores and most severe intelligibility issues) can be automatically selected for further analysis.
    • 系统客观地评估语言的可懂度,而不是依靠人类来主观地评估一个主题的语音清晰度。 系统接收语音输入,并使用模板约束广义后验概率算法在多个不同级别计算置信度分数。 一个或多个可理解性分类器用于在可懂度量表上分类所需实体。 特定的清晰度分类器利用各种置信度分数等特征。 可以调整可懂度分类的规模,以适应应用场景。 基于多个级别的置信度分数和可理解性分类结果,计算出总目标可懂度分数。 客观可理解性分数可用于根据其可理解性级别对待评估的不同科目或系统进行排名。 可以自动选择低于预定清晰度(例如具有低置信度得分和最严重的可理解性问题的话语)的语音用于进一步分析。
    • 2. 发明申请
    • Minimum Converted Trajectory Error (MCTE) Audio-to-Video Engine
    • 最小转换轨迹误差(MCTE)音频到视频引擎
    • US20120116761A1
    • 2012-05-10
    • US12939528
    • 2010-11-04
    • Lijuan WangFrank Kao-Ping Soong
    • Lijuan WangFrank Kao-Ping Soong
    • G10L15/00
    • G10L21/06G10L21/10G10L2021/105
    • Embodiments of an audio-to-video engine are disclosed. In operation, the audio-to-video engine generates facial movement (e.g., a virtual talking head) based on an input speech. The audio-to-video engine receives the input speech and recognizes the input speech as a source feature vector. The audio-to-video engine then determines a Maximum A Posterior (MAP) mixture sequence based on the source feature vector. The MAP mixture sequence may be a function of a refined Gaussian Mixture Model (GMM). The audio-to-video engine may then use the MAP to estimate video feature parameters. The video feature parameters are then interpreted as facial movement. The facial movement may be stored as data to a storage module and/or it may be displayed as video to a display device.
    • 公开了音频到视频引擎的实施例。 在操作中,音频到视频引擎基于输入语音产生面部动作(例如,虚拟通话头)。 音频到视频引擎接收输入语音并将输入语音识别为源特征向量。 音频到视频引擎然后基于源特征向量确定最大后验(MAP)混合序列。 MAP混合序列可以是精细高斯混合模型(GMM)的函数。 音频到视频引擎然后可以使用MAP估计视频特征参数。 视频功能参数被解释为面部动作。 面部运动可以作为数据存储到存储模块和/或其可以作为视频显示到显示装置。
    • 4. 发明授权
    • Speech and text driven HMM-based body animation synthesis
    • 语音和文本驱动的基于HMM的身体动画综合
    • US08224652B2
    • 2012-07-17
    • US12239564
    • 2008-09-26
    • Lijuan WangLei MaFrank Kao-Ping Soong
    • Lijuan WangLei MaFrank Kao-Ping Soong
    • G10L21/00
    • G10L21/06G06T13/205G10L13/00
    • An “Animation Synthesizer” uses trainable probabilistic models, such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN), etc., to provide speech and text driven body animation synthesis. Probabilistic models are trained using synchronized motion and speech inputs (e.g., live or recorded audio/video feeds) at various speech levels, such as sentences, phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. The Animation Synthesizer then uses the trainable probabilistic model for selecting animation trajectories for one or more different body parts (e.g., face, head, hands, arms, etc.) based on an arbitrary text and/or speech input. These animation trajectories are then used to synthesize a sequence of animations for digital avatars, cartoon characters, computer generated anthropomorphic persons or creatures, actual motions for physical robots, etc., that are synchronized with a speech output corresponding to the text and/or speech input.
    • “动画合成器”采用隐马尔可夫模型(HMM),人工神经网络(ANN)等可训练概率模型,提供语音和文本驱动的人体动画合成。 根据可用数据,在各种语音级别(例如句子,短语,单词,音素,子音素等)上使用同步运动和语音输入(例如,实况或记录的音频/视频馈送)训练概率模型,以及 运动类型或身体部位被建模。 动画合成器然后使用可训练概率模型来基于任意文本和/或语音输入来选择一个或多个不同身体部位(例如,面部,头部,手,手臂等)的动画轨迹。 然后,这些动画轨迹用于合成数字化身,卡通人物,计算机生成的拟人或生物的动画序列,物理机器人的实际动作等,其与对应于文本和/或语音的语音输出同步 输入。
    • 6. 发明授权
    • Minimum converted trajectory error (MCTE) audio-to-video engine
    • 最小转换轨迹误差(MCTE)音频到视频引擎
    • US08751228B2
    • 2014-06-10
    • US12939528
    • 2010-11-04
    • Lijuan WangFrank Kao-Ping Soong
    • Lijuan WangFrank Kao-Ping Soong
    • G10L21/06
    • G10L21/06G10L21/10G10L2021/105
    • Embodiments of an audio-to-video engine are disclosed. In operation, the audio-to-video engine generates facial movement (e.g., a virtual talking head) based on an input speech. The audio-to-video engine receives the input speech and recognizes the input speech as a source feature vector. The audio-to-video engine then determines a Maximum A Posterior (MAP) mixture sequence based on the source feature vector. The MAP mixture sequence may be a function of a refined Gaussian Mixture Model (GMM). The audio-to-video engine may then use the MAP to estimate video feature parameters. The video feature parameters are then interpreted as facial movement. The facial movement may be stored as data to a storage module and/or it may be displayed as video to a display device.
    • 公开了音频到视频引擎的实施例。 在操作中,音频到视频引擎基于输入语音产生面部动作(例如,虚拟通话头)。 音频到视频引擎接收输入语音并将输入语音识别为源特征向量。 音频到视频引擎然后基于源特征向量确定最大后验(MAP)混合序列。 MAP混合序列可以是精细高斯混合模型(GMM)的函数。 音频到视频引擎然后可以使用MAP估计视频特征参数。 视频功能参数被解释为面部动作。 面部运动可以作为数据存储到存储模块和/或其可以作为视频显示到显示装置。
    • 8. 发明申请
    • SPEECH AND TEXT DRIVEN HMM-BASED BODY ANIMATION SYNTHESIS
    • 语音和文字驱动基于HMM的身体动画合成
    • US20100082345A1
    • 2010-04-01
    • US12239564
    • 2008-09-26
    • Lijuan WangLei MaFrank Kao-Ping Soong
    • Lijuan WangLei MaFrank Kao-Ping Soong
    • G10L13/08G10L13/00G06T15/70
    • G10L21/06G06T13/205G10L13/00
    • An “Animation Synthesizer” uses trainable probabilistic models, such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN), etc., to provide speech and text driven body animation synthesis. Probabilistic models are trained using synchronized motion and speech inputs (e.g., live or recorded audio/video feeds) at various speech levels, such as sentences, phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. The Animation Synthesizer then uses the trainable probabilistic model for selecting animation trajectories for one or more different body parts (e.g., face, head, hands, arms, etc.) based on an arbitrary text and/or speech input. These animation trajectories are then used to synthesize a sequence of animations for digital avatars, cartoon characters, computer generated anthropomorphic persons or creatures, actual motions for physical robots, etc., that are synchronized with a speech output corresponding to the text and/or speech input.
    • “动画合成器”采用隐马尔可夫模型(HMM),人工神经网络(ANN)等可训练概率模型,提供语音和文本驱动的人体动画合成。 根据可用数据,在各种语音级别(例如句子,短语,单词,音素,子音素等)上使用同步运动和语音输入(例如,实况或记录的音频/视频馈送)训练概率模型,以及 运动类型或身体部位被建模。 动画合成器然后使用可训练概率模型来基于任意文本和/或语音输入来选择一个或多个不同身体部位(例如,面部,头部,手,手臂等)的动画轨迹。 然后,这些动画轨迹用于合成数字化身,卡通人物,计算机生成的拟人或生物的动画序列,物理机器人的实际动作等,其与对应于文本和/或语音的语音输出同步 输入。
    • 10. 发明申请
    • Unnatural prosody detection in speech synthesis
    • 语言合成中的非自然韵律检测
    • US20090083036A1
    • 2009-03-26
    • US11903020
    • 2007-09-20
    • Yong ZhaoFrank Kao-ping SoongMin ChuLijuan Wang
    • Yong ZhaoFrank Kao-ping SoongMin ChuLijuan Wang
    • G10L13/08G06F17/30
    • G10L13/10
    • Described is a technology by which synthesized speech generated from text is evaluated against a prosody model (trained offline) to determine whether the speech will sound unnatural. If so, the speech is regenerated with modified data. The evaluation and regeneration may be iterative until deemed natural sounding. For example, text is built into a lattice that is then (e.g., Viterbi) searched to find a best path. The sections (e.g., units) of data on the path are evaluated via a prosody model. If the evaluation deems a section to correspond to unnatural prosody, that section is replaced, e.g., by modifying/pruning the lattice and re-performing the search. Replacement may be iterative until all sections pass the evaluation. Unnatural prosody detection may be biased such that during evaluation, unnatural prosody is falsely detected at a higher rate relative to a rate at which unnatural prosody is missed.
    • 描述了一种技术,通过该技术,从文本产生的合成语音针对韵律模型(离线训练)进行评估,以确定语音是否会听起来不自然。 如果是,则使用修改的数据重新生成语音。 评估和再生可能是迭代的,直到被认为是自然的声音。 例如,文本被内置到一个格子中,然后(例如,维特比)被搜索以找到最佳路径。 通过韵律模型评估路径上的数据的部分(例如,单位)。 如果评估认为一部分对应于非自然韵律,则该部分被替换,例如通过修改/修剪格子并重新执行搜索。 替换可能是迭代的,直到所有部分通过评估。 不自然的韵律检测可能有偏差,使得在评估期间,相对于错过非自然韵律的速率,以较高的速率错误地检测到非自然韵律。