{"id":62710,"date":"2024-02-29T06:40:02","date_gmt":"2024-02-29T11:40:02","guid":{"rendered":"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=62710"},"modified":"2024-02-29T08:36:48","modified_gmt":"2024-02-29T13:36:48","slug":"emote-portrait-alive","status":"publish","type":"post","link":"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=62710","title":{"rendered":"\"Emote Portrait Alive\""},"content":{"rendered":"<p><a href=\"https:\/\/humanaigc.github.io\/emote-portrait-alive\/\" target=\"_blank\" rel=\"noopener\">EMO<\/a>, by Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo from Alibaba's Institute for Intelligent Computing, is <span style=\"color: #800000;\">\"an expressive audio-driven portrait-video generation framework. Input a single reference image and the vocal audio, e.g. talking and singing, our method can generate vocal avatar videos with expressive facial expressions, and various head poses\"<\/span>.<\/p>\n<p>As far as I know, there's no interactive demo so far, much less code &#8212; just a <a href=\"https:\/\/humanaigc.github.io\/emote-portrait-alive\/\" target=\"_blank\" rel=\"noopener\">github demo page<\/a> and an <a href=\"https:\/\/arxiv.org\/abs\/2402.17485\" target=\"_blank\" rel=\"noopener\">arXiv.org paper<\/a>.<\/p>\n<p>Their demo clips are very impressive &#8212; a <a href=\"https:\/\/twitter.com\/minchoi\/status\/1762812204884074979\" target=\"_blank\" rel=\"noopener\">series of X posts from yesterday<\/a> has gotten 1.1M views already. Here's <a href=\"http:\/\/languagelog.ldc.upenn.edu\/myl\/AlibabaSong1.mp4\" target=\"_blank\" rel=\"noopener\">Leonardo DiCaprio artificially lip-syncing Eminem<\/a>:<\/p>\n<p><iframe loading=\"lazy\" title=\"AlibabaSong1\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/4znFrPcnRnY?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><br \/>\n<!--more--><\/p>\n<p>(There are plenty more <a href=\"https:\/\/humanaigc.github.io\/emote-portrait-alive\/\" target=\"_blank\" rel=\"noopener\">where that came from<\/a> &#8212; I uploaded that sample to YouTube because I was unable to persuade the html &lt;video&gt; element to modify the display width without distortion.)<\/p>\n<p>I'm always skeptical of hand-selected synthesis demos, since we have no way of knowing how many problematic attempts were discarded, or what sorts of side-information might have been provided. Still, their examples are impressive. The <a href=\"https:\/\/arxiv.org\/abs\/2402.17485\" target=\"_blank\" rel=\"noopener\">arXiv.org paper<\/a> explains:<\/p>\n<p style=\"padding-left: 40px;\"><span style=\"color: #800000;\">To train our model, we constructed a vast and diverse audio-video dataset, amassing over 250 hours of footage and more than 150 million images. <\/span><span style=\"color: #800000;\">This expansive dataset encompasses a wide range of content, including speeches, film and television clips, and singing performances, and covers multiple languages such as Chinese and English. The rich variety of speaking and singing videos ensures that our training material captures a broad spectrum of human expressions and vocal styles, providing a solid foundation for the development of EMO. We conducted extensive experiments and comparisons on the HDTF dataset, where our approach surpassed current state-of-the-art (SOTA) methods, including DreamTalk, Wav2Lip, and SadTalker, across multiple metrics such as FID, SyncNet, F-SIM, and FVD. In addition to quantitative assessments, we also carried out a comprehensive user study and qualitative evaluations, which revealed that our method is capable of generating highly natural and expressive talking and even singing videos, achieving the best results to date.<\/span><\/p>\n<p>&#8230;and provides this graphical overview:<\/p>\n<p><a href=\"https:\/\/languagelog.ldc.upenn.edu\/myl\/AlibabaEMOflow.png\"><img decoding=\"async\" title=\"Click to embiggen\" src=\"https:\/\/languagelog.ldc.upenn.edu\/myl\/AlibabaEMOflow.png\" width=\"490\" \/><\/a><\/p>\n<p>Here's AI Mona Lisa lip-syncing <a href=\"https:\/\/www.youtube.com\/watch?v=0X4XGrK_PwM\" target=\"_blank\" rel=\"noopener\">Miley Cyrus covered by Yuqi<\/a>\u00a0&#8212; because the syllabic pace is slower, it's easier to visually evaluate the (generally excellent) phonetic lip\/face synchrony:<\/p>\n<p><iframe loading=\"lazy\" title=\"AlibabaSong2\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/KzEkyMJLUfo?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<p><strong>Update &#8212;<\/strong> I was a bit puzzled by their account of the training set size: \"over 250 hours of footage and more than 150 million images\". If the images came from the standard digital video frame rate of 30 fps, that would be 250*60*60*30 = 27 million. The standard movie rate of 24 fps would yield a lower total. So did they actually use more \"footage\"? or did they do some interpolation? or did they multiply wrong?<\/p>\n<p>The paper says in another section that<\/p>\n<p style=\"padding-left: 40px;\"><span style=\"color: #800000;\">We collected approximately 250 hours of talking head videos from the internet and supplemented this with the HDTF [34] and VFHQ [31] datasets to train our models. As the VFHQ dataset lacks audio, it is only used in the first training stage.<\/span><\/p>\n<p>That would explain the difference, but it means that the training was actually based on a lot more than \"250 hours of footage\".<\/p>\n<p><strong>Update #2 &#8212;<\/strong> There are quite a few recent experiments in similar directions, mostly <a href=\"https:\/\/scholar.google.com\/scholar?cites=14573981281897840469&amp;as_sdt=5,39&amp;sciodt=0,39&amp;hl=en\" target=\"_blank\" rel=\"noopener\">following up on Wav2Lip<\/a> (<span style=\"font-weight: 400;\">K.R. Prajwal, et al.,\u00a0 \"<a href=\"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3394171.3413532\" target=\"_blank\" rel=\"noopener\">A lip sync expert is all you need for speech to lip generation in the wild<\/a>\", 2020), whose authors provided <a href=\"https:\/\/bhaasha.iiit.ac.in\/lipsync\/\" target=\"_blank\" rel=\"noopener\">an interactive demo<\/a> and <a href=\"https:\/\/github.com\/Rudrabha\/Wav2Lip\" target=\"_blank\" rel=\"noopener\">source code<\/a>.<\/span><\/p>\n<p>For example, some <a href=\"https:\/\/openaccess.thecvf.com\/content\/ICCV2023\/html\/Gururani_SPACE_Speech-driven_Portrait_Animation_with_Controllable_Expression_ICCV_2023_paper.html\" target=\"_blank\" rel=\"noopener\">recent work at NVidia<\/a> provides <a href=\"https:\/\/research.nvidia.com\/labs\/dir\/space\/\" target=\"_blank\" rel=\"noopener\">an interesting set of videos<\/a> comparing their\u00a0 system's outputs to those several others.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>EMO, by Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo from Alibaba's Institute for Intelligent Computing, is \"an expressive audio-driven portrait-video generation framework. Input a single reference image and the vocal audio, e.g. talking and singing, our method can generate vocal avatar videos with expressive facial expressions, and various head poses\". As far as [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[60,24],"tags":[],"class_list":["post-62710","post","type-post","status-publish","format-standard","hentry","category-computational-linguistics","category-phonetics-and-phonology"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/62710","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=62710"}],"version-history":[{"count":21,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/62710\/revisions"}],"predecessor-version":[{"id":62731,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/62710\/revisions\/62731"}],"wp:attachment":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=62710"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=62710"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=62710"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}