当前位置: 首页 > web >正文

[ECCV 2024]UMBRAE: Unified Multimodal Brain Decoding

论文网址:01133.pdf

论文代码:GitHub - weihaox/UMBRAE: [ECCV 2024] UMBRAE: Unified Multimodal Brain Decoding | Unveiling the 'Dark Side' of Brain Modality

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Works

2.4. UMBRAE

2.4.1. Architecture

2.4.2. Cross-Subject Alignment

2.4.3. Multimodal Alignment

2.4.4. Brain Prompting Interface

2.5. Experiments

2.5.1. Implementation Details

2.5.2. BrainHub

2.5.3. Brain Captioning

2.5.4. Brain Grounding

2.5.5. Brain Retrieval

2.5.6. Visual Decoding

2.5.7. Weakly-Supervised Adaptation

2.6. Ablation Study

2.6.1. Architectural Improvements

2.6.2. Training Strategies

2.7. Conclusion

1. 心得

(1)额

2. 论文逐段精读

2.1. Abstract

        ①Challenges: spatial brain-powered information and cross-subject research

 granularity  n. 间隔尺寸,[岩] 粒度

2.2. Introduction

        ①The target object of brain signal decoding: people with cognitive or physical disabilities or even locked-in patients

        ②⭐Challenges: a) single modality decoding will cause loss of brain information, b) text encoding ignores the spatial information

2.3. Related Works

        ①Mentioned generation models, LLM based models and alignment models

2.4. UMBRAE

        ①UMBRAE denotes unified multimodal brain decoding

        ②Overall framework of UMBRAE:

(咦,一眼OneLLM了)

2.4.1. Architecture

        ①Brain encoder: lightweight Transformer

        ②Brain signal s\in\mathbb{R}^{1\times L_{s}} for each person is from subject set \mathcal{S}_{\Omega}, where L_s is with arbitrary length

        ③Tokenizer transforms s\in\mathbb{R}^{1\times L_{s}} to \mathbf{s}_{k}\in\mathbb{R}^{M\times D} with M tokens and D dimension(这个是绿色小方块?)

        ④Brain tokens \mathbf{x}\in\mathbb{R}^{L\times D}(这个到底是什么啊哪哪都是brain token但是图上没有啊图上只有subject token,看文本就像是紫色小方块)

        ⑤Universal Perceive Encoder: cross attention module

prepend  v. 预置;前置;预先考虑;预先准备;预追加

2.4.2. Cross-Subject Alignment

        ①Uniform random sampling in the data of each participant with probability p:

p_k=\frac{\|\mathcal{S}_k\|}{\sum_{n=1}^K\|\mathcal{S}_n\|}

就是如果一个batch size的大小是B,有p_k的概率抽到某个被试S_{k},从S_{k}中抽\theta B个数据。然后剩下的\left ( 1-\theta \right ) B个数据从其他受试者中均匀采样

2.4.3. Multimodal Alignment

        ①Instead of mapping data in all the modalities to the same space, they align brain signal element by element to pretrained image feature

        ②To align brain response s\in\mathbb{R}^{1\times L_{s}} and image v\in\mathbb{R}^{W\times H\times C}, they minimize the loss between brain encoder \mathcal{B} and image encoder \mathcal{V}:

\mathcal{L}_{\mathrm{rec}}=\mathbb{E}_{\mathbf{b}\sim\mathbf{B},\mathbf{v}\sim\mathbf{V}}[\|\mathcal{V}(v)-\mathcal{B}(b))\|_2^2]

2.4.4. Brain Prompting Interface

        ①Templet of MLLM:

for brain captioning, they define <instruction> as: ‘Describe this image <image> as simply as possible.’, for brain grounding task, they define <instruction> as: ‘Locate <expr> in <image> and provide its coordinates, please.’, where <expr> is the expression

2.5. Experiments

2.5.1. Implementation Details

        ①Visual encoder: CLIP ViT-L/14

        ②LLM: Vicuna-7B/13B

        ③Image feature: \mathbf{T}\in\mathbb{R}^{16\times16\times1024} from the second last layer of the transformer encoder, and is converted to \mathbf{T}^{\prime}\in\mathbb{R}^{256\times D}D=4,096 for Vicuna-7B and D=5,120 for Vicuna-13B

        ④Epoch: 240 

        ⑤Batch size: 256

        ⑥Training time: 12 hours in one A100 GPU

        ⑦Optimizer: AdamW with \beta _1=0.9\beta _2=0.95, weight decay of 0.01, learning rate of 3e-4

        ⑧\theta =0.5, meaning that in each batch of 256 samples, 128 come from each of two subjects

2.5.2. BrainHub

        ①Dataset: NSD

        ②Tasks: brain captioning, brain grounding, brain retrieval, visual decoding

2.5.3. Brain Captioning

        ①Brain captioning performance comparison table:

where -S1 denotes training on single subject 01

2.5.4. Brain Grounding

        ①Example of brain captioning and brain grounding tasks:

        ②Performance of brain grounding:

2.5.5. Brain Retrieval

        ①Forward retrieval, backward retrieval, and exemplar retrieval peroformance:

where the model needs to identify brain embedding, image embedding and image

2.5.6. Visual Decoding

        ①Image reconstruction:

2.5.7. Weakly-Supervised Adaptation

        ①Performance with different training data on S7:

2.6. Ablation Study

2.6.1. Architectural Improvements

        ①UMBRAE has less parameters

2.6.2. Training Strategies

        ①Module ablation:

2.7. Conclusion

        ~

http://www.xdnf.cn/news/2585.html

相关文章:

  • 赞奇AIknow是什么?
  • 2025年8月PMP考试费用上涨?8月PMP考试费用解析!
  • 电力系统失步解列与振荡解析
  • 基于知识库的智能客户服务工具
  • Tailwind CSS 实战:基于 Kooboo 构建企业官网页面(二)
  • runtimeChunk的作用
  • Servlet (简单的servlet的hello world程序)
  • SAP-pp 怎么通过底表的手段查找BOM的全部ECN变更历史
  • 小红书笔记详情API接口概述及JSON数据返回参考
  • element通过业务按钮点击导入,调用el-upload的导入方法
  • Redis缓存问题的深度解析与解决方案
  • c++的匿名函数捕获
  • 代码小优化
  • Babel、core-js、Loader之间的关系和作用全解析
  • 人类社会的第四阶段
  • 【C语言练习】006. 编写条件语句处理不同情况
  • Spring中生成Bean的方式总结-笔记
  • Customizing Materials Management with SAP ERP Operations
  • Spark-Streaming核心编程内容总结
  • CSS布局实战:Flexbox 与 Grid 精髓解析
  • ecovadis认证评估标准?ecovadis审核目的?
  • 网络安全厂商F5荣登2025 CRN AI 100榜单,释放AI潜力
  • Vue3 里 CSS 深度作用选择器 :deep()
  • HQChart k线图配置
  • BUUCTF——The mystery of ip
  • mac 设置飞书默认浏览器(解决系统设置默认浏览器无效)
  • Nacos简介—4.Nacos架构和原理二
  • [AHOI2001] 质数和分解
  • 蓝桥杯 16. 密文搜索
  • Zookeeper实现分布式锁实战应用