当前位置: 首页 > ops >正文

Shakker-Labs提出RepText方法,提升FLUX处理文本能力

我们提出的RepText旨在赋予预训练的单语文本到图像生成模型准确渲染(更精确地说,是复现)用户指定字体多语言视觉文本的能力,而无需真正理解这些语言。具体而言,我们采用ControlNet的框架设置,并额外整合语言无关的字形与渲染文本位置信息,以生成协调的视觉文本效果,允许用户按需自定义文本内容、字体和位置。为提高准确性,我们在扩散损失基础上引入了文本感知损失。此外,为稳定渲染过程,在推理阶段我们直接使用带噪字形潜在变量而非随机初始化,并采用区域掩码将特征注入严格限制在文本区域,以避免背景失真。通过大量实验验证,我们的RepText相较现有工作展现出显著优势:其性能超越现有开源方案,并与原生多语言闭源模型达到相当效果。

在这里插入图片描述

代码

import torch
from controlnet_flux import FluxControlNetModel
from pipeline_flux_controlnet import FluxControlNetPipelinefrom PIL import Image, ImageDraw, ImageFont
import numpy as np
import cv2
import re
import osdef contains_chinese(text):if re.search(r'[\u4e00-\u9fff]', text):return Truereturn Falsedef canny(img):low_threshold = 50high_threshold = 100img = cv2.Canny(img, low_threshold, high_threshold)img = img[:, :, None]img = 255 - np.concatenate([img, img, img], axis=2)return imgbase_model = "black-forest-labs/FLUX.1-dev"
controlnet_model = "Shakker-Labs/RepText"controlnet = FluxControlNetModel.from_pretrained(controlnet_model, torch_dtype=torch.bfloat16)
pipe = FluxControlNetPipeline.from_pretrained(base_model, controlnet=controlnet, torch_dtype=torch.bfloat16
).to("cuda")## set resolution
width, height = 1024, 1024## set font
font_path = "./assets/Arial_Unicode.ttf" # use your own font
font_size = 80 # it is recommended to use a font size >= 60
font = ImageFont.truetype(font_path, font_size)## set text content, position, color
text_list = ["哩布哩布"]
text_position_list = [(370, 200)]
text_color_list = [(255, 255, 255)]## set controlnet conditions
control_image_list = [] # canny list
control_position_list = [] # position list
control_mask_list = [] # regional mask list
control_glyph_all = np.zeros([height, width, 3], dtype=np.uint8) # all glyphs## handle each line of text
for text, text_position, text_color in zip(text_list, text_position_list, text_color_list):### glyph image, render text to black backgroundcontrol_image_glyph = Image.new("RGB", (width, height), (0, 0, 0))draw = ImageDraw.Draw(control_image_glyph)draw.text(text_position, text, font=font, fill=text_color)### get bboxbbox = draw.textbbox(text_position, text, font=font)### position conditioncontrol_position = np.zeros([height, width], dtype=np.uint8)control_position[bbox[1]:bbox[3], bbox[0]:bbox[2]] = 255control_position = Image.fromarray(control_position.astype(np.uint8))control_position_list.append(control_position)### regional maskcontrol_mask_np = np.zeros([height, width], dtype=np.uint8)control_mask_np[bbox[1]-5:bbox[3]+5, bbox[0]-5:bbox[2]+5] = 255control_mask = Image.fromarray(control_mask_np.astype(np.uint8))control_mask_list.append(control_mask)### accumulate glyphcontrol_glyph = np.array(control_image_glyph)control_glyph_all += control_glyph### canny conditioncontrol_image = canny(cv2.cvtColor(np.array(control_image_glyph), cv2.COLOR_RGB2BGR))control_image = Image.fromarray(cv2.cvtColor(control_image, cv2.COLOR_BGR2RGB))control_image_list.append(control_image)control_glyph_all = Image.fromarray(control_glyph_all.astype(np.uint8))
control_glyph_all = control_glyph_all.convert("RGB")
# control_glyph_all.save("./results/control_glyph.jpg")# it is recommended to use words such 'sign', 'billboard', 'banner' in your prompt
# for Englith text, it helps if you add the text to the prompt
prompt = "a street sign in city"
for text in text_list:if not contains_chinese(text):prompt += f", '{text}'"
prompt += ", filmfotos, film grain, reversal film photography" # optional
print(prompt)generator = torch.Generator(device="cuda").manual_seed(42)image = pipe(prompt,control_image=control_image_list, # cannycontrol_position=control_position_list, # positioncontrol_mask=control_mask_list, # regional maskcontrol_glyph=control_glyph_all, # as init latent, optional, set to None if not usedcontrolnet_conditioning_scale=1.0,controlnet_conditioning_step=30,width=width,height=height,num_inference_steps=30,guidance_scale=3.5,generator=generator,
).images[0]if not os.path.exists("./results"):os.makedirs("./results")
image.save(f"./results/result.jpg")
http://www.xdnf.cn/news/14344.html

相关文章:

  • 每天宜搭宜搭小知识—报表组件—日历热力图
  • C++第一阶段——语言基础与核心特性
  • Kafka Connect 存在任意文件读取漏洞(CVE-2025-27817)
  • 【OpenVINO™】使用OpenVIN.CSharp.API在C#平台快速部署PP-OCRv5模型识别文本
  • 【保姆级开发文档】安卓开发四大组件及其生命周期详解
  • 最新版MATLAB R2025a ,支持Windows10/11
  • Laravel 12 更新与之前版本结构变更清单
  • XxlJob热点文章定时计算
  • 001微信小程序入门
  • 向量外积与秩1矩阵的关系
  • Path.mkdir vs os.makedirs:为什么Ruff建议替换?
  • node中Token刷新机制:给你的数字钥匙续期的奇妙之旅
  • RADIUS服务器的核心应用场景与ASP认证服务器的快速对接指南
  • Linux--存储系统探秘:从块设备到inode
  • 基于STM32单片机RLC检测仪
  • TabSyncer:浏览器标签页管理工具
  • 【freertos互斥量补充】递归锁
  • 1.18 进程管理PM2
  • 山东大学项目实训-创新实训-法律文书专家系统-项目报告(六)
  • 【数据结构中的堆】
  • ASR-PRO语音识别可能出现的问题
  • langchain从入门到精通(九)——ChatGPT/Playground手动模拟记忆功能
  • MFE微前端:如何捕捉远程应用的remote.js加载失败的错误?
  • 【人工智能数学基础】测度论
  • 11.OpenCV—联合QT环境配置
  • RTDETRv2 pytorch 官方版自己数据集训练遇到的问题解决
  • 正整数的正向分解
  • 股指期货的多空策略是什么?
  • 编译链接实战(30)strip移除了哪些内容
  • java设计模式[3]之结构性型模式