当前位置：首页 > java >正文

maxtext开源程序是一个简单、高性能和可扩展的 Jax LLM！

java 2025/7/3 20:07:53

一、软件介绍

文末提供程序和源码下载

MaxText 是一种高性能、高度可扩展的开源软件，以纯 Python/Jax LLM 编写，以 Google Cloud TPU 和 GPU 为目标，用于训练和推理。借助 Jax 和 XLA 编译器的强大功能，MaxText 实现了高 MFU 并从单个主机扩展到超大型集群，同时保持简单和“无优化”

MaxText 旨在成为研究和生产领域雄心勃勃LLM的项目的起点。我们鼓励用户从开箱即用的 MaxText 开始尝试，然后 fork 和修改 MaxText 以满足他们的需求。

我们使用 MaxText 在 int8 中演示了高性能、收敛良好的训练，并将训练扩展到 ~51K 芯片。

Key supported features: 主要支持的功能：

TPUs and GPUs TPU 和 GPU
Training and Inference 训练和推理
Models: Llama 2, Llama 3, Llama 4, Mistral and Mixtral family, Gemma, Gemma 2, Gemma 3, and DeepSeek family
型号：Llama 2、Llama 3、Llama 4、Mistral 和 Mixtral 系列、Gemma、Gemma 2、Gemma 3 和 DeepSeek 系列

二、Getting Started 开始

For your first time running MaxText, we provide specific instructions.
对于您第一次运行 MaxText，我们提供了具体说明。

MaxText supports training and inference of various open models. Follow user guides in the getting started folder to know more.
MaxText 支持各种开放模型的训练和推理。按照 Getting Started 文件夹中的用户指南了解更多信息。

Some extra helpful guides:
一些额外的有用指南：

Gemma (generations 1-3): a family of open-weights Large Language Model (LLM) by Google DeepMind, based on Gemini research and technology. You can run decode and finetuning using these instructions. For Gemma 2 and 3, use the corresponding gemma2 and gemma3 scripts for checkpoint convertion and decoding.
Gemma（第 1-3 代）：由 Google DeepMind 提供的一系列开放权重大型语言模型（LLM），基于 Gemini 研究和技术。您可以使用这些说明运行 decode 和 finetune。对于 Gemma 2 和 3，请使用相应的 gemma2 和 gemma3 脚本进行 checkpoint 转换和解码。
Llama2: a family of open-weights Large Language Model (LLM) by Meta. You can run decode and finetuning using these instructions.
Llama2：Meta 的一系列开放权重大型语言模型（LLM）。您可以使用这些说明运行 decode 和 finetune。
Mixtral: a family of open-weights sparse mixture-of-experts (MoE) models by Mistral AI. You can run decode and finetuning using these instructions.
Mixtral：Mistral AI 的一系列开放权重稀疏专家混合（MoE）模型。您可以使用这些说明运行 decode 和 finetune。
DeepSeek: a novel family of open-weights sparse MoE models by DeepSeek AI. DeepSeek-V3 features advanced techniques, including Multi-Head Latent Attention (MLA), finer-grained and shared experts, Multi-Token Prediction (MTP), and FP8 mixed precision designed for enhanced efficiency and performance. You can run pre-training, finetuning, and decoding using these instructions.
DeepSeek：DeepSeek AI 推出的一系列新颖的开放权重稀疏 MoE 模型。DeepSeek-V3 采用先进的技术，包括多头潜在注意力（MLA）、更细粒度和共享的专家、多标记预测（MTP）和 FP8 混合精度，旨在提高效率和性能。您可以使用这些说明运行预训练、微调和解码。

In addition to the getting started guides, there are always other MaxText capabilities that are being constantly being added! The full suite of end-to-end tests is in end_to_end. We run them with a nightly cadence. They can be a good source for understanding MaxText Alternatively you can see the continuous unit tests which are run almost continuously.
除了入门指南之外，还有其他 MaxText 功能不断添加！全套端到端测试end_to_end。我们以每晚的节奏运行它们。它们可以成为理解 MaxText 的良好来源：或者，你可以看到几乎连续运行的连续单元测试。

三、运行时性能结果

More details on reproducing these results can be found in MaxText/configs/README.md.
有关重现这些结果的更多详细信息，请参阅 MaxText/configs/README.md。

TPU v5p

No. of params 不。参数	Accelerator Type 加速器类型	TFLOP/chip/sec TFLOP/芯片/秒	Model flops utilization (MFU) 模型 flops 利用率（MFU）
32B	v5p-128	3.28e+02	67.76%
64B 64B 系列	v5p-128	3.23e+02	70.31%
128B 128B 系列	v5p-256	3.15e+02	68.68%
128B 128B 系列	v5p-512	3.15e+02	68.53%
256B 编号 256B	v5p-1024	3.16e+02	68.82%
512B 512B 系列	v5p-1024	2.94e+02	63.99%
1024B 型号 1024B	v5p-2048	2.49e+02	64.05%
1024B 型号 1024B	v5p-4096	2.97e+02	64.80%
1160B 1160B 系列	v5p-7680	2.95e+02	64.27%
1160B 1160B 系列	v5p-12288	3.04e+02	66.23%

TPU v5e TPU v5e 系列

For 16B, 32B, 64B, and 128B models. See full run configs in MaxText/configs/v5e/ as 16b.sh, 32b.sh, 64b.sh, 128b.sh.
适用于 16B、32B、64B 和 128B 型号。在 MaxText/configs/v5e/ 中查看完整运行配置，如 16b.sh 、、 32b.sh 64b.sh 128b.sh 、。

Hardware 硬件	16B TFLOP/sec/chip 16B TFLOP/秒/芯片	16B MFU	32B TFLOP/sec/chip 32B TFLOP/秒/芯片	32B MFU	64B TFLOP/sec/chip 64B TFLOP/秒/芯片	64B MFU	128B TFLOP/sec/chip 128B TFLOP/秒/芯片	128B MFU 128B MFU 系列
1x v5e-256 1 个 V5E-256	120	61.10%	132	66.86%	118	59.90%	110	56.06%
2x v5e-256 2 个 V5E-256	117	59.37%	128	64.81%	112	56.66%	110	55.82%
4x v5e-256 4 个 V5E-256	117	59.14%	126	64.10%	110	55.85%	108	54.93%
8x v5e-256 8 个 V5E-256	115	58.27%	125	63.67%	108	54.96%	104	52.93%
16x v5e-256 16 个 V5E-256	111	56.56%	123	62.26%	105	53.29%	100	50.86%
32x v5e-256 32 个 V5E-256	108	54.65%	119	60.40%	99	50.18%	91	46.25%

四、与替代品的比较

MaxText 深受 MinGPT/NanoGPT 的启发，这是用 PyTorch 编写并针对 Nvidia GPU 的优雅独立 GPT 实现。MaxText 更复杂，支持更多行业标准模型，并可扩展至数万个芯片。最终，MaxText 的 MFU 是该代码库最近报告的 17% 的三倍多，具有可大规模扩展性，并实现了键值缓存以实现高效的自动回归解码。

MaxText 更类似于 Nvidia/Megatron-LM，后者是一个针对 Nvidia GPU 的经过良好调整LLM的实现。这两种实施实现了相当的 MFU。代码库的差异突出了不同的编程策略。MaxText 是纯 Python，严重依赖 XLA 编译器来实现高性能。相比之下，Megatron-LM 是 Python 和 CUDA 的混合体，依靠经过充分优化的 CUDA 内核来实现高性能。

MaxText 也可以与 Pax 相媲美。与 Pax 一样，MaxText 在 Jax 中提供了高性能和可扩展的实现LLMs。Pax 专注于启用强大的配置参数，使开发人员能够通过编辑配置参数来更改模型。相比之下，MaxText 是一个简单、具体的 various LLMs 实现，它鼓励用户通过分叉和直接编辑源代码来扩展。

五、Features and Diagnostics 功能和诊断

在加速器上运行单个程序、多个数据（SPMD）作业时，如果出现任何错误或任何 VM 由于某种原因挂起/崩溃，则整个进程可能会挂起。在这种情况下，捕获堆栈跟踪将有助于识别和排查 TPU VM 上运行的作业的问题。

The following configurations will help to debug a fault or when a program is stuck or hung somewhere by collecting stack traces. Change the parameter values accordingly in MaxText/configs/base.yml:
以下配置将有助于调试错误，或者通过收集堆栈跟踪来调试程序卡住或挂起时。在 MaxText/configs/base.yml 中相应地更改参数值：

Set collect_stack_trace: True to enable collection of stack traces on faults or when the program is hung. This setting will periodically dump the traces for the program to help in debugging. To disable this, set collect_stack_trace: False.
设置为 collect_stack_trace: True 在出现故障或程序挂起时启用堆栈跟踪收集。此设置将定期转储程序的跟踪，以帮助调试。要禁用此功能，请将 . collect_stack_trace: False
Set stack_trace_to_cloud: False to display stack traces on console. stack_trace_to_cloud: True will create a temporary file in /tmp/debugging in the TPUs to store the stack traces. There is an agent running on TPU VMs that will periodically upload the traces from the temporary directory to cloud logging in the gcp project. You can view the traces in Logs Explorer on Cloud Logging using the following query:
设置为 stack_trace_to_cloud: False 在控制台上显示堆栈跟踪。 stack_trace_to_cloud: True 将在 TPU /tmp/debugging 中创建一个临时文件来存储堆栈跟踪。TPU VM 上运行有一个代理，该代理将定期将跟踪从临时目录上传到 gcp 项目中的云日志记录。您可以使用以下查询在 Cloud Logging 上的 Logs Explorer 中查看跟踪记录：

<span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><span style="color:#1f2328"><span style="color:var(--fgColor-default, var(--color-fg-default))"><span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><code>logName="projects/<project_name>/logs/tpu.googleapis.com%2Fruntime_monitor"
jsonPayload.verb="stacktraceanalyzer"
</code></span></span></span></span>

stack_trace_interval_seconds signifies the duration in seconds between each stack trace collection event. Setting stack_trace_interval_seconds: 600 will collect the stack traces every 600 seconds (10 minutes).
stack_trace_interval_seconds 表示每个堆栈跟踪收集事件之间的持续时间（以秒为单位）。设置 stack_trace_interval_seconds: 600 将每 600 秒（10 分钟）收集一次堆栈跟踪。

Here is the related PyPI package: Client Challenge.
以下是相关的 PyPI 包：https://pypi.org/project/cloud-tpu-diagnostics。

预先编译（AOT）

To compile your training run ahead of time, we provide a tool train_compile.py. This tool allows you to compile the main train_step in train.py for target hardware (e.g. a large number of v5e devices) without using the full cluster.
为了提前编译您的训练运行，我们提供了一个工具 train_compile.py 。此工具允许您为目标硬件（例如大量 v5e 设备）编译 main train_step in train.py ，而无需使用整个集群。

TPU Support TPU 支持

You may use only a CPU or a single VM from a different family to pre-compile for a TPU cluster. This compilation helps with two main goals:
您可以仅使用来自不同系列的 CPU 或单个 VM 来预编译 TPU 集群。此编译有助于实现两个主要目标：

It will flag any out of memory (OOM) information, such as when the per_device_batch_size is set too high, with an identical OOM stack trace as if it was compiled on the target hardware.
它将使用相同的 OOM 堆栈跟踪来标记任何内存不足（OOM）信息，例如当设置 per_device_batch_size 得太高时，就像它是在目标硬件上编译的一样。
The ahead of time compilation can be saved and then loaded for fast startup and restart times on the target hardware.
可以保存并加载提前编译，以便在目标硬件上快速启动和重启。

The tool train_compile.py is tightly linked to train.py and uses the same configuration file configs/base.yml. Although you don't need to run on a TPU, you do need to install jax[tpu] in addition to other dependencies, so we recommend running setup.sh to install these if you have not already done so.
该工具 train_compile.py 与 train.py 相同的配置文件紧密链接并使用相同的 configuration file configs/base.yml 。虽然您不需要在 TPU 上运行，但除了其他依赖项之外，您还需要安装 jax[tpu] ，因此如果您尚未安装这些依赖项，我们建议您运行 setup.sh 以安装这些依赖项。

Example AOT 1: Compile ahead of time basics
示例 AOT 1：提前编译基础知识

After installing the dependencies listed above, you are ready to compile ahead of time:
安装上面列出的依赖项后，您就可以提前编译了：

<span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><span style="color:#1f2328"><span style="color:var(--fgColor-default, var(--color-fg-default))"><span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><code># Run the below on a single machine, e.g. a CPU
python3 -m MaxText.train_compile MaxText/configs/base.yml compile_topology=v5e-256 compile_topology_num_slices=2 \
global_parameter_scale=16 per_device_batch_size=4
</code></span></span></span></span>

This will compile a 16B parameter MaxText model on 2 v5e pods.
这将在 2 个 v5e Pod 上编译一个 16B 参数 MaxText 模型。

Example AOT 2: Save compiled function, then load and run it
示例 AOT 2：保存编译后的函数，然后加载并运行它

Here is an example that saves then loads the compiled train_step, starting with the save:
下面是一个示例，该示例保存然后加载已编译的， train_step 从 save 开始：

Step 1: Run AOT and save compiled function
第 1 步：运行 AOT 并保存编译函数

<span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><span style="color:#1f2328"><span style="color:var(--fgColor-default, var(--color-fg-default))"><span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><code># Run the below on a single machine, e.g. a CPU
export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true"
python3 -m MaxText.train_compile MaxText/configs/base.yml compile_topology=v5e-256 \
compile_topology_num_slices=2 \
compiled_trainstep_file=my_compiled_train.pickle global_parameter_scale=16 \
per_device_batch_size=4 steps=10000 learning_rate=1e-3
</code></span></span></span></span>

Step 2: Run train.py and load the compiled function
第 2 步：运行 train.py 并加载编译后的函数

To load the compiled train_step, you just need to pass compiled_trainstep_file=my_compiled_train.pickle into train.py:
要加载编译train_step，您只需传入 compiled_trainstep_file=my_compiled_train.pickle train.py ：

<span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><span style="color:#1f2328"><span style="color:var(--fgColor-default, var(--color-fg-default))"><span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><code># Run the below on each host of the target hardware, e.g. each host on 2 slices of v5e-256
export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true"
python3 -m MaxText.train MaxText/configs/base.yml run_name=example_load_compile \
compiled_trainstep_file=my_compiled_train.pickle \
global_parameter_scale=16  per_device_batch_size=4 steps=10000 learning_rate=1e-3 \
base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket
</code></span></span></span></span>

In the save step of example 2 above we included exporting the compiler flag LIBTPU_INIT_ARGS and learning_rate because those affect the compiled object my_compiled_train.pickle. The sizes of the model (e.g. global_parameter_scale, max_sequence_length and per_device_batch) are fixed when you initially compile via compile_train.py, you will see a size error if you try to run the saved compiled object with different sizes than you compiled with. However a subtle note is that the learning rate schedule is also fixed when you run compile_train - which is determined by both steps and learning_rate. The optimizer parameters such as adam_b1 are passed only as shaped objects to the compiler - thus their real values are determined when you run train.py, not during the compilation. If you do pass in different shapes (e.g. per_device_batch), you will get a clear error message reporting that the compiled signature has different expected shapes than what was input. If you attempt to run on different hardware than the compilation targets requested via compile_topology, you will get an error saying there is a failure to map the devices from the compiled to your real devices. Using different XLA flags or a LIBTPU than what was compiled will probably run silently with the environment you compiled in without error. However there is no guaranteed behavior in this case; you should run in the same environment you compiled in.
在上面示例 2 的保存步骤中，我们包括导出编译器标志 LIBTPU_INIT_ARGS ， learning_rate 并且由于这些会影响编译对象 my_compiled_train.pickle. ，因此当您最初通过编译 compile_train.py 时，模型的大小（例如 global_parameter_scale 、 max_sequence_length 和 per_device_batch ）是固定的，如果您尝试以与编译时不同的大小运行保存的编译对象，您将看到大小错误。但是，一个微妙的注意事项是，当您运行时 compile_train ，学习率计划也是固定的 - 这是由 steps 和 learning_rate 决定的。优化器参数（例如 adam_b1 ）仅作为 shape 对象传递给编译器 - 因此，它们的实际值是在 run train.py 时确定的，而不是在编译期间确定的。如果您确实传入不同的形状（例如 per_device_batch ），您将收到一条明确的错误消息，报告编译后的签名具有与 Importing 不同的预期形状。如果您尝试在与通过 compile_topology 请求的编译目标不同的硬件上运行，您将收到一条错误消息，指出无法将设备从编译的设备映射到您的真实设备。使用与已编译的 XLA 标志或 LIBTPU 不同的 XLA 标志或 LIBTPU 可能会在您编译的环境中静默运行，而不会出错。但是，在这种情况下无法保证行为;您应该在编译的同一环境中运行。

GPU Support GPU 支持

Ahead-of-time compilation is also supported for GPUs with some differences from TPUs:
GPU 也支持提前编译，但与 TPU 有一些不同：

GPU does not support compilation across hardware: A GPU host is still required to run AoT compilation, but a single GPU host can compile a program for a larger cluster of the same hardware.
GPU 不支持跨硬件编译：运行 AoT 编译仍然需要 GPU 主机，但单个 GPU 主机可以为相同硬件的更大集群编译程序。
For A3 Cloud GPUs, the maximum "slice" size is a single host, and the compile_topology_num_slices parameter represents the number of A3 machines to precompile for.
对于 A3 Cloud GPU，最大“切片”大小是单个主机，该 compile_topology_num_slices 参数表示要预编译的 A3 计算机的数量。

Example 例

This example illustrates the flags to use for a multihost GPU compilation targeting a cluster of 4 A3 hosts:
此示例说明了用于针对 4 个 A3 主机集群的多主机 GPU 编译的标志：

Step 1: Run AOT and save compiled function
第 1 步：运行 AOT 并保存编译函数

<span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><span style="color:#1f2328"><span style="color:var(--fgColor-default, var(--color-fg-default))"><span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><code># Run the below on a single A3 machine
export XLA_FLAGS="--xla_gpu_enable_async_collectives=true"
python3 -m MaxText.train_compile MaxText/configs/base.yml compile_topology=a3 \
compile_topology_num_slices=4 \
compiled_trainstep_file=my_compiled_train.pickle global_parameter_scale=16 \
attention=dot_product per_device_batch_size=4 steps=10000 learning_rate=1e-3
</code></span></span></span></span>

Step 2: Run train.py and load the compiled function
第 2 步：运行 train.py 并加载编译后的函数

<span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><span style="color:#1f2328"><span style="color:var(--fgColor-default, var(--color-fg-default))"><span style="background-color:var(--bgColor-muted, var(--color-canvas-subtle))"><code># Run the below on each of the 4 target A3 hosts.
export XLA_FLAGS="--xla_gpu_enable_async_collectives=true"
python3 -m MaxText.train MaxText/configs/base.yml run_name=example_load_compile \
compiled_trainstep_file=my_compiled_train.pickle \
attention=dot_product global_parameter_scale=16  per_device_batch_size=4 steps=10000 learning_rate=1e-3 \
base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket
</code></span></span></span></span>

As in the TPU case, note that the compilation environment must match the execution environment, in this case by setting the same XLA_FLAGS.
与 TPU 情况一样，请注意，编译环境必须与执行环境匹配，在这种情况下，通过设置相同的 XLA_FLAGS .