Skip to content

How to Install Flash-Attention

PyPI has no wheels for flash-attn. Every pip install flash-attn triggers a from-source CUDA compilation that can take over two hours (see Why Installing GPU Python Packages Is So Complicated for background). Prebuilt wheels do exist on GitHub Releases, and the package’s setup.py can fetch them automatically if your environment matches.

Requirements

  • Platform: Linux on NVIDIA Ampere (A100, RTX 3090), Ada Lovelace (RTX 4090), or Hopper (H100). Windows has experimental support since v2.3.2. No macOS support.
  • Software: CUDA toolkit >=12.0 with nvcc on PATH, PyTorch >=2.2 already installed in the target environment.

Install with prebuilt wheels (recommended)

The flash-attn package includes a CachedWheelsCommand in its setup.py that tries to download a matching prebuilt wheel from GitHub Releases before falling back to compilation. The --no-build-isolation flag is required because setup.py imports torch and packaging at the top level. Both must be installed in the environment before running the install:

uv pip install packaging
uv pip install flash-attn --no-build-isolation

pip and uv both support --no-build-isolation. The flag tells the installer to use packages from the current environment during the build instead of creating an isolated one.

When a prebuilt wheel matches, installation completes in seconds. When it doesn’t match, the command silently falls back to compiling from source, which takes much longer. If the install takes more than a minute, a prebuilt wheel was not found for your configuration.

Prebuilt wheel coverage (v2.8.3)

All prebuilt wheels target CUDA 12 on Linux x86_64. Two aarch64 wheels exist for torch 2.9.

PyTorch Python 3.9 Python 3.10 Python 3.11 Python 3.12 Python 3.13
2.4 yes yes yes yes
2.5 yes yes yes yes yes
2.6 yes yes yes yes yes
2.7 yes yes yes yes yes
2.8 yes yes yes yes yes
2.9 yes

Each cell has both CXX11 ABI TRUE and FALSE variants. If your combination is not in this table, skip to Build from source.

Warning

Prebuilt wheels lag behind PyTorch releases. If you are on a newer version of PyTorch than what is listed above (e.g. torch 2.10 or 2.11), no prebuilt wheel exists and the install will fall back to compiling from source. Either pin a supported PyTorch version or follow the Build from source instructions.

Add to a uv project

For projects managed with uv using uv add and uv sync, the --no-build-isolation approach above does not apply. Instead, uv provides extra-build-dependencies to inject torch into the isolated build environment. The match-runtime = true option ensures the build uses the same torch version the project resolves at runtime:

[project]
dependencies = ["flash-attn", "torch"]

[tool.uv.extra-build-dependencies]
flash-attn = ["packaging", { requirement = "torch", match-runtime = true }]

Then run uv sync as normal. uv handles the build isolation, torch injection, and version matching automatically. The build will compile CUDA extensions from source, which requires nvcc on PATH and takes several minutes.

To control the build, pass environment variables with extra-build-variables. For example, to limit parallel compilation jobs on low-memory machines:

[tool.uv.extra-build-variables]
flash-attn = { MAX_JOBS = "4" }

Install from a direct wheel URL

When automatic download fails or when pinning a specific wheel in a requirements file, construct the URL and install it directly. The URL pattern is:

https://github.com/Dao-AILab/flash-attention/releases/download/v{version}/flash_attn-{version}+cu{cuda}torch{torch}cxx11abi{abi}-cp{py}-cp{py}-linux_x86_64.whl

To fill in the blanks, check your environment:

python -c "import torch; print('torch:', torch.__version__[:3])"
python -c "import torch; print('cxx11abi:', torch._C._GLIBCXX_USE_CXX11_ABI)"
python -c "import sys; print('python: cp' + ''.join(map(str, sys.version_info[:2])))"

Then install the matching wheel. For example, with Python 3.12, PyTorch 2.7, and CXX11 ABI True:

uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

Build from source

When no prebuilt wheel matches your environment, install the build dependencies into the same environment as PyTorch and then compile with a constrained job count:

uv pip install ninja packaging
MAX_JOBS=4 uv pip install flash-attn --no-build-isolation

ninja is critical. Without it, the build uses a single-threaded fallback that takes roughly two hours instead of minutes. MAX_JOBS=4 prevents out-of-memory kills on machines with less than 96GB RAM; increase the number on machines with more memory to speed up the build.

See How to Install PyTorch with uv for getting a compatible PyTorch installation in place first.

Install with conda-forge or pixi

conda and pixi users can skip the wheel and compilation complexity entirely. The conda-forge build handles the CUDA toolkit dependency as part of the solver, so there is no need to manage nvcc, ABI variants, or --no-build-isolation.

pixi add flash-attn

Packages are available for linux-64 and linux-aarch64. For more on when conda-based tools are the better choice for GPU workloads, see uv vs pixi vs conda for Scientific Python.

Verify the installation

After installing, confirm flash-attention loads and can see the GPU:

python -c "import flash_attn; print(flash_attn.__version__)"

If this fails with ModuleNotFoundError, the installation did not complete. Check the install output for errors. If it fails with a CUDA-related error at import time, the installed build may not match your driver or GPU architecture.

Troubleshooting

ModuleNotFoundError: No module named 'packaging' during install. The setup.py imports packaging before it does anything else, including downloading prebuilt wheels. Run pip install packaging first, then retry.

ModuleNotFoundError: No module named 'torch' during install. PyTorch must be installed in the environment before running pip install flash-attn. The --no-build-isolation flag tells pip to use the current environment’s packages during the build, and setup.py imports torch immediately. Install PyTorch first, then retry.

Build starts compiling instead of downloading a wheel. Your combination of Python version, PyTorch version, or CXX11 ABI does not have a prebuilt wheel. Check the compatibility table above. If your combination is listed, check what torch.__version__ reports. PyTorch installed from a CUDA-specific index (e.g. download.pytorch.org/whl/cu128) reports a version like 2.7.1+cu128. The +cu128 local version suffix can cause the auto-download to construct the wrong wheel URL. If this happens, use the direct wheel URL method instead.

Build killed by OOM. Set MAX_JOBS=2 or MAX_JOBS=1 to reduce parallel compilation. Each compilation job can consume several gigabytes of memory.

nvcc not found. The CUDA toolkit is not on PATH. Install it from NVIDIA’s CUDA toolkit archive or use a Docker image with CUDA pre-installed (such as nvidia/cuda:12.8.0-devel-ubuntu22.04).

Build takes hours. Install ninja (pip install ninja) and retry. Without ninja, the CUDA extensions compile one file at a time.

Note

FlashAttention-4 is a separate package (pip install --pre flash-attn-4) that uses JIT compilation and ships as a pure Python wheel. No CUDA compiler or --no-build-isolation flag needed. It requires a Hopper or Blackwell GPU and CUDA >=12.3.

Related

Handbook articles:

External resources:

Get Python tooling updates

Subscribe to the newsletter
Last updated on

Please submit corrections and feedback...