flashattention

Here are 35 public repositories matching this topic...

egaoharu-kensei / flash-attention-triton

Cross-platform FlashAttention-2 Triton implementation for Turing+ GPUs with custom configuration mode

Updated Jan 12, 2026
Python

lavawolfiee / mini-flash-attention

Minimal FlashAttention in CUDA C++/CuTe: readable WMMA/CuTe kernels, no NxN workspace, up to 4.5x faster than naive PyTorch

cuda attention cutlass cute gpu-kernels pytorch-extension tensor-cores llm flash-attention flashattention wmma

Updated Jun 2, 2026
Cuda

MaxLSB / flash-attn2

Star

FlashAttention for sliding window attention in Triton (fwd + bwd pass)

python deep-learning pytorch triton sliding-window flash-attention-2 flashattention

Updated Jun 25, 2025
Python

lyj20071013 / Triton-FlashAttention

Star

This repository contains multiple implementations of Flash Attention optimized with Triton kernels, showcasing progressive performance improvements through hardware-aware optimizations. The implementations range from basic block-wise processing to advanced techniques like FP8 quantization and prefetching

pytorch triton attention flashattention

Updated Mar 26, 2026
Python

llcuda / llcuda

Star

CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small GGUF models (1B-5B) on dual Tesla T4 GPUs (15GB each, SM 7.5)

python machine-learning ai deep-learning jupyter gpu cuda inference pytorch nvidia cuda-kernels google-colab tensor-cores tesla-t4 llm gguf unsloth flashattention

Updated Feb 1, 2026
Jupyter Notebook

JustVugg / nanoeuler

Star

GPT-2-style LLM built from scratch in C/CUDA with hand-written backprop, BPE tokenizer, FlashAttention, pretraining, and SFT.

c nlp training machine-learning deep-learning neural-network openmp cuda cublas language-model from-scratch byte-pair-encoding gpt2 llm bpe-tokenizer flashattention trasformer

Updated Jun 18, 2026
Cuda

Wulfic / AI-OS

Star

HRM-sMoE LLM training toolkit.

Updated May 31, 2026
Python

Any-Winter-4079 / Nano-GPT-Speedrun-Track

Star

This repo represents my Nano-GPT speedrun playground, which started coding along Let's reproduce GPT-2 (124M), then moved into further improvements.

decoder transformers speedrun muon rope gpt-2 gpt3 decoder-model nanogpt flexattention flashattention

Updated May 12, 2026
Python

Raptor-1772791874 / CudaFlashAttention

Star

一份初学者3个月从0实现Varlen Flash AttentionV2的仓库，我相信它能帮助想入门的学者

cuda artificial-intelligence cuda-kernels forward aiinfra flashattention

Updated Jun 27, 2026
Cuda

XiaomingFun233 / flash_attn_cuda

Star

easy naive flash attention without optimization base on origin paper

decode attention cuda-kernels flashattention

Updated Nov 14, 2025
Cuda

kennedy-kitoko / yolov12-sdpa-flashattention-pytorch

Star

PyTorch implementation of YOLOv12 with Scaled Dot-Product Attention (SDPA) optimized by FlashAttention for fast and efficient object detection.

pytorch yolo object-detection sdpa ultralytics yolov12 flashattention

Updated Jun 20, 2025
HTML

kalyani-25 / Reimplementation_flash-attention-from-scratch

Star

16-step CUDA optimization of FlashAttention-2 achieving 99.2% of official performance on A100 — Ampere architecture

deep-learning cuda pytorch ampere gpu-kernels nsight llm-inference flashattention

Updated Mar 6, 2026
Cuda

aidendorian / Marcella-66M-SLM

Star

A 66M parameter decoder-only transformer language model implemented from scratch in PyTorch. Features a custom SentencePiece tokenizer, RoPE positional embeddings, SwiGLU feed-forward network, per-layer KV cache for efficient autoregressive inference, and a Svelte-based streaming chat interface.

transformers torch pytorch pretrained-models slm language-model rope alpaca sdpa finetuning sentencepiece transformer-models kv-cache small-language-models fineweb flashattention marcella

Updated May 13, 2026
Python

Saurabh-66 / Triton-optimized-ASR-Pipeline

Star

ASR Pipeline (GLM-ASR) optimized using custom Triton kernels (achieving a 72.2% improvement in speed)

triton asr asr-pipeline asr-model kv-cache triton-kernels flashattention

Updated May 12, 2026
Python

manishklach / mlx-metal-kernels

Star

Experimental MLX custom Metal kernels for Apple Silicon — fast attention, decode, KV-cache, and future Mac GPU inference primitives.

python macos machine-learning deep-learning metal transformers inference attention mps mlx gpu-kernels kv-cache apple-silicon custom-kernels apple-gpu llm llm-inference flashattention metal-kernels

Updated Jun 21, 2026
Python

rogerchang1108 / FlashAttention-with-CUDA

Star

200 lines Flash Attention (only forward pass) in CUDA.

cuda forward-pass flashattention

Updated Feb 23, 2025
Cuda

manishklach / ghostkv-lab

Star

Research harness for evaluating query-time bounded elimination of reconstructable KV-cache witnesses in long-context transformer inference workloads. Related provisional filing: IN 202641062451.

transformer gpu-memory memory-systems kv-cache cxl long-context llm-inference transformer-memory ai-infrastructure flashattention transformer-optimization systems-research long-context-inference attention-optimization

Updated May 18, 2026
Python

shrvan30 / flash-attention-cuda

Star

FlashAttention-style CUDA implementation with shared-memory tiling, online softmax fusion, IO-aware optimization, and GPU benchmarking.

machine-learning hpc gpu parallel-computing cuda transformer attention cuda-kernels shared-memory gpu-programming flashattention cuda-optimization flashattention2

Updated May 29, 2026
Cuda

dheeren-tejani / smol-llm

Star

Experimental GPT-2 scale (~124M param) LLM trained from scratch. Trained on 22B tokens od Cosmopedia Dataset. Includes full training pipeline, with SFT FineTuning and log analysis tools with backend and frontend and deployment

nlp tokenizer pytorch transformer llama language-model nlp-machine-learning sft gpt2 train-from-scratch llm bitsandbytes openhermes flashattention cosmopedia

Updated May 15, 2026
Python

MrAnayDongre / Inference-Kernels

Star

LLM inference kernels from scratch in Triton: KV cache, FlashAttention, PagedAttention, RMSNorm, RoPE, SwiGLU, and benchmarks.

cuda pytorch triton gpu-kernels machine-learning-systems inference-optimization kv-cache llm-inference pagedattention flashattention

Updated Jun 18, 2026
Python

Improve this page

Add a description, image, and links to the flashattention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the flashattention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flashattention

Here are 35 public repositories matching this topic...

egaoharu-kensei / flash-attention-triton

lavawolfiee / mini-flash-attention

MaxLSB / flash-attn2

lyj20071013 / Triton-FlashAttention

llcuda / llcuda

JustVugg / nanoeuler

Wulfic / AI-OS

Any-Winter-4079 / Nano-GPT-Speedrun-Track

Raptor-1772791874 / CudaFlashAttention

XiaomingFun233 / flash_attn_cuda

kennedy-kitoko / yolov12-sdpa-flashattention-pytorch

kalyani-25 / Reimplementation_flash-attention-from-scratch

aidendorian / Marcella-66M-SLM

Saurabh-66 / Triton-optimized-ASR-Pipeline

manishklach / mlx-metal-kernels

rogerchang1108 / FlashAttention-with-CUDA

manishklach / ghostkv-lab

shrvan30 / flash-attention-cuda

dheeren-tejani / smol-llm

MrAnayDongre / Inference-Kernels

Improve this page

Add this topic to your repo