Advancing SVD-based LLM Compression via Layer-Wise Error Model Search

ICML 2026

1Technical University of Munich, 2BMW Group, 3University of Glasgow, 4TU Wien, *Equal contribution

Technical TL;DR

We propose Lems, a rank global allocator driven by a calibrated layer-wise error model of spectral shape, scale, and bias, and Kfac-SVD, which estimates the Fisher matrix with a token-wise approximation to address the rank bottleneck of prior Fisher-based SVD methods. Across Mistral, Qwen3, and Llama 3 families, Lems delivers zero-shot accuracy improvements of up to 4.8 p.p. generalizing to 70B parameters, while Kfac-SVD achieves an average perplexity improvement of 15%.

a) Calibrating Layer Sensitivity
b) Solving Error Propagation

LEMS Mechanics: the two main operations that drive the global rank allocation.

Abstract

Low-rank SVD-based compression offers a powerful strategy to reduce the computational costs of LLMs. However, existing methods face two key limitations: (i) global rank allocation, where uncalibrated error proxies fail to capture complex error propagation, and (ii) decomposition quality, where Fisher-based estimators suffer from severe rank collapse. In this work, we address these limitations by introducing Layer-wise Error Modeling Search (Lems) and KFAC-SVD. Lems advances rank allocation by introducing a layer-wise error surrogate that integrates local and global layer importance alongside a propagation bias, enabling effective global rank allocation via an ILP formulation. Lems improves decomposition quality by utilizing token-wise statistics, mitigating the rank deficiency observed in prior Fisher-based SVD approaches. Across Mistral, Qwen3, and Llama 3 model families, we show that Lems consistently outperforms existing search strategies, delivering significant zero-shot accuracy improvements of up to 4.8 p.p. that generalize to model sizes of 70B parameters, while Lems achieves an average perplexity improvement of 15%.

Results

Search Method Comparison

Comparison of LEMS against state-of-the-art search baselines across modern LLMs and two compression rates.
Mistral-7B Llama3-8B Qwen3-8B
Search Method Wiki C4 PTB Acc Wiki C4 PTB Acc Wiki C4 PTB Acc
Baseline 5.258.1027.7263.95 6.149.479.9063.34 9.7115.5215.4362.03
0.8 uniform 7.1419.4364.8452.35 11.4452.0080.0647.69 12.5231.8547.2553.56
ASVD 7.2022.4178.5147.81 12.9263.8399.6445.95 15.7043.7972.6147.72
SVD-LLM v2 7.1319.3564.8452.68 11.4052.0080.3847.70 12.5231.7247.4453.36
MRCS 7.1119.1368.2151.75 11.9955.8983.9146.63 14.5237.3853.7552.04
ARS 7.2611.7447.7154.58 11.8120.4428.8354.31 11.5822.4125.0555.66
ATP 7.1419.4364.8452.35 11.0735.3281.3251.56 12.5231.8547.2553.56
LEMS 5.9811.2437.0957.67 8.1617.5222.8055.99 10.3818.3218.3259.28
0.6 uniform 14.3872.05208.4839.10 48.56475.361312.5034.39 21.68101.61195.8539.63
ASVD 16.78102.00317.8935.56 75.21571.151577.0033.98 29.22122.56228.9736.09
SVD-LLM v2 14.9073.18209.3039.29 47.53467.991242.6534.43 21.4296.95189.8239.74
MRCS 14.2174.05253.4537.44 67.95610.371815.1233.41 38.72161.10322.9036.42
ARS 19.4337.09166.8740.83 28.7756.55112.0340.57 29.5183.91161.1040.72
ATP 13.5960.43189.0840.24 25.14173.51467.9939.95 19.4784.89186.8840.20
LEMS 10.5834.37124.0045.60 17.8681.01201.2843.09 15.7057.22120.6645.51

SVD Method Comparison

Wiki performance and added execution time of individual parts of our method on Llama3-8B and Qwen3-8B.
Mistral-7B Llama3-8B Qwen3-8B
SVD Method Wiki C4 PTB Acc Wiki C4 PTB Acc Wiki C4 PTB Acc
Baseline 5.258.1027.7263.96 6.149.479.9063.33 9.7115.5215.4362.04
0.9 FWSVD 9.4713.7558.3556.93 42.1159.9669.8348.88 16.9524.7531.5453.82
ASVD 9.1413.4149.1357.30 65.0977.60207.6747.89 20.3629.9836.3751.19
SVD-LLM 6.4615.0766.1256.81 10.1439.6349.9152.27 12.5227.4036.6655.96
SVD-LLM v2 6.4615.0766.1256.78 10.1839.7150.4052.17 12.5227.4536.8055.80
DoBi-SVD 7.1115.1963.5855.17 11.2234.7742.3652.98 13.4628.8835.8155.48
GFWSVD 65.8698.86371.6637.03 7612.1413359.7325953.5232.39 659.97691.641646.2443.30
KFAC-SVD 6.2213.1044.0456.92 8.8425.6932.6054.45 11.5123.7628.0057.07
+LEMS 5.378.4929.8662.77 6.5811.1111.2961.99 9.8516.0415.7762.89
0.7 FWSVD 34.8450.30215.1044.24 716.39837.541939.7533.40 41.3768.7590.7242.68
ASVD 28.1642.94163.6446.09 10989.419399.7436032.8933.03 70.6696.95131.4943.10
SVD-LLM 10.9642.60169.4944.39 34.64376.04600.9137.69 17.1163.09136.1945.76
SVD-LLM v2 10.9642.60169.4944.35 34.98378.99615.1637.70 17.1563.33136.7245.62
DoBi-SVD 12.3739.71170.8242.94 31.54500.12223.6738.48 18.5463.8394.3446.24
GFWSVD 3325.503738.964042.7831.39 104264.63199411.34106737.2031.68 173251.5749251.1361294.0232.74
KFAC-SVD 9.2934.64110.7245.76 19.43151.34253.4540.73 14.6950.0093.9746.78
+LEMS 7.4218.2567.9551.52 11.0733.5165.0949.39 11.8128.3337.9753.80

BibTeX

@inproceedings{thoma_advancing_2026,
      location = {Seoul, South Korea},
      title = {Advancing {SVD}-based {LLM} Compression via Layer-Wise Error Model Search},
      url = {https://openreview.net/forum?id=IjIgNPFuCt},
      abstract = {Low-rank {SVD}-based compression offers a powerful strategy to reduce the computational costs of Large language models ({LLMs}); however, existing methods commonly encounter two recurring obstacles: (i) global rank allocation, where uncalibrated error proxies fail to account for complex error propagation, and (ii) decomposition quality, where Fisher-based estimators suffer from severe rank collapse. In this work, we address these limitations by presenting Layer-wise Error Modeling Search ({LEMS}) and {KFAC}-{SVD}. {LEMS} advances rank allocation by introducing a layer-wise error surrogate that integrates both local and global layer importance alongside a propagation bias, allowing us to determine global rank configurations efficiently as an Integer Linear Program ({ILP}). Simultaneously, {KFAC}-{SVD} improves decomposition quality by utilizing token-wise statistics, preventing the rank deficiency observed in prior Fisher-based {SVD}. We demonstrate across Mistral, Qwen3, and Llama-3 families that {KFAC}-{SVD} achieves an average perplexity improvements of 15\%, while {LEMS} consistently outperforms existing search strategies, delivering significant zero-shot accuracy improvements of up to 4.7 p.p. that generalize to scales of 70B parameters. Code is made available in the Supplement.},
      eventtitle = {Forty-third International Conference on Machine Learning},
      author = {Thoma, Moritz and Groezinger, Maximilian and Forstenhäusler, Maximilian and Aghajanzadeh, Emad and Vemparala, Manoj Rohit and Anagnostopoulos, Christos and Mori, Pierpaolo and Fasfous, Nael and Frickenstein, Alexander and Mueller-Gritschneder, Daniel and Schlichtmann, Ulf},
      urldate = {2026-05-27},
      date = {2026-04-30},
      langid = {english},
    }