| Meno: | Jitka |
|---|---|
| Priezvisko: | Muravská |
| Názov: | Permuting Matrix Columns for Improved Block-Sparse Pruning |
| Vedúci: | Mgr. Vladimír Boľa, PhD. |
| Rok: | 2026 |
| Kµúčové slová: | block pruning, linear sum assignment, TETRIS, permutation |
| Abstrakt: | Modern transformer models, from large language models to vision transformers, achieve impressive task performance but require substantial compute and memory at inference. Pruning reduces this cost by zeroing out a fraction of its weights, ideally at little or no loss in task quality. Block pruning makes the resulting sparsity pattern hardwarefriendly, but at a cost: grouping weights into blocks before pruning can trap important weights inside otherwise-low-value blocks. The TETRIS algorithm of [16] addresses this by permuting the elements along the dimensions of the weight matrix before pruning, using a greedy local search to find good permutations. We propose Global Assignment TETRIS (GA-TETRIS), a variant that replaces the local search with an exact solution to a minimum-weight perfect matching in a bipartite graph problem at each iteration, augmented with noise injection and random-swap refinement to escape local optima. We evaluate GA-TETRIS against the Original TETRIS, three simpler heuristics (Sort-by-Norm, Random-Swaps, and Random-Swaps with sorted initialization), and the no-permutation Block-Wanda baseline on two Vision Transformers (0.9M and 13.4M parameters) and SmolLM2-360M. On the per-layer pruning objective, GATETRIS achieves the highest score improvement at wide block widths across all three models. On whole-model evaluation, performance collapses rapidly with growing block width on every model; only the Larger ViT at block 1 × 2 preserves substantial accuracy. In that single regime, GA-TETRIS is the best algorithm by 1.85 percentage points over the next best. On SmolLM2 at the same block width, however, the simple Sort-by-Norm heuristic produces the lowest perplexity — though still roughly seven times worse than the unpruned baseline — suggesting that the best algorithm depends on both block width and model. |
Súbory diplomovej práce:
| main.pdf |
| code.zip |
Súbory prezentácie na obhajobe: