Meno:Gabriela
Priezvisko:Chutňáková
Názov:Adaptive layer skipping during large language model inference
Vedúci:Mgr. Vladimír Boľa, PhD.
Rok:2026
Kµúčové slová:large language model, inference efficiency, model pruning, layer skipping, adaptive computation
Abstrakt:Large language model inference demands substantial computation, yet not all of it may be necessary. We assume that tokens vary in the amount of processing they need, therefore skipping some feed-forward network (FFN) layers may reduce computation without significantly degrading model performance. In this work, we investigate the possibilities of adaptive layer skipping during inference, based on the current token representation. We focus on layer importance and the relationship between the magnitude of layer outputs and their impact on model quality. We compare token-level routing to static layer removal. We propose two routing methods: (1) skipping layers with small estimated output based on separately trained classifiers, and (2) using gating layers trained jointly to approximate the outputs of the modified model to those of the original. Token routing allows us to omit one third of FFN computation, at the cost of an increase in perplexity from 7.3 to 15.3 on WikiText-2.

Súbory diplomovej práce:

Diplomova_praca_Chutnakova.pdf

Súbory prezentácie na obhajobe:

Upravi»