Enhancing decompiled Ghidra variable names using machine learning

Heus, Yente

Citer

Enhancing decompiled Ghidra variable names using machine learning

par Heus, Yente

Promoteur Sadre, Ramin S.R.
Publication Non publié, 2024-06-01

Mémoire

Résumé :

In this paper, we implement a Ghidra extension that improves the names assigned to decompiled variables. This extension will be based on a state-of-the-art machine learning technique for text input and output problems, called the "Sequence-to-Sequence Encoder-Decoder Transformer Model", described and introduced in the paper "Attention is all you need" by Vaswani et al. (2017).Before we can train the model, we have to decide what data to use. In this paper, we collect some of the most popular GitHub repositories based on the language C. From each of these repositories, we generate two binaries, one with debug information, and one without. This allows us to generate a dataset for supervised machine learning.Our model is also designed to make it simple to retrain on your own dataset. As a result, a security researcher can, for example, generate his own dataset comprised of French source code using our tools, and predict French variable names as a result.Last but not least, we implement and use a custom scoring metric that is able to take into account some of the problems associated with comparing text in natural language, such as synonyms, abbreviations, word order, etc.

Référencement	Visibilité	Pérennité	Facilité
Les publications encodées constituent la bibliographie académique de l'Université.	Les documents déposés sont indexés par les moteurs de recherche (Google Scholar,…).	Les documents déposés en open-access sont archivés au sein du réseau de préservation SAFE-PLN (www.safepln.org).	Les listes de publications sont compatibles avec le CV-ULB, le FNRS et accessibles sur le web.

Enhancing decompiled Ghidra variable names using machine learning

Documents en relation

DI-fusion