Mémoire
Résumé : | In this paper, we implement a Ghidra extension that improves the names assigned to decompiled variables. This extension will be based on a state-of-the-art machine learning technique for text input and output problems, called the "Sequence-to-Sequence Encoder-Decoder Transformer Model", described and introduced in the paper "Attention is all you need" by Vaswani et al. (2017).Before we can train the model, we have to decide what data to use. In this paper, we collect some of the most popular GitHub repositories based on the language C. From each of these repositories, we generate two binaries, one with debug information, and one without. This allows us to generate a dataset for supervised machine learning.Our model is also designed to make it simple to retrain on your own dataset. As a result, a security researcher can, for example, generate his own dataset comprised of French source code using our tools, and predict French variable names as a result.Last but not least, we implement and use a custom scoring metric that is able to take into account some of the problems associated with comparing text in natural language, such as synonyms, abbreviations, word order, etc. |