2022-09-18

Read

反编译-代码混淆-程序分析与编译优化

这个暑假参加了那个编译系统设计赛，经历“磨难”后，不得不说确实视野开阔了很多。曾经CTF比赛中拿各种代码混淆完全没有办法，曾经下定决心想要学反编译却完全不知道从何下手。曾经自己搞代码混淆，看见LLVM IR也是一头雾水。

而现在，视野终于开阔了许多，能够有一些这几个领域的全局的视野了。

其他学习资源

最开始有一段时间真的是一点资源都没有。。但是现在看来，怎么着我也应该想想是不是又逆向相关的会议。确实，有了下面这些会议，起码逆向方面的论文是不愁看不完了。

Working Conference on Reverse Engineering (WCRE) https://ieeexplore.ieee.org/xpl/conhome/1000635/all-proceedings WCRE Working Conference on Reverse Engineering PPREW-5: Proceedings of the 5th Program Protection and Reverse Engineering Workshop 这个期刊好啊。 https://dl.acm.org/conference/pprew SSPREW: Software Security, Protection, and Reverse Engineering Workshop https://dl.acm.org/conference/ssprew

其他我收藏的链接

Github的两个list https://github.com/yasong/Awesome-Info-Inferring-Binary https://github.com/SystemSecurityStorm/Awesome-Binary-Rewriting

https://news.ycombinator.com/item?id=11218138 两个人的讨论。里面推荐对两篇文章的逆向引用搜索：https://scholar.google.com/scholar?as_ylo=2018&hl=en&as_sdt=2005&sciodt=0,5&cites=1148004013363547510&scipsc=

https://scholar.google.com/scholar?cites=7322807636381891759&as_sdt=2005&sciodt=0,5&hl=en

https://github.com/toor-de-force/Ghidra-to-LLVM https://uwspace.uwaterloo.ca/bitstream/handle/10012/17976/Toor_Tejvinder.pdf?sequence=3&isAllowed=y Ghidra Pcode编译到IR。代码太简单了。。栈内存好像是alloca出来的，可能还是想保持语义想运行。

https://github.com/decomp/decomp 这人也想基于LLVM IR然后去优化。https://github.com/decomp/doc 相关文档 https://github.com/avast/retdec/blob/05c9b11351d3e82012d823fa3709f940033768cf/publications/README.md RetDec的publication

The Decompilation Wiki. http://www.program-transformation.org/Transform/DeCompilation

dagger主要讲的是反编译到IR上，找到语义等价的LLVM IR的指令的过程。感觉有点像编译器后端的Instruction Selection，可能能用上利用DAG（有向无环图）选择指令的技术。它是作为llvm的fork编写的，2017后就没有维护了。和llvm耦合好严重啊，都不知道哪里是它的代码。好像好复杂。

https://github.com/repzret/dagger 反编译到LLVM IR。aarch64还在开发过程中。https://llvm.org/devmtg/2013-04/bougacha-slides.pdf 介绍的slides

https://github.com/JuliaComputingOSS/llvm-cbe 曾经IR到C有一个backend，2016年被移除了。现在有人接手

https://corescholar.libraries.wright.edu/cgi/viewcontent.cgi?article=3277&context=etd_all LLVM IR based decompilation。

https://github.com/lifting-bits/sleigh sleigh作为Ghidra的反编译器，是用C++写的，而且汇编到pcode的lift部分也是它负责的。所以用Ghidra可能也只要用这个就可以了。

https://github.com/cmu-sei/pharos 涉及到很多反编译技术

看论文的一些笔记

2022年9月18日

很多都是借用现有的type recovery，重点去讲structure recovery。

C Decompilation : Is It Possible ? 2009的一个: http://web.archive.org/web/20180517094139/http://decompilation.info/sites/all/files/articles/C%20decompilation.pdf 第二章相关工作里面有不少引用 structural analysis：[4–6]，这个也用在了编译器：[8]。 unification-based algorithm for recovery of types：Mycroft [9]

现有反编译器：DCC decompiler [7]. Boomerang [11], REC [12] and Hex-Rays plug-in [13]

【Phoenix】Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring https://kapravelos.com/teaching/csc591-s20/readings/decompilation.pdf Edward Schwartz's PhD thesis (https://users.ece.cmu.edu/~ejschwar/papers/arthesis14.pdf) covers in detail the Phoenix decompiler, which is another good jumping-off point. 这篇论文关注控制结构的恢复。控制结构的恢复最早是基于interval analysis的？这是什么得学一学。后面才被细化为structural analysis

【Dream】No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations https://www.ndss-symposium.org/wp-content/uploads/2017/09/11_4_2.pdf slides： https://www.ndss-symposium.org/wp-content/uploads/2017/09/11NoMoreGotos.slide_.pdf code?： https://github.com/USECAP/dream

【DecFuzzer】How far we have come: testing decompilation correctness of C decompilers https://dl.acm.org/doi/abs/10.1145/3395363.3397370 香港科技大学的综述代码在：https://github.com/monkbai/DecFuzzer 论文下载不到，SCI hub太强了。https://sci-hubtw.hkvisa.net/https://dl.acm.org/doi/10.1145/3395363.3397370 functionality-preserving disassembling and C style control structure recovery [17, 31, 47, 64, 65, 67] 变量恢复static analysis and inference techniques [10, 12, 13, 30, 54]. fool-proof techniques for binary disassembling and decompilation [17, 31, 64-67]. EMI编译器测试看了下是插入了不影响语义的代码之后去开编译优化，发现优化器做出的错误决定而导致的crash。比如把一个不该执行的循环内操作提到外面。错误判断一些分支恒为真或假。是设置Csmith的输出使得只生成一个函数？？本来Csmith生成的代码很多全局变量的使用。如果全局变量改变了，很难手动找到是哪个函数？它是生成了局部变量，然后把对全局变量的使用全替换成了局部变量，函数结束的时候把局部变量的值update到全局变量，这样如果全局变量变了，就肯定是在最后update的时候改变的。那手动看的时候不要继续找内部怎么使用？这样做有什么好处吗。。。可能是方便找到这个函数到底涉及到了哪些全局变量，然后方便只提取这些到反编译结果的全局变量？？这篇论文可以研究一下重编译的技术。怎么单独提取出一个文件。怎么合并两个C语言文件。这样我想要用别人的汇编代码也可以用了。

【RetDec】没有论文好像。slides: https://2018.pass-the-salt.org/files/talks/04-retdec.pdf 中端用到了LLVM IR，但是最开始生成的IR也是那种全局变量表示寄存器的形式，不知道最后的时候有没有好一点。也许只是用LLVM IR去做一些控制流的pattern matching？不过有变量识别的Pass。有机会研究一下IR和变量识别的情况。

Evolving Exact Decompilation https://www.cs.unm.edu/~eschulte/data/bed.pdf

【rev.ng】rev.ng: A Multi-Architecture Framework for Reverse Engineering and Vulnerability Discovery. https://www.rev.ng/downloads/iccst-18-paper.pdf 这个反编译器开源了lifter：先转到Qemu IR然后转到LLVM IR。这个好像也不太和反编译相关，也只是搞插桩、fuzzing的。

类型恢复【TIE】Principled Reverse Engineering of Types in Binary Programs. http://users.ece.cmu.edu/~aavgerin/papers/tie-ndss-2011.pdf 这篇搞了自己的DVSA，主要区别是SI里可以放除esp外的变量符号？。重点主要在后面的约束求解部分。后面的类型系统和求解部分也非常复杂TODO。

【SecondWrite】https://user.eng.umd.edu/~barua/elwazeer-PLDI-2013.pdf

【DIVINE】: DIscovering Variables IN Executables 这篇还是VSA系列的那些人写的。讲先用最简单的semi naïve方法鉴别变量，跑VSA，然后拿VSA结果去生成约束跑ASI。迭代几次得到最好的结果。里面说如果变量是8字节大小，那VSA直接无法处理，值总是Top（32位程序）。那就不能直接把内存最大切分粒度搞成4字节？？

【REWARDS】Automatic Reverse Engineering of Data Structures from Binary Execution https://www.cs.purdue.edu/homes/xyzhang/Comp/ndss10.pdf TODO

【retypd】https://arxiv.org/pdf/1603.05495.pdf 需要进一步学习subtyping TODO。它不仅开源，而且不需要VSA的指针信息。可以与之前需要VSA的结合？

推荐的学习路线

其他学习资源

看论文的一些笔记