Sun Security Lab

Software Security

Open source software (OSS) has been widely used in both free and proprietary applications. The Black Duck reports that 96% of their scanned applications contain open source components, which account for 57% of the code base on average. At the same time, vulnerabilities embedded in upstream OSS are fast propagated to the underlying applications. Also, the clone or reuse of OSS without explicit reference makes it challenging for maintainers to track and mitigate vulnerabilities. Our research develops practical techniques for detecting such vulnerabilities, which help build a more reliable and secure information system infrastructure.

Read more about our works:
GraphSPD PatchDB Security Patch Classification
Security Patch Identification BinProv

With the increasing popularity of open-source software, embedded vulnerabilities have been widely propagating to downstream software. Due to different maintenance policies, software vendors may silently release security patches without providing sufficient advisories (e.g., CVE). This leaves users unaware of security patches and provides attackers good chances to exploit unpatched vulnerabilities. Thus, detecting those silent security patches becomes imperative for secure software maintenance. In our paper, we propose a graph neural network based security patch detection system named GraphSPD, which represents patches as graphs with richer semantics and utilizes a patch-tailored graph model for detection. We first develop a novel graph structure called PatchCPG to represent software patches by merging two code property graphs (CPGs) for the pre-patch and post-patch source code as well as retaining the context, deleted, and added components for the patch. By applying a slicing technique, we retain the most relevant context and reduce the size of PatchCPG. Then, we develop the first end-to-end deep learning model called PatchGNN to determine if a patch is security-related directly from its graphstructured PatchCPG. PatchGNN includes a new embedding process to convert PatchCPG into a numeric format and a new multi-attributed graph convolution mechanism to adapt diverse relationships in PatchCPG.

BinProv: Binary Code Provenance Identification without Disassembly

Provenance identification, which is essential for binary analysis, aims to uncover the specific compiler and configuration used for generating the executable. Traditionally, the existing solutions extract syntactic, structural, and semantic features from disassembled programs and employ machine learning techniques to identify the compilation provenance of binaries. However, their effectiveness heavily relies on disassembly tools (e.g., IDA Pro) and tedious feature engineering, since it is challenging to obtain accurate assembly code, particularly, from the stripped or obfuscated binaries. In addition, the features in machine learning approaches are manually selected based on the domain knowledge of one specific architecture, which cannot be applied to other architectures. In this paper, we develop an end-to-end provenance identification system BinProv, which leverages a BERT (Bidirectional Encoder Representations from Transformers) based embedding model to learn and represent the context semantics and syntax directly from the binary code. Therefore, BinProv avoids the disassembling step and manual feature selection in provenance identification. Moreover, BinProv can distinguish the compilers and the four optimization levels (O0/O1/O2/O3) by fine-tuning the classifier model with the embedding inputs for specific provenance identification tasks. Experimental results show that BinProv achieves 92.14%, 99.4%, and 99.8% accuracy at byte sequence, function, and binary levels, respectively. We further demonstrate that BinProv works well on obfuscated binary code, suggesting that BinProv is a viable approach to remarkably mitigate the disassembler dependence in future provenance identification tasks. Finally, our case studies show that BinProv can better identify compiler helper functions and improve the performance of binary code similarity detection.

Publiched in the International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2022.

Download the Paper Code Export Citation

@inproceedings{xu2022binprov,
	author = {He, Xu and Wang, Shu and Xing, Yunlong and Feng, Pengbin and Wang, Haining and Li, Qi and Chen, Songqing and Sun, Kun},
	title = {BinProv: Binary Code Provenance Identification without Disassembly},
	year = {2022},
	publisher = {Association for Computing Machinery},
	address = {New York, NY, USA},
	booktitle = {Proceedings of the 25th International Symposium on Research in Attacks, Intrusions and Defenses},
	pages = {350–363},
	numpages = {14},
	location = {Limassol, Cyprus},
	series = {RAID '22}
	}

Security Patch Classification

With the increasing usage of open source software (OSS) in both free and proprietary applications, vulnerabilities embedded in OSS are also propagated to the underlying applications. It is critical to find security patches to fix these vulnerabilities, especially those essential to reduce security risk. Unfortunately, given a security patch, currently, there does not exist a way to automatically recognize the vulnerability that is fixed. In our paper, we first conduct an empirical study on security patches by type (i.e., corresponding vulnerability type), using a large-scale dataset collected from the National Vulnerability Database (NVD). Based on analysis results, we develop a machine learning-based system to help identify the vulnerability type of a given security patch. The evaluation results show that our system achieves good performance.

Published in the IEEE Conference on Communications and Network Security (CNS) 2020.

Download the Paper Export Citation

@INPROCEEDINGS{wang2020cns,
author={X. {Wang} and S. {Wang} and K. {Sun} and A. {Batcheller} and S. {Jajodia}},
booktitle={2020 IEEE Conference on Communications and Network Security (CNS)}, 
title={A Machine Learning Approach to Classify Security Patches into Vulnerability Types}, 
year={2020},
volume={},
number={},
pages={1-9},
doi={10.1109/CNS48642.2020.9162237}
}

In our paper, we develop a defense system and implement a toolset to automatically identify secret security patches in OSS. To distinguish security patches from other patches, we first build a security patch database that contains more than 4700 security patches mapping to the records in the CVE list. Next, we identify a set of features to help distinguish security patches from non-security ones using machine learning approaches. Finally, we use code clone identification mechanisms to discover similar patches or vulnerabilities in similar types of OSS. The experimental results show our approach can achieve good detection performance. A case study on OpenSSL, LibreSSL, and BoringSSL discover 12 secret security patches.

Software Security

GraphSPD: Graph-Based Security Patch Detection with Enriched Code Semantics

BinProv: Binary Code Provenance Identification without Disassembly

PatchDB: A Large-Scale Security Patch Dataset

Security Patch Classification

Security Patch Identification