XIE Yinpeng1, ZHOU Qingbo1, HE Jindong2, XIE Xinzhi2, ZHOU Song1*
(1 National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, Jiangsu, China;2 Electric Power Research Institute, State Grid Fujian Electric Power Company,Fuzhou 350007, Fujian, China)
Abstract:
In some scenarios, the SQL injection detection methods based on supervised learning are not suitable. For example, when the professional labeling is not enough or the new application is just started, only a small number of labeled SQL statements can be obtained, but it is easy to obtain a large number of unlabeled SQL statements.A self-training based semi-supervised SQL injection detection method is proposed, abbreviated as S4ID. First, feature extraction is performed on the SQL statements, including pattern extraction based on the syntax tree and feature vector representation based on the bag-of-words model. Then a semi-supervised model is trained based on the self-training framework. More specifically, the model will select some samples from the unlabeled data and then add pseudo-labels to these selected samples to expand the labeled data, based on which the model will be further learned. The results show that under the condition of limited labeled samples, S4ID can use unlabeled samples to achieve a better SQL injection detection accuracy than supervised learning methods.
KeyWords:
SQL injection detection; self-training; machine learning; semi-supervised learning