基于自训练的半监督SQL注入检测算法-陕西师范大学学报期刊社网站

陕西师范大学学报（自然科学版）

数据挖掘专题

基于自训练的半监督SQL注入检测算法

解银朋1，周庆博1，何金栋2，谢新志2，周嵩1*

（1 计算机软件新技术国家重点实验室南京大学计算机科学与技术系，江苏南京 210023；2 国家电网福建省电力有限公司电力科学研究院，福建福州 350007）

周嵩，男，工程师，主要研究方向为大数据、数据库、人工智能。E-mail:zhousong@nju.edu.cn

摘要:

针对基于监督学习的SQL注入检测方法在某些场景下不适用的问题，本文提出一种基于自训练的半监督SQL注入检测方法（self-training based semi-supervised SQL injection detection, S4ID）。S4ID首先对SQL语句进行特征提取，包括基于语法树的模式提取和基于词袋模型的特征向量表示；然后使用基于自训练的半监督算法进行训练，通过从未标记样本中选取部分样本并打上伪标记，实现训练集的扩充，从而改善机器学习模型。实验结果表明，在有标记样本有限的情况下，S4ID可以利用未标记样本，取得比监督学习方法更好的SQL注入检测效果。

关键词：

SQL注入检测；自训练；机器学习；半监督学习

收稿日期：

2020-09-16

中图分类号：

TP391

文献标识码：

文章编号：

1672-4291(2021)01-0037-07

基金项目：

国家电网总部科技项目（SGGR0000XTJS1900448）

Doi:

Semi-supervised SQL injection detection based on self-training

XIE Yinpeng1, ZHOU Qingbo1, HE Jindong2, XIE Xinzhi2, ZHOU Song1*

（1 National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, Jiangsu, China;2 Electric Power Research Institute, State Grid Fujian Electric Power Company,Fuzhou 350007, Fujian, China）

Abstract:

In some scenarios, the SQL injection detection methods based on supervised learning are not suitable. For example, when the professional labeling is not enough or the new application is just started, only a small number of labeled SQL statements can be obtained, but it is easy to obtain a large number of unlabeled SQL statements.A self-training based semi-supervised SQL injection detection method is proposed, abbreviated as S4ID. First, feature extraction is performed on the SQL statements, including pattern extraction based on the syntax tree and feature vector representation based on the bag-of-words model. Then a semi-supervised model is trained based on the self-training framework. More specifically, the model will select some samples from the unlabeled data and then add pseudo-labels to these selected samples to expand the labeled data, based on which the model will be further learned. The results show that under the condition of limited labeled samples, S4ID can use unlabeled samples to achieve a better SQL injection detection accuracy than supervised learning methods.

KeyWords:

SQL injection detection; self-training; machine learning; semi-supervised learning