In view of the serious damages caused by IIS’ vulnerability, researchers have proposed many vulnerability discovery technologies, such as the dynamic analysis, the symbolic execution, the fuzz test and so on [15][16]. Comparing with other technologies, the fuzz test requires only a little knowledge of target, but can expand to large application program easily with good reusability, and thus has become the most popular vulnerability discovery solution currently, especially in the IIS.
With regard to network protocols, the most representative fuzzifier is SPIKE [17]. Its principle is describing the protocol as a block sequence model and making the data variation in blocks, and then partitioning the message’s data structure and making statistic of field length after the variation automatically, thus improving the effectiveness of test cases significantly, but the fuzzifier shows an inadequate descriptive power for the constrained relationship in protocol messages. Later, Sulley [18] and Peach [19] extended the data model based on SPIKE and added more descriptions on the dependence relationship between data blocks. In order to provide a more flexible and accurate fuzzy framework, AFL[20] tracked the path coverage of each input through the lightweight instrumentation in source program, and allocated ID randomly to the basic blocks on the path using the Hash mechanism to determine whether where is any new path generated, and then used the input generating a new path as the seed. Combining with the detailed information in the program, the method improves the code coverage rate, but the Hash method may have collision cases easily and thus cause the problem of false negatives even though the input has reached a new path. Gan et al. proposed CollAFL [21] which allocated the ID value to each basic block using the greedy algorithm and other methods to ensure the Hash value is difference in each side and thus avoided Hash collision and realized a more accurate judgment of path coverage. Besides, the method uses the number of neighbor branches of path as the weight to sort seeds and prompts the fuzzifier to probe the paths haven’t been reached. Sanjay et al. proposed VUzzer [22]. The method requires no prior knowledge of application program or input format, and mainly uses the control flow and data flow characteristic of static and dynamic analyses to assist in variation and selecting seed files, and prompts the program to execute on a deeper level by giving the specific weight to each basic block.
With regard to the Industrial Internet protocol, most research achievements were realized by improving and extending existing network protocol fuzz test methods or tools. For instance, Roland et al. proposed ProFuzz [23] specific for the Profinet protocol stack. ProFuzz is mainly based on the fuzz test tools developed by Scapy fuzzer [24], and also contains Sullley’s fuzzer module, supporting five types including cyclic real-time execution, device discovery, configuration acquisition, application request and accurate time control protocol. With regard to Wurldtech Company’s BlackPeer test framework [25], its key to constructing anomalous data is, in the premise of given initial sample data, defining protocol messages with the extended Backus-Naur form and generating corresponding test data grammar and then realizing the variation of protocol data units using interactive semantIIS. On the basis, Yafeng Zhang et al. [9] improved the BNF grammar. They parsed the industrial protocol samples into the variation tree by introducing a tree-shape structure, and traversed all nodes in the tree for the purpose of variation. McCorkle et al. [26] made a fuzz test to the upper computer software of IIS, constructing anomalous data by making the variation operation to response messages and keeping the communication state of protocol by forging check codes and count values. SecuriTeam added the DNP3 protocol into the test tool beSTORM [27] and found the vulnerability of denial of service attack specific for DNP3 during the fuzz test to Wireshark. The Mu suite launched by Mu Dynamic Company [28] is applicable to protocols of IEC 61850, Modbus/TCP and DNP3, which constructs abnormal message data with the structured grammar analysis method and can also extend the Industrial Internet protocols with unknown specifications using the additional function Studio Fuzz. Lzfuzz [29] is used for the specific DCADA protocols with unknown grammar, of which the basic idea is deducing message tokens using the Lempel-Ziv compression algorithm and, combined with token information, making the variation processing to samples, but the method can hardly be used for the protocols with complicated input grammar.
From the research above, we summarize the problems of fuzz test methods applying to Industrial Internet protocols currently as follows. (a) The low code coverage. Many bugs can be triggered only in the case of a large path coverage rate. Although some methods have used the dynamic multi-modal sensor communication data processed in the program, it is not reflected in the generation of test cases directly, causing that most cases are executed repeatedly on the same paths as input data and only cover a few paths. (b) No unified description models. Protocols, even of the same type, have different forms of messages, but current methods require building different data models, thus increasing the construction workload of model. (c) Insufficient pertinence of test cases. It’s hard for test cases to pass the program verification, thus causing too many invalid tests. The design of variation strategy without taking the characteristic of Industrial Internet protocols into full consideration may cause the redundancy of case in the case of a lot of sample data and then affect test efficiency.
A. Dynamic Taint Analysis
The dynamic taint analysis proposed by J. Newsome [30] is a method tracking information flow during the running of program, i.e., tracking the transmission of data slots to be analyzed in the system and gaining target program’s detailed processing process for the data. In the method proposed, we use the technology to get the dynamic multi-modal sensor communication data in the program to provide the basis for the further fuzz test.
The dynamic taint analysis method involves two parts, namely taint data identification and taint transmission path monitoring [31]. The core of taint data identification is defining taint data. If the data source is suspect, the data generated by it are taint data, which, as the input, shall be tracked and analyzed in the running process of binary code, and then the message data constructed in the fuzz test can be considered as suspect.
During the running of program, the transmission of taint data is completed through instructions and taint data may participate in the operation as the source operands in the arithmetical operation instruction or the parameters of data movement instruction, of which the operation output is generally related to taint tracking data and thus should be identified as the taint attribute. Figure 3 shows a simple example of dynamic analysis process. In the figure, a1 and a2 are taint data; the arrow means the operation execution; the leading end represents the source operand; the tail end represents the operation output. During the running of program, variables V1 ~ V8 are affected by taint data a1 and a2. If A, as the set of taint source of each variable, is empty, we can consider that the variable is not tainted. From the figure we can see that the set of taint source of V8 which is the last output is the union set of taint source sets of operation operands V3 and V6.
Using the dynamic taint method, we mainly aim to analyze the data flow and the control flow, also known as the explicit flow and the implicit flow, during the execution of binary codes. The former refers to data dependence relationship, i.e. variable V1’s taint information is sent to V2 directly through the assignment or arithmetic operation, as shown in Figure 3; the latter corresponds to the control & dependence relationship of conditional branches, i.e. variable V1’s taint information is sent to V2 indirectly through the associated condition expression. For the purpose of understanding, we make a simple analysis using the example of sample code shown in Figure 4.
In the example above, the function of code implementation experiences the input of variable x, the generation of message msg and the submission through post. In the process, x is identified as the taint datum, and according to the analysis method of data flow, msg is assigned as constant ‘a’ or ‘b’ directly. The constant won’t be tainted, so msg is identified as the untainted attribute. However, the value of msg also depends on whether conditional branches x==’a’ and x==’b’ are true or false, i.e. msg and x have a control & dependence relationship, which belongs to the implicit taint of flow. In practical applications, such codes have a security risk when used for the communication realization of network protocols.
When msg is submitted, the attacker may intercept msg and then deduce the value of input x, so msg is definitely a taint datum and thus should be tacked and monitored. There will be false negatives without considering the control & dependence relationship of msg. On the contrary, url’s value is independent of branches L3 and L7, so there is no harm even if it is identified as the untainted datum.
B. Method and Principle
According to the Industrial Internet protocol realization program, the section gets related dynamic multi-modal sensor communication data through dynamic taint analysis to guide the generation of test cases. Figure 5 shows the specific test process of the method. The method proposed uses many protocol messages as the input to avoid the problem of insufficient test cases caused by single sample datum.
Definition 1 The dynamic interactive fields: In the given program execution, then for conditional branch xi (the ith execution of conditional branch x), there is DIF(xi)={Fj|Fj is the protocol field affecting the execution of xi} in which DIF(xi) is the set of protocol fields.
Definition 2 The dependence & control relationship: In the given program execution, then if there is a conditional branch yi deciding whether to execute xi, we can say xi depends on yi’s dynamic control and express it as CDC(xi)={yj,T|F} in which CDC(xi) is the set of branches meeting the condition, and Constraint(xi) represents the constraint of conditional branch xi.
Definition 3 Dynamic control flow diagram: For each program input, and its execution path can be expressed with the dynamic control flow diagram, in which conditional branch xi, is the node, DIF(xi) is the dynamic interactive field of node, and CDC(xi) is the side representing the dependence relationship of conditional branches.
We define the input protocol fields affecting conditional branches through the dynamic taint analysis, make the taint identification processing to each protocol input field and track the tainted data flow in the execution of program. With regard to the dependence & control relationship of program, we capture it with the algorithm proposed in ref. [32]. It should be noted that for Industrial Internet protocols, the fields generally have the checksum (like the CRC) which will cause the deluge of taint data if transmitted in control information flow. Because of all fields used as checksums, most conditional branches will be identified as taint data, in which case, it’s hard to directly find the specific field affecting conditional branches. Besides, due to the dependence & control relationship, the fuzzy method proposed considers the transmission taint in control flow indirectly. Therefore, when using the dynamic taint analysis method, we only focus on the transmission taint in the data flow rather than tracking the transmission taint in the control flow.
Algorithm 1 introduces the fuzzy method proposed in details. The main function dynamicFuzz can execute program P, Protocol G and protocol message I as the input, to process much protocol input, and the output is the abnormal information triggered. Lines 1~2 of algorithm initialize the data structure to be stored. The test cases generated which will also be used as the new input are put in the queue for storage for convenience of recursion execution. Line 5 combined with two parameters, program and protocol input, gets the sets of DIF(xi) and CDC(xi) of each conditional branch in each program using the dynamic taint analysis method, and then constructs the corresponding dynamic control flow diagram using DIF(xi) as the node and CDC(xi) as the side. Line 6 makes the fuzzy operation from the start point of executable path in the control flow diagram. Line 8 deduces Constraint(xi), the corresponding constraint condition of node xi, as the parameter of test case generation algorithm.
Line 16 of algorithm means that after the generation of test case, we need to use the test case as the new input to replace the value of protocol field in set DIF(xi) of original input and modify all fields related to Constraint(xi) to meet the constraint condition. To ensure the test case will be executed on a deeper level in the program and improve the probability of finding new execution paths, the rest of protocol fields which are unrelated to DIF(xi) and Constraint(xi) should also be kept valid according to protocol grammar. For instance, in the protocol input, the values of two fields start address and end address shall change from original 1 and 2 into 3 and 6. Even if there is no constraint condition, the corresponding fields of address data shall be regenerated based on the original normal size (increasing from 1 to 3), but if the two address change into the invalid test cases of 6 and 3 after the fuzzification (start address < end address), the fields unrelated to DIF(xi) and Constraint(xi) still keep the normal values.
Lines 17~20 monitors whether there is any abnormal state such as the crash or memory leak in the execution of test case. Although the test case is used as the new input, in consideration of the expenses caused by program execution, the method proposed doesn’t get the dynamic multi-modal sensor communication data of each conditional branch repeatedly. In the execution of test case, Lines 21~24 store the code paths traversed, put the conditional branches unparsed into the input queue and then decide the priority ranking according to the quantity of new branches. Line 25 of algorithm stores the dynamic multi-modal sensor communication data sequence of each node for lines 9~13 to judge whether current node generates the test case. When the list has a dynamic multi-modal sensor communication data sequence and the test case generated which correspond to current node xi’s DIF(xi) and Constraint(xi) respectively, then the algorithm will skip the test case generation of the node directly. Line 29 of algorithm stores the dynamic control flow diagrams as the paths probed in the queue after completing the code paths of program.
Algorithm 1. A Fuzz test Algorithm Combined with Dynamic multi-modal sensor communication data
|
Input: Program P, Protocol Grammar G and Protocol Message I
Output: Abnormal Information
Function dynamicFuzz (,,)
- probedPath,inputQueue, checklist ← empty()
- inputQueue.push(I)
- while inputQueue.notEmpty() do
- input ← inputQueue.pop()
- dcfg ← executionAnalysis (, )
- node ← dcfg.start(probedPath)
- while node ≠ NULL do
- c ← dcfg.getConstraint (node.CDC)
- if checklist.find(node.DIF, c) &&
- node.true, node.false ≠ NULL then
- node ← dcfg.next()
- continue
- end
- tcList ← makeTestCases (node.DIF, c, G)
- for each tc in tcList do
- input’← makeInput (input, tc.value, c)
- res ← executionMonitor(, input’)
- if (res.execption) then
- return getExceptionInfo(res)
- end
- if(probedPath.isNew(res.pathInfo)) then
- inputQueue.push(input’)
- end
- end
- checklist.add(node.DIF, c)
- node ← dcfg.next()
- end
|
To reflect the dynamic multi-modal sensor communication data extracted from the program into the construction of test case directly, the method proposed focuses on the generation technology assisted with the variation, and guides the generation of test case grammar specific for node xi by combining with corresponding dynamic multi-modal sensor communication data. Algorithm 2 describes the specific process. The test case generation function makeTestCases considers protocol field fields, i.e. the element in set DIF(xi) in Definition 1, rule constraint c and protocol G as the input, and the output is the set of test cases tcList. The key to the function is generating test cases by acquiring the efficient grammar of each node and the reversed grammar. Line 2 has only one valid grammar for protocol fields, i.e. deducing the grammar of each protocol field in the constraint condition according to parameters as the valid grammar of the node. The feasibility is the valid grammar applied to node xi definitely is the subset of protocol grammar G, because according to Definition 3.2, i.e. in the constraint condition of Constraint(xi), the valid grammar can apply to all the protocol fields in DIF(xi). Line 3 means when the protocol field has much valid grammar, constructing much grammar through the combinations of the valid grammar and storing the grammar in the set of test cases. For instance, for node xi, there is DIF(xi)={Fa,Fb}, and the valid grammar deduced in Constraint(xi) is Fa=(0|1), Fb=2, and then the grammar combined is (Fa=0, Fb=2) and (Fa=1, Fb=2). Next, Lines 4~9 of algorithm reverse the valid grammar corresponding to each filed of valid grammar in the premise of meeting constraint condition Constraint(xi), and then combine them with other fields’ valid grammar for fuzzified grammar and add the grammar to the set of test cases.
When there is no applicable test grammar found in DIF(xi), the case data shall be generated through variation with methods like random bit flip, extreme substitution, boundary value substitution, and so on.
Algorithm 2. A Test Case Generation Algorithm Combined with Dynamic multi-modal sensor communication data
|
Input: Protocol Field fields, Rule Constraint c and Protocol Grammar G
Output: Set of Test Cases tcList
Function makeTestCases(fields, c, G)
- tcList ← empty()
- validGrams ← extractGrams(fields, c, G)
- tcList ← combination (validGrams) ˄ c
- for each g in validGrams do
- fg ← ~g ˄ c
- fuzzyGram ← fg ˄ (other g’in validGrams)
- tcList.add(fuzzyGram);
- end
- return tcList
|