REmatch: High-performance Regular Expression Matching for

REmatch: High-performance Regular Expression
Matching for Network Security
Petabi, Inc.
Contact: Victor Valgenti ([email protected])
May 6, 2015
2082 Business Center Drive #170 Irvine CA 92606 ([email protected])
1
Regular Expression Matching in Network Security
Regular expression matching is an important aspect of most network security solutions. Unfortunately, regular expressions can prove resource-intensive to match and most regular expression
engines do not scale with the number of regular expressions. This can cause problems to a
network security system if an attacker is able to target the regular expression engine with worstcase traffic [4,8] which can cause excessive burden on the matching system and potentially even
lost packets. Further, since regular expressions are deemed resource-intensive many network
security systems use arbitrary, redundant filters just to reduce the amount of traffic forwarded to
regular expression matching. Our philosophy, however, is to make regular expression matching
efficient enough that it can match at line-speed, making filtering to avoid regular expressions
obsolete, such that regular expressions can be used for all pattern matching. This is what we call
total inspection with regular expressions.
Our patented regular expression engine, REmatch, can match at speeds up to 2,000 times
faster than other matching libraries for large regular expression sets. REmatch takes advantages
of parallelism common in general purpose processors as well as creating a traversal-friendly layout of the matcher such that these matching speeds are possible on general purpose processors.
We believe in total inspection of network traffic using regular expressions, eliminating the need
for many of the complex filters used in the pipeline of many network security applications as
those filters can often be eliminated or replicated with the use of regular expressions. Even better, REmatch is comprised of a C++ automata construction library and a C matching library
that can easily be used in place of other regular expression matching libraries for immediate improvements in matching speed. REmatch maintains line-speed matching for thousands of rules
and scales linearly with the number of cores used. REmatch makes total inspection of network
traffic against a set of hundreds, or thousands, of regular expressions a reality without the need
for expensive hardware or specialized chipsets.
2
REmatch Performance
Figure 1 illustrates the basic strengths of the REmatch regular expression matching engine. For
this example, three sets of regular expressions were used: ClamAV, Petabi, and Snort (Please
Section 4.3 for specifics concerning test setup and execution). First, in Figure 1 REmatch
scales linearly to the increase in the number of regular expressions. When the number of regular
expressions is small, less than 3, PCRE matches exceptionally fast. However, even at only 5
regular expressions REmatch is 70-500% faster, while for 1,245 rules REmatch is from 243 to
1,343 times faster than PCRE and at 2264 rules REmatch is nearly 2,000 times faster. Further,
REmatch, even with more than two thousand regular expressions, maintains greater than 1Gbps
on a single core. In this manner, REmatch can maintain total inspection of traffic and still meet
line-speed requirements.
To demonstrate the versatility and power of REmatch we created a test where we substituted PCRE calls in Snort with calls to the REmatch libraries (please refer to Section 4.4 for
details). The performance comparison is summarized in Table 1 and the trends mirror those
from Figure 1.
1
2082 Business Center Drive #170 Irvine CA 92606 ([email protected])
10000
Throughput Mbps
1000
4x
37x
100
206x
382x
1,900x
10
ClamAV-REmatch
Petabi-REmatch
1 Snort-REmatch
ClamAV-PCRE
Petab-PCRE
Snort-PCRE
0.1
1
10
100
1000
10000
Total RE
Figure 1: REmatch performance vs PCRE performance as number of regular expressions increase.
Table 1: Drop-in Replacement of PCRE with REmatch in Snort.
3
# Regex
Snort with REmatch Mbps
Snort with PCRE Mbps
Speedup
1
100
200
300
500
700
779
440
422
422
410
390
350
334
666
28
12.2
8
4.8
2.6
1.1
0.66
15
35
51
81
135
304
REmatch to enhance your products
As we have demonstrated, REmatch could serve to boost performance anywhere you match
multiple regular expressions against a particular set of data. The versatility of REmatch is such
that it can easily be added to current products with minimal change to code-base. Optionally,
it could serve as a new core matching engine. If you are interested in REmatch please contact
us and we will set up further evaluations and answer all questions you may have concerning our
products. Finally, we are flexible with licensing terms and willing to work with partners. We
encourage you to seriously consider REmatch to improve the regular expression matching in
your products.
4
Test Details
This section explains the details in the test environment, test methodology, and data used.
2
2082 Business Center Drive #170 Irvine CA 92606 ([email protected])
4.1
Explanation of REmatch
REmatch [7] is our patented regular expression matching engine. REmatch makes use of
parallelism inherent in most commodity general purpose processors while utilizing a traversal
and architecture-friendly memory layout. One of the primary goals of REmatch is to shrink
the working-set such that it can entirely reside in cache memory. Thus, the memory alignment
of the REmatch matching automata is designed to maximize cache lines and locality during
matching.
REmatch further improves performance by utilizing Non-deterministic Finite Automata
(NFA) rather than Deterministic Finite Automata (DFA) during matching. Without delving
into the complexities involved DFA simply grow too large to be practical for any set of regular expressions larger than a couple hundred. NFA, however, scale linearly with the number of
regular expressions. This is one of the reasons why REmatch scales far better with the number
of regular expressions.
4.2
Explanation of PCRE
The Perl Compatible Regular Expression (PCRE) [2] library is a common, full-functioned, regular expression matching library. We compare against this library as it is one of the best libraries
freely available. We note that all of the regular expressions are compiled prior to matching.
Thus, the times for PCRE only consider matching time, not construction or deconstruction time.
4.3
Throughput and Scalability of REmatch Evaluation
Figure 1 illustrates both the high performance and scalability of REmatch. To generate the
data for this graph, we matched against sample-sets of regular expressions containing 1, 2, 5,
10, 50, 100, 200, 400, 600, 800, 1,000, 1,200, 1245, 2,264 and 30,596 regular expressions
respectively. To create a sample-set of regular expressions for each point x rules were randomly
selected (without replacement) from the targeted rule-set. This sample rule-set was then used
by the respective matching engine to match against the traffic three separate times to compute
an average throughput for that sample rule-set. This process was repeated 10 times to determine
the average throughput of all ten sample rule-sets and the result is displayed in the Figure 1 and
reiterated in Table 2. This process was repeated once for each regular expression set for each
matching engine (REmatch or PCRE). Note if the sample size was larger than the total number
of rules in the rule-set than that data-point is set to N/A. Further, no sampling is done when the
size of the sample is equal to the size of the rule-set.
4.3.1
Regular Expression Sets for this Test
Table 3 illustrates specific statistics concerning each of the rule-sets. Aside from the Average
Length of the rules and the name of the rule-set, each entry is the total occurrence of the investigated feature within the set. The greater the number of these features the more complex the
rules. These data provide a statistical view of the regular expressions involved. Each rule-set is
described in more detail below. The ClamAV regular expression set and Snort regular expression
3
2082 Business Center Drive #170 Irvine CA 92606 ([email protected])
Table 2: REmatch vs PCRE throughput comparison in tabular format (values in Mbps)
# Regex
ClamAV-REmatch
Petabi-REmatch
Snort-REmatch
ClamAV-PCRE
Petabi-PCRE
Snort-PCRE
1
2
5
10
50
100
200
400
600
800
1,000
1,200
1,245
2,264
30,596
1,736.81
1,921.04
1,991.06
2,312.89
2,191.82
2,043.93
1,748.67
1,442.60
1,291.76
1,218.31
1,122.65
1,069.19
1,063.23
918.57
437.09
2,054.31
2,201.65
2,325.42
2,311.58
2,273.08
2,202.18
2,162.94
1,972.59
1,805.33
1,758.72
1,711.88
1,622.91
1,590.23
1,287.03
N/A
2,831.48
2,037.06
2,280.55
2,316.74
2,268.02
2,208.63
2,132.88
2,008.29
1,811.03
1,805.36
1,668.91
1,622.50
1,587.03
N/A
N/A
3,406.39
2,124.09
972.58
517.42
109.19
54.91
27.47
13.63
9.09
6.80
5.45
4.53
4.36
2.40
0.17
4,633.56
1,697.65
410.14
193.78
28.40
15.55
7.51
3.70
2.52
1.88
1.44
1.24
1.18
0.65
N/A
5,002.04
3,911.48
1,436.80
504.21
92.37
38.08
17.13
8.78
6.01
4.47
3.54
3.02
2.88
N/A
N/A
set can be provided upon request. The Petabi rules require a Non-disclosure Agreement prior to
release.
1. ClamAV: The ClamAV [5] regular expression rule-set represents the ClamAV database as
of January 8, 2015. The ClamAV format of regular expressions is very close to normal
PCRE. Thus, we created a script that converted them from the ClamAV format to a standard PCRE format. The large average length of the regular expressions stems from the fact
that the rules are denoted as binary strings and the conversion process then uses PCRE’s
‘\x’ notation. Another interesting note is that ClamAV rules represent mostly fixed binary
strings.
2. Petabi: These rules are part of our business and were crafted specifically for REmatch.
They are currently used in Network Intrusion Detection Systems employed in client businesses and handle line speeds without issue. These rules were crafted in-house and/or
gathered from many sources; some of those sources with strong Non-disclosure Agreements. As such, these rules cannot be provided prior to such an agreement being signed
by all parties privy to the rules.
3. Snort: The snort regular expressions were harvested from the Sourcefire Vulnerability
Research Team Snort [6] 2.9.7.2 Registered Users rule-set for April 22, 2015. Where
possible, the rules represent the merging of content and PCRE tags into a single regular
expression. This was done since Snort rules employ a pipeline approach such that later
tags rely on earlier tags to filter content to reduce workload. For rules where there were
no such tags, other Snort rule features were converted to regex (where possible) and prepended to the regular expression to better simulate the intent of the Snort rule.
4
2082 Business Center Drive #170 Irvine CA 92606 ([email protected])
Table 3: Regular Expression Set Breakdown. AVG length is the average length in number
of characters of all of the regular expressions. Wildcard characters are the ‘.’ character and
the character class [\x00-\xFF]. Repetition represents ‘?’ (zero or one) ‘*’ (zero or many)
and ‘+’ (one or many). Counting represents any counted repetition like ‘a{1,5}’. Alternation
indicates the number of times alternate branches are used like ‘a(b|c)d’. Types represent one
of the following PCRE types: \d (any digit), \w (any word char), \s (any whitespace), \D
(any non-digit), \W (any non-word char), and \S (any non-whitespace). Classes represent any
character class like: [abc].
4.3.2
RE Set
Total RE
ClamAV
Petabi
Snort
30,596
2,264
1,245
Avg Length Alternation Wildcards Types
270.1
89.6
100.8
3
177
400
2,349
1,828
845
0
3,821
3,774
Classes
Repetition
Counting
0
1,525
973
1,226
7,836
6,090
212
890
738
Test Traffic Data for this Test
For traffic we used a synthetic traffic capture with 100,000 packets all with random data. Each
packet is given an arbitrary length of 79 bytes. The number of 79 bytes of data was taken as the
average data size from several publicly available packets captures used in Intrusion Detection
System evaluation [1, 3]. 100,000 packets was chosen arbitrarily as sufficient to test the system
and small enough to do so quickly. Figure 2 shows that the distribution of bytes is even across
the total values possible for each byte illustrating the uniformly random nature of the data.
Tests were performed by reading the packet capture through the libpcap library. All tests were
performed with the same traffic capture.
4.3.3
Hardware Setup
All tests were run on a single core in a single thread.
1. Operating System: FreeBSD 10.1 v2-RELEASE
2. CPU: Intel Core i5-3570K Ivy Bridge Quad-Core 3.4GHz (6MB L3 Cache)
3. RAM: 16GB
4.4
Drop-in Replacement of PCRE in Snort with REmatch Evaluation
For this evaluation we examined the impact of using REmatch in Snort rather than PCRE matching. The idea was to demonstrate a drop-in replacement of REmatch into Snort with minimal
changes to Snort overall. It was a simple matter to replace pcre compile calls in Snort to use
the REmatch automata construction calls instead. Then, during matching rather than using
pcre search we used the library call to the REmatch matcher. The primary difference for this
approach is that all regular expressions are matched at once with any matches cached and returned as Snort’s list of possible matches is traversed. As can be seen by the data in Table 1 if
5
2082 Business Center Drive #170 Irvine CA 92606 ([email protected])
1
0.9
0.8
0.7
CDF
0.6
0.5
0.4
0.3
0.2
0.1
0
Random
0
50
100
150
200
250
Byte Value
Figure 2: Diversity of data byte values.
there are more than one or two regular expressions to match then REmatch can provide vastly
superior matching speed.
4.4.1
Regular Expression Set for this Test
For this evaluation 779 regular expressions were selected from the Sourcefire Vulnerability Research Team Snort [6] 2.9.7.0 Registered Users rule-set. The selection criteria depended on
regular expressions that defined the intent of the entire rule. For example, a snort rule might
have a content option of ‘ab’ and another of ‘cd’. If the PCRE option for this rule contained
both ‘ab’ and ‘cd’ then that regular expression was included in the regular expression set.
4.4.2
Test Traffic Data for this Test
For this evaluation traffic was generated using a Spirent SPT-2000 traffic generator. 1,400 byte
TCP packets with random payloads were generated and sent across network links to the target
box at 1Gbps speeds.
4.4.3
Hardware Setup
All tests show numbers for a single core and single thread.
1. Operating System: Ubuntu 11.04 (Kernel 2.6.38-16)
2. CPU: Intel dual-Core i3-3220 3.30GHz (3MB L3 Cache)
3. RAM: 2GB
6
2082 Business Center Drive #170 Irvine CA 92606 ([email protected])
References
[1] National cyberwatch center mid-atlantic collegiate cyber defense competition, 2012.
[2] P. Hazel. Perl compatible regular expressions. http://www.pcre.org/.
[3] B. Sangster, T. J. O’Connor, T. Cook, R. Fanelli, E. Dean, W. J. Adams, C. Morrell, and
G. Conti. Towards instrumenting network warfare competitions to generate labeled datasets.
In Proceedings of USENIX Security Workshop on Cyber Security Experimentation and Test,
2009.
[4] R. Smith, C. Estan, and S. Jha. Backtracking algorithmic complexity attacks against a NIDS.
In Proceedings of the 22nd Annual Computer Security Applications Conference, 2006.
[5] Sourcefire.
ClamAV Open Source Antivirus Engine 0.98.6, 2015.
http://www.clamav.net/.
Available at
[6] Sourcefire Vulnerability Research Team. Sourcefire Vulnerability Research Team (VRT)
Snort Rule-set, Apr. 2015. Available at http://www.snort.org/vrt.
[7] V. C. Valgenti, J. Chhugani, Y. Sun, N. Satish, M. S. Kim, C. Kim, and P. Dubey. GPPgrep: High-speed regular expression processing engine on general purpose processors. In
International Symposium on Research in Attacks, Intrusions, and Defense. Springer, 2012.
[8] V. C. Valgenti, H. Sun, and M. S. Kim. Protecting run-time filters for network intrusion
detection systems. In Advanced Information Networking and Applications (AINA), 2014
IEEE 28th International Conference on, pages 116–122, May 2014.
7