Notice: Stat is currently in private beta. This documentation is incomplete and subject to change.

Stat Docs

Home

Introduction

File Types & Sections

Regular Expressions

A regular expression is a pattern that is used to match parts of a string or binary value. They can also be used to extract parts of a string or binary value or to test to see if it matches a pattern. If you've ever programmed in JavaScript, PHP, Python, or any other popular programming language, then you may be familiar with regular expressions. In Stat, the syntax for a regular expression is much like that in JavaScript, which itself was inspired by the regular expression syntax in Perl.

Stat implements most of the regular expression functionality available in JavaScript so in most cases you can just copy a JavaScript regular expression and paste it into Stat and it'll work... probably. There are a few differences between Stat regular expression syntax and JavaScript regular expression syntax but for the most part, the syntax is very much the same.

Regular Expression Literals

To create a regular expression literal enclose a pattern in forward slashes /pattern/ Here is simple example:

MAIN
	// This is how you create a regular expression literal
	// This statement doesn't do anything
	// other than create the regular expression literal, but this
	// is a valid statement, albeit a bit useless
	/john/

The code above doesn't do anything other than create a regular expression literal, then forgets about it. You'd be better served by storing that regular expression literal into a variable so that it can be referred to later or so that you can pass it to a function or a stream. Here's a more useful example:

MAIN
	let pattern = /john/

Writing Regular Expressions

A regular expression is made up of basic characters and special character sequences that tell the regular expression engine what to match and how to traverse through the string / binary value as it searches for matches.

Matching simple characters

Matching simple characters is simple. Just specify the character directly in the regular expression to match that character. To match a sequence of simple characters just specify that sequence.

MAIN
	// Matches the word "hello" anywhere in a string or binary value
	let pattern = /hello/

Regular expression escape characters

There are some characters that won't work as simple characters in a regular expression because they have special meaning. The regular expression syntax makes heavy use of these characters in order to perform advanced pattern matches. In order to match these characters, they must be escaped with a back slash \. Here is a quick list of the escape characters available:

\0 - The null character (Not to be confused with the empty value... that's different)
\\ - A literal back slash. To match a single back slash, prefix it with another back slash.
\t - A tab character
\f - A form feed character. Otherwise known as the new page character.
\v - A vertical tab character.
\n - A newline character.
\r - A carriage return character.
\e - An escape character. Same as \x1b in binary
\/ - A forward slash character
\( - A left parenthesis character
\) - A right parenthesis character
\{ - A left curly brace character
\} - A right curly brace character
\[ - A left bracket character
\] - A right bracket character
\. - A dot (period) character
\| - A vertical pipe character
\* - An asterisk (star) character
\+ - A plus character
\? - A question mark character
\^ - A caret character
\$ - A dollar sign character

MAIN
	// Matches a literal dot followed by a literal question mark
	let pattern = /\.\?/

Matching unicode characters

You can match a unicode character by simply adding the character to the regular expression. Alternatively, you can use a unicode character escape just like in a string. Here is an example:

MAIN
	// These both match a watermelon emoji character
	let pattern = /🍉/
	pattern = /\u{1f349}/

Matching arbitrary binary bytes

You can match a binary byte by using a hexadecimal escape byte just like in a binary value. The syntax is simple... just use \x followed by the 2 digit hexadecimal value of the byte you want to match which ranges from \x00 to \xff

MAIN
	// This matches abcn because "n" in hex is 6e
	let pattern = /abc\x6e/

	// Matches abc followed by a binary x98 (152) byte
	pattern = /abc\x98/

Matching common character classes

The regular expression syntax includes a way to match common character classes like numbers, spaces, and more. Here is a quick list of all character classes available:

\d - Matches any digit character 0-9
\D - Matches any non digit character. That is any character that is not 0-9
\s - Matches any whitespace character including space, tab, vertical tab, return, and new line characters
\S - Matches any non whitespace character which is everything except space, tab, vertical tab, return, and new line
\w - Matches any word character which includes A-Z, a-z, 0-9 and the underscore character _
\W - Matches any non word character which is everything except A-Z, a-z, 0-9, and _

MAIN
	// Matches a digit character followed by any white space character
	let pattern = /\d\s/

Matching word boundaries

You can match a word boundary with the following escape: \b. This escape checks to make sure that the current position in the string or binary value is a word boundary. A word boundary is where a word character (a-zA-Z0-9_) is followed by a non word character or visa versa. This also matches at the start of the string as long as the first character is a word character or at the end of a string if the last character is a word character. You can also negate the match by using \B which matches if there isn't a word boundary at the current position.

It's important to note that matching a word boundary with \b or a non word boundary with \B doesn't move the match position. It doesn't actually consume any characters in the value you are searching. Rather, it matches the boundary itself, not any actual characters. Here are some examples:

MAIN
	// Matches abc but only if the next character is
	// not a word character. Like abc@. Also matches
	// abc at the end of the string
	let pattern = /abc\b/

	// Matches abc but only if the next character is
	// also a word character. Like abcd
	pattern = /abc\B/

Matching any character

The dot character (.) in a regular expression matches any character except a new line character by default. You can make the dot character also match new line characters by providing the "dot all" flag (s) after the regular expression literal. (More on that below)

MAIN
	// Matches abc followed by any character except a new line character
	let pattern = /abc./

	// Matches abc followed by any character, even a new line character
	pattern = /abc./s

Matching a specified set of characters

You can specify a character set which matches any single character in that set. The syntax looks like this: [abc]. It's a set of characters to match enclosed in square brackets.

You can also add character ranges within the set by separating two characters by a dash like so: a-z. This means any lower case character. 0-9 means any digit character. Note that you don't need to specify just letters or numbers. You can specify any range in the ASCII character set, for example: !-~ which matches any printable character. Or 4-d which matches 4-9, plus any upper case character, plus lowercase a-d plus a bunch of characters with ASCII codes in between: :;<=>?@[\]~_`

You can also negate a character set by setting the first character as a caret character ^. For example, this matches any character except a, b, or c: [^abc]. Check out these examples:

MAIN
	// Matches abc followed by any digit
	let pattern = /abc[0-9]/

	// Matches abc followed by one of x, y, or z
	pattern = /abc[xyz]/

	// Matches abc followed by any character
	// except a lower case letter
	pattern = /abc[^a-z]/

Alternate matches

You can specify alternate matches by separating them with the vertical pipe character |. The pattern on left side of the pipe is tried first, and if it fails, it tries to match pattern on the right side of the pipe. You can add as many alternative matches as necessary. Here are some examples:

MAIN
	// Matches abc or def
	let pattern = /abc|def/

	// Matches abd, def, or xyz
	pattern = /abc|def|xyz/

Regular expression groups

You can enclose parts of a regular expression in parenthesis to specify a capture group. A capture group can then be referenced later on in the regular expression to match the same sequence of characters that the capture group matched previously. Capture groups are also returned as part of a match when using the "match with" operator <~ or the "match all" operator <<~. (More on that below).

Groups can also be used along with alternative matches in order to separate 2 or more alternative matches from the rest of the match. For example:

MAIN
	// Matches abc followed by either def or ghi
	let pattern = /abc(def|ghi)/

In addition to capture groups, you can specify other group types that behave differently. Here is a list of the possible group types

(pattern) - Capture group - can be referenced later and is returned along with the full match when using the "match with" or "match all" operators
(?:pattern) - Non-capture group - cannot be referenced later and is not returned
(?=pattern) - Look ahead group - looks ahead and attempts to match the pattern, but does not consume any of the value being searched nor is it included in the match
(?!pattern) - Negative look ahead group - looks ahead and matches if the pattern doesn't match. It also doesn't consume any of the value being searched nor is it included in the match

Here are some examples:

MAIN
	// Captures the sequence abc
	let pattern = /(abc)/

	// Matches abc in a group, followed by def,
	// followed by a 2nd group of ghi
	pattern = /(abc)def(ghi)/

	// Groups can be nested
	pattern = /(a group that (contains another group) suffix)/

	// Matches abc but doesn't store the match
	pattern = /(?:abc)/

	// Matches abc only if it is followed by def
	pattern = /abc(?=def)/

	// Matches abc only if it is not followed by def
	pattern = /abc(?!def)/

Referencing capture groups using back references

You can reference a previous capture group match by using a back reference which is a back slash followed by a single digit character which is the 1 based index of the capture group. To reference the first capture group use \1, to reference the 2nd capture group, use \2 and so on. Here is a quick example:

MAIN
	// Matches "hello there hello" or "bye there bye"
	let pattern = /(hello|bye) there \1/

	// Matches abc yy or abc zz
	pattern = /(abc) ([yz])\2/

Regular Expression Quantifiers

Regular expression quantifiers allow you to match more or less than 1 of something. Up until now, we've only shown examples where you match a single character or sequence of characters. However, if you need to match say 3 or more of something, then that's where quantifiers come in. To match more or less than one of something, simply follow any match with the appropriate quantifier. Here is a list of available quantifiers and how they work:

? - Match 0 or 1 of the previous match
+ - Match 1 or more of the previous match
* - Match 0 or more of the previous match
{,n} - Where n is a number, match between 0 and n of the previous match
{n,} - Where n is a number, match at least n or more of the previous match
{n,m} - Where n and m are numbers, match at least n but no more than m of the previous match

Here are some examples of using quantifiers

MAIN
	// Matches ab followed by an optional c
	let pattern = /abc?/

	// Matches ab followed by 1 or more c characters
	pattern = /abc+/

	// Matches ab followed by 0 or more c characters
	pattern = /abc*/

	// Matches ab followed by 1 or more
	// white-space characters, followed by def
	pattern = /abc\s+def/

	// Matches abc followed by 2 or more groups
	// of def. So abcdefdef or abcdefdefdef, etc...
	pattern = /abc(def){2,}/

	// Matches abcdddef or abcddddef or abcdddddef
	pattern = /abcd{3,5}ef/

Regular expression quantifier greediness

By default all regular expression quantifiers are greedy. This means that they will try to match as much of the string as possible that still satisfies the match. For example, if you are matching the following string "abcdddddef" with the following regular expression /abcd{2,}/ then it'll match abcddddd. Notice that it matched as many of the d character as it could. However, there is another way... we could make our quantifier non-greedy so that it matches as little of the previous match as possible to still satisfy the match. To make a quantifier non-greedy, append a question mark character to the quantifier. Here are some examples that outline this behavior:

MAIN
	// Matches abc followed by 2 or more d characters
	// Since it is greedy, it'll match as many as possible.
	let pattern = /abcd{2,}/

	// This makes the quantifier non-greedy which means
	// it'll match as little as 2 d characters
	pattern = /abcd{2,}?/

	// Matches abc followed by 0 or 1 d characters
	// which it'll prefer 0, followed by e
	pattern = /abcd??e/

	// Matches a dash followed by as many characters
	// as possible followed by another dash.
	// Given this string: -abcd-efgh-
	// It'll match the entire string
	pattern = /-.*-/

	// Matches a dash followed by as little characters
	// as possible followed by another dash.
	// Given this string: -abcd-efgh-
	// It'll match -abcd-
	pattern = /-.*?-/

Anchoring regular expression matches

Anchoring a regular expression means to specify the position in a string or binary value where a match should begin or end. To anchor a regular expression to the beginning of the target, put a caret character ^ at the beginning of the regular expression. To anchor a regular expression to the end of the target put a dollar sign character $ at the end of the regular expression. These anchors don't match any characters, but rather they assert that the match starts at the beginning or ends at the end of the target. Here are some examples:

MAIN
	// Matches abc, but only at the start
	let pattern = /^abc/

	// Matches abc, but only at the end
	pattern = /abc$/

	// Matches only abc, not abcd or 123abc
	pattern = /^abc$/

You can anchor not only at the start or end of the target, but also at the start or end of lines as well. To make an anchor match at the start or end of a line, use the m "multi-line" flag after the regular expression like so:

MAIN
	// Matches abc, but only at the start
	// of the string or start of a line
	let pattern = /^abc/m

	// Matches abc, but only at the end
	// of a line or end of the string
	pattern = /abc$/m

	// Matches any line in the string
	// that is abc
	pattern = /^abc$/m

To match the start of a line but also match a pattern on the previous line, place your anchor at the beginning of a group. Likewise, to match the end of a line, but also match a pattern on the next line, place your anchor at the end of a group. Here are some examples:

MAIN
	// Matches abc followed by 1 or more new
	// line characters, followed by def at the
	// beginning of a line
	let pattern = /abc\n+(^def)/m

	// Matches abc at the end of a line
	// followed by 1 or more new line characters,
	// followed by def
	pattern = /(abc$)\n+def/m

	// Matches 123 followed by 1 or more
	// new line characters, followed by abc on
	// a single line, followed by 1 or more newline
	// characters, followed by def
	pattern = /123\n+(^abc$)\n+def/m

Regular expression flags

There are a few flags that you can add after a regular expression to change its behavior when performing matches. We've briefly mentioned a few of them above. Here is a list of all the available regular expression flags

i - Case insensitive flag. This flag causes any letter match to be case insensitive. This means that any letter matches both its lower case or uppercase versions.
s - Dot all flag. This flag makes the dot character match all characters including new line characters.
m - Multi-line flag. This flag makes the start anchor ^ match at the beginning of a line in addition to the beginning of the string and it makes the end anchor $ match at the end of a line in addition to matching at the end of the string.

MAIN
	// Matches any letter either upper or lower case
	let pattern = /[a-z]/i

	// Matches ab, aB, Ab, or AB on any line
	pattern = /^ab$/mi

	// Matches literally everything
	pattern = /^.*$/s

Something to note

In Stat, there is no such thing as the g global match flag like there is in JavaScript and other languages. When performing matches, the operator determines how many matches are made instead of a flag in the regular expression.

Multi-line Regular Expressions

All regular expression literals can be multi-line. There is no special way to code a multi-line regular expression, no "here doc" or triple quotes... nope, just code it like you'd expect. The only thing to be aware of with multi-line regular expressions is that even they must follow the correct indentation rules. The indentation rule for multi-line regular expressions is that each subsequent line must be indented 1 more time from the first line of the regular expression. It's better to just show you some examples:

MAIN
	// Matches abc followed by a new line character
	// followed by def followed by another new
	// line character followed by ghi
	// abc\ndef\nghi
	let pattern = /abc
		def
		ghi/

	// The same thing but with escapes
	pattern = /abc\ndef\nghi/

	// This is here to show how indentation works
	// with multi-line regular expressions
	if true
		// Another regular expression but with
		// a trailing new line character
		pattern = /abc
			def
			ghi
			/

		// The same thing but with escapes
		pattern = /abc\ndef\nghi\n/

	// One more example with indentation
	pattern = /
		This is actually the 2nd line
		    This is the 3rd line and it begins with 4 spaces
		This is the 4th line and does not start with any white-space/

Here's a quick illustration on how multi-line indentation works.

Multi-line regular expressions with indentation

Regular Expression Interpolation

Regular Expression interpolation is a technique used to embed string or binary values into regular expression literals. This makes it easier to assemble complex regular expressions without having to use concatenation and convert a string or binary value to a regular expression. Stat has a very powerful interpolation interpreter that allows you to not only embed variables, but complex expressions including function calls and even nested values with their own interpolations.

To embed a value inside a regular expression, use the following syntax: \{value}. Here are some examples:

IMPORTS
	myFunction

MAIN
	let name = "John"
	let pattern = /abc \{name} def/

	// You can use any value that evaluates to a string or
	// binary, not just variables. This uses a function call
	pattern = /abc \{myFunction()} def/

	// You can even use other string or binary literals
	// Obviously, you could just use John in the regular expression itself
	// but the purpose of this example is to show you what's possible
	pattern = /Hello \{"John"}/

	// You can even nest interpolations like this
	pattern = /Hello \{name + ". sub \{myFunction()}"}/

Something to note

Regular expression interpolation matches only literal characters. You cannot use regular expression interpolation to build a dynamic regular expression. To build a dynamic regular expression, use a value conversion from string or binary to regExp?

MAIN
	let value = ".*"

	// Matches abc followed by a literal dot
	// and a literal asterisk character.
	let pattern = /abc\{value}/

Creating Dynamic Regular Expressions

You can't use a regular expression literal to create a dynamic regular expression. While you can interpolate string values into your regular expressions, those interpolations are matched literally and any special characters within those interpolations have no special meaning outside of simple character matching.

In order to create a dynamic regular expression, that is one that is pieced together using other values, you have to create the pattern as a string, then convert that string to an optional regular expression regExp? using the to operator. Notice that we aren't converting the string to just a regExp. The reason for this is that the pattern could be invalid. If I tried to convert this string to a regular expression: "abc[def", that would not work because it's missing a closing bracket. If for some reason, the conversion didn't work, then empty would be returned.

MAIN
	let all = ".*"
	let stringPattern = "abc\{all}"

	// Converts the string to an optional regular expression
	// This will end up being /abc.*/
	let maybeRegExp = stringPattern to: regExp?

	// You can force the conversion if you're sure that it won't fail
	// If it does fail, then it'll evaluate to an empty regular expression that matches nothing
	let forcedRegExp = (stringPattern to: regExp?)!

Testing For Matches

To see if a string or binary value matches a regular expression, you can use the "matches" operator which looks like this: ~=. The return value is a bool value which will be either true or false. Here's what the syntax looks like:

MAIN
	let stringVal = "Hello World"
	let pattern = /l+o\b/

	// This will be true
	let doesMatch = stringVal ~= pattern

	// You can match against binary values too
	let binVal = `\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64`
	// This will also be true
	doesMatch = binVal ~= pattern

Getting a Match

You can get the first match in a string or binary value by using the "matchOnce" operator which looks like this: ~>. The return value will be of this type if you match against a string:

META
	type typeDef
	typeDef: {
		match: string,
		range: range,
		captures: [
			{
				match: string,
				range: range
			}
		]
	}?

And it'll be this type if you match against a binary value:

META
	type typeDef
	typeDef: {
		match: binary,
		range: range,
		captures: [
			{
				match: binary,
				range: range
			}
		]
	}?

Here are some examples of how to use the "matchOnce" ~> operator:

MAIN
	// This will be {
	//		match: "Hello",
	//		range: 1..6,
	//		captures[]
	//	}
	let match = "Hello World" ~> /\w+/

	// This will be {
	//		match: "1.254.987",
	//		range: 1..10,
	//		captures: [{match: "987", range: 7..10}]
	//	}
	matches = "1.254.987 864.0.11210" ~> /(?:(\d+)\.?)+/

You can swap the order of the value and regular expression as well as the direction of the operator. No matter how you write it, Stat understands what you mean. Here are some examples:

MAIN
	// All these statements are the same
	"Hello World" ~> /\w+/
	"Hello World" <~ /\w+/
	/\w+/ ~> "Hello World"
	/\w+/ <~ "Hello World"

Getting a List of All Matches

You can get a list of all matches in a string or binary value by using the "matchAll" operator which looks like this: ~>>. The return value will be of this type if you match against a string:

META
	type typeDef
	typeDef: [
		{
			match: string,
			range: range,
			captures: [
				{
					match: string,
					range: range
				}
			]
		}
	]

And it'll be this type if you match against a binary value:

META
	type typeDef
	typeDef: [
		{
			match: binary,
			range: range,
			captures: [
				{
					match: binary,
					range: range
				}
			]
		}
	]

Here are some examples of how to use the "matchAll" ~>> operator:

MAIN
	// This will be [
	// 	{match: "Hello", range: 1..6, captures[]}
	// 	{match: "World", range: 7..12, captures[]}
	// ]
	let matches = "Hello World" ~>> /\w+/

	// This will be [
	// 	{
	//		match: "1.254.987",
	//		range: 1..10,
	//		captures: [{match: "987", range: 7..10}]
	// 	},
	// 	{
	//		match: "864.0.11210",
	//		range: 11..22,
	//		captures: [{match: "11210", range: 17..22}]
	// 	}
	// ]
	matches = "1.254.987 864.0.11210" ~>> /(?:(\d+)\.?)+/

You can swap the order of the value and regular expression as well as the direction of the operator. No matter how you write it, Stat understands what you mean. Here are some examples:

MAIN
	// All these statements are the same
	"Hello World" ~>> /\w+/
	"Hello World" <<~ /\w+/
	/\w+/ ~>> "Hello World"
	/\w+/ <<~ "Hello World"

Something to note

In Stat, there is no such thing as the g global match flag like there is in JavaScript and other languages. To get all matches, use the "matchAll" ~>> operator instead.