Regular Expression Part-I

What is regular expression?

A regular expression is a string of characters which tells the searcher which string (or strings) you are looking for. The following explains the format of regular expressions in detail.

As with most human languages the regex language has many dialects; regexes written for perl aren't automatically suited for sed, awk or grep, to name just a few standard UNIX tools.

I've chosen to write all the regexes in this tutorial in the POSIX dialect. This because POSIX is slowly winning terrain in the world of regexes, and because a fair amount of dialects are similar to it (well, actually it's the other way around). But this doesn't mean I'll be covering all the features of the POSIX 1003.2 regular expression standard. Another reason for using the POSIX dialect as opposed to the Perl dialect is because the Perl documentation does a much better job of explaining the Perl dialect than I ever will. Also, this way you won't be locked into any particular tool's regex extensions. In a way, the POSIX dialect can be considered the greatest common denominator.

Simple Regular Expressions

In its simplest form, a regular expression is just a word or phrase to search for. For example,

Verilog

would match any subject with the string "Verilog" in it, or which mentioned the word "Verilog" in the subject line. Thus, subjects with "Verilog", "SystemVerilog" or "iVerilog" would all be matched, as would a subject containing the phrases "Verilog Simulator" or "SystemVerilog Book." Here are some more examples:

sim :Finds any subject with the string "sim" in its name, or which mentions sim (or simulation or gatesim or Verilog simulation) in the subject line.
gate : Finds any subject with the string "gate" in its name or contents. Subjects with "gate", "gatelevel or "and_gate" are found, as well as subjects containing the words "gate sim" or "udp gates".

Metacharacters

Some characters have a special meaning to the searcher. These characters are called metacharacters. Although they may seem confusing at first, they add a great deal of flexibility and convenience to the searcher.

Charactor	Description
^	Start of line
$	End of line
.	Any any other character
*	Match 0 or more of the preceeding character.
?	Match 0 or 1 occurance of the previous character.
+	Match 1 or more occurances of the previous character.
\t	The Tab character
\n	The carriage return
\s	Match any whitespace (space, tab, etc)
\d	Match any digit (same as [0-9]).

Structure of a Regular Expression

There are three important parts to a regular expression. Anchors are used to specify the position of the pattern in relation to a line of text. Character Sets match one or more characters in a single position. Modifiers specify how many times the previous character set is repeated. A simple example that demonstrates all three parts is the regular expression "^#*." The up arrow is an anchor that indicates the beginning of the line. The character "#" is a simple character set that matches the single character "#." The asterisk is a modifier. In a regular expression it specifies that the previous character set can appear any number of times, including zero.

Unix regular expressions

Regular expression syntax is now defined as obsolete by POSIX, but is still widely used for the purposes of backwards compatibility. Most regular-expression-aware Unix utilities, for example grep and sed, use it by default.

In this syntax, most characters are treated as literals—they match only themselves ("a" matches "a", "(bc" matches "(bc", etc). The exceptions are called metacharacters:

.	Matches any single character
[ ]	Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] matches any lowercase letter. These can be mixed, [abcq-z] matches a, b, c, q, r, s, t, u, v, w, x, y, z, and so does [a-cq-z].
'-'	This character should be literal only if it is the last or the first character within the brackets, [abc-] or [-abc]. To match an '[' or ']' character, the easiest way is to make sure the closing bracket is first in the enclosing square brackets, [][ab] matches ']', '[', 'a' or 'b'.
[^ ]	Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter. As above, these can be mixed.
^	Matches the start of the line (or any line, when applied in multiline mode)
$	Matches the end of the line (or any line, when applied in multiline mode)
( )	Defines a "marked subexpression". What the enclosed expression matched can be recalled later. See the next entry, \n. Note that a "marked subexpression" is also a "block". Note that this is not found in some instances of regex.
\n	Where n is a digit from 1 to 9; matches what the nth marked subexpression matched. This construct is theoretically irregular and has not been adopted in the extended regular expression syntax.
*	* A single character expression followed by "" matches zero or more copies of the expression. For example, "[xyz]" matches "", "x", "y", "zx", "zyx", and so on.
* \n*	Where n is a digit from 1 to 9, matches zero or more iterations of what the nth marked subexpression matched. For example, "(a.)c\1*" matches "abcab" and "abcabab" but not "abcac".
{x,y}	Match the last "block" at least x and not more than y times. For example, "a\{3,5\}" matches "aaa", "aaaa" or "aaaaa". Note that this is not found in some instances of regex.

Note that particular implementations of regular expressions interpret backslash differently in front of some of the metacharacters. For example, egrep and Perl interpret unbackslashed parentheses and vertical bars as metacharacters, reserving the backslashed versions to mean the literal characters themselves. Old versions of grep did not support the alternation operator "|".

Examples

".at" matches any three-character string like hat, cat or bat
"[hc]at" matches hat and cat
"[^b]at" matches all the matched strings from the regex ".at" except bat
"^[hc]at" matches hat and cat but only at the beginning of a line
"[hc]at$" matches hat and cat but only at the end of a line

Since many ranges of characters depends on the chosen locale setting (e.g., in some settings letters are organized as abc..yzABC..YZ while in some others as aAbBcC..yYzZ) the POSIX standard defines some classes or categories of characters as shown in the following table.

POSIX	class	similar to meaning
[:upper:]	[A-Z]	uppercase letters
[:lower:]	[a-z]	lowercase letters
[:alpha:]	[A-Za-z]	upper- and lowercase letters
[:alnum:]	[A-Za-z0-9]	digits, upper- and lowercase letters
[:digit:]	[0-9]	digits
[:xdigit:]	[0-9A-Fa-f]	hexadecimal digits
[:blank:]	[ \t]	space and TAB
[:space:]	[ \t\n\r\f\v]	blank characters
[:cntrl:]		control characters
[:graph:]	[^ \t\n\r\f\v]	printed characters
[:print:]	[^\t\n\r\f\v]	printed characters and space

Example: [[:upper:]ab] should only match the uppercase letters and lowercase 'a' and 'b'.

It is generally agreed that [:print:] consists of [:graph:] plus the space character. However, in PERL regular expressions [:print:] matches [:graph:] union [:space:].

An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers. The editor vim further distinguishes word and word-head classes (using the notation \w and \h) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions.

Anchor Characters: ^ and $

The character "^" is the starting anchor, and the character "$" is the end anchor. The regular expression "^A" will match all lines that start with a capital A. The expression "A$" will match all lines that end with the capital A.

Do you have any Comment? mail me at:deepak@asic-world.com