Malcolm Max DeRungs

Regular expressions (regex) for a variety of situations.

^at.* remove all lines that begin with "at"
.*exception.*. remove all lines that contain the string "exception"
target="basefrm" id="itemTextLink[0-9]+" $1 remove the number sequence from the string

^ (.*)$ $1 return capture group without beginning space
^,(.*)$ $1 return capture group without beginning comma
^(.*),$ $1 return capture group without ending comma
^…(.*) $1 return capture group without beginning three characters
^(.*)\?.* $1 return capture group without everything after the ?
^(.*)#[0-9].* $1 return capture group without number(s) at end of a string
^(.*)#[0-9]+;#.* $1 return capture group without hash, number(s), semicolon and hash pattern in string

Examples

This regex pattern contains two replacement capture groups ([^/?])$|/?([?].*)
The ( ) is refered to as backreferences $1 and $2. It also contains two match groups [ ]. The replacement works even in case one capture group is empty because a non-participating capture group is filled with an empty string after a match. ([^/?])$ means match and capture into Group 1 any character but a / or ? at the end of the string. The | means or. The /? means an optional (1 or 0) forward slash. And ([?].+) means match and capture into Group 2 a literal ? followed with 1+ characters other than a newline.

Return a capture group that removes a bunch of HTML empty strings

.*(<body |body>|<div |div>|head>|html>|script>|style>|<table |table>|title>|<tr |tr>|(A {|BODY {|TD {|content=|initializeDocument|meta name)|<td valign=|</a>|</td>|.*src="Support.*|alt="mauyong">|.*alt=.*/>$|.*href='javascript:clickOnNode.*alt="|" target="basefrm").* $1

Solr Query

Find and escape all special characters in a user query before it is sent to Solr. Queries may otherwise return zero or unexpected results, for example:

wellness: my
Heritage Metrics Model - 7380166 [1]

// Term is search query string
term = this.replaceSpecialCharacters(term)

// Find and escape all special characters in Solr query string
public static replaceSpecialCharacters(term: string) {

    // Create a map of special characters with their replacements
    let pairMap: Object = {
        '[': '\\[',
        ']': '\\]',
        '(': '\\(',
        ')': '\\)',
        '{': '\\{',
        '}': '\\}',
        '*': '\\*',
        '+': '\\+',
        '?': '\\?',
        '|': '\\|',
        '^': '\\^',
        '$': '\\$',
        '\\': '\\\\',
        '-': '\\-',
        '&': '\\&',
        '!': '\\!',
        '~': '\\~',
        ':': '\\:',
        ';': '\\;'
    }
    let findMap = Object.keys(pairMap)
    let replaceMap = Object.values(pairMap)
    let cleanRegexArray: string[] = []
    let matchMap = {}
    for (let i = 0, len = findMap.length; i < len; i++) {
        // Create an array of regex escaped characters using 
        // a character class inside a capture group
        cleanRegexArray.push(findMap[i].replace(/([\[\](){}*+?|^$.\\])/g, '\\$1'))

        // Create an object with proper key:value property pairs for matching
        matchMap[findMap[i]] = replaceMap[i]
    }

    // Create regex-ready OR string of characters to find
    let cleanRegexString = cleanRegexArray.join('|')

    // Replace characters in term with characters matching map key
    term = term.replace(
        new RegExp(cleanRegexString, 'g'),
        function (matchKey) {
            return matchMap[matchKey]
        }
    )
    return term
}

The the character class [ ] inside the capture group ( ) contains all the characters that need to be escaped before they can be used in a regex expression. These include

Brackets: []
Parentheses: ()
Curly braces: {}
Operators: * + ? |
Anchors: ^ $
Others: . \

Escaping guidelines for items in the character class:

Escape [ ] and \ literals in character class
Avoid ^ at beginning
Avoid - at beginning or end
Escape the capture group replacements

Fusion JavaScript Pipeline

Escape special characters in a Solr Partial Update Indexer ID field

function(doc) {
	if (doc.getId() !== null) {

		// get the ID
		var new_id = doc.getId();

		// escape dashes
		new_id = id.replace(/-/g,"\-");

		// change the id field
		doc.setField(id, new_id);
	}
	return doc;
}
;

Regex Field Extraction Index Stage