Regex in Search Console
As Google Search Console now allow to use regex to filter both queries and pages, the time has come to get an in-depth introduction to RE2 (the library used in Search Console).
01. A short introduction to regular expressions
02. Google RE2 regex
03. Search Console regex cheatsheet
A little background to understand regular expressions
I really like gathering as much as information as I can about this kind of topic to fully understand all the intricacies. In the case of using regex in Search Console, I found the topic very interesting. So if you are like me, the following lines should be worth it.
A short introduction to regular expressions
Regex are used for text processing tasks such as web scraping and string searching algorithms. If you are not yet aware of, regular expressions can be used through different syntaxes: POSIX (Portable Operating System Interface) and Perl (developed by a certain Larry Wall in 1987!). The story behind regex definitely worth a read especially for the history part on Wikipedia and the complete tutorial by Priceton University.
There are a lot of regular expression engines (28) and as you have noticed or will soon, the library used on Search Console is slightly different from the ones we are used to use. On Wikipedia, the tables named Part 1 and Part 2 present all the features comparison between the libraries: it will help you understand what you can and can’t do using RE2.
I first stated that regex can be useful for web scraping but we generally tend to prefex xPath as regex can be really expensive in terms of CPU (Central Processing Unit). Xpath are not (or less) as you can precisely ask to look for a pattern in a very specific part of a web page.
Google RE2 regex
RE2 have been developed by Google. As stated by the Google Software Engineering Team as in 2010’s blog post (RE2: a principled approach to regular expression matching):
The feature-rich regular expression implementations of today are based on a backtracking search with a potential for exponential run time and unbounded stack usage. At Google, we use regular expressions as part of the interface to many external and internal systems, including Code Search, Sawzall, and Bigtable. Those systems process large amounts of data; exponential run time would be a serious problem. On a more practical note, these are multithreaded C++ programs with fixed-size stacks: the unbounded stack usage in typical regular expression implementations leads to stack overflows and server crashes. To solve both problems, we’ve built a new regular expression engine, called RE2, which is based on automata theory and guarantees that searches complete in linear time with respect to the size of the input and in a fixed amount of stack space.By Russ Cox, Software Engineering Team
Below’s sentence sum it all:
RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk.Paul Wankadia on RE2 repository
To make it short, using RE2 in Search Console makes perfectly sense considering the wide diversity of knowledge about regular expressions by users (from beginners to advanced users) and the large amount of data to process.
RE2 in Search Console is really powerful and fast considering (again the amount of data). We should be grateful for that! I remember seeing a lot of users using illicit regexes while working at Oncrawl that broke the crawl analysis or consumed too much CPU!
Search Console regex cheatsheet list
To write my regexes, I have a simple mnemonic technique: “I want everything that is / that is not [here comes the regular expression]”
Look for results beginning by a specific pattern
^google ^larry page
Look for results beginning by specific patterns
Look for results that contains numbers
Here, + means that we look for queries or pages containing one or more numbers.
We can also look for a more precise results like: “I want everything that contain strictly two digits”:
We can go further and ask for queries or pages containing a specific pattern before (#1) or after (#2) the digits:
#1 > seo[0-9]+ #2 > [0-9]+seo
Look for queries or pages containing several conditions
I want everything that contains this or that or that…
Exclude a pattern from results
As we have seen, RE2 doesn’t give us much liberty when it comes to exclude patterns from results. Still, we can exclude the first word(s) of a query or page:
We can also use the pipe to exclude this or that or etc.
Remember that we can’t exclude patterns lookahead.
Look for queries followed by a specific number of characters
Look for long tail queries
Look for short tail queries
The full syntax you can use with RE2 is available on their Github page.
Last but not least, consider testing your regex on regex101. Also, if you consider running your regex on another tool than Search Console, I recommend keeping an eye on the number of steps and ms that display in the top right corner to make sure your regex won’t be too expensive.
If this article saved you some time on learning regular expressions for Google Search Console, consider adding caffeine in my Chemex!