Assorted others | Cyber Pontification

A couple more things I wanted to mention

Markdown

While HTML is a powerful language for creating documents, there are times when a simpler syntax is sufficient.

Markdown uses symbols to offer a subset of HTML. This has the benefits of being more concise and readable, even when the symbols haven’t been converted to a display format.

Markdown covers headers, links, lists, and emphasis. People have expanded the syntax to include new features, like tables, call-outs and definition lists.

It’s not a language supported by browsers, but it can be converted to HTML and an increasing number of websites use this for user provided content.

Here are a couple of sites showing the syntax:

RegEx

Regular Expression (RegEx) is a syntax that allows people to search for a pattern. Rather than looking for a particular term, you can specify what characters you want and how many times they should appear. Describing the shape the data should have.

The simplest search would be a series of letters, numbers, and spaces. These characters don’t have any special meaning, so they would work like a basic search. Just be aware that it is case-sensitive by default. So result is not the same as RESULT.

Multiple choices?

If you’re looking for multiple things, you can tell the regex to look for this or that. Just use a vertical bar | to connect the choices. E.g. cat|dog. This works well by itself if there are a finite number of options.

Where are you looking?

By default, the regex will take the text you give it and look everywhere for the pattern. This is a sensible default, but you might want to narrow it down. You can refer to the start or end of the input.

Use the caret (^) for the start and ($) for the end. Taking the URL structure from Infrastructure. We can look for secure URLs by checking the protocol with ^https and look for PDFs with pdf$. However, that won’t include URLs with a query string.

Choosing characters

If you want to capture any character, you can use a full stop .. This will match a single instance of any character.

Make your own group

You can specify particular characters by listing them inside square brackets. This will cover all the vowels: [aeiou].

You can include any character. Just don’t put a caret (^) as the first character, as this changes the group to mean anything except these characters. [^aeiou] matches anything but vowels.

You can also use ranges. So you can get all letters with [A-Za-z]. This works as computers represent all characters with numbers, so groups like letters and numbers are stored sequentially. While you could look for less intuitive sequences to use as a range, it wouldn’t be recommended.

Premade groups

These are some shorter representations of common groups so they don’t need to be redefined each time. Like the caret, you can get the opposite group. Use the capital version of these letters.

Digits \d

This covers the numbers [0-9]

Spaces \s

Representing spaces, tabs and new lines.

Words \w

A bit misleading, these covers letters, numbers, and underscores. Not just letters.

How many?

The previous section shows you how to refer to particular characters, but it only looks for a single instance. How do we get more?

Quantifiers can be explicit numbers, or vaguer categories. By default, regexes are greedy. They will look for the largest possible match and reduce until all segments of the regex are used.

There are a few characters that capture some rough quantities.

Asterisk *

This covers zero or more options. So any number of matches.

Plus +

This matches one or more occurrences. The thing I’m looking for must exist.

Question mark ?

This represents zero or one matches. This item is optional.

If you want more precise quantities, you can use curly brackets. A single value can be given {5}, or it can take a minimum and a maximum {4,6}.

Capture groups

Sometimes you just use a regex to find a match. Other times, you need to do something with parts of the match. Wrap the parts of interest in brackets () and then you can reference them in a substitution using a dollar and the ordinal number of the capture group. E.g. $2 for the second group.

Examples

One of the simplest regex’s is .*. Which means any character any number of times. This is useful if you need to capture everything between other specific items.

This regex will match on an IPv6 segment [0-9A-Fa-f]{4}. The square brackets describe the characters you want. In this case, the ranges 0 to 9 and the letters A to F in both upper and lower cases. The 4 in the braces specify how many times you want the preceding group.

If you need to work with regexes then a site like regex101 is useful as it explains the regex and allows you to test it against examples. Unless you’re working with one of the other languages given in the “Flavor” list, choose ECMAScript (JavaScript). Screaming Frog is written in Java, so that’s one exception.