Regex: Tokenize repeating characters in a string
TL;DR: Tokenize the repeating characters in a string with (\w)\1*
.
Some coding challenges (and the occasional real-world challenge) require you to act on the repeating characters in a string. A couple of examples include finding binary gaps, like the string of six zeroes in 1101000000101, and run-length encoding (RLE) challenges.
You can quickly tokenize a string's repeating characters using the regular expression (\w)\1*
.
(\w)
captures any single alphanumeric character\1*
looks for zero or more matches of the previous alphanumeric character
Here's how it looks in JavaScript:
const test = "1101000000101";
const regex = /(\w)\1*/g
const tokens = test.match(regex);
console.log(tokens);
// => [ '11', '0', '1', '000000', '1', '0', '1' ]
And here's how it looks in Python:
import re
test = "1101000000101"
regex = r"(\w)\1*"
tokens = [x[0] for x in re.finditer(regex, test)]
print(tokens)
# => ['11', '0', '1', '000000', '1', '0', '1']
Regular expression disclaimer and checklist:
- Test it at regex101.com
- Visualize it at Regexper
- Check for catastrophic backtracking and denial of service vulnerabilities.
Photo by Waldemar Brandt on Unsplash