Regex: Tokenize repeating characters in a string

TL;DR: Tokenize the repeating characters in a string with (\w)\1*.


Some coding challenges (and the occasional real-world challenge) require you to act on the repeating characters in a string. A couple of examples include finding binary gaps, like the string of six zeroes in 1101000000101, and run-length encoding (RLE) challenges.

You can quickly tokenize a string's repeating characters using the regular expression (\w)\1*.

  • (\w) captures any single alphanumeric character
  • \1* looks for zero or more matches of the previous alphanumeric character

Here's how it looks in JavaScript:

const test = "1101000000101";
const regex = /(\w)\1*/g

const tokens = test.match(regex);

console.log(tokens); 
// => [ '11', '0', '1', '000000', '1', '0', '1' ]

And here's how it looks in Python:

import re

test = "1101000000101"
regex = r"(\w)\1*"

tokens = [x[0] for x in re.finditer(regex, test)]

print(tokens)
# => ['11', '0', '1', '000000', '1', '0', '1']

Regular expression disclaimer and checklist:

Photo by Waldemar Brandt on Unsplash