How to Use Regular Expressions to Solve Common Data Validation Tasks
Regular expressions (regex) are powerful tools for searching, matching, and manipulating text. Whether you’re validating user input, extracting data from logs, or performing complex text replacements, regex can save you time and code. In this guide, we cover a wide range of regex patterns for common programming challenges such as validating emails, dates, IP addresses, numbers, and more. While the examples provided here are tailored for JavaScript, the concepts are applicable across many programming languages.
Regex Basics: Common Patterns and Their Meanings
Below is a table summarizing essential regex tokens along with their descriptions and examples. Understanding these building blocks is key to mastering more advanced patterns.
Expression | Description | Example |
---|---|---|
^ |
Matches the start of a string. | ^Hello matches "Hello" in "Hello World". |
$ |
Matches the end of a string. | World$ matches "World" in "Hello World". |
. |
Matches any single character except newline. | a.c matches "abc" or "adc", but not "ac". |
* |
Matches 0 or more occurrences of the preceding element. | ab* matches "a", "ab", "abbb", etc. |
+ |
Matches 1 or more occurrences of the preceding element. | ab+ matches "ab", "abb", but not just "a". |
? |
Matches 0 or 1 occurrence of the preceding element. | ab? matches "a" or "ab". |
{n} |
Matches exactly n occurrences of the preceding element.
|
a{3} matches "aaa". |
{n,} |
Matches n or more occurrences of the preceding element.
|
a{2,} matches "aa", "aaa", etc. |
{n,m} |
Matches between n and m occurrences of the
preceding element.
|
a{1,3} matches "a", "aa", or "aaa". |
[abc] |
Matches any character listed inside the brackets. | [abc] matches "a", "b", or "c". |
[^abc] |
Matches any character not listed inside the brackets. | [^abc] matches "d" but not "a", "b", or "c". |
(xyz) |
Groups and captures the sequence of characters "xyz". | (abc)+ matches "abc", "abcabc", etc. |
| |
Acts as a logical OR between expressions. | cat|dog matches either "cat" or "dog". |
\d |
Matches any digit (0-9). | \d+ matches "123", "456", etc. |
\D |
Matches any non-digit character. | \D+ matches "abc", "XYZ", etc. |
\w |
Matches any word character (letters, digits, or underscore). | \w+ matches "hello123" or "word_". |
\W |
Matches any non-word character. | \W+ matches punctuation like "!", "@#$", etc. |
\s |
Matches any whitespace character (spaces, tabs, line breaks). | \s+ matches spaces, tabs, etc. |
\S |
Matches any non-whitespace character. | \S+ matches "hello", "world", etc. |
\b |
Matches a word boundary (the position between a word and a non-word character). |
\bcat\b matches "cat" as a whole word but not within
"scatter".
|
\B |
Matches a non-word boundary. |
\Bcat\B matches "cat" within "scatter" but not as an
isolated word.
|
1. Emulating DOTALL in JavaScript
Many regex engines provide a DOTALL flag, which makes the
.
metacharacter match newline characters as well. JavaScript does
not support this flag by default. To emulate DOTALL behavior, replace each
.
with [\S\s]
, which matches any character
(whitespace or non-whitespace).
/[\S\s]*/
2. Validating Email Addresses
Email validation via regex is notoriously challenging due to the complexity of the RFC specifications. The following pattern is robust and should work in 99% of cases. For more details on the limitations of email regex patterns, refer to this comprehensive comparison.
Tip: Always disable case sensitivity when matching emails.
/^[-a-z0-9~!$%^&*_=+}{'?]+(\.[-a-z0-9~!$%^&*_=+}{'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+)*\.(aero|arpa|biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|pro|travel|mobi|[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i
3. Validating IPv4 Addresses
A proper IPv4 regex ensures that each octet is between 0 and 255. This pattern
does exactly that. If you plan to match an IP address within a larger string,
consider using word boundaries (\b
) instead of ^
and
$
.
/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/
4. Validating Dates
Regular expressions can confirm the format of a date, but they can’t fully validate its authenticity (e.g., checking for leap years). The patterns below verify the structure and the valid range for days in each month, though leap years are not accounted for.
4.1 ISO Date Format (yyyy-mm-dd)
/^[0-9]{4}-(?:(?:(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01]))|(02-(0[1-9]|[12][0-9]))|((0[469]|11)-(0[1-9]|[12][0-9]|30)))$/
4.2 ISO Date with Flexible Separators
This regex accepts -, /, ., or a space as the separator, ensuring the same separator is used throughout the date.
/^[0-9]{4}([- /.])(?:(?:(0[13578]|1[02])\1(0[1-9]|[12][0-9]|3[01]))|(02\1(0[1-9]|[12][0-9]))|((0[469]|11)\1(0[1-9]|[12][0-9]|30)))$/
4.3 United States Date Format (mm/dd/yyyy)
/^(?:(?:(0[13578]|1[02])\/(0[1-9]|[12][0-9]|3[01]))|(02\/(0[1-9]|[12][0-9]))|((0[469]|11)\/(0[1-9]|[12][0-9]|30)))\/[0-9]{4}$/
4.4 24-Hour Time Format (HH:MM)
/^(20|21|22|23|[01]\d|\d)(:[0-5]\d){1,2}$/
5. Validating Numbers
Number validation can vary greatly depending on your requirements—whether you need to validate integers, decimals, currency formats, or numbers within a specific range. Below are several examples addressing common cases.
5.1 Positive Integers (Any Length)
/^\d+$/
5.2 Positive Integers (Up to 10 Digits)
/^\d{1,10}$/
5.3 Positive Integers (Fixed 5 Digits)
/^\d{5}$/
5.4 Negative Integers (Any Length)
/^-\d+$/
5.5 Negative Integers (Up to 10 Digits)
/^-\d{1,10}$/
5.6 Negative Integers (Fixed 5 Digits)
/^-\d{5}$/
5.7 Integers (Optional Negative Sign)
/^-?\d+$/
5.8 Integers (Up to 10 Digits)
/^-?\d{1,10}$/
5.9 Integers (Fixed 5 Digits)
/^-?\d{5}$/
5.10 Numbers with Optional Decimals
/^-?\d*\.?\d+$/
5.11 Numbers with Exactly 2 Decimal Places
/^-?\d*\.\d{2}$/
5.12 Currency Format
This regex validates currency numbers with an optional dollar sign, optional negative sign, thousand separators, and up to two decimal places. It accepts formats like "$1,000,000.00", "10000.12", and "0.00".
/^$?\-?([1-9]\d{0,2}(,\d{3})*(\.\d{1,2})?|0(\.\d{1,2})?|(\.\d{1,2}))$/
5.13 Numbers from 0 to 100 with Optional Decimals
/^(100|[1-9]?\d)(\.\d+)?$/
6. Validating Feet and Inches Notation
To validate measurements in the format F'I" (e.g., 6'11"), use the following pattern. This regex ensures that inches are less than 12.
/^\d+'(0|[1-9]|1[0-1])"$/
7. Validating Hexadecimal Color Codes
Hex color codes may optionally begin with a "#" and be either 3 or 6 hexadecimal digits. This regex handles both formats.
/^#?([a-f0-9]{6}|[a-f0-9]{3})$/i
8. Checking for Alphanumeric Values
While the \w
character class matches letters, digits, and
underscores, you may want to allow only letters and numbers. Use the following
regex:
/^[a-zA-Z0-9]+$/
9. Validating Social Security Numbers (SSN)
SSNs in the United States consist of nine digits, often separated by hyphens. Note that this regex validates the format only and does not check for authenticity.
/^\d{3}-?\d{2}-?\d{4}$/
10. Validating Canadian Social Insurance Numbers (SIN)
Canadian SINs consist of nine digits and may include spaces or hyphens as separators. This regex enforces consistent use of the separator. For full validation, the checksum digit should be computed separately.
/^\d{3}([\s-])?\d{3}\1\d{3}$/
11. Validating US Zip Codes
US zip codes can be either 5 digits or in the Zip+4 format. Use one of the following regex patterns:
Zip Code (5 digits):
/^\d{5}$/
Zip+4 Format (5 digits, a hyphen, and 4 digits):
/^\d{5}(-\d{4})?$/
12. Validating Canadian Postal Codes
Canadian postal codes follow the pattern A9A 9A9. This regex accepts an optional space between the two groups.
/^[ABCEGHJKLMNPRSTVXY]\d[A-Z] *\d[A-Z]\d$/i
13. Extracting Filenames from Windows Paths
Windows paths are separated by backslashes. This regex extracts the filename (or last directory) from the path. Note that without additional context, it cannot distinguish between a file and a folder.
/[^\\]+$/
14. Validating US or Canadian Telephone Numbers
Telephone numbers can have various formats. The regex below accepts numbers like "999-999-9999", "9999999999", and "(999) 999-9999". Adjust your UI to enforce formatting, if necessary.
/^(\d{10})|(([\(]?[0-9]{3}[\)]?)[ \.\-]?[0-9]{3}[ \.\-][0-9]{4})$/
15. Validating Credit Card Numbers
While regex can ensure that a credit card number conforms to expected lengths and starting digits, always use the Luhn algorithm to validate the checksum.
VISA:
/^4[0-9]{12}(?:[0-9]{3})?$/
MasterCard:
/^5[1-5][0-9]{14}$/
American Express:
/^3[47][0-9]{13}$/
Diners Club:
/^3(?:0[0-5]|[68][0-9])[0-9]{11}$/
Discover:
/^6(?:011|5[0-9]{2})[0-9]{12}$/
JCB:
/^(?:2131|1800|35\d{3})\d{11}$/
16. Stripping All HTML Tags from a String
When you need to remove all HTML tags from a string, use the regex below. Be
sure to include the global (g
) and case-insensitive
(i
) flags.
/<[^>]+>/gi
17. Removing Blank Lines from a String
Use this regex in global and multiline modes to remove blank lines from text. Replace matches with an empty string.
/^\s*\r?\n/gm
18. Conclusion
Regular expressions are an indispensable tool for text processing and data validation. They can greatly simplify tasks like validating emails, IP addresses, dates, numbers, and more. However, remember that regex is best used for format validation—not for verifying data accuracy (e.g., ensuring a date is real or a credit card number is valid). Always complement regex validation with additional logic or built-in libraries where necessary. Practice and testing are key to mastering regex and handling edge cases in production systems.
Comments
Post a Comment