Skip to main content

How to Use Regular Expressions to Solve Common Data Validation Tasks

How to Use Regular Expressions to Solve Common Data Validation Tasks

Regular expressions (regex) are powerful tools for searching, matching, and manipulating text. Whether you’re validating user input, extracting data from logs, or performing complex text replacements, regex can save you time and code. In this guide, we cover a wide range of regex patterns for common programming challenges such as validating emails, dates, IP addresses, numbers, and more. While the examples provided here are tailored for JavaScript, the concepts are applicable across many programming languages.


Regex Basics: Common Patterns and Their Meanings

Below is a table summarizing essential regex tokens along with their descriptions and examples. Understanding these building blocks is key to mastering more advanced patterns.

Expression Description Example
^ Matches the start of a string. ^Hello matches "Hello" in "Hello World".
$ Matches the end of a string. World$ matches "World" in "Hello World".
. Matches any single character except newline. a.c matches "abc" or "adc", but not "ac".
* Matches 0 or more occurrences of the preceding element. ab* matches "a", "ab", "abbb", etc.
+ Matches 1 or more occurrences of the preceding element. ab+ matches "ab", "abb", but not just "a".
? Matches 0 or 1 occurrence of the preceding element. ab? matches "a" or "ab".
{n} Matches exactly n occurrences of the preceding element. a{3} matches "aaa".
{n,} Matches n or more occurrences of the preceding element. a{2,} matches "aa", "aaa", etc.
{n,m} Matches between n and m occurrences of the preceding element. a{1,3} matches "a", "aa", or "aaa".
[abc] Matches any character listed inside the brackets. [abc] matches "a", "b", or "c".
[^abc] Matches any character not listed inside the brackets. [^abc] matches "d" but not "a", "b", or "c".
(xyz) Groups and captures the sequence of characters "xyz". (abc)+ matches "abc", "abcabc", etc.
| Acts as a logical OR between expressions. cat|dog matches either "cat" or "dog".
\d Matches any digit (0-9). \d+ matches "123", "456", etc.
\D Matches any non-digit character. \D+ matches "abc", "XYZ", etc.
\w Matches any word character (letters, digits, or underscore). \w+ matches "hello123" or "word_".
\W Matches any non-word character. \W+ matches punctuation like "!", "@#$", etc.
\s Matches any whitespace character (spaces, tabs, line breaks). \s+ matches spaces, tabs, etc.
\S Matches any non-whitespace character. \S+ matches "hello", "world", etc.
\b Matches a word boundary (the position between a word and a non-word character). \bcat\b matches "cat" as a whole word but not within "scatter".
\B Matches a non-word boundary. \Bcat\B matches "cat" within "scatter" but not as an isolated word.

1. Emulating DOTALL in JavaScript

Many regex engines provide a DOTALL flag, which makes the . metacharacter match newline characters as well. JavaScript does not support this flag by default. To emulate DOTALL behavior, replace each . with [\S\s], which matches any character (whitespace or non-whitespace).

/[\S\s]*/

2. Validating Email Addresses

Email validation via regex is notoriously challenging due to the complexity of the RFC specifications. The following pattern is robust and should work in 99% of cases. For more details on the limitations of email regex patterns, refer to this comprehensive comparison.

Tip: Always disable case sensitivity when matching emails.

 /^[-a-z0-9~!$%^&*_=+}{'?]+(\.[-a-z0-9~!$%^&*_=+}{'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+)*\.(aero|arpa|biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|pro|travel|mobi|[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i 

3. Validating IPv4 Addresses

A proper IPv4 regex ensures that each octet is between 0 and 255. This pattern does exactly that. If you plan to match an IP address within a larger string, consider using word boundaries (\b) instead of ^ and $.

 /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/ 

4. Validating Dates

Regular expressions can confirm the format of a date, but they can’t fully validate its authenticity (e.g., checking for leap years). The patterns below verify the structure and the valid range for days in each month, though leap years are not accounted for.

4.1 ISO Date Format (yyyy-mm-dd)

 /^[0-9]{4}-(?:(?:(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01]))|(02-(0[1-9]|[12][0-9]))|((0[469]|11)-(0[1-9]|[12][0-9]|30)))$/ 

4.2 ISO Date with Flexible Separators

This regex accepts -, /, ., or a space as the separator, ensuring the same separator is used throughout the date.

 /^[0-9]{4}([- /.])(?:(?:(0[13578]|1[02])\1(0[1-9]|[12][0-9]|3[01]))|(02\1(0[1-9]|[12][0-9]))|((0[469]|11)\1(0[1-9]|[12][0-9]|30)))$/ 

4.3 United States Date Format (mm/dd/yyyy)

 /^(?:(?:(0[13578]|1[02])\/(0[1-9]|[12][0-9]|3[01]))|(02\/(0[1-9]|[12][0-9]))|((0[469]|11)\/(0[1-9]|[12][0-9]|30)))\/[0-9]{4}$/ 

4.4 24-Hour Time Format (HH:MM)

 /^(20|21|22|23|[01]\d|\d)(:[0-5]\d){1,2}$/ 

5. Validating Numbers

Number validation can vary greatly depending on your requirements—whether you need to validate integers, decimals, currency formats, or numbers within a specific range. Below are several examples addressing common cases.

5.1 Positive Integers (Any Length)

 /^\d+$/ 

5.2 Positive Integers (Up to 10 Digits)

 /^\d{1,10}$/ 

5.3 Positive Integers (Fixed 5 Digits)

 /^\d{5}$/ 

5.4 Negative Integers (Any Length)

 /^-\d+$/ 

5.5 Negative Integers (Up to 10 Digits)

 /^-\d{1,10}$/ 

5.6 Negative Integers (Fixed 5 Digits)

 /^-\d{5}$/ 

5.7 Integers (Optional Negative Sign)

 /^-?\d+$/ 

5.8 Integers (Up to 10 Digits)

 /^-?\d{1,10}$/ 

5.9 Integers (Fixed 5 Digits)

 /^-?\d{5}$/ 

5.10 Numbers with Optional Decimals

 /^-?\d*\.?\d+$/ 

5.11 Numbers with Exactly 2 Decimal Places

 /^-?\d*\.\d{2}$/ 

5.12 Currency Format

This regex validates currency numbers with an optional dollar sign, optional negative sign, thousand separators, and up to two decimal places. It accepts formats like "$1,000,000.00", "10000.12", and "0.00".

 /^$?\-?([1-9]\d{0,2}(,\d{3})*(\.\d{1,2})?|0(\.\d{1,2})?|(\.\d{1,2}))$/ 

5.13 Numbers from 0 to 100 with Optional Decimals

 /^(100|[1-9]?\d)(\.\d+)?$/ 

6. Validating Feet and Inches Notation

To validate measurements in the format F'I" (e.g., 6'11"), use the following pattern. This regex ensures that inches are less than 12.

 /^\d+'(0|[1-9]|1[0-1])"$/ 

7. Validating Hexadecimal Color Codes

Hex color codes may optionally begin with a "#" and be either 3 or 6 hexadecimal digits. This regex handles both formats.

 /^#?([a-f0-9]{6}|[a-f0-9]{3})$/i 

8. Checking for Alphanumeric Values

While the \w character class matches letters, digits, and underscores, you may want to allow only letters and numbers. Use the following regex:

 /^[a-zA-Z0-9]+$/ 

9. Validating Social Security Numbers (SSN)

SSNs in the United States consist of nine digits, often separated by hyphens. Note that this regex validates the format only and does not check for authenticity.

 /^\d{3}-?\d{2}-?\d{4}$/ 

10. Validating Canadian Social Insurance Numbers (SIN)

Canadian SINs consist of nine digits and may include spaces or hyphens as separators. This regex enforces consistent use of the separator. For full validation, the checksum digit should be computed separately.

 /^\d{3}([\s-])?\d{3}\1\d{3}$/ 

11. Validating US Zip Codes

US zip codes can be either 5 digits or in the Zip+4 format. Use one of the following regex patterns:

Zip Code (5 digits):

 /^\d{5}$/ 

Zip+4 Format (5 digits, a hyphen, and 4 digits):

 /^\d{5}(-\d{4})?$/ 

12. Validating Canadian Postal Codes

Canadian postal codes follow the pattern A9A 9A9. This regex accepts an optional space between the two groups.

 /^[ABCEGHJKLMNPRSTVXY]\d[A-Z] *\d[A-Z]\d$/i 

13. Extracting Filenames from Windows Paths

Windows paths are separated by backslashes. This regex extracts the filename (or last directory) from the path. Note that without additional context, it cannot distinguish between a file and a folder.

 /[^\\]+$/ 

14. Validating US or Canadian Telephone Numbers

Telephone numbers can have various formats. The regex below accepts numbers like "999-999-9999", "9999999999", and "(999) 999-9999". Adjust your UI to enforce formatting, if necessary.

 /^(\d{10})|(([\(]?[0-9]{3}[\)]?)[ \.\-]?[0-9]{3}[ \.\-][0-9]{4})$/ 

15. Validating Credit Card Numbers

While regex can ensure that a credit card number conforms to expected lengths and starting digits, always use the Luhn algorithm to validate the checksum.

VISA:

 /^4[0-9]{12}(?:[0-9]{3})?$/ 

MasterCard:

 /^5[1-5][0-9]{14}$/ 

American Express:

 /^3[47][0-9]{13}$/ 

Diners Club:

 /^3(?:0[0-5]|[68][0-9])[0-9]{11}$/ 

Discover:

 /^6(?:011|5[0-9]{2})[0-9]{12}$/ 

JCB:

 /^(?:2131|1800|35\d{3})\d{11}$/ 

16. Stripping All HTML Tags from a String

When you need to remove all HTML tags from a string, use the regex below. Be sure to include the global (g) and case-insensitive (i) flags.

 /<[^>]+>/gi 

17. Removing Blank Lines from a String

Use this regex in global and multiline modes to remove blank lines from text. Replace matches with an empty string.

 /^\s*\r?\n/gm 

18. Conclusion

Regular expressions are an indispensable tool for text processing and data validation. They can greatly simplify tasks like validating emails, IP addresses, dates, numbers, and more. However, remember that regex is best used for format validation—not for verifying data accuracy (e.g., ensuring a date is real or a credit card number is valid). Always complement regex validation with additional logic or built-in libraries where necessary. Practice and testing are key to mastering regex and handling edge cases in production systems.


19. References and Further Reading

Comments