sobota, 13 czerwca 2009

Regex dollars can't buy me love

Short note about regex dollars.
I always thought that regex dollar sign $ in SingleLine mode means end of the character string (I thought it is like 0 <zero> in the end of C string, some kind of delimiter that ends string). However i found out that regex like ^[0-9]+$ matches not only digits but also strings like "123\n". It matches strings that not only have characters specified in regex pattern but also the same strings that end with single \n.
Lately, i had to validate user input from webpage end i ended up checking if the string matches regex and also checking if the string doesn't end with \n. I didn't like the solution and with great help from stackoverflow users i found out that only way to mark end of string is using \z. After a bit more research i was able to find some more explanation:

The difference between ‹\Z› and ‹\z› comes into play when the last character in your subject text is a line break. In that case, ‹\Z› can match at the very end of the subject text, after the final line break, as well as immediately before that line break. The benefit is that you can search for ‹omega\Z› without having to worry about stripping off a trailing line break at the end of your subject text. When reading a file line by line, some tools include the line break at the end of the line, whereas others don’t; ‹\Z› masks this difference. ‹\z› matches only at the very end of the subject text, so it will not match text if a trailing line break follows. The anchor ‹$› is equivalent to ‹\Z›, as long as you do not turn on the “^ and $ match at line breaks” option. This option is off by default for all regex flavors except Ruby. Ruby does not offer a way to turn this option off. Just like ‹\Z›, ‹$› matches at the very end of the subject text, as well as before the final line break, if any.

Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan. Copyright 2009 Jan Goyvaerts and Steven Levithan, 978-0-596-2068-7

To sum up, the only safe way to mark the end of character string is to use \z instead of $. At least if it is input from webpage or any source where user doesn't press enter to confirm their input.
Weird... I was really used to $.

Brak komentarzy:

Prześlij komentarz