R
You fell in the "slittle of the https://unicode.org/ "! But we're going in parts.First, I copied and collected the word Olá of your code and I did the following:from unicodedata import name
for s in 'Olá': # para cada caractere da string
print(f'{s} {ord(s):4X} {name(s)}')
That is, for each character of the string, I print the character itself, the https://docs.python.org/3/library/functions.html#ord and his https://docs.python.org/3/library/unicodedata.html#unicodedata.name . The result was:O 4F LATIN CAPITAL LETTER O
l 6C LATIN SMALL LETTER L
a 61 LATIN SMALL LETTER A
́ 301 COMBINING ACUTE ACCENT
Yes, 4 characters, and the last is the https://www.fileformat.info/info/unicode/char/301/index.htm .What happens is that this string is in the NFD form (one of the https://unicode.org/reports/tr15/ ). To understand in detail, I suggest you read /q/406545/112052 , /a/396555/112052 and /a/345946/112052 , but summarizing, the character á (read "a" with acute accent) can be represented in two ways:as a single character: the á ( https://www.fileformat.info/info/unicode/char/e1/index.htm )as a combination of 2 characters: https://www.fileformat.info/info/unicode/char/61/index.htm (no accent) and acute accent (the fourth character appearing in the above example)The first form is known as NFC, and the second, as NFD (read the links already suggested above to learn more).The problem is that both forms, when rendered, are shown on the screen - in the vast majority of sources, but in "all" - in the same way, and you only realize the difference if "hide bits" and check what actually has in the string. Therefore, the regex will not give match in this string because the accent was not included in the list of valid characters.An alternative to solving is to convert the string to NFC, using https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize . Thus the letter "a" and the accent are combined in the character á:from unicodedata import name, normalize
for s in normalize('NFC', 'Olá'):
print(f'{s} {ord(s):4X} {name(s)}')
See the difference:O 4F LATIN CAPITAL LETTER O
l 6C LATIN SMALL LETTER L
á E1 LATIN SMALL LETTER A WITH ACUTE
Another detail is that inside brackets, https://www.regular-expressions.info/charclass.html#special . And how you used it https://docs.python.org/3/library/re.html#re.IGNORECASE , do not need to put uppercases and lowercases in the expression, because flag you will already consider both (i.e. you can leave the regex only with the lowercase - or only with the uppercase).And the expression compiled (returned by https://docs.python.org/3/library/re.html#re.compile ) also owns https://docs.python.org/3/library/re.html#re.Pattern.match , which you can use directly (instead of re.match(good_chars_regexp, etc), can do only good_chars_regexp.match(etc)import re
lista = ['Olá amigos da Internet!', 'Dúvida sobre Python', '@StackOverFlow']
good_chars_regexp = re.compile(r"^[a-záéíóúâêîôãõç0-9,.-?"'’!“\s;:“”\–‘’’/]+$", re.IGNORECASE)
from unicodedata import normalize
for l in lista:
print(good_chars_regexp.match(normalize('NFC', l)) is not None)
The output is:True
True
False
If you want to install an external module, an alternative is the https://pypi.org/project/regex/ , which has some features more than the module re. One that can help in this case is https://pypi.org/project/regex/#unicode-codepoint-properties-including-scripts-and-blocks :import regex
good_chars_regexp = regex.compile(r"^([0-9,.-?"'’!“\s;:“”\–‘’’/]|\p{Script=Latin}\p{M}?)+$", regex.IGNORECASE)
for l in lista: # não precisa mais normalizar
print(good_chars_regexp.match(l) is not None)
Thus, regex considers numbers and other characters (point, comma, hyphen, aspas, etc). or \p{Script=Latin}\p{M}?.In \p{Script=Latin} are all characters https://en.wikipedia.org/wiki/Latin_script_in_Unicode (which may be too comprehensive if you only want Portuguese texts) and \p{M} includes the categories "Mark" (all starting with "M" https://www.fileformat.info/info/unicode/category/index.htm ), in which the acute accent is included. O ? soon after it indicates https://www.regular-expressions.info/optional.html (i.e. we can only have the letter, or letter followed by the accent, in case the string is in NFD).Obs: It is also worth remembering that this regex does not check if you have words in fact. For example, if the string is !!!,,," ", it considers valid. Of course, there already escapes a bit of the scope of the question, but if the idea is, for example, to check that it has at least one letter or something, maybe help to take a look /q/342605/112052 , /q/337924/112052 and /q/377860/112052 .Finally, an option - a little more complicated - that works independently of the string being in NFC or NFD, and that does not require normalization, would be:good_chars_regexp = re.compile(r"^([a-záéíóúâêîôãõç0-9,.-?"'’!“\s;:“”\–‘’’/]|[aeiou]\u0301|[aeio]\u0302|[ao]\u0303|c\u0327)+$", re.IGNORECASE)
In the case, I consider the letters sharp (áéí...), or the letters followed by the respective accent - and for this I used the unicode exhausts (\u followed by the hexadecimal code of each character), using the codes of https://www.fileformat.info/info/unicode/char/301/index.htm , https://www.fileformat.info/info/unicode/char/302/index.htm , https://www.fileformat.info/info/unicode/char/303/index.htm and https://www.fileformat.info/info/unicode/char/327/index.htm (each preceded by the respective letters which may have them). Thus, regex takes both NFC and NFD cases.