# Regular expression different from what set of lines

• There are two sets of lines.
White list:

``````bhdgffhfa
aggjjaaga
hbefebhfi
dbceeabih
ihifdbhhf
djacfjbae
ibjbeigac
aaaaaaaaj
dgifecchi
dcbhgjaaf
eedfbcedb
aaaaaaabf
beaajbihg
aaaaaaaag
eacccdjeh
hehcijfhc
aaaaaaaaa
aaaaaaabc
jfdhdeice
``````

And the blacklist:

``````aaaaaaaai
ehigcbgjd
hgjdeajec
aaaaaaabe
haebgiggc
aaaaaaaba
aaaaaaaaf
bfjhehbcf
diieabefh
ifbjdgibf
dddbgcgai
bbcafhcif
hfjciccbi
fdbgidjdj
aaaaaaabb
jhdibgbfj
bhgjfacgi
ehhieihhh
abijjabda
``````

We need to write a regular expression, not longer than 60 symbols, which can distinguish from which set the line is taken.
For example, if the programme entrance is available `aaaaaaaaa`the program has to decide it's a line from the white list.

UPDATE: https://www.hackerrank.com/contests/regular-expresso/challenges/match-maker

• Two decisive pathters may be received: on the white list (whitelist) and on the black list (blacklist).

White list: `/ce|ag|fe|ff|aaj|cd|dc|aad|abc|abf|acf|bei|bhh|cij|aaaaaaaaa/`
Black list: `/bg|ai|bb|ie|jc|jd|bd|cg|jh|aba|abe|aaaaaaaaf/`

The Black List Pattern is shorter and allows a compression of up to 37 symbols: `/ai|b[bdg]|cg|ie|j[cdh]|ab[ae]|a{8}f/`
But using two patters makes it possible to diagnose.

Resolving patterns are drawn from a unique list of tokens on the optimised greed algorithm.

Two- and three-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-s-seven-s.

PHP:

``````\$whitelist = "
bhdgffhfa
aggjjaaga
hbefebhfi
dbceeabih
ihifdbhhf
djacfjbae
ibjbeigac
aaaaaaaaj
dgifecchi
dcbhgjaaf
eedfbcedb
aaaaaaabf
beaajbihg
aaaaaaaag
eacccdjeh
hehcijfhc
aaaaaaaaa
aaaaaaabc
jfdhdeice
";
\$blacklist = "
aaaaaaaai
ehigcbgjd
hgjdeajec
aaaaaaabe
haebgiggc
aaaaaaaba
aaaaaaaaf
bfjhehbcf
diieabefh
ifbjdgibf
dddbgcgai
bbcafhcif
hfjciccbi
fdbgidjdj
aaaaaaabb
jhdibgbfj
bhgjfacgi
ehhieihhh
abijjabda
";
//Генератор решающих токенов
function tokens_gen(\$whitelist, \$blacklist){
\$white_tokens = [];
\$black_tokens = [];
for(\$i=-1; \$i< 12; \$i++){
\$c = (\$i<0) ? "" : chr(ord("a")+\$i);    // сначала двухсимвольные токены
\$c = (\$i==10) ? "aaaaaaa" : \$c;         // последними - токены c "aaaaaaa"
for(\$ord2=ord("a"); \$ord2 <= ord("j"); \$ord2++){
for(\$ord3=ord("a"); \$ord3 <= ord("j"); \$ord3++){
\$ccc = \$c.chr(\$ord2).chr(\$ord3);
\$pm_white = preg_match("/\$ccc/", \$whitelist);
\$pm_black = preg_match("/\$ccc/", \$blacklist);
if((\$pm_white==1) && (\$pm_black==0)) array_push(\$white_tokens, \$ccc);
if((\$pm_black==1) && (\$pm_white==0)) array_push(\$black_tokens, \$ccc);
}
}
}
return [\$white_tokens, \$black_tokens];
}
// "Жадный" рекурсивный отбор токенов
function greedy(\$tokens, \$list, \$pat=""){
\$tokens_new = [];
\$ppp_max = "";
foreach(\$tokens as \$ppp){
\$p = (\$pat=="")? \$ppp: "\$pat|\$ppp";
\$sum = 0;
foreach(\$list as \$str) \$sum += preg_match("/\$p/", \$str);
if(\$sum > \$sum_max){
\$ppp_max = \$ppp;
\$sum_max = \$sum;
}
if(\$sum > 0) array_push(\$tokens_new, \$ppp);
}
if(\$ppp_max == "") return \$pat;
\$pat = (\$pat=="")? \$ppp_max: "\$pat|\$ppp_max";
\$left = 'display: block; float: left; ';
printf("<span style='\$left'>Токенов:</span><span style='\$left width:30px;' align='right'>%d</span>", count(\$tokens));
printf("<span style='\$left'>&emsp;Строк,&emsp;всего:</span><span style='\$left width: 24px;' align='right'>%d</span>&emsp;по токену: \$sum_max&emsp;", count(\$list));
print("Токен: \$ppp_max &emsp; Паттерн: \$pat<br>");
\$list = preg_grep("/\$ppp_max/", \$list, PREG_GREP_INVERT);
\$pat = greedy(\$tokens_new, \$list, \$pat);
return \$pat;
}
// Перевод контрольных строк в массивы
\$white = explode(chr(10), trim(\$whitelist));
foreach(\$white as &\$w) \$w=trim(\$w);
\$black = explode(chr(10), trim(\$blacklist));
foreach(\$black as &\$b) \$b=trim(\$b);
\$white_black = array_merge(\$white, \$black);
// Получение решающих токенов
list(\$white_tokens, \$black_tokens) = tokens_gen(\$whitelist, \$blacklist);
// Получение жадных паттернов
print("Жадные паттерны:<br><br>");
\$white_pattern = greedy(\$white_tokens, \$white);
\$black_pattern = greedy(\$black_tokens, \$black);
//\$black_pattern = 'ai|b[bdg]|cg|ie|j[cdh]|ab[ae]|a{8}f';
print("<br><br>white_pattern = \$white_pattern<br>black_pattern = \$black_pattern<br>");
\$left = 'display: block; float: left; ';
print("<br>Проверка:<br>");
foreach(\$white_black as \$str){
\$in_white = preg_match("/\$white_pattern/",\$str);
\$in_black = preg_match("/\$black_pattern/",\$str);
print("<br><span style='\$left width:80px'>\$str:</span>\$in_white \$in_black");
}
``````

Results:

```Greedy patters:
Tokens: 108 Stroke, total token:
Tokens:108 Strock, total:18 token: 2 Token: ag Pattern: ce:ag
Tokens: 88 Stroke, total token: 2 Token: fe Pattern: ce:ag:fe
Tokens:81 Stroke, total:14 token: 2 Token: ff Pattern: ce:ag:fe,ff
Tokens:67 Stroke, total:12 token: 2 Token: aaj Pattern: ce,ag,fe,ff,aaj
Tokens:49 Stroke, total:10 token: 1 Token: cd Pattern: ce:ag:fe,ff,aaj|cd
Tokens:41 Stroke, total:9 token: 1 Token: dc Pattern: ce:ag,fe,ff,aaj,cd,dc
Tokens:28 Stroke, total:7 token: 1 Token: abc Pattern: ce:ag:feajff,cd,dc,aadcabc
Tokens:26 Stroke, total:6 token: 1 Token: abf Pattern: ceagag:feajff,cd,dc,aadabf
Tokens:24 Stroke, total:5 token: 1 Token: acf Pattern: ceagag:feajffajcd,dc,aadabf:acf
Tokens:22 Stroke, total:4 token: 1 Token: bei Pattern: ceagag:fe:ffajcd,dc,aadabc:acfacacacbei
Tokens:18 Stroke, total:3 token: 1 Token: bh Pattern: ceagag:feffff|cd,dc,aadabc:acfacacfacbei|bhh
Tokens:11 Stroke, total:2 token: 1 Token: cij Pattern: ceagag:fe:ffajcdc,aadabcacacfacacbeibebh:cij
Tokens:5 Stroke, total:1 token: 1 Token: aaaaaaaaa Pattern: ceagag:fe:d:cdc|abcacabcacacfacacf:acbei:cij:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Tokens:118 Stroke, total token: 5 Token: bg Pattern: bg
Tokens:118 Stroke, total:16 Token: 2 Token: ai Pattern: bg:ai
Tokens:82 Stroke, total:14 token: 2 Token: bb Pattern: bg:ai,bb
Tokens:75 Stroke, total:12 token: 2 Token: ie Pattern: bg:ai:bb:ie
Tokens:67 Stroke, total:10 token: 2 Token: jc Pattern: bg:ai,bb,ie,jc
Tokens:53 Stroke, total:8 token: 2 Token: jd Pattern: bg:ai:bb:ie:jc|jd
Tokens:37 Stroke, total:6 per token: 1 Token: bd Pattern: bg:ai:bb:ie:jc,jd:bd:d
Tokens:24 Stroke, total:5 thicken: 1 Token: cg Pattern: bg:ai:bb:ie|jc,jd,bd:cg
Tokens:18 Stroke, total:4 token: 1 Token: jh Pattern: bg:ai:bb:ie:jc,jd,c:cg:jh
Tokens:12 Stroke, total:3 token: 1 Token: aba Pattern: bg:ai:bb:ie:jc,jd,c,cg|jh:aba
Tokens:5 Stroke, total:2 token: 1 Token: abe Pattern: bg:ai:bb:ie|jc,jd,c,cg:jh:aba:abe
Tokens:3 Stroke, total:1 token: 1 Token: aaaaaaaaaaf Pattern: bg:ai:bb:ie:jc:jd:cg:jh:aba:abe:aaaaaaaaaaaf
white_pattern = ceagagagfefeff|aaj|cd|dc|abcacabcacabfacacbebeibebhcicijciaaaaaaaaaaa
black_pattern = bg|aiaibb|ie|jc|jdccg|jh|aba|abeaaaaaaaaaaaaf
Verification:
bhdgffhfa:1 0
aggjjaaga:1 0
hbefebhfi:1 0
dbceeabih:1 0
ihifdbhf:1 0
djacfjbae:1 0
ibjbeigac:1 0
aaaaaaaj:1 0
dgifecchi:1 0
dcbhgjaaf:1 0
eedfbcedb:1 0
aaaaaaabf:1 0
beaajbihg:1 0
aaaaaaag:1 0
eacccdjeh:1 0
hehcijfhc:1 0
aaaaaa:1 0
aaaaaabc:1 0
jfdhdeice:1 0
aaaaaaaai:0 1
ehigcbgjd:0 1
hgjdeajec:0 1
aaaaaabe:0 1
haebgiggc:0 1
aaaaaaba:0 1
aaaaaaaf:0 1
bfjhehbcf:0 1
diieabefh:0 1
ifbjdgibf:0 1
dddbgcgai:0 1
bbcafhcif:0 1
hfjciccbi:0 1
fdbgidjdj:0 1
aaaaaabb:0 1
jhdibgbfj:0 1
bhgjfacgi:0 1
ehhieihhh:0 1
abijjabda:0 1
```

The test showed that all the patters were working.
A schema with a whole "great" overhead gave the same results.

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2