Regular expression different from what set of lines
-
There are two sets of lines.
White list:bhdgffhfa aggjjaaga hbefebhfi dbceeabih ihifdbhhf djacfjbae ibjbeigac jaffigdad aaaaaaaaj dgifecchi dcbhgjaaf eedfbcedb aaaaaaabf beaajbihg aaaaaaaag eacccdjeh aaaaaaaad hehcijfhc aaaaaaaaa aaaaaaabc jfdhdeice
And the blacklist:
aaaaaaaai ehigcbgjd hgjdeajec aaaaaaabe haebgiggc aaaaaaaba aaaaaaaaf bfjhehbcf diieabefh jhjcaeead ifbjdgibf dddbgcgai bbcafhcif hfjciccbi fdbgidjdj aaaaaaabb jhdibgbfj cfjbaijad bhgjfacgi ehhieihhh abijjabda
We need to write a regular expression, not longer than 60 symbols, which can distinguish from which set the line is taken.
For example, if the programme entrance is availableaaaaaaaaa
the program has to decide it's a line from the white list.UPDATE: https://www.hackerrank.com/contests/regular-expresso/challenges/match-maker
-
Two decisive pathters may be received: on the white list (whitelist) and on the black list (blacklist).
White list:
/ce|ag|fe|ff|aaj|cd|dc|aad|abc|abf|acf|bei|bhh|cij|aaaaaaaaa/
Black list:/bg|ai|bb|ie|jc|jd|bd|cg|jh|aba|abe|aaaaaaaaf/
The Black List Pattern is shorter and allows a compression of up to 37 symbols:
/ai|b[bdg]|cg|ie|j[cdh]|ab[ae]|a{8}f/
But using two patters makes it possible to diagnose.Resolving patterns are drawn from a unique list of tokens on the optimised greed algorithm.
Two- and three-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-s-seven-s.
PHP:
$whitelist = " bhdgffhfa aggjjaaga hbefebhfi dbceeabih ihifdbhhf djacfjbae ibjbeigac jaffigdad aaaaaaaaj dgifecchi dcbhgjaaf eedfbcedb aaaaaaabf beaajbihg aaaaaaaag eacccdjeh aaaaaaaad hehcijfhc aaaaaaaaa aaaaaaabc jfdhdeice ";
$blacklist = "
aaaaaaaai
ehigcbgjd
hgjdeajec
aaaaaaabe
haebgiggc
aaaaaaaba
aaaaaaaaf
bfjhehbcf
diieabefh
jhjcaeead
ifbjdgibf
dddbgcgai
bbcafhcif
hfjciccbi
fdbgidjdj
aaaaaaabb
jhdibgbfj
cfjbaijad
bhgjfacgi
ehhieihhh
abijjabda
";//Генератор решающих токенов
function tokens_gen($whitelist, $blacklist){
$white_tokens = [];
$black_tokens = [];
for($i=-1; $i< 12; $i++){
$c = ($i<0) ? "" : chr(ord("a")+$i); // сначала двухсимвольные токены
$c = ($i==10) ? "aaaaaaa" : $c; // последними - токены c "aaaaaaa"
for($ord2=ord("a"); $ord2 <= ord("j"); $ord2++){
for($ord3=ord("a"); $ord3 <= ord("j"); $ord3++){
$ccc = $c.chr($ord2).chr($ord3);
$pm_white = preg_match("/$ccc/", $whitelist);
$pm_black = preg_match("/$ccc/", $blacklist);
if(($pm_white==1) && ($pm_black==0)) array_push($white_tokens, $ccc);
if(($pm_black==1) && ($pm_white==0)) array_push($black_tokens, $ccc);
}
}
}
return [$white_tokens, $black_tokens];
}// "Жадный" рекурсивный отбор токенов
function greedy($tokens, $list, $pat=""){
$tokens_new = [];
$ppp_max = "";
foreach($tokens as $ppp){
$p = ($pat=="")? $ppp: "$pat|$ppp";
$sum = 0;
foreach($list as $str) $sum += preg_match("/$p/", $str);
if($sum > $sum_max){
$ppp_max = $ppp;
$sum_max = $sum;
}
if($sum > 0) array_push($tokens_new, $ppp);
}
if($ppp_max == "") return $pat;
$pat = ($pat=="")? $ppp_max: "$pat|$ppp_max";
$left = 'display: block; float: left; ';
printf("<span style='$left'>Токенов:</span><span style='$left width:30px;' align='right'>%d</span>", count($tokens));
printf("<span style='$left'> Строк, всего:</span><span style='$left width: 24px;' align='right'>%d</span> по токену: $sum_max ", count($list));
print("Токен: $ppp_max   Паттерн: $pat<br>");
$list = preg_grep("/$ppp_max/", $list, PREG_GREP_INVERT);
$pat = greedy($tokens_new, $list, $pat);
return $pat;
}// Перевод контрольных строк в массивы
$white = explode(chr(10), trim($whitelist));
foreach($white as &$w) $w=trim($w);
$black = explode(chr(10), trim($blacklist));
foreach($black as &$b) $b=trim($b);
$white_black = array_merge($white, $black);// Получение решающих токенов
list($white_tokens, $black_tokens) = tokens_gen($whitelist, $blacklist);// Получение жадных паттернов
print("Жадные паттерны:<br><br>");
$white_pattern = greedy($white_tokens, $white);
$black_pattern = greedy($black_tokens, $black);
//$black_pattern = 'ai|b[bdg]|cg|ie|j[cdh]|ab[ae]|a{8}f';
print("<br><br>white_pattern = $white_pattern<br>black_pattern = $black_pattern<br>");$left = 'display: block; float: left; ';
print("<br>Проверка:<br>");
foreach($white_black as $str){
$in_white = preg_match("/$white_pattern/",$str);
$in_black = preg_match("/$black_pattern/",$str);
print("<br><span style='$left width:80px'>$str:</span>$in_white $in_black");
}
Results:
Greedy patters:
Tokens: 108 Stroke, total token:
Tokens:108 Strock, total:18 token: 2 Token: ag Pattern: ce:ag
Tokens: 88 Stroke, total token: 2 Token: fe Pattern: ce:ag:fe
Tokens:81 Stroke, total:14 token: 2 Token: ff Pattern: ce:ag:fe,ff
Tokens:67 Stroke, total:12 token: 2 Token: aaj Pattern: ce,ag,fe,ff,aaj
Tokens:49 Stroke, total:10 token: 1 Token: cd Pattern: ce:ag:fe,ff,aaj|cd
Tokens:41 Stroke, total:9 token: 1 Token: dc Pattern: ce:ag,fe,ff,aaj,cd,dc
Tokens:33 Stroke, total:8 token: 1 Token: aad Pattern: ceagag:fe,ff,cd,dc|aad
Tokens:28 Stroke, total:7 token: 1 Token: abc Pattern: ce:ag:feajff,cd,dc,aadcabc
Tokens:26 Stroke, total:6 token: 1 Token: abf Pattern: ceagag:feajff,cd,dc,aadabf
Tokens:24 Stroke, total:5 token: 1 Token: acf Pattern: ceagag:feajffajcd,dc,aadabf:acf
Tokens:22 Stroke, total:4 token: 1 Token: bei Pattern: ceagag:fe:ffajcd,dc,aadabc:acfacacacbei
Tokens:18 Stroke, total:3 token: 1 Token: bh Pattern: ceagag:feffff|cd,dc,aadabc:acfacacfacbei|bhh
Tokens:11 Stroke, total:2 token: 1 Token: cij Pattern: ceagag:fe:ffajcdc,aadabcacacfacacbeibebh:cij
Tokens:5 Stroke, total:1 token: 1 Token: aaaaaaaaa Pattern: ceagag:fe:d:cdc|abcacabcacacfacacf:acbei:cij:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Tokens:118 Stroke, total token: 5 Token: bg Pattern: bg
Tokens:118 Stroke, total:16 Token: 2 Token: ai Pattern: bg:ai
Tokens:82 Stroke, total:14 token: 2 Token: bb Pattern: bg:ai,bb
Tokens:75 Stroke, total:12 token: 2 Token: ie Pattern: bg:ai:bb:ie
Tokens:67 Stroke, total:10 token: 2 Token: jc Pattern: bg:ai,bb,ie,jc
Tokens:53 Stroke, total:8 token: 2 Token: jd Pattern: bg:ai:bb:ie:jc|jd
Tokens:37 Stroke, total:6 per token: 1 Token: bd Pattern: bg:ai:bb:ie:jc,jd:bd:d
Tokens:24 Stroke, total:5 thicken: 1 Token: cg Pattern: bg:ai:bb:ie|jc,jd,bd:cg
Tokens:18 Stroke, total:4 token: 1 Token: jh Pattern: bg:ai:bb:ie:jc,jd,c:cg:jh
Tokens:12 Stroke, total:3 token: 1 Token: aba Pattern: bg:ai:bb:ie:jc,jd,c,cg|jh:aba
Tokens:5 Stroke, total:2 token: 1 Token: abe Pattern: bg:ai:bb:ie|jc,jd,c,cg:jh:aba:abe
Tokens:3 Stroke, total:1 token: 1 Token: aaaaaaaaaaf Pattern: bg:ai:bb:ie:jc:jd:cg:jh:aba:abe:aaaaaaaaaaafwhite_pattern = ceagagagfefeff|aaj|cd|dc|abcacabcacabfacacbebeibebhcicijciaaaaaaaaaaa
black_pattern = bg|aiaibb|ie|jc|jdccg|jh|aba|abeaaaaaaaaaaaafVerification:
bhdgffhfa:1 0
aggjjaaga:1 0
hbefebhfi:1 0
dbceeabih:1 0
ihifdbhf:1 0
djacfjbae:1 0
ibjbeigac:1 0
jaffigdad:1 0
aaaaaaaj:1 0
dgifecchi:1 0
dcbhgjaaf:1 0
eedfbcedb:1 0
aaaaaaabf:1 0
beaajbihg:1 0
aaaaaaag:1 0
eacccdjeh:1 0
aaaaaaad:1 0
hehcijfhc:1 0
aaaaaa:1 0
aaaaaabc:1 0
jfdhdeice:1 0
aaaaaaaai:0 1
ehigcbgjd:0 1
hgjdeajec:0 1
aaaaaabe:0 1
haebgiggc:0 1
aaaaaaba:0 1
aaaaaaaf:0 1
bfjhehbcf:0 1
diieabefh:0 1
jhjcaeead:0 1
ifbjdgibf:0 1
dddbgcgai:0 1
bbcafhcif:0 1
hfjciccbi:0 1
fdbgidjdj:0 1
aaaaaabb:0 1
jhdibgbfj:0 1
cfjbaijad:0 1
bhgjfacgi:0 1
ehhieihhh:0 1
abijjabda:0 1The test showed that all the patters were working.
A schema with a whole "great" overhead gave the same results.