Regular expression different from what set of lines



  • There are two sets of lines.
    White list:

    bhdgffhfa
    aggjjaaga
    hbefebhfi
    dbceeabih
    ihifdbhhf
    djacfjbae
    ibjbeigac
    jaffigdad
    aaaaaaaaj
    dgifecchi
    dcbhgjaaf
    eedfbcedb
    aaaaaaabf
    beaajbihg
    aaaaaaaag
    eacccdjeh
    aaaaaaaad
    hehcijfhc
    aaaaaaaaa
    aaaaaaabc
    jfdhdeice
    

    And the blacklist:

    aaaaaaaai
    ehigcbgjd
    hgjdeajec
    aaaaaaabe
    haebgiggc
    aaaaaaaba
    aaaaaaaaf
    bfjhehbcf
    diieabefh
    jhjcaeead
    ifbjdgibf
    dddbgcgai
    bbcafhcif
    hfjciccbi
    fdbgidjdj
    aaaaaaabb
    jhdibgbfj
    cfjbaijad
    bhgjfacgi
    ehhieihhh
    abijjabda
    

    We need to write a regular expression, not longer than 60 symbols, which can distinguish from which set the line is taken.
    For example, if the programme entrance is available aaaaaaaaathe program has to decide it's a line from the white list.

    UPDATE: https://www.hackerrank.com/contests/regular-expresso/challenges/match-maker



  • Two decisive pathters may be received: on the white list (whitelist) and on the black list (blacklist).

    White list: /ce|ag|fe|ff|aaj|cd|dc|aad|abc|abf|acf|bei|bhh|cij|aaaaaaaaa/
    Black list: /bg|ai|bb|ie|jc|jd|bd|cg|jh|aba|abe|aaaaaaaaf/

    The Black List Pattern is shorter and allows a compression of up to 37 symbols: /ai|b[bdg]|cg|ie|j[cdh]|ab[ae]|a{8}f/
    But using two patters makes it possible to diagnose.

    Resolving patterns are drawn from a unique list of tokens on the optimised greed algorithm.

    Two- and three-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-seven-s-seven-s.

    PHP:

    $whitelist = "
        bhdgffhfa
        aggjjaaga
        hbefebhfi
        dbceeabih
        ihifdbhhf
        djacfjbae
        ibjbeigac
        jaffigdad
        aaaaaaaaj
        dgifecchi
        dcbhgjaaf
        eedfbcedb
        aaaaaaabf
        beaajbihg
        aaaaaaaag
        eacccdjeh
        aaaaaaaad
        hehcijfhc
        aaaaaaaaa
        aaaaaaabc
        jfdhdeice
    ";
    

    $blacklist = "
    aaaaaaaai
    ehigcbgjd
    hgjdeajec
    aaaaaaabe
    haebgiggc
    aaaaaaaba
    aaaaaaaaf
    bfjhehbcf
    diieabefh
    jhjcaeead
    ifbjdgibf
    dddbgcgai
    bbcafhcif
    hfjciccbi
    fdbgidjdj
    aaaaaaabb
    jhdibgbfj
    cfjbaijad
    bhgjfacgi
    ehhieihhh
    abijjabda
    ";

    //Генератор решающих токенов
    function tokens_gen($whitelist, $blacklist){
    $white_tokens = [];
    $black_tokens = [];
    for($i=-1; $i< 12; $i++){
    $c = ($i<0) ? "" : chr(ord("a")+$i); // сначала двухсимвольные токены
    $c = ($i==10) ? "aaaaaaa" : $c; // последними - токены c "aaaaaaa"
    for($ord2=ord("a"); $ord2 <= ord("j"); $ord2++){
    for($ord3=ord("a"); $ord3 <= ord("j"); $ord3++){
    $ccc = $c.chr($ord2).chr($ord3);
    $pm_white = preg_match("/$ccc/", $whitelist);
    $pm_black = preg_match("/$ccc/", $blacklist);
    if(($pm_white==1) && ($pm_black==0)) array_push($white_tokens, $ccc);
    if(($pm_black==1) && ($pm_white==0)) array_push($black_tokens, $ccc);
    }
    }
    }
    return [$white_tokens, $black_tokens];
    }

    // "Жадный" рекурсивный отбор токенов
    function greedy($tokens, $list, $pat=""){
    $tokens_new = [];
    $ppp_max = "";
    foreach($tokens as $ppp){
    $p = ($pat=="")? $ppp: "$pat|$ppp";
    $sum = 0;
    foreach($list as $str) $sum += preg_match("/$p/", $str);
    if($sum > $sum_max){
    $ppp_max = $ppp;
    $sum_max = $sum;
    }
    if($sum > 0) array_push($tokens_new, $ppp);
    }
    if($ppp_max == "") return $pat;
    $pat = ($pat=="")? $ppp_max: "$pat|$ppp_max";
    $left = 'display: block; float: left; ';
    printf("<span style='$left'>Токенов:</span><span style='$left width:30px;' align='right'>%d</span>", count($tokens));
    printf("<span style='$left'>&emsp;Строк,&emsp;всего:</span><span style='$left width: 24px;' align='right'>%d</span>&emsp;по токену: $sum_max&emsp;", count($list));
    print("Токен: $ppp_max &emsp; Паттерн: $pat<br>");
    $list = preg_grep("/$ppp_max/", $list, PREG_GREP_INVERT);
    $pat = greedy($tokens_new, $list, $pat);
    return $pat;
    }

    // Перевод контрольных строк в массивы
    $white = explode(chr(10), trim($whitelist));
    foreach($white as &$w) $w=trim($w);
    $black = explode(chr(10), trim($blacklist));
    foreach($black as &$b) $b=trim($b);
    $white_black = array_merge($white, $black);

    // Получение решающих токенов
    list($white_tokens, $black_tokens) = tokens_gen($whitelist, $blacklist);

    // Получение жадных паттернов
    print("Жадные паттерны:<br><br>");
    $white_pattern = greedy($white_tokens, $white);
    $black_pattern = greedy($black_tokens, $black);
    //$black_pattern = 'ai|b[bdg]|cg|ie|j[cdh]|ab[ae]|a{8}f';
    print("<br><br>white_pattern = $white_pattern<br>black_pattern = $black_pattern<br>");

    $left = 'display: block; float: left; ';
    print("<br>Проверка:<br>");
    foreach($white_black as $str){
    $in_white = preg_match("/$white_pattern/",$str);
    $in_black = preg_match("/$black_pattern/",$str);
    print("<br><span style='$left width:80px'>$str:</span>$in_white $in_black");
    }

    Results:

    Greedy patters:

    Tokens: 108 Stroke, total token:
    Tokens:108 Strock, total:18 token: 2 Token: ag Pattern: ce:ag
    Tokens: 88 Stroke, total token: 2 Token: fe Pattern: ce:ag:fe
    Tokens:81 Stroke, total:14 token: 2 Token: ff Pattern: ce:ag:fe,ff
    Tokens:67 Stroke, total:12 token: 2 Token: aaj Pattern: ce,ag,fe,ff,aaj
    Tokens:49 Stroke, total:10 token: 1 Token: cd Pattern: ce:ag:fe,ff,aaj|cd
    Tokens:41 Stroke, total:9 token: 1 Token: dc Pattern: ce:ag,fe,ff,aaj,cd,dc
    Tokens:33 Stroke, total:8 token: 1 Token: aad Pattern: ceagag:fe,ff,cd,dc|aad
    Tokens:28 Stroke, total:7 token: 1 Token: abc Pattern: ce:ag:feajff,cd,dc,aadcabc
    Tokens:26 Stroke, total:6 token: 1 Token: abf Pattern: ceagag:feajff,cd,dc,aad🔤abf
    Tokens:24 Stroke, total:5 token: 1 Token: acf Pattern: ceagag:feajffajcd,dc,aad🔤abf:acf
    Tokens:22 Stroke, total:4 token: 1 Token: bei Pattern: ceagag:fe:ffajcd,dc,aad🔤abc:acfacacacbei
    Tokens:18 Stroke, total:3 token: 1 Token: bh Pattern: ceagag:feffff|cd,dc,aad🔤abc:acfacacfacbei|bhh
    Tokens:11 Stroke, total:2 token: 1 Token: cij Pattern: ceagag:fe:ffajcdc,aad🔤abcacacfacacbeibebh:cij
    Tokens:5 Stroke, total:1 token: 1 Token: aaaaaaaaa Pattern: ceagag:fe:d:cdc|abcacabcacacfacacf:acbei:cij:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    Tokens:118 Stroke, total token: 5 Token: bg Pattern: bg
    Tokens:118 Stroke, total:16 Token: 2 Token: ai Pattern: bg:ai
    Tokens:82 Stroke, total:14 token: 2 Token: bb Pattern: bg:ai,bb
    Tokens:75 Stroke, total:12 token: 2 Token: ie Pattern: bg:ai:bb:ie
    Tokens:67 Stroke, total:10 token: 2 Token: jc Pattern: bg:ai,bb,ie,jc
    Tokens:53 Stroke, total:8 token: 2 Token: jd Pattern: bg:ai:bb:ie:jc|jd
    Tokens:37 Stroke, total:6 per token: 1 Token: bd Pattern: bg:ai:bb:ie:jc,jd:bd:d
    Tokens:24 Stroke, total:5 thicken: 1 Token: cg Pattern: bg:ai:bb:ie|jc,jd,bd:cg
    Tokens:18 Stroke, total:4 token: 1 Token: jh Pattern: bg:ai:bb:ie:jc,jd,c:cg:jh
    Tokens:12 Stroke, total:3 token: 1 Token: aba Pattern: bg:ai:bb:ie:jc,jd,c,cg|jh:aba
    Tokens:5 Stroke, total:2 token: 1 Token: abe Pattern: bg:ai:bb:ie|jc,jd,c,cg:jh:aba:abe
    Tokens:3 Stroke, total:1 token: 1 Token: aaaaaaaaaaf Pattern: bg:ai:bb:ie:jc:jd:cg:jh:aba:abe:aaaaaaaaaaaf

    white_pattern = ceagagagfefeff|aaj|cd|dc|abcacabcacabfacacbebeibebhcicijciaaaaaaaaaaa
    black_pattern = bg|aiaibb|ie|jc|jdccg|jh|aba|abeaaaaaaaaaaaaf

    Verification:

    bhdgffhfa:1 0
    aggjjaaga:1 0
    hbefebhfi:1 0
    dbceeabih:1 0
    ihifdbhf:1 0
    djacfjbae:1 0
    ibjbeigac:1 0
    jaffigdad:1 0
    aaaaaaaj:1 0
    dgifecchi:1 0
    dcbhgjaaf:1 0
    eedfbcedb:1 0
    aaaaaaabf:1 0
    beaajbihg:1 0
    aaaaaaag:1 0
    eacccdjeh:1 0
    aaaaaaad:1 0
    hehcijfhc:1 0
    aaaaaa:1 0
    aaaaaabc:1 0
    jfdhdeice:1 0
    aaaaaaaai:0 1
    ehigcbgjd:0 1
    hgjdeajec:0 1
    aaaaaabe:0 1
    haebgiggc:0 1
    aaaaaaba:0 1
    aaaaaaaf:0 1
    bfjhehbcf:0 1
    diieabefh:0 1
    jhjcaeead:0 1
    ifbjdgibf:0 1
    dddbgcgai:0 1
    bbcafhcif:0 1
    hfjciccbi:0 1
    fdbgidjdj:0 1
    aaaaaabb:0 1
    jhdibgbfj:0 1
    cfjbaijad:0 1
    bhgjfacgi:0 1
    ehhieihhh:0 1
    abijjabda:0 1

    The test showed that all the patters were working.
    A schema with a whole "great" overhead gave the same results.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2