T
Before answeringTo process HTML, you should actually use a https://es.wikipedia.org/wiki/Document_Object_Model . Just in JavaScript, the very functions of language allow you to do it properly. The way to do it, it would be touring each of the text nodes of your HTML, and checking there if you continue one with another for more than 30 characters.Besides, it's not the same as a label. <i> as in your example, <div> or <p>They're gonna cut the text, as long as it's on. However, that would make a somewhat longer answer, so I continue to respond specifically to your regex question, if and only if the risks are clear.What are the risks? Apart from not identifying the semantics of each label, regex is not the tool to analyze HTML. For example, you're using something quite simplistic in your regex: /<[^>]+>/but you will always find some exception in the HTML syntax. For example:<input type="text" value="esto > te rompe el patrón">
<!-- o esto > también -->
And those cases can be incorporated into the regex logic, but then another appears as https://www.w3.org/TR/REC-xml/#sec-cdata-sect And then another exception, and another, and another...Not even a 1000 character regex will serve you to properly analyze HTML. Therefore, for HTML, regex is not the solution, for that are the JavaScript functions that treat it as a DOM. And I think you can see a stroke of complexity when you get to the end of this answer.We could assume that all your HTML is simple, and it won't break if we use it. <[^>]+>But it has to be very clear that it is a very simplistic solution, in which we are assuming too much.If we let ourselves assume that HTML is basic, then we will see how to respond, but you would be taking this risk.Solution for simple tags, which do not have a > insideBase structure. The idea is to match 30 consecutive characters that are not blank except an NBSP, to replace them by following a certain logic:html.replace( /(^|[^\S\xa0])([\S\xa0]{30,})/g, function (m, carPrevio, textoLargo) {
textoLargo = tuLogicaParaCorregir( textoLargo );
return carPrevio + textoLargo;
});
which matches:(^|[^\S\xa0]) - Group 1 (previous character) - The beginning of the text, or a character that is a blank space (except \xa0 which is an NBSP).Why that kind of characters? Because an NBSP is included inside \s in most modern JavaScript implementations (depends on browser or version), but we have to exclude it from \s.The logic of [^\S\xa0] It's a double negative: a character:that No. be aNo. blank space\S- I mean, a blank space.that No. be an NBSP (\xa0).([\S\xa0]{30,}) - Group 2 (long text) - 30 or more consecutive that are not a blank space, or that are an NBSP.And then, inside the replace() We use a function to apply the logic you want to modify that long text: be replaced by spaces, or what you prefer to do.♪ Maybe you wonder why I'm matching the previous character, if there's no reason to compare it. It's just a matter of efficiency. By forcing the coincidence to be anchored to the space prior to the word, the number of initial places from where you can try failed matches are reduced. In fact, with the text of the code below, it is reduced from 15224 steps to 6068 steps when using that character.Ignore labels. Now, inside that [\S\xa0]{30,} We have to see how he doesn't count the tags. To begin, we excluded the characters < and [:(?:[^\s<[]|\xa0){30,}
And then, we allow it to match any number of tags before each character:(?:(?:<[^>]+>|\[[^\]]+])*(?:[^\s<[]|\xa0)){30,}
In this way, we are making a loop, in which there can be any number of tags, but always 1 unique character outside the tag, and that loop is repeated 30 or more times: Count HTML entities as 1 character. In your question, you're counting as if they were 6 characters, but in reality, in HTML they are rendered as a single character. To count it as 1, we should use a logic similar to what we use with tags. We excluded & of character class, including &\w+; to tell you once for iteration. But we also have to include &(?!\w+;) to tell & that are not part of HTML entities:(?:(?:<[^>]+>|\[[^\]]+])*(?:[^\s<[&]|\xa0|&(?:\w+;|(?!\w+;)))){30,}
Still a case.. It seemed that with the previous regex we already had everything. However, it is not necessary to take into account when you do not find a long word, you can agree within a tag<tag porque_el_motor_de_RegExp_puede_iniciar_la_coincidencia_aca="aahhh!">
// |
// |
// +---> nada impide que intente desde acá si no coincidió con otra palabra larga
The ideal thing is to use the modifier https://developer.mozilla.org/es/docs/Web/JavaScript/Reference/Global_Objects/RegExp/sticky ( https://developer.mozilla.org/es/docs/Web/JavaScript/Reference/Global_Objects/RegExp/sticky ). However, IE http://kangax.github.io/compat-table/es6/#test-RegExp_y_and_u_flags (bad IE ruining everything again).To avoid that behavior, we're gonna use a trick. RexEgg.com calls it http://www.rexegg.com/regex-best-trick.html . It consists of matching what we don't want to match, to use it in replacement without modifying anything. I mean:/(lo que no queremos)|nuestro regex/
and using $1 no change in replacement.To ignore the labels, it would be:/(<[^>]+>|\[[^\]]+])|nuestro regex/g
Joining the parties. Finally, we have the regex:/(<[^>]+>|\[[^\]]+])|(^|[^\S\xa0])((?:(?:<[^>]+>|\[[^\]]+])*(?:[^\s<[&]|\xa0|&(?:\w+;|(?!\w+;)))){30,})/g
that will match a tag, or a text of 30 or more characters, without counting the tags. Well, we already have the match of the 30+ characters, with any number of tags in the middle, but...And how to replace the Or literal NBSP for a space? Within the function we pass to replace(), we also need to make sure that when replacing the characters, these are not within a tag. I mean, we're gonna have a replace. inside from another replace:html = html.replace( regex1, function (m, carPrevio, textoLargo) {
// Acá tenemos seleccionado el texto de 30+ con tags
textoLargo = textoLargo.replace( regex2, function (m, grupo1) {
//Acá vamos a reemplazar los NBSP que no estén en un tag
return /* ... */;
});
return /* ... */;
});
And to replace those who aren't in a tag we reapply the new trick, that is,/(<[^>]+>|[[^]]+])| |\xa0/ig
Code:// texto HTML
let html = [
'laaaaaaaaaar gooooooooooooooo oooooooo',
'',
'no haría match en este caso',
'(en realidad coincide, pero lo devuelve sin modificar):',
'abcdfghijklmnño<i>pqrstuvxyzqwerty',
'',
'Algunos ejemplos, además del anterior, serían los siguientes',
'(lo estiré para contar al como 1):',
' <a href="http://elpais.com/diario/1978/02/22/sociedad/256950016_850215.html" style="outline-style: none; outline-width: initial; outline-color: initial; display: inline; ">"Tienda <i>País</i> 21.1.56abcdef890',
'',
'Lo mismo, pero con 29 caracteres:',
' <a href="http://elpais.com/diario/1978/02/22/sociedad/256950016_850215.html" style="outline-style: none; outline-width: initial; outline-color: initial; display: inline; ">"Tienda <i>País</i> 21.1.56abcdef89',
'',
'si no contamos las etiquetas, su longitud es tan solo de 20:',
'padre <b>Terreros:</b></span></div>'
].join('\n');
let regex = /(<[^>]+>|[[^]]+])|(^|[^\S\xa0])((?:(?:<[^>]+>|[[^]]+])*(?:[^\s<[&]|\xa0|&(?:\w+;|(?!\w+;)))){30,})/g;
html = html.replace(
regex,
function (m, tagExcluido, carPrevio, textoLargo) {
//si coincidió con un tag, devolverlo sin modificar
if (tagExcluido) return tagExcluido;
//reemplazar &nbsp; y NBSP por espacios, excepto en tags
textoLargo = textoLargo.replace(
/(<[^>]+>|\[[^\]]+])|&nbsp;|\xa0/ig,
function (m, tagExcluido) {
//de nuevo, ignorar tags o devolver un espacio
return (tagExcluido ? tagExcluido : ' ');
}
);
return carPrevio + textoLargo;
}
);
//imprimir el resultado en un <pre> para verlo
document
.getElementById('resultado')
.innerText = html;<pre id="resultado" />