Break the text on the proposals with the retention of the divider



  • It may be limited to the conclusion of the proposal:

    "start letter" or "or"?

    For example:

    "Hi! I'm a simple text. Can you share me?"

    ['Hi,'I'm a simple text.', 'Can you separate me?'

    There was an attempt, but it was a bad one:

    re.split(r'\w[.!?]+\s+[А-Я]', "Hello! I'm John. Are you OK? fine... and so")
    


  • It's a gap, but it's used. https://ru.wikipedia.org/wiki/%D0%A0%D0%B5%D0%B3%D1%83%D0%BB%D1%8F%D1%80%D0%BD%D1%8B%D0%B5_%D0%B2%D1%8B%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D1%8F#.D0.9F.D1.80.D0.BE.D1.81.D0.BC.D0.BE.D1.82.D1.80_.D0.B2.D0.BF.D0.B5.D1.80.D1.91.D0.B4_.D0.B8_.D0.BD.D0.B0.D0.B7.D0.B0.D0.B4 To make sure there's a letter in front of the protein, and...

    import re
    

    result = re.split(r'(?<=\w[.!?]) ', "Hello! I'm John. Are you OK? fine... and so")
    print (result)

    result = re.split(r'(?<=\w[.!?]) ', "Привет! Я простой текст. Ты сможешь разделить меня?")
    print (result)

    Result:

     ['Hello!', "I'm John.", 'Are you OK?', 'fine... and so']
    ['Привет!', 'Я простой текст.', 'Ты сможешь разделить меня?']

    P. S. I didn't check on Junicode. Testing. https://repl.it/languages/python3

    UPD \w Perhaps to be replaced by the listing of permissible symbols, as these may be letters, figures and sign




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2