How do we clear the evaporated page from the tags?



  • My job is to gather information from the site and, if available, to carry out search engines. He will then search the site according to the most relevant search phrase. Now I have the task of evaporated the pages, leaving only the text without the tags. The code has to be checked every time, whether there's been an update on the page, and if so, it's a match.

    For some reason, I have a page with a "musor" in the form of a tag, although it was expected to be different. What do I do?

    from bs4 import BeautifulSoup
    from os import walk
    # модуль difflib содержит классы и функции для сравнения последовательностей (текстов)
    import difflib
    import re
    import codecs  # модуль символьной перекодировки 
    

    pages_list = []

    собираем список страниц, которые есть, для этого: проходим по raw_pages (папки на жестком диске со страницами):

    for dirpath, dirnames, filenames in walk('data/raw_pages'):
    if '.ipynb_checkpoints' in dirpath:
    continue

    dirpath = dirpath.replace('\\', '/') # для Windows
    for fn in filenames:
        if '.DS_Store' in fn:
            continue 
        fp = f'{dirpath}/{fn}'
        pages_list.append(fp)
    

    def remove_script(file):
    """Функция облегчает жизнь difflib'у и удаляет скрипты, футеры и хедеры"""
    soup = BeautifulSoup(''.join(file), 'html.parser')
    for s in soup.select('script'):
    # extract()удаляет тег или строку из дерева
    # и возвращает тег или строку, которые были извлечены
    s.extract()

    for f in soup.select('footer'):
        f.extract()
    
    for f in soup.select('header'):
        f.extract()
        
    return str(soup).split('\n')
    

    Теперь из полученного списка берем 2 первых файла и очищаем их функцией remove_script:

    fp_1 = 'data/raw_pages/zrg74.ru/obshhestvo/item/26920-chistovoe-vyrazhenie-v-zlatouste-oglasili-sroki-sdachi-10-jetazhki-dlja-vethoavarijshhikov.html'
    fp_2 = 'data/raw_pages/zrg74.ru/obshhestvo/item/26924-verh-masterstva-v-zlatouste-blagoustrojstvo-jekotropy-urenga-zavershajut-rabotami-na-vysote.html'

    with codecs.open(fp_1, 'r', 'utf_8_sig') as f:
    file1 = remove_script(f.readlines())
    with codecs.open(fp_2, 'r', 'utf_8_sig') as f:
    file2 = remove_script(f.readlines())

    def clean_diff(diff):
    """Функция очиcтки (пригодится позже)"""
    diff = re.sub('<[^<>]+>', ' ', diff)
    diff = re.sub(' ', ' ', diff)
    diff = re.sub('\xa0', ' ', diff)
    diff = re.sub('\s\s+', ' ', diff)
    diff = re.sub('^[+-] ', '', diff)

    return diff
    

    теперь сравним все страницы с «эталонной»: за нее возьмем первую же ссылку.

    page_lines = []
    for diff in difflib.ndiff(file1, file2): # с предобработкой данных
    if re.search('^+ ', diff) is None:
    continue
    diff = clean_diff(diff)
    if len(diff) == 0:
    continue
    page_lines.append(diff)

    print(diff)

    page_text = ' '.join(page_lines)

    print(page_text)

    print(page_lines)



  • Try using regular expressions from the module. re

    Example of code:

    import re
    

    new_string = re.sub('<[^>]*>', '', your_string)

    Example of code compliance:

    введите сюда описание изображения



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2