How do we clear the evaporated page from the tags?
-
My job is to gather information from the site and, if available, to carry out search engines. He will then search the site according to the most relevant search phrase. Now I have the task of evaporated the pages, leaving only the text without the tags. The code has to be checked every time, whether there's been an update on the page, and if so, it's a match.
For some reason, I have a page with a "musor" in the form of a tag, although it was expected to be different. What do I do?
from bs4 import BeautifulSoup from os import walk # модуль difflib содержит классы и функции для сравнения последовательностей (текстов) import difflib import re import codecs # модуль символьной перекодировки
pages_list = []
собираем список страниц, которые есть, для этого: проходим по raw_pages (папки на жестком диске со страницами):
for dirpath, dirnames, filenames in walk('data/raw_pages'):
if '.ipynb_checkpoints' in dirpath:
continuedirpath = dirpath.replace('\\', '/') # для Windows for fn in filenames: if '.DS_Store' in fn: continue fp = f'{dirpath}/{fn}' pages_list.append(fp)
def remove_script(file):
"""Функция облегчает жизнь difflib'у и удаляет скрипты, футеры и хедеры"""
soup = BeautifulSoup(''.join(file), 'html.parser')
for s in soup.select('script'):
# extract()удаляет тег или строку из дерева
# и возвращает тег или строку, которые были извлечены
s.extract()for f in soup.select('footer'): f.extract() for f in soup.select('header'): f.extract() return str(soup).split('\n')
Теперь из полученного списка берем 2 первых файла и очищаем их функцией remove_script:
fp_1 = 'data/raw_pages/zrg74.ru/obshhestvo/item/26920-chistovoe-vyrazhenie-v-zlatouste-oglasili-sroki-sdachi-10-jetazhki-dlja-vethoavarijshhikov.html'
fp_2 = 'data/raw_pages/zrg74.ru/obshhestvo/item/26924-verh-masterstva-v-zlatouste-blagoustrojstvo-jekotropy-urenga-zavershajut-rabotami-na-vysote.html'with codecs.open(fp_1, 'r', 'utf_8_sig') as f:
file1 = remove_script(f.readlines())
with codecs.open(fp_2, 'r', 'utf_8_sig') as f:
file2 = remove_script(f.readlines())def clean_diff(diff):
"""Функция очиcтки (пригодится позже)"""
diff = re.sub('<[^<>]+>', ' ', diff)
diff = re.sub(' ', ' ', diff)
diff = re.sub('\xa0', ' ', diff)
diff = re.sub('\s\s+', ' ', diff)
diff = re.sub('^[+-] ', '', diff)return diff
теперь сравним все страницы с «эталонной»: за нее возьмем первую же ссылку.
page_lines = []
for diff in difflib.ndiff(file1, file2): # с предобработкой данных
if re.search('^+ ', diff) is None:
continue
diff = clean_diff(diff)
if len(diff) == 0:
continue
page_lines.append(diff)print(diff)
page_text = ' '.join(page_lines)
print(page_text)
print(page_lines)
-
Try using regular expressions from the module.
re
♪Example of code:
import re
new_string = re.sub('<[^>]*>', '', your_string)
Example of code compliance: