How do you evaporate the addresses from the beautifulSoup page?
-
I want to make a simple search engine and, in the first stage, collect data from a page that will then be searched. However, trying to get a link from the page to every news, getting a mistake (the inspector indicates a mishap in the hockey news). The mistake is:
ConnectionError: HTTPConnectionPool(host=www.zrg74.ruhttp', port=80): Max retries exceeded with url: //zrg74.ru/sport/item object982-dorogoj-bolshoj-hokkej-v-zconnectdouste-namereny-sozrate
Bringing a piece of the code. It has a function get_page_text() that receives the homepage as it is:
... response = requests.get(url, headers=headers, allow_redirects=True) if response.status_code == 200: page_text = response.text return page_text ...
Process code URL'a:
soup = BeautifulSoup(page_text) posts_list = soup.find_all('div', {'class': 'jeg_post_excerpt'}) for p in posts_list: lnk = p.find('a').attrs['href'] title = re.sub('[^А-ЯЁа-яё0-9\s]', ' ', p.text) title = re.sub('\s\s+', ' ', title) page_url = 'http://www.zrg74.ru' + lnk clean_path = '/'.join([d for d in page_url.split('/')[2:] if len(d) > 0])
page_text = get_page_text(page_url, USER_AGENT) if page_text is None: continue dir_path = 'data/raw_pages/' + '/'.join(clean_path.split('/')[:-1]) makedirs(dir_path, exist_ok=True) with open(dir_path + '/' + clean_path.split('/')[-1] + '.html', 'w', encoding='utf-8') as f: f.write(page_text)
I need a result at this stage like this:
{'http://zrg74.ru/obshhestvo/item/26959-rabota-ne-dlja-galochki-zlatoustovec-povedal-o-njuansah-raboty-perepischika.html',
'http://zrg74.ru/obshhestvo/item/26954-vzjalis-vmeste-dve-semi-iz-zlatousta-prinjali-uchastie-v-oblastnom-festivale-dlja-zameshhajushhih-semej.html'}
Remark: USER_AGENT is a line with the name of the browsers we work with.
-
Don't know.
URL
- and "breast," you seem to have some kind of crap. Use the standard library. https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse She's right.URL
to the constituents and to the right siteURL
♪Also check that
.find
I found something, I didn't.None
And then you take the atribut. Or turn those code fragments in.try - except
♪Supplement:
Catalogues in the file system need to work through
os.path.split
andos.path.join
♪I mean, it's a way.
urllib
You're doing it.split
And get the way, the name of the file, then through.os.path.join
On the other hand, you're doing the right path in the file system. Don't put up or scatter your paths manually, use it.urllib.parse
andos.path
And there is.split
andjoin
which divide and unite the paths much better.