How do you evaporate the addresses from the beautifulSoup page?



  • I want to make a simple search engine and, in the first stage, collect data from a page that will then be searched. However, trying to get a link from the page to every news, getting a mistake (the inspector indicates a mishap in the hockey news). The mistake is:

    ConnectionError: HTTPConnectionPool(host=www.zrg74.ruhttp', port=80): Max retries exceeded with url: //zrg74.ru/sport/item object982-dorogoj-bolshoj-hokkej-v-zconnectdouste-namereny-sozrate

    Bringing a piece of the code. It has a function get_page_text() that receives the homepage as it is:

    ...
    response = requests.get(url, headers=headers, allow_redirects=True)
    if response.status_code == 200:
            page_text = response.text
            return page_text
    ...
    

    Process code URL'a:

    soup = BeautifulSoup(page_text)
    posts_list = soup.find_all('div', {'class': 'jeg_post_excerpt'}) 
    for p in posts_list:
        lnk = p.find('a').attrs['href']
        title = re.sub('[^А-ЯЁа-яё0-9\s]', ' ', p.text)
        title = re.sub('\s\s+', ' ', title)
        page_url = 'http://www.zrg74.ru' + lnk
        clean_path = '/'.join([d for d in page_url.split('/')[2:] if len(d) > 0])
    
    page_text = get_page_text(page_url, USER_AGENT)
    if page_text is None:
        continue
    dir_path = 'data/raw_pages/' + '/'.join(clean_path.split('/')[:-1])
    makedirs(dir_path, exist_ok=True) 
    with open(dir_path + '/' + clean_path.split('/')[-1] + '.html', 'w', encoding='utf-8') as f:
        f.write(page_text)
    

    I need a result at this stage like this:

    {'http://zrg74.ru/obshhestvo/item/26959-rabota-ne-dlja-galochki-zlatoustovec-povedal-o-njuansah-raboty-perepischika.html',
    'http://zrg74.ru/obshhestvo/item/26954-vzjalis-vmeste-dve-semi-iz-zlatousta-prinjali-uchastie-v-oblastnom-festivale-dlja-zameshhajushhih-semej.html'}

    Remark: USER_AGENT is a line with the name of the browsers we work with.



  • Don't know. URL- and "breast," you seem to have some kind of crap. Use the standard library. https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse She's right. URL to the constituents and to the right site URL

    Also check that .find I found something, I didn't. None And then you take the atribut. Or turn those code fragments in. try - except

    Supplement:

    Catalogues in the file system need to work through os.path.split and os.path.join

    I mean, it's a way. urllib You're doing it. split And get the way, the name of the file, then through. os.path.join On the other hand, you're doing the right path in the file system. Don't put up or scatter your paths manually, use it. urllib.parse and os.path And there is. split and joinwhich divide and unite the paths much better.



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2