Ctrl →

lexansk · Фев. 12, 2020 08:37:59

Добрый день! Только учусь, простите за глупы вопрос.
В скрипте парсера

'brand' : item.find('div', class_='brand font-bold text-uppercase')

из-за кривой верстки, помимо нужных элементов, выдает два со значением None. Как их убрать из выдачи чтобы использовать .get_text()

 import requests
from bs4 import BeautifulSoup
URL = 'https://www.yoox.com/ru/%D0%B4%D0%BB%D1%8F%20%D0%BC%D1%83%D0%B6%D1%87%D0%B8%D0%BD/%D0%BE%D0%B4%D0%B5%D0%B6%D0%B4%D0%B0/shoponline#/dept=clothingmen&gender=U&page=1&season=X'
HEADERS = {'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0', 'accept' : '*/*'}
def get_html(url, params = None):
    r = requests.get(url, headers = HEADERS, params = params)
    return r
def get_content(html):
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('div', class_ = 'col-8-24')
    clothes = []
    for item in items:
        clothes.append({
            'brand' : item.find('div', class_='brand font-bold text-uppercase')
        })
    print(clothes)

Striver · Фев. 12, 2020 09:58:35

     for item in items:
        item_data = item.find('div', class_='brand font-bold text-uppercase')
        if item_data:
                clothes.append({
                    'brand' : item_data 
                })

lexansk · Фев. 12, 2020 10:45:44

Striver

Спасибо, но в выдаче все-равно остается два элемента со значениями {'brand': None}, {'brand': None}

Striver · Фев. 12, 2020 11:31:04

Спасибо, но в выдаче все-равно остается два элемента со значениями {'brand': None}, {'brand': None}

Вот полная функция get_content:

 def get_content(html):
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('div', class_ = 'col-8-24')
    clothes = []
    for item in items:
        item_data = item.find('div', class_='brand font-bold text-uppercase')
        if item_data:
            clothes.append({
                'brand': item_data
            })
    print(clothes)

У меня она выводит данные без None

lexansk · Фев. 18, 2020 12:12:28

Striver
item_dat

Спасибо, это я затупил

lexansk · Фев. 19, 2020 05:58:03

Появилось свободное время и продолжил на практике изучать python. В итоге программа-парсер модифицировалась до следующего состояния:

 import requests
from bs4 import BeautifulSoup as bs
headers = {'accept': '*/*', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0'}
base_url = 'https://www.yoox.com/ru/%D0%B4%D0%BB%D1%8F%20%D0%BC%D1%83%D0%B6%D1%87%D0%B8%D0%BD/%D0%BE%D0%B4%D0%B5%D0%B6%D0%B4%D0%B0/shoponline#/dept=clothingmen&gender=U&season=X'
def yoox_parse(base_url, headers):
    session = requests.Session()
    request = session.get(base_url, headers=headers)
    clothes = []
    if request.status_code == 200:
        soup = bs(request.content, 'html.parser')
        divs = soup.find_all('div', attrs={'class': 'col-8-24'})
        for div in divs:
            brand = div.find('div', attrs={'class': 'brand font-bold text-uppercase'})
            group = div.find('div', attrs={'class': 'microcategory font-sans'})
            old_price = div.find('span', attrs={'class': 'oldprice text-linethrough text-light'})
            new_price = div.find('span', attrs={'class': 'newprice font-bold'})
            price = div.find('span', attrs={'class': 'fullprice font-bold'})
            size = div.find('span', attrs={'class': 'aSize'})
            href = div.find('div', attrs={'a': 'href'})
            art = div.find('div', attrs={'div': 'data-current-cod10'})
            print(art)
            if brand and group:
                clothes.append({
                    'art': art,
                    'href': href,
                    'size': size,
                    'brand': brand.get_text(),
                    'group': group.get_text(strip=True),
                    'old_price': old_price,
                    'new_price': new_price,
                    'price': price,
                })
        print(clothes)
    else:
        print('ERROR')
yoox_parse(base_url, headers)

За счет if brand and group: удалось отсечь элементы кривой верстки. Но появились еще проблемы, с которыми я пока своими силами справиться не могу.

- Так как элементы price, old_price и new_price есть не у всех товаров, то нельзя применить к ним метод get_text() чтобы отсечь html тэги. Так же, не знаю как отсечь в значении валюту “руб” чтобы потом проводить с данными арифметические операции.

- Элемент size может иметь несколько значений, но заполняется только первым. Я так понимаю, нужно сделать цикл for для перебора элементов, но мои эксперименты результатов не дали.

- элементы href и art, по загадочной для меня причине, я собрать не могу.

Не остается ничего другого, как снова просить помощи у коллективного разума. Заранее благодарю за помощь.

Striver · Фев. 19, 2020 08:09:52

Так как элементы price, old_price и new_price есть не у всех товаров, то нельзя применить к ним метод get_text() чтобы отсечь html тэги.

 'price': price.get_text() if price else None

Так же, не знаю как отсечь в значении валюту “руб” чтобы потом проводить с данными арифметические операции.

 'price': price.get_text().replace(' ', '').replace('руб', '') if price else None

тоже самое сделай для old_price и new_price

Отредактировано Striver (Фев. 19, 2020 08:20:03)

Striver · Фев. 19, 2020 08:17:07

Элемент size может иметь несколько значений, но заполняется только первым. Я так понимаю, нужно сделать цикл for для перебора элементов, но мои эксперименты результатов не дали.

 ...
sizes = div.find_all('span', attrs={'class': 'aSize'})
...
    'sizes': [size.get_text() for size in sizes],
...

lexansk · Фев. 19, 2020 09:08:39

Striver

Спасибо большое. Очень помог, я такого синтаксиса не знал.
Уже сделал пагинацию. Осталось разобраться со ссылками на товары и артикулами и буду думать как это все не в файл грузить а в SQL.

lexansk · Фев. 19, 2020 09:33:14

Теперь код выглядит таким образом:

 import requests
import csv
from bs4 import BeautifulSoup as bs
headers = {'accept': '*/*', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0'}
base_url = 'https://www.yoox.com/ru/%D0%B4%D0%BB%D1%8F%20%D0%BC%D1%83%D0%B6%D1%87%D0%B8%D0%BD/%D0%BE%D0%B4%D0%B5%D0%B6%D0%B4%D0%B0/shoponline#/dept=clothingmen&gender=U&season=X&page=1'
def yoox_parse(base_url, headers):
    session = requests.Session()
    request = session.get(base_url, headers=headers)
    clothes = []
    urls = []
    urls.append(base_url)
    if request.status_code == 200:
        soup = bs(request.content, 'html.parser')
        try:
            pagination = soup.find_all('li', attrs={'class': 'text-light'})
            count = int(pagination[-1].text)
            for i in range(1,count):
            #for i in range(1, 3):
                url = f'https://www.yoox.com/ru/%D0%B4%D0%BB%D1%8F%20%D0%BC%D1%83%D0%B6%D1%87%D0%B8%D0%BD/%D0%BE%D0%B4%D0%B5%D0%B6%D0%B4%D0%B0/shoponline#/dept=clothingmen&gender=U&season=X&page={i}'
                if url not in urls:
                    urls.append(url)
            
        except:
             pass
    for url in urls:
        request = session.get(url, headers=headers)
        soup = bs(request.content, 'html.parser')
        divs = soup.find_all('div', attrs={'class': 'col-8-24'})
        for div in divs:
            brand = div.find('div', attrs={'class': 'brand font-bold text-uppercase'})
            group = div.find('div', attrs={'class': 'microcategory font-sans'})
            old_price = div.find('span', attrs={'class': 'oldprice text-linethrough text-light'})
            new_price = div.find('span', attrs={'class': 'newprice font-bold'})
            price = div.find('span', attrs={'class': 'fullprice font-bold'})
            sizes = div.find_all('span', attrs={'class': 'aSize'})
            href = div.find('div', attrs={'a': 'href'})
            art = div.find('div', attrs={'div': 'data-current-cod10'})
            if brand and group and new_price: # new_price выводит только товары со скидкой
                clothes.append({
                    'art': art,
                    'href': href,
                    'sizes': [size.get_text() for size in sizes],
                    'brand': brand.get_text(),
                    'group': group.get_text(strip=True),
                    'old_price': old_price.get_text().replace(' ', '').replace('руб', '') if old_price else None,
                    'new_price': new_price.get_text().replace(' ', '').replace('руб', '') if new_price else None,
                    'price': price.get_text().replace(' ', '').replace('руб', '') if price else None
                })
        print(len(clothes))
    else:
        print('ERROR or Done')
    return clothes
def files_writer(clothes):
    with open('parsed_yoox_man_clotes.csv', 'w') as file:
        a_pen = csv.writer(file)
        a_pen.writerow(('Артикул', 'Ссылка', 'Размер', 'Марка', 'Категория', 'Старая цена', 'Новая цена', 'Цена'))
        for clothe in clothes:
            a_pen.writerow((clothe['art'], clothe['href'], clothe['sizes'], clothe['brand'], clothe['group'], clothe['old_price'], clothe['new_price'], clothe['price']))
clothes = yoox_parse(base_url, headers)
files_writer(clothes)

Пагинация отрабатывает верно, возвращая правильные urls. Но когда запустил боевой тест, понял, что сбор товаров идет только по первой странице в цикле(равном количеству страниц, которые распознали в коде). Глаза сломал уже, не вижу где ошибка.

Python-сообщество

Уведомления

#1 Фев. 12, 2020 08:37:59

Удалить значения None при парсинге сайта

#2 Фев. 12, 2020 09:58:35

Удалить значения None при парсинге сайта

#3 Фев. 12, 2020 10:45:44

Удалить значения None при парсинге сайта

#4 Фев. 12, 2020 11:31:04

Удалить значения None при парсинге сайта

#5 Фев. 18, 2020 12:12:28

Удалить значения None при парсинге сайта

#6 Фев. 19, 2020 05:58:03

Удалить значения None при парсинге сайта

#7 Фев. 19, 2020 08:09:52

Удалить значения None при парсинге сайта

#8 Фев. 19, 2020 08:17:07

Удалить значения None при парсинге сайта

#9 Фев. 19, 2020 09:08:39

Удалить значения None при парсинге сайта

#10 Фев. 19, 2020 09:33:14

Удалить значения None при парсинге сайта

Board footer