Latest posts on Запись спарсенных данных в бд topichttp://python.su/forum/topic/35106/2018-04-18T12:12:02+03:00Общий :: Data Mining :: Запись спарсенных данных в бд
2018-04-18T12:12:02+03:00Santik191361<blockquote><em>ZerG</em><br/>У вас в цикле идет генерация данных а потом запись в таблицу - что понятное дело перезапишет предыдущие данные и всегда будет последняя строкаперенесте строки отвечающие за запись в базу в цикл for</blockquote>Большое спасибо за ответ, но перенос строк записи в бд в цикл фор не помог бы - они бы также исполнялись в цикле и получалось бы +4 rows в бд вместо 1. Но ночью уже все таки меня осенило - удалил функцию goodcontent и, чуть-чуть измененную перенес в maincontent )<br/>еще раз спасибо! <br/>upd + создал еще отдельную функцию для записи данных в бд )
Общий :: Data Mining :: Запись спарсенных данных в бд
2018-04-18T12:02:48+03:00ZerG191359У вас в цикле идет генерация данных а потом запись в таблицу - что понятное дело перезапишет предыдущие данные и всегда будет последняя строка<br/>перенесте строки отвечающие за запись в базу в цикл for<br/>
Общий :: Data Mining :: Запись спарсенных данных в бд
2018-04-17T15:07:19+03:00Santik191345Здравствуйте, уважаемые питонисты!<br/>Бьюсь уже 2 день над выгрузкой данных в бд от парсинга. <br/>Ниже представлен мой код (я начинающий), в результате этого кода в таблицу добавляется итерация данных (т.е сначала строка с 1 значением, потом с 2, и т.д). Не понимаю как исправить ситуацию, перепробовал множество вариантов - добавление класса было уже результатом почти отчаяния. <br/>Интересуют 2 вещи: 1) Я правильно понимаю, что я изначально выбрал неправильный алгоритм и моя структура принципе неудобно, если да, то в каком направлении копать? <img src="/static/djangobb_forum/img/smilies/sad.png" /><br/>2) Собственно, можно ли в итоге достичь результата с моим кодом, и если да, то как?<br/>ПС лишние бибилиотеки, которые подгружены - они на самом деле не лишние, но эти куски кода у меня пока закомментированы, сюда я их вставлять не стал, чтобы не мешали. Для примера в коде парсится только одна страница, чтобы понять.<br/>ПСС Тут еще загвоздка в том, что на каждом урле указаны свои данные (этаж может быть указан, а может быть и нет)<br/>Благодарю за внимание!<br/><div class="code"><pre>
<span class="kn">import</span> <span class="nn">requests</span><span class="o">,</span> <span class="nn">time</span><span class="o">,</span> <span class="nn">re</span><span class="o">,</span> <span class="nn">pymysql</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">from</span> <span class="nn">selenium</span> <span class="kn">import</span> <span class="n">webdriver</span>
<span class="kn">from</span> <span class="nn">urllib.request</span> <span class="kn">import</span> <span class="n">urlretrieve</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="n">url</span> <span class="o">=</span> <span class="s1">'https://www.avito.ru/chelyabinsk/kvartiry/prodam'</span>
<span class="n">allpages</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1">#основной поток программы - получаем список ссылок</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">content</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">response</span><span class="p">,</span> <span class="s1">'html.parser'</span><span class="p">)</span>
<span class="n">houses</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'div'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s1">'item-highlight'</span><span class="p">)</span>
<span class="n">ahref</span> <span class="o">=</span> <span class="n">houses</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'href'</span><span class="p">)</span>
<span class="n">listofparams</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">houses</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s1">'div'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s1">'item-highlight'</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">House</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">floor</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">rooms</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">floorhouse</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">typeofhouse</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">allsquare</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">floor</span> <span class="o">=</span> <span class="n">floor</span>
<span class="bp">self</span><span class="o">.</span><span class="n">rooms</span> <span class="o">=</span> <span class="n">rooms</span>
<span class="bp">self</span><span class="o">.</span><span class="n">floorhouse</span> <span class="o">=</span> <span class="n">floorhouse</span>
<span class="bp">self</span><span class="o">.</span><span class="n">typeofhouse</span> <span class="o">=</span> <span class="n">typeofhouse</span>
<span class="bp">self</span><span class="o">.</span><span class="n">allsquare</span> <span class="o">=</span> <span class="n">allsquare</span>
<span class="k">def</span> <span class="fm">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s1">'we have a '</span><span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">floor</span><span class="p">))</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">House</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">goodcontent</span><span class="p">(</span><span class="n">objReturn</span><span class="p">,</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
<span class="k">global</span> <span class="n">a</span>
<span class="k">for</span> <span class="n">letter</span> <span class="ow">in</span> <span class="n">objReturn</span><span class="p">:</span>
<span class="k">if</span> <span class="n">letter</span> <span class="o">==</span> <span class="s1">':'</span><span class="p">:</span>
<span class="n">category</span> <span class="o">=</span> <span class="n">objReturn</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="n">i</span><span class="p">]</span>
<span class="n">objReturn</span> <span class="o">=</span> <span class="n">objReturn</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="p">:]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">category</span> <span class="o">==</span> <span class="s1">'Этаж'</span><span class="p">:</span>
<span class="n">a</span><span class="o">.</span><span class="n">floor</span> <span class="o">=</span> <span class="n">objReturn</span>
<span class="c1"># floor = objReturn</span>
<span class="k">elif</span> <span class="n">category</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'Количество'</span><span class="p">):</span>
<span class="n">a</span><span class="o">.</span><span class="n">rooms</span> <span class="o">=</span> <span class="n">objReturn</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># rooms = objReturn[0]</span>
<span class="k">elif</span> <span class="n">category</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'Этажей'</span><span class="p">):</span>
<span class="n">a</span><span class="o">.</span><span class="n">floorhouse</span> <span class="o">=</span> <span class="n">objReturn</span>
<span class="c1"># floorhouse = objReturn</span>
<span class="k">elif</span> <span class="n">category</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'Тип'</span><span class="p">):</span>
<span class="n">a</span><span class="o">.</span><span class="n">typeofhouse</span> <span class="o">=</span> <span class="n">objReturn</span>
<span class="c1"># typeofhouse = objReturn</span>
<span class="k">elif</span> <span class="n">category</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'Общая'</span><span class="p">):</span>
<span class="n">a</span><span class="o">.</span><span class="n">allsquare</span> <span class="o">=</span> <span class="n">objReturn</span>
<span class="c1"># allsquare = objReturn</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">House</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">floor</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">rooms</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">floorhouse</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">typeofhouse</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">allsquare</span><span class="p">)</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">conn</span> <span class="o">=</span> <span class="n">pymysql</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s1">'127.0.0.1'</span><span class="p">,</span> <span class="n">user</span><span class="o">=</span><span class="s1">'root'</span><span class="p">,</span> <span class="n">password</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">db</span><span class="o">=</span><span class="s1">'parser'</span><span class="p">,</span> <span class="n">charset</span><span class="o">=</span><span class="s1">'utf8mb4'</span><span class="p">)</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">conn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
<span class="n">cur</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">'USE parser'</span><span class="p">)</span>
<span class="n">cur</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">'INSERT INTO appartaments (floor, room, housefloor, typeofwalls, squarehouse) VALUES (</span><span class="si">%s</span><span class="s1">, </span><span class="si">%s</span><span class="s1">, </span><span class="si">%s</span><span class="s1">, </span><span class="si">%s</span><span class="s1">, </span><span class="si">%s</span><span class="s1">)'</span><span class="p">,</span> <span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">floor</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">rooms</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">floorhouse</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">typeofhouse</span><span class="p">,</span> <span class="n">a</span><span class="o">.</span><span class="n">allsquare</span><span class="p">))</span>
<span class="n">conn</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>
<span class="n">cur</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">conn</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">maincontent</span><span class="p">(</span><span class="n">ahref</span><span class="p">,</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
<span class="k">global</span> <span class="n">titleAnouncement</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'https://www.avito.ru/chelyabinsk/kvartiry/1-k_kvartira_43_m_79_et._139234913'</span><span class="p">)</span>
<span class="n">respDescr</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">content</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">respDescr</span><span class="p">,</span> <span class="s1">'html.parser'</span><span class="p">)</span>
<span class="n">titleAnouncement</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'span'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s1">'title-info-title-text'</span><span class="p">)</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">titleAnouncement</span><span class="p">)</span>
<span class="n">countRoom</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'li'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s1">'item-params-list-item'</span><span class="p">)</span>
<span class="n">countRoomReturn</span> <span class="o">=</span> <span class="n">countRoom</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
<span class="n">countRoomReturn</span> <span class="o">=</span> <span class="n">goodcontent</span><span class="p">(</span><span class="n">countRoomReturn</span><span class="p">)</span>
<span class="n">listofparams</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">countRoomReturn</span><span class="p">)</span>
<span class="k">while</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">countRoom</span> <span class="o">=</span> <span class="n">countRoom</span><span class="o">.</span><span class="n">findNextSibling</span><span class="p">()</span>
<span class="n">countRoomReturn</span> <span class="o">=</span> <span class="n">countRoom</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
<span class="n">countRoomReturn</span> <span class="o">=</span> <span class="n">goodcontent</span><span class="p">(</span><span class="n">countRoomReturn</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s1">'Элементы закончились'</span><span class="p">)</span>
<span class="k">break</span>
<span class="n">maincontent</span><span class="p">(</span><span class="s1">'dw'</span><span class="p">)</span>
</pre></div>