Найти - Пользователи
Полная версия: Scrapy. Correct way to parse product attributes table with attributes groups and save result to 4 mysql tables
Начало » Data Mining » Scrapy. Correct way to parse product attributes table with attributes groups and save result to 4 mysql tables
1
mikhaild
Need advice: what is correct way to parse product attributes html table with attributes groups and save results to 4 mysql tables: attribute, attribute_description, attribute_group, attribute_group_description.

Product attribute group names and attribute names unknown, different from product to product. Number of attributes group in html table unknown, but we can count it with

product_attribute_group_number = response.xpath('count(//th)').extract()
print ‘###product_attribute_group_number###’, int(float(product_attribute_group_number))

We can loop over every attribute group with:

for x in range (1,product_attribute_group_number):
for sel in response.xpath('//tr[th]/following-sibling::tr[count(.|//tr[th]/preceding-sibling::tr)=count(//tr[th]/preceding-sibling::tr)]|//tr[th]' %(x, x+1, x+1, x)):
product_attribute_group_name = sel.xpath('th/text()').extract()
print ‘###product_attribute_group_name###’, product_attribute_group_name
item = {}
for prop_row in product_attributes:
try:
prop = prop_row.xpath('th/text()').extract()
except IndexError, e:
print e# or pass, do nothing just ignore that row
prop = prop.strip()
try:
val = prop_row.xpath('td/text()').extract()
except IndexError, e:
print e# or pass, do nothing just ignore that row…
val = val.strip()
item = val
yield item

Is it correct way with correct selector xpath? Next question: what is correct selector xpath for last attributes group? (It hasn`t following-sibling::tr) Are there more elegant methods to parse html table with product attributes which are grouped to attribute groups?

Table example:
Operating Systemattributes group name)
OS(attribute name) Windows 8(attribute value)
OS Language(attribute name) English(attribute value)
Audioattributes group name)
Speakers(attribute name) Stereo Speakers(attribute value)
Mic In(attribute name) Yes(attribute value)
Headphone(attribute name) Yes(attribute value)
Batteryattributes group name)
Battery Type(attribute name) 4 Cell Li-ion(attribute value)
Battery life(attribute name) 41 WHr(attribute value)

<div class=“parameters-wrapper”>
<table class=“techSpecs”>
<tr>
<th class=“tech-specs-category” colspan=“2”>Operating System:</th>
</tr>
<tr>
<th>OS</th>
<td>Windows 8</td>
</tr>
<tr>
<th>OS Language</th>
<td>English</td>
</tr>
<tr>
<th class=“tech-specs-category” colspan=“2”>Audio:</th>
</tr>
<tr>
<th>Speakers</th>
<td>Stereo Speakers</td>
</tr>
<tr>
<th>Mic In</th>
<td>Yes</td>
</tr>
<tr>
<th>Headphone</th>
<td>Yes</td>
</tr>
<tr>
<th class=“tech-specs-category” colspan=“2”>Battery:</th>
</tr>
<tr>
<th>Battery Type</th>
<td>4 Cell Li-ion</td>
</tr>
<tr>
<th>Battery life</th>
<td>41 WHr</td>
</tr>
</table>
</div>
lorien
I am afraid you've chosen wrong place to ask questions in English language :) This is Russian board. Try official mailing list of scrapy framework.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Powered by DjangoBB