Product attribute group names and attribute names unknown, different from product to product. Number of attributes group in html table unknown, but we can count it with
product_attribute_group_number = response.xpath('count(//th)').extract()
print ‘###product_attribute_group_number###’, int(float(product_attribute_group_number))
We can loop over every attribute group with:
for x in range (1,product_attribute_group_number):
for sel in response.xpath('//tr[th]/following-sibling::tr[count(.|//tr[th]/preceding-sibling::tr)=count(//tr[th]/preceding-sibling::tr)]|//tr[th]' %(x, x+1, x+1, x)):
product_attribute_group_name = sel.xpath('th/text()').extract()
print ‘###product_attribute_group_name###’, product_attribute_group_name
item = {}
for prop_row in product_attributes:
try:
prop = prop_row.xpath('th/text()').extract()
except IndexError, e:
print e# or pass, do nothing just ignore that row
prop = prop.strip()
try:
val = prop_row.xpath('td/text()').extract()
except IndexError, e:
print e# or pass, do nothing just ignore that row…
val = val.strip()
item = val
yield item
Is it correct way with correct selector xpath? Next question: what is correct selector xpath for last attributes group? (It hasn`t following-sibling::tr) Are there more elegant methods to parse html table with product attributes which are grouped to attribute groups?
Table example:
Operating System

OS(attribute name) Windows 8(attribute value)
OS Language(attribute name) English(attribute value)
Audio

Speakers(attribute name) Stereo Speakers(attribute value)
Mic In(attribute name) Yes(attribute value)
Headphone(attribute name) Yes(attribute value)
Battery

Battery Type(attribute name) 4 Cell Li-ion(attribute value)
Battery life(attribute name) 41 WHr(attribute value)
<div class=“parameters-wrapper”>
<table class=“techSpecs”>
<tr>
<th class=“tech-specs-category” colspan=“2”>Operating System:</th>
</tr>
<tr>
<th>OS</th>
<td>Windows 8</td>
</tr>
<tr>
<th>OS Language</th>
<td>English</td>
</tr>
<tr>
<th class=“tech-specs-category” colspan=“2”>Audio:</th>
</tr>
<tr>
<th>Speakers</th>
<td>Stereo Speakers</td>
</tr>
<tr>
<th>Mic In</th>
<td>Yes</td>
</tr>
<tr>
<th>Headphone</th>
<td>Yes</td>
</tr>
<tr>
<th class=“tech-specs-category” colspan=“2”>Battery:</th>
</tr>
<tr>
<th>Battery Type</th>
<td>4 Cell Li-ion</td>
</tr>
<tr>
<th>Battery life</th>
<td>41 WHr</td>
</tr>
</table>
</div>