>>> ar = [] >>> ar.append(r.findall(url)) >>> ar [['http://www.www.site.local']] >>> ar.append(r.findall(url)) >>> ar [['http://www.www.site.local'], ['http://www.www.site.local']]
>>> ar = [] >>> ar.append(r.findall(url)) >>> ar [['http://www.www.site.local']] >>> ar.append(r.findall(url)) >>> ar [['http://www.www.site.local'], ['http://www.www.site.local']]
Egor2010Страницы лучше обрабатывать через lxml.html
А если это делается с помощью re то зачем html5lib?
>>> import lxml.html >>> >>> doc = lxml.html.fromstring('<a href="http://www.domain.com">text</a>') >>> lst = doc.xpath(r'//a/@href') >>> lst ['http://www.domain.com'] >>>
sypper-pit
>>> ar[0] >>> ar[1]
>>> import re
>>> url='<a href="http://www.site2.local" target="_blank">http://www.site2.local</a><a href="http://www.site1.local" target="_blank">http://www.site1.local</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> ar=[]
>>> ar.append(r.findall(url))
>>> ar
[['http://www.site2.local', 'http://www.site1.local']]
>>> ar[0]
['http://www.site2.local', 'http://www.site1.local']
>>> ar[1]
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
ar[1]
IndexError: list index out of range
ar[0]
>>> ar [['http://www.site2.local', 'http://www.site1.local']] >>> ar[0][0] 'http://www.site2.local' >>> ar[0][1] 'http://www.site1.local' >>>
sypper-pitне подскажите как отбирать “нормальные(по которым можно перейти на другую страницу)” url?
http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-linkпо аналогии забираем