Задача, спарсить даные с сайта http://b2binform.ru/result/?c=203
сохранить в сsv.
Ну как вы поняли, каждый салон нужно открыть, посмотреть его контактные даные и сохранить в отдельный файл.
Для решения задачи был выбран
https://scrapy.org/
мануал как все запилить бралсо отсюда, потеому используються старые модули
http://gis-lab.info/qa/scrapy.html
Какбы, вот сам код содерджимого паука
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader.processor import TakeFirst
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from b2b.items import OrphanageItem
class b2bSpider(CrawlSpider):
name = 'b2b'
allowed_domains = ['b2binform.ru']
start_urls = ['http://b2binform.ru/result/?c=203/']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('result?c=203&page')), follow=True),
Rule(SgmlLinkExtractor(allow=('c/.html', )), callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
l = OrphanLoader(OrphanageItem(), hxs)
l.add_xpath('nazva', ".//*[@id='tabs-1']/div[4]/div[1]")
l.add_xpath('adres', ".//*[@id='tabs-1']/div[4]/div[2]")
l.add_xpath('email', ".//*[@id='tabs-1']/div[4]/div[3]")
l.add_xpath('tel', ".//*[@id='tabs-1']/div[4]/div[4]")
l.add_xpath('website', ".//*[@id='tabs-1']/div[4]/div[5]")
l.add_xpath('razdel', ".//*[@id='tabs-1']/div[4]/div[6]")
l.add_xpath('dodat2', ".//*[@id='tabs-1']/div[4]/div[7]")
l.add_xpath('dodat3', ".//*[@id='tabs-1']/div[4]/div[8]")
l.add_value('url', response.url)
return l.load_item()
код на питоне просто его тут нетвот ответ в консоли:
scrapy crawl b2b -o scarped_data_utf8.csv -t csv
/home/bozon/b2b/b2b/spiders/b2b_spider.py:1: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
from scrapy.contrib.spiders import CrawlSpider, Rule
/home/bozon/b2b/b2b/spiders/b2b_spider.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
/home/bozon/b2b/b2b/spiders/b2b_spider.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
/home/bozon/b2b/b2b/spiders/b2b_spider.py:3: ScrapyDeprecationWarning: Module `scrapy.contrib.loader` is deprecated, use `scrapy.loader` instead
from scrapy.contrib.loader.processor import TakeFirst
/home/bozon/b2b/b2b/spiders/b2b_spider.py:3: ScrapyDeprecationWarning: Module `scrapy.contrib.loader.processor` is deprecated, use `scrapy.loader.processors` instead
from scrapy.contrib.loader.processor import TakeFirst
/home/bozon/b2b/b2b/spiders/b2b_spider.py:16: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
Rule(SgmlLinkExtractor(allow=('result?c=203&page')), follow=True),
/home/bozon/b2b/b2b/spiders/b2b_spider.py:17: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
Rule(SgmlLinkExtractor(allow=('c/.html', )), callback='parse_item'),
2016-09-26 18:40:49 [scrapy] INFO: Scrapy 1.0.3 started (bot: b2b)
2016-09-26 18:40:49 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-09-26 18:40:49 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'b2b.spiders', 'FEED_URI': 'scarped_data_utf8.csv', 'SPIDER_MODULES': ['b2b.spiders'], 'BOT_NAME': 'b2b', 'TELNETCONSOLE_ENABLED': False, 'FEED_FORMAT': 'csv'}
2016-09-26 18:40:50 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, LogStats, CoreStats, SpiderState
2016-09-26 18:40:50 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-09-26 18:40:50 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-09-26 18:40:50 [scrapy] INFO: Enabled item pipelines:
2016-09-26 18:40:50 [scrapy] INFO: Spider opened
2016-09-26 18:40:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-26 18:40:50 [scrapy] DEBUG: Crawled (200) <GET http://b2binform.ru/result/?c=203/ (referer: None)
2016-09-26 18:40:50 [scrapy] INFO: Closing spider (finished)
2016-09-26 18:40:50 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 224,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 5648,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 9, 26, 15, 40, 50, 329729),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 9, 26, 15, 40, 50, 107101)}
2016-09-26 18:40:50 [scrapy] INFO: Spider closed (finished)