退出:scrapy(退出状态0;不是预期的)

我尝试运行在我的Docker容器中启动许多蜘蛛的bash脚本。 我的supervisor.conf放在“ /etc/supervisor/conf.d/ ”里,就像这样:

 [program:scrapy] command=/tmp/start_spider.sh autorestart=false startretries=0 stderr_logfile=/tmp/start_spider.err.log stdout_logfile=/tmp/start_spider.out.log 

但主pipe返回这个错误:

2015-08-21 10:50:30,466 CRIT Supervisor以超级用户身份运行(configuration文件中没有用户)

2015-08-21 10:50:30,466 WARN在parsing过程中包含额外的文件“/etc/supervisor/conf.d/tor.conf”

2015-08-21 10:50:30,478信息RPC接口“主pipe”已初始化

2015-08-21 10:50:30,478 CRIT服务器'unix_http_server'在没有任何HTTPauthentication检查的情况下运行

2015-08-21 10:50:30,478 INFO supervisord从pid 5开始

2015-08-21 10:50:31,481信息产生:'scrapy'与pid 8

2015-08-21 10:50:31,555退出的信息:scrapy(退出状态0;不是预期的)

2015-08-21 10:50:32,557 INFO放弃了:scrapy进入FATAL状态,太多的启动重试太快

我的程序停止运行。 但是,如果我手动运行我的程序,它工作得很好…

如何解决这个问题? 有任何想法吗?

我find了解决我的问题。 对于supervisor.conf ,更改

 [program:scrapy] command=/tmp/start_spider.sh autorestart=false startretries=0 

通过:

 [program:scrapy] command=/bin/bash -c "exec /tmp/start_spider.sh > /dev/null 2>&1 -DFOREGROUND" autostart=true autorestart=false startretries=0 

这里是我的代码:

start_spider.sh

 #!/bin/bash # list letter parseLetter=('a' 'b') # change path cd $path/scrapy/scrapyTodo/scrapyTodo tLen=${#parseLetter[@]} for (( i=0; i<${tLen}; i++ )); do scrapy crawl root -a alpha=${parseLetter[$i]} & done 

这里是我的scrapy代码:

 #!/usr/bin/python -tt # -*- coding: utf-8 -*- from scrapy.selector import Selector from elasticsearch import Elasticsearch from scrapy.contrib.spiders import CrawlSpider from scrapy.http import Request from urlparse import urljoin from bs4 import BeautifulSoup from scrapy.spider import BaseSpider from bs4 import BeautifulSoup from tools import sendEmail from tools import ElasticAction from tools import runlog from scrapy import signals from scrapy.xlib.pydispatch import dispatcher from datetime import datetime import re class studentCrawler(BaseSpider): # Crawling Start CrawlSpider.started_on = datetime.now() name = "root" DOWNLOAD_DELAY = 0 allowed_domains = ['website.com'] ES_Index = "website" ES_Type = "root" ES_Ip = "127.0.0.1" child_type = "level1" handle_httpstatus_list = [404, 302, 503, 999, 200] #add any other code you need es = ElasticAction(ES_Index, ES_Type, ES_Ip) # Init def __init__(self, alpha=''): base_domain = 'https://www.website.com/directory/student-' + str(alpha) + "/" self.start_urls = [base_domain] super(CompanyCrawler, self).__init__(self.start_urls) def is_empty(self, any_structure): """ Function that allow to check if the data is empty or not :arg any_structure: any data """ if any_structure: return 1 else: return 0 def parse(self, response): """ main method that parse the web page :param response: :return: """ if response.status == 404: self.es.insertIntoES(response.url, "False") if str(response.status) == "503": self.es.insertIntoES(response.url, "False") if response.status == 999: self.es.insertIntoES(response.url, "False") if str(response.status) == "200": # Selector sel = Selector(response) self.es.insertIntoES(response.url, "True") body = self.getAllTheUrl('u'.join(sel.xpath(".//*[@id='seo-dir']/div/div[3]").extract()).strip(),response.url ) def getAllTheUrl(self, data, parent_id): dictCompany = dict() soup = BeautifulSoup(data,'html.parser') for a in soup.find_all('a', href=True): self.es.insertChildAndParent(self.child_type, str(a['href']), "False", parent_id) 

我发现BeautifulSoup不工作蜘蛛是由主pipe发射….