Ubuntu – Cannot create a crontab job for the scrapy program

command linecrongoogle-chromepythonselenium

I have written a small Python scraper (using Scrapy framework). The scraper requires a headless browse… I am using ChromeDriver.

As I am running this code on an Ubuntu server which does not have any GUI, I had to install Xvfb in order to run ChromeDriver on my Ubuntu server (I followed this guide)

This is my code:

class MySpider(scrapy.Spider):
    name = 'my_spider'

    def __init__(self):
        # self.driver = webdriver.Chrome(ChromeDriverManager().install())
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        self.driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=chrome_options)

I can run the above code from Ubuntu shell and it execute without any errors:

ubuntu@ip-1-2-3-4:~/scrapers/my_scraper$ scrapy crawl my_spider

Now I want to setup a cron job to run the above command everyday:

# m h  dom mon dow   command
PATH=/usr/local/bin:/home/ubuntu/.local/bin/
05 12 * * * cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1

but the crontab job gives me the following error:

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 192, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 196, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 86, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 98, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/spiders/__init__.py", line 19, in from_crawler
    spider = cls(*args, **kwargs)
  File "/home/ubuntu/scrapers/my_scraper/my_scraper/spiders/spider.py", line 27, in __init__
    self.driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=chrome_options)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
    desired_capabilities=desired_capabilities)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
  (unknown error: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
  (Driver info: chromedriver=2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706),platform=Linux 5.4.0-1029-aws x86_64)

Update

This answer help me solve the issue (but I don't quite understand why)

I ran echo $PATH on my Ubuntu shell and copied the value into the crontab:

PATH=/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
05 12 * * * cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1

Note: As I have created a bounty for this question, I am happy to award it to any answer which explains why changing the PATH solved the issue.

Best Answer

This is the reason of almost all the cases where cron doesn't seem to run.

Cron always runs with a mostly empty environment. HOME, LOGNAME, and SHELL are set; and a very limited PATH. It is therefore advisable to use complete paths to executables, and export any variables you need in your script when using cron.

Also, you can use the environment variables you use on your shell.

Note that you can't use variable substitution as in shell, so a declaration like PATH=/usr/local/bin:$PATH is interpreted literally.

Related Question