One of the most useful features of Scrapy is that it can download and process images. For example in the ecommerce world, retail companies use web scraping technology to make use of online data of products. Scraping images is necessary in order to match competitors’ products with their own products. With scrapy, you can easily download images from websites with the ImagesPipeline.
Download images with Scrapy Link to heading
The process of downloading images: Link to heading
- Very first, you should install Pillow, an imaging library because scrapy uses it.
- You have to enable ImagesPipeline. Go to settings.py and include ImagesPipeline as an item pipeline.
- Again in settings.py, define IMAGES_STORE which is the path where the images should be downloaded.
- In your item class you create these fields: image_urls and images
- Inside your spider you scrape the URLs of the images you want to download and put it into image_urls field(It has to be a list). Now your job is done in your spider. scrapy’s ImagesPipeline downloads the image(s) and the scraper waits until downloading finishes(or fails).
- After the images are downloaded the images field will be populated with the results. It will contain a list of dictionaries of information about the image such as download path, URL, and the checksum of the file.
So let’s do it step by step:
1. Install Pillow with pip: Link to heading
pip install Pillow
2. Enable ImagesPipeline: Link to heading
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
3.Define a path for the images: Link to heading
IMAGES_STORE = '/home/user/SA/ImageScraper/ImageScraper/product_images'
4. Create field in item class: Link to heading
class BookItem(Item):
#other fields...
images = Field()
image_urls = Field()
#other fields...
5. Scrape URLs in your spider: Link to heading
def parse_book(self, response):
book = BookItem()
relative_img_urls = response.css("div.item.active > img::attr(src)").extract()
book["image_urls"] = self.url_join(relative_img_urls, response)
return book
def url_join(self, urls, response):
joined_urls = []
for url in urls:
joined_urls.append(response.urljoin(url))
return joined_urls
img_urls needs to be a list and needs to contain ABSOLUTE URLs that’s why sometimes you have to create a function to transform relative URLs to absolute.
6. If everything works correctly you will see an output something like this: Link to heading
{
'image_name': u'1,000 Places to See Before You Die',
'image_urls': [u'http://books.toscrape.com/media/cache/9e/10/9e106f81f65b293e488718a4f54a6a3f.jpg'],
'images': [{'checksum': '6a732297bf33405b3b161f33edb2bcd5',
'path': 'full/e81aa83c5f23d396bc2953c061af6a5c454cd3e4.jpg',
'url': 'http://books.toscrape.com/media/cache/9e/10/9e106f81f65b293e488718a4f54a6a3f.jpg'}]
}
Custom Names for Image Downloading fields Link to heading
You can define your own field names instead of image_urls and images. In the settings file set this:
FILES_URLS_FIELD = 'instead_of_image_urls_field_name'
FILES_RESULT_FIELD = 'instead_of_images_field_name'
Create Thumbnails of the Images Link to heading
The ImagesPipeline can do it for you. You just have to include in settings.py the dimensions of the desired thumbnails and it creates them automatically. Like this:
IMAGES_THUMBS = {
'small': (50, 50),
'big': (260, 260),
}
It generates two kinds of thumbnails(a smaller and a bigger) for each images saving them into two different folder. The aspect ratio will be kept.
File Expiration Link to heading
Scrapy is capable of checking if the image has been already downloaded recently so it won’t download it again if not necessary. You can define how long scrapy should not download the same image again in the settings:
IMAGES_EXPIRES = 15 #15 days of delay for image expiration (default:90 days)
Custom Filenames for Images Link to heading
The default filenames of the downloaded images are based on a SHA1 hash of their URLs. But in real world it doesn’t help you to know what’s on the image without opening it. You can use whatever filenames you want for the images. You have to extend the ImagesPipeline. You override two functions: get_media_requests and file_path. In the first function you will return a Request object with meta information. This meta information will carry the name of the desired filename. In the second function you will simply use the meta information you passed to override the default file path. Sample code:
class CustomImageNamePipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [Request(x, meta={'image_name': item["image_name"]})
for x in item.get('image_urls', [])]
def file_path(self, request, response=None, info=None):
return '%s.jpg' % request.meta['image_name']
You create a field called image_name in the item class. Then you populate it with the desired data in your spider.