Parsing Links for an Image - 09 May 2012
Here’s a generic image parser I wrote to find resource images for my project management application. It’s definitely a process you’ll want to run in the background, but it should handle most links.
I’ve tried to provide some brief comments in the code example to give you an idea of what I’m trying to do. This is a very simple solution to a complex problem and may not scale to your specific needs. You’ll also need to install gem 'nokogiri'
.
require 'open-uri' require 'nokogiri' module ResourceParser # Pass the page link you want to grab an image from # ResourceParser.link your_url def self.link(url) # Need to format the URL for Nokogiri url = format_link url # Pass the URL to Nokogiri to get the image link img = get_image url # Return nil if we couldn't find an image if img.nil? return nil else img = img['src'] end # Strip parameters from the url url = url.slice( /\A(http|https)(:\/\/){1}[a-z0-9\-\.]{1,}/i ) # Format the URL based on the type of link in the img src attribute format_image( img, url ) end def self.get_image(url) doc = Nokogiri::HTML(open url) doc.css('img').first # Return nil if Nokogiri runs into an error with the link rescue StandardError return nil end def self.format_link(url) unless url =~ /\Ahttp|https/i url = "http://#{url}" end url end def self.format_image(img, url) # Make sure we have the correct image format unless img =~ /\.(png|jpg|gif|bmp|tif|jpeg)/i return nil end # Check for absolute path if img =~ /\A\/.*/ return url + img # Check for complete link elsif img =~ /\Ahttp|https/i return img # Handle a relative path else return url + '/' + img end end end
The module is using Nokogiri for the parsing, so it won’t work on websites like Twitter or those that serve their pages using javascript. You pass the link method a url, and it will either return a link to the first image it finds or nil. If Nokogiri runs into an error looking up the page, it will return nil as well.
The module was written to format the image src that it finds, so it should always return the correct link for the image. This was needed to handle websites that include a complete HTML link in their img src, as well as sites that provide an absolute or relative path to the image.