Welcome to RubyCrawler, a simple web crawler written in Ruby! 😀
Clone this repo:
$ git clone https://github.com/DaniG2k/RubyCrawler.git ruby_crawler
$ cd ruby_crawler
Install the dependencies:
$ bundle install
And install the gem itself:
$ rake install
Require the gem:
require 'ruby_crawler'
Configure the start urls and the include/exclude patterns:
RubyCrawler.configure do |conf|
conf.start_urls = ['https://gocardless.com/']
conf.include_patterns = [/https:\/\/gocardless\.com/]
conf.exclude_patterns = []
end
Include and exclude patterns must both take arrays of regular expressions.
If you want to see all the urls under the gocardless.com domain, then change the include pattern to:
RuyCrawler.configure do |conf|
conf.include_patterns = [/gocardless\.com/]
end
This will match more subdomains such as https://blog.gocardless.com/.
Then kick off a crawl:
RubyCrawler.crawl
By default, RubyCrawler is polite (i.e. it respects a website's robots.txt file). However, you can change this by setting:
RubyCrawler.configure do |conf|
conf.polite = false
end
When you kick off a new crawl, you will see the include and exclude patterns change accordingly.
To see the sitemap (i.e. stored urls), just type:
RubyCrawler.stored
# =>
# ["https://gocardless.com/",
# "https://gocardless.com/features/",
# "https://gocardless.com/pricing/",
# "https://gocardless.com/accountants/",
# "https://gocardless.com/charities/",
# "https://gocardless.com/agencies/",
# "https://gocardless.com/education/",
# "https://gocardless.com/finance/",
# "https://gocardless.com/local-government/",
# "https://gocardless.com/saas/",
# "https://gocardless.com/telcos/",
# "https://gocardless.com/utilities/"]
To view the assets (css|img|js) on the crawled pages, you can run:
RubyCrawler.assets
To reset the RubyCrawler's configuration, simply execute:
RubyCrawler.reset
- Currently no flushing of stored urls or assets to a dataabse. Everything is in-memory.
- Canonical links in page source not taken into account.
- Current, only a global configuration is supported, although it would be possible to implement configuration on a per-spider basis.
After checking out the repo, run bin/setup
to install dependencies. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/DaniG2k/ruby_crawler.