Configurable downloader script

Check different data sources for content and trigger the download of new elements.

This script is designed to be run periodically (e.g. with a cronjob) to regularly check and automatically download new content when it comes out.

I wrote this as a kind of exercise to experiment with some design patterns in Python. There is no particular focus about performances or reliability (at least not yet).

Disclaimer

This script only triggers the download of given URLs. I'm not responsible for the usage you do with this software, for any damage might result by using it, and I'm certainly not encouraging piracy in any form.

Main features

Handle different types of data sources as well as perform the downloads in different ways
An optional caching system ensures that the same content is not downloaded multiple times
Easy to add support for new methods of fetch, download and cache URLs
Pre/Post download hooks for quick tweaking/extension of default behavior via scripts

Requisites

Python >= 3.8

Usage

clone the repo
install dependencies: cd autoDownloader && pip install -r requirements.txt
create a config.json file in the main repo folder (see example below)
Check python main.py -h for the available command line options
try it out: python main.py. The command itself won't output anything, but you can follow the execution by checking the log with tail -f autoDownloader.log in another shell.
once you verified that everything works fine, schedule a periodic execution, e.g. every day at 01:00AM

Example of config.json file:

{
  "items": [
    {
      "name": "My favorite content",
      "dest_dir": "/home/user/downloads/auto_downloads",
      "provider": {
        "type": "RssProvider",
        "url": "https://example-rss.com/rss.xml",
        "namespaces": {
          "ns0": "https://example-rss.com"
        },
        "xpaths": {
          "title": "/title",
          "items": "//item",
          "url": "/link"
        },
        "patterns": [
          "filtering-regexp"
        ]
      },
      "cache": {
        "type": "FileCache",
        "path": "/home/user/downloads/auto_downloads/cache.txt"
      },
      "downloader": {
        "type": "HttpDownloader",
        "method": "GET"
      },
      "global_pre_script": "/home/user/downloads/auto_downloads/pre_script.sh",
      "post_downloads_script": "/home/user/downloads/auto_downloads/post_download.sh"
    }
  ]
}

Quick guide

Items

Items describe your download definitions. Each item contains the configuration telling the script how to check and download content from a specific source.

The main configuration contains a list of item definitions, which are executed in order. The ordering can be important if you want to use scripts to build a complex download mechanism for a certain data source.

Each item contains at least a name, a destination folder and one of each:

provider
cache
downloader

Check the reference for the complete list of fields.

The supported providers, caches and downloaders can be easily extended in the code; take a look at the existing ones for reference.

Scripts

You can specify some script to be run on certain hooks. Currently the available hooks are:

global_pre_script

Executed before the download of any URL from the current item has started

global_post_script

Executed after the download of all the URLs from the current item has finished

pre_download_script

Executed before the download of each URL from the current item.

Available environment variables:

AUTODOWNLOADER_URL: URL about to be downloaded

post_download_script

Executed after the download of each URL from the current item.

Available environment variables:

AUTODOWNLOADER_URL: URL of the downloaded content
AUTODOWNLOADER_FILENAME: destination file name

Providers

The provider description in each item defines how to fetch information about the content to download. Ultimately this results in a list of URLs, which will get passed to the cache and finally to the downloader.

Supported URL providers

Caches

As this script is meant to be run periodically, it can happen that the URL provider will return the same set of available URLs over and over.

To avoid triggering the download of the URLs that were already taken on the previous run a cache can be used. The downloader will skip any URL which is stored in the cache.

Supported caches

On file
NullCache (just skip caching)

Downloaders

The downloader is the component that is effectively taking care of downloading the content from a URL.

Supported download methods

Development

Testing

Please add new tests if you want to add new functionalities or fix bugs.

The code is unit tested with unittest module, so no extra libraries are needed to run them. Just make sure that the repo root is in your PYTHONPATH and then run

python -m unittest

Note that python 3.8 or higher is required to run the tests.

Documentation

The configuration file format is defined with jsonschema. The definition files are under the schemas folder.

The jsonschema configuration is checked at runtime when the config is loaded, so the definitions must be updated in case of extension.

To generate the markdown description out of the jsonschema definitions use jsonschema2md with the following options:

jsonschema2md -d schemas -e json -o doc

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
doc		doc
external_libraries		external_libraries
schemas		schemas
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
caches.py		caches.py
downloaders.py		downloaders.py
factories.py		factories.py
items.py		items.py
main.py		main.py
package-lock.json		package-lock.json
providers.py		providers.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Configurable downloader script

Disclaimer

Main features

Requisites

Usage

Quick guide

Items

Scripts

global_pre_script

global_post_script

pre_download_script

post_download_script

Providers

Supported URL providers

Caches

Supported caches

Downloaders

Supported download methods

Development

Testing

Documentation

About

Releases

Packages

Languages

License

jpierceisme/autoDownloader

Folders and files

Latest commit

History

Repository files navigation

Configurable downloader script

Disclaimer

Main features

Requisites

Usage

Quick guide

Items

Scripts

global_pre_script

global_post_script

pre_download_script

post_download_script

Providers

Supported URL providers

Caches

Supported caches

Downloaders

Supported download methods

Development

Testing

Documentation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages