Check different data sources for content and trigger the download of new elements.
This script is designed to be run periodically (e.g. with a cronjob) to regularly check and automatically download new content when it comes out.
I wrote this as a kind of exercise to experiment with some design patterns in Python. There is no particular focus about performances or reliability (at least not yet).
This script only triggers the download of given URLs. I'm not responsible for the usage you do with this software, for any damage might result by using it, and I'm certainly not encouraging piracy in any form.
- Handle different types of data sources as well as perform the downloads in different ways
- An optional caching system ensures that the same content is not downloaded multiple times
- Easy to add support for new methods of fetch, download and cache URLs
- Pre/Post download hooks for quick tweaking/extension of default behavior via scripts
- Python >= 3.8
- clone the repo
- install dependencies:
cd autoDownloader && pip install -r requirements.txt
- create a
config.json
file in the main repo folder (see example below) - Check
python main.py -h
for the available command line options - try it out:
python main.py
. The command itself won't output anything, but you can follow the execution by checking the log withtail -f autoDownloader.log
in another shell. - once you verified that everything works fine, schedule a periodic execution, e.g. every day at 01:00AM
Example of config.json
file:
{
"items": [
{
"name": "My favorite content",
"dest_dir": "/home/user/downloads/auto_downloads",
"provider": {
"type": "RssProvider",
"url": "https://example-rss.com/rss.xml",
"namespaces": {
"ns0": "https://example-rss.com"
},
"xpaths": {
"title": "/title",
"items": "//item",
"url": "/link"
},
"patterns": [
"filtering-regexp"
]
},
"cache": {
"type": "FileCache",
"path": "/home/user/downloads/auto_downloads/cache.txt"
},
"downloader": {
"type": "HttpDownloader",
"method": "GET"
},
"global_pre_script": "/home/user/downloads/auto_downloads/pre_script.sh",
"post_downloads_script": "/home/user/downloads/auto_downloads/post_download.sh"
}
]
}
Items describe your download definitions. Each item contains the configuration telling the script how to check and download content from a specific source.
The main configuration contains a list of item definitions, which are executed in order. The ordering can be important if you want to use scripts to build a complex download mechanism for a certain data source.
Each item contains at least a name
, a destination folder
and one of each:
provider
cache
downloader
Check the reference for the complete list of fields.
The supported providers, caches and downloaders can be easily extended in the code; take a look at the existing ones for reference.
You can specify some script to be run on certain hooks. Currently the available hooks are:
Executed before the download of any URL from the current item has started
Executed after the download of all the URLs from the current item has finished
Executed before the download of each URL from the current item.
Available environment variables:
- AUTODOWNLOADER_URL: URL about to be downloaded
Executed after the download of each URL from the current item.
Available environment variables:
- AUTODOWNLOADER_URL: URL of the downloaded content
- AUTODOWNLOADER_FILENAME: destination file name
The provider description in each item defines how to fetch information about the content to download. Ultimately this results in a list of URLs, which will get passed to the cache and finally to the downloader.
As this script is meant to be run periodically, it can happen that the URL provider will return the same set of available URLs over and over.
To avoid triggering the download of the URLs that were already taken on the previous run a cache can be used. The downloader will skip any URL which is stored in the cache.
The downloader is the component that is effectively taking care of downloading the content from a URL.
- Direct download over HTTP
- Torrent (via rTorrent client)
- NullDownloader (skip download)
Please add new tests if you want to add new functionalities or fix bugs.
The code is unit tested with unittest module, so no extra libraries are needed to run them. Just make sure that the repo root is in your PYTHONPATH and then run
python -m unittest
Note that python 3.8 or higher is required to run the tests.
The configuration file format is defined with jsonschema. The definition files are under the schemas
folder.
The jsonschema configuration is checked at runtime when the config is loaded, so the definitions must be updated in case of extension.
To generate the markdown description out of the jsonschema definitions use jsonschema2md with the following options:
jsonschema2md -d schemas -e json -o doc