Design Notes for an automated Offsite Backup facility/toolkit

Design Goals

  • provide a shareable resource that can be leveraged by multiple projects
    • dual-licensed under Mozilla and GPL?
      • or under a license compatible with both Mozilla and GPL eg BSD?
  • aim to follow (if not lead) "best practice" in custodianship of health data
  • make use of the new generation of Internet accessible data storage facilities, particularly Amazon S3
  • include in both NetEpi and GNUmed and/or offer as a stand-alone module

Design Parameters

  • written in Python (aids alignment with NetEpi and GNUmed)
  • a very small portion would be PostgreSQL-specific (just the pg_dump command)
    • a class to be used to handle database-specific commands and options so that alternative classes can easily be added to hndle other databases eg MySQL, Interbase etc - but PostgreSQL should be the initial back-end target.
  • provide a command-line/shell scripting interface as well as exposing a Python API.


  • checksum validation (Aamazon S3 provides this anyway) to ensure integrity of uploaded/retrieved files
  • option to have the dump gnotarized
  • database dumps to be encrypted using GPG before being uploaded
    • mechanism to ensure that encryption has been completed correctly and run to completion before the database dump file is uploaded, otherwise there is a small risk that unencrypted patient data may be exposed to the operators of the Amazon S3 servers and/or could be snooped on transfer (since HTTP not HTPPS is used for transfer/retreival protocol to/from Amazon S3)
  • mechanism to automatically retrieve selected encrypted dumps and unencrypt them and then do a test restore to a new PostgreSQL database (on a test server)
  • mechanism to prune uploaded files from the Amazon S3 repository on a defined basis eg GFS (grandfather/father/son) with settable parameters.
    • mechanism to estimate annual cost of storage/upload based on back-up upload and pruning schedule nd supplied charge rates (for Amazon S3 or similar)
  • scheduling to be provided by operating system (eg cron), or scheduling to be part of the package (using teh Python schedule library?), or both eg utility called by cron once a day (or even once an hour) , but it decides whether or what to do on that invocation based on its internal schedule and teh current date/time.
  • email notification of success and/or failure
  • logging via standard Python logging library


Implementation details

  • Possibly use teh PyCURL? library (wrapper around libcurl) to handle HTTP transactions robustly, ratyher than having to write error handling for HTTP in Python
    • simpler than using TwistedMatrix??

Functional Tests


Topic revision: 08 May 2006, TimChurches
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback
Powered by Olark