Uploading Datasets

Uploading Datasets#

Two DUFT Server installations can seamlessly exchange data, enabling facility-based instances to push datasets to a centralised DUFT Server. This process is managed by the Upload Engine, a specialised component designed for high-volume, batch-based data transfers.

Unlike traditional JSON or FHIR-based APIs that transmit individual records, the Upload Engine processes structured batch files (e.g., JSON, CSV) to ensure efficient and reliable data transmission. This approach mitigates common challenges associated with individual record uploads, such as slow performance, incomplete data transfers, and difficulties in tracking which records have been sent.

Uploads are processed by the Upload Engine, which focuses solely on file-based data transmission, ensuring that locally generated datasets are efficiently transferred, stored, and made available for integration into the central repository.

NOTE: The Upload Engine does not integrate received datasets into a repository. This is done on purpose, by doing so, the separation of concerns between uploading and integration remains. A server-based Data Task can process the received files, including integrating their data into a repository.

Use Cases#

The Upload Engine supports a variety of use cases, including:

Routine Report Submission – Facilities submit monthly or weekly summary reports to a national data repository.
Bulk Case Data Upload – Individual-level records (e.g., patient encounters, lab results) are batched and transmitted as structured files.
Synchronisation with a Central DUFT Server – Local DUFT instances push their data to a central DUFT Server for aggregation.

Key Engineering Decisions#

Uploading Datasets

The Upload Engine is optimised for bulk data uploads rather than transmitting individual records. The key benefits of this design include:

Efficiency – Uploading structured files in bulk is significantly faster than sending individual records.
Data Integrity – Ensures complete dataset transfers, reducing the risk of partial updates.
Tracking and Auditing – Each uploaded file is uniquely named and stored with a timestamp and sender’s username for traceability.
Fault Tolerance – Failed uploads can be retried without introducing inconsistencies or data duplication.
Decoupled Processing – Uploaded files are stored in an Uploads folder on the DUFT Server, allowing asynchronous processing by other system components.

Uploads require authentication on the receiving DUFT Server to ensure secure storage and association with a valid user account.

Implementation of Uploads#

Uploads are implemented as Data Tasks, benefiting from all standard Data Task features, including parameter configuration and progress logging for end users. For example, the following data_task.json defines an upload task:

{

    "id": "UPLOAD",
    "name": "DUFT Report Uploader",
    "description": "Send data to DUFT Server.",
    "pythonFile": "sample_data_upload.ipynb",
    "hideInUI": false,
    "executeFromRoot": false,
    "dataConnections": [
        "ANA",
        "DUFT-SERVER-API"
    ]
}

This task references data connections, defined in data_connections.json, which specify parameters such as the upload destination:

{

    "id": "DUFT-SERVER-API",
    "name": "DUFT Server Location",
    "description": "The central DUFT Server where reports are uploaded.",
    "params": {
        "serverURL": "https://duft-central.example.com/api/upload",
        "username": "facility_user",
        "password": "securepassword"
    }
}

DUFT simplifies the upload process by exposing APIs that Data Tasks can leverage. The post_file_to_duft_server DUFT Python function provides a straightforward mechanism to upload files programmatically. The following code sample shows this:

params = {}
import time
import json
import sys


from api_data_task_executioner.data_task_tools import \
    assert_dte_tools_available, get_resolved_parameters_for_connection, initialise_data_task, \
    find_json_arg, get_temp_dir, purge_temp_dir, get_data_dir, post_file_to_duft_server 

environment = initialise_data_task("Jupyter Sample Data Upload Task", params=params)

params["name"] = params.get("customname", params.get("name", "No parameters given!"))

if not params:
    environment.log_error("No parameters given!")

assert_dte_tools_available()

resolved_parameters = get_resolved_parameters_for_connection("ANA")
resolved_server_api_parameters = get_resolved_parameters_for_connection("DUFT-SERVER-API")

tmp_dir = get_temp_dir()
environment.log_message('Data Upload Starting!')
environment.log_message("Using parameters: %s" % resolved_parameters)
environment.log_message("Using temp directory: %s" % tmp_dir)
print(resolved_server_api_parameters)

# Load sample dataset from SQL File
import pandas as pd
import sqlite3
import os

# Load the data
conn = sqlite3.connect(os.path.join(get_data_dir(), resolved_parameters['sqlite3file']))
data = pd.read_sql_query("SELECT * FROM dim_client LIMIT 20", conn)
conn.close()

file_to_save = os.path.join(tmp_dir, 'dim_client_sample.csv')
data.to_csv(file_to_save, index=False)
data.head()

try:
    post_file_to_duft_server(
        resolved_server_api_parameters["server"], 
        resolved_server_api_parameters["username"], 
        resolved_server_api_parameters["password"], 
        file_to_save, 
        "client-list")
    environment.log_message('Data Upload Complete!')
except Exception as e:
    environment.log_error(f"Failed to upload file: {e}”)

Uploading Datasets

Contents

Uploading Datasets#

Use Cases#

Key Engineering Decisions#

Implementation of Uploads#