Introducing Juturna#

Juturna is a tool that lets you create and manage data processing pipelines starting from json files. Juturna is meant as a fast prototyping tool for consuming, manipulating and transforming input data throughout a number of customisable nodes.

Main entities#

Let’s start by defining what is what in Juturna.

A pipeline can simply be defined as a collection of nodes. Each node acquires a piece of data from its parent and, after performing a single task, provide its output to its children. In this sense, a Juturna pipeline is nothing else but a rooted tree , a particular kind of DAG where there is a single root node with in-degree of 0 (this is not technically the case, but we’ll skip it for now), and every other node has an in-degree of 1.

Main Juturna entities

A node is a pipeline component that should, ideally, do one and only one thing. Depending on the task they are programmed to perform, nodes can be:

  • source: they either consume external data (obtained from real-time streams, remote or local files, databases…) or generate data to push into the pipelline

  • processing: they either transform, annotate or tag input data, or generate completely new data based on their input

  • sink: they deliver the input data to a configured destination, either local or remote

To recap, the key points to keep in mind about a pipeline are:

  • a pipeline has a single source node, consuming data from an external source, either remote or local

  • each node in a pipeline, be it source, processing, or sink, can read its input data from a single node, but can produce its output data for multiple nodes

Creating a pipeline#

Pipelines can be created starting from a simple json file. An example is provided below of what a basic configuration file for an empty pipeline looks like this:

{
  "version": "0.1.0",
  "plugins": ["./plugins"],
  "pipeline": {
    "name": "my_awesome_pipeline",
    "id": "1234567890",
    "folder": "./running_pipelines",
    "nodes": [ ],
    "links": [ ]
  }
}

The actual items concerning the pipeline are contained within the pipeline object. Namely, they are:

  • name: a symbolic pipeline name

  • id: a unique pipeline identifier

  • folder: a path to the folder where the required pipeline tree will be created (used to store artifacts, temporary files, or any other pipeline product that needs persistency)

  • nodes: a list of all the nodes in the pipeline

  • links: a list of the links connecting the nodes in the pipeline

A couple of things to notice here:

  • We have to specify the library version we are currently using. This field is likely to be removed in the future, but as of now it is important to keep it to make sure the API version is correct and nothing breaks apart.

  • We assign both name and id fields in the pipeline configuration. Whilst there might be an overlap between them, the id field is required by other tools that might wrap Juturna and manage multiple pipelines at once. With those tools, pipeline ids are likely to be assigned automatically in order to prevent overlaps.

Once this is ready, we can go on and instantiate the pipeline itself:

import juturna as jt


pipe = jt.components.Pipeline.from_json('path/to/config.json')

The pipeline we just created was configured to dump all its produced files and artifacts in a directory specified in the folder field. Here, Juturna will create a subdirectory assigned to the pipeline, and within it as many subdirectories as there are nodes in the pipeline. In our example, we did not have any node, so the pipeline subdirectory will only contain the configuration file provided when instantiated. Keep in mind, all these directories will only be created when the pipeline is warmed up, but we will get to that later.

At this point, we can get back all the basic creation information from the pipeline:

>>> pipe.name
'my_awesome_pipeline'
>>> pipe.pipe_id
'1234567890'
>>> pipe.created_at
1743494475.7670584
>>> pipe.pipe_folder
'./running_pipelines/my_awesome_pipeline'
>>> pipe.status
{'pipe_id': '1234567890',
 'folder': './running_pipelines/my_awesome_pipeline',
 'self': <PipelineStatus.NEW: 'pipeline_created'>,
 'nodes': {}}

This is great and all, but for now now we just created an empty pipeline with no nodes in it. A real pipeline is supposed to achieve something, and our example does nothing. Let’s go on and create a real pipeline.