Part 1: Let's get started!
==========================
To get started with |bonobo|, you need to install it in a working python 3.5+ environment (you should use a
`virtualenv `_).
.. code-block:: shell-session
$ pip install bonobo
Check that the installation worked, and that you're using a version that matches this tutorial (written for bonobo
|longversion|).
.. code-block:: shell-session
$ bonobo version
See :doc:`/install` for more options.
Create an ETL job
:::::::::::::::::
Since Bonobo 0.6, it's easy to bootstrap a simple ETL job using just one file.
We'll start here, and the later stages of the tutorial will guide you toward refactoring this to a python package.
.. code-block:: shell-session
$ bonobo init tutorial.py
This will create a simple job in a `tutorial.py` file. Let's run it:
.. code-block:: shell-session
$ python tutorial.py
Hello
World
- extract in=1 out=2 [done]
- transform in=2 out=2 [done]
- load in=2 [done]
Congratulations! You just ran your first |bonobo| ETL job.
Inspect your graph
::::::::::::::::::
The basic building blocks of |bonobo| are **transformations** and **graphs**.
**Transformations** are simple python callables (like functions) that handle a transformation step for a line of data.
**Graphs** are a set of transformations, with directional links between them to define the data-flow that will happen
at runtime.
To inspect the graph of your first transformation:
.. note::
You must `install the graphviz software first `_. It is _not_ the python's graphviz
package, you must install it using your system's package manager (apt, brew, ...).
For Windows users: you might need to add an entry to the Path environment variable for the `dot` command to be recognized
.. code-block:: shell-session
$ bonobo inspect --graph tutorial.py | dot -Tpng -o tutorial.png
Open the generated `tutorial.png` file to have a quick look at the graph.
.. graphviz::
digraph {
rankdir = LR;
"BEGIN" [shape="point"];
"BEGIN" -> {0 [label="extract"]};
{0 [label="extract"]} -> {1 [label="transform"]};
{1 [label="transform"]} -> {2 [label="load"]};
}
You can easily understand here the structure of your graph. For such a simple graph, it's pretty much useless, but as
you'll write more complex transformations, it will be helpful.
Read the Code
:::::::::::::
Before we write our own job, let's look at the code we have in `tutorial.py`.
Import
------
.. code-block:: python
import bonobo
The highest level APIs of |bonobo| are all contained within the top level **bonobo** namespace.
If you're a beginner with the library, stick to using only those APIs (they also are the most stable APIs).
If you're an advanced user (and you'll be one quite soon), you can safely use second level APIs.
The third level APIs are considered private, and you should not use them unless you're hacking on |bonobo| directly.
Extract
-------
.. code-block:: python
def extract():
yield 'hello'
yield 'world'
This is a first transformation, written as a `python generator `_, that will send some strings, one after the other, to its
output.
Transformations that take no input and yields a variable number of outputs are usually called **extractors**. You'll
encounter a few different types, either purely generating the data (like here), using an external service (a
database, for example) or using some filesystem (which is considered an external service too).
Extractors do not need to have its input connected to anything, and will be called exactly once when the graph is
executed.
Transform
---------
.. code-block:: python
def transform(*args):
yield tuple(
map(str.title, args)
)
This is a second transformation. It will get called a bunch of times, once for each input row it gets, and apply some
logic on the input to generate the output.
This is the most **generic** case. For each input row, you can generate zero, one or many lines of output for each line
of input.
Load
----
.. code-block:: python
def load(*args):
print(*args)
This is the third and last transformation in our "hello world" example. It will apply some logic to each row, and have
absolutely no output.
Transformations that take input and yields nothing are also called **loaders**. Like extractors, you'll encounter
different types, to work with various external systems.
Please note that as a convenience mean and because the cost is marginal, most builtin `loaders` will send their
inputs to their output unmodified, so you can easily chain more than one loader, or apply more transformations after a
given loader.
Graph Factory
-------------
.. code-block:: python
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(extract, transform, load)
return graph
All our transformations were defined above, but nothing ties them together, for now.
This "graph factory" function is in charge of the creation and configuration of a :class:`bonobo.Graph` instance, that
will be executed later.
By no mean is |bonobo| limited to simple graphs like this one. You can add as many chains as you want, and each chain
can contain as many nodes as you want.
Services Factory
----------------
.. code-block:: python
def get_services(**options):
return {}
This is the "services factory", that we'll use later to connect to external systems. Let's skip this one, for now.
(we'll dive into this topic in :doc:`4-services`)
Main Block
----------
.. code-block:: python
if __name__ == '__main__':
parser = bonobo.get_argument_parser()
with bonobo.parse_args(parser) as options:
bonobo.run(
get_graph(**options),
services=get_services(**options)
)
Here, the real thing happens.
Without diving into too much details for now, using the :func:`bonobo.parse_args` context manager will allow our job to
be configurable, later, and although we don't really need it right now, it does not harm neither.
.. note::
This is intended to run in a console terminal. If you're working in a jupyter notebook, you need to adapt the thing to
avoid trying to parse arguments, or you'll get into trouble.
Reading the output
::::::::::::::::::
Let's run this job once again:
.. code-block:: shell-session
$ python tutorial.py
Hello
World
- extract in=1 out=2 [done]
- transform in=2 out=2 [done]
- load in=2 [done]
The console output contains two things.
* First, it contains the real output of your job (what was :func:`print`-ed to `sys.stdout`).
* Second, it displays the execution status (on `sys.stderr`). Each line contains a "status" character, the node name,
numbers and a human readable status. This status will evolve in real time, and allows to understand a job's progress
while it's running.
* Status character:
* “ ” means that the node was not yet started.
* “`-`” means that the node finished its execution.
* “`+`” means that the node is currently running.
* “`!`” means that the node had problems running.
* Numerical statistics:
* “`in=...`” shows the input lines count, also known as the amount of calls to your transformation.
* “`out=...`” shows the output lines count.
* “`read=...`” shows the count of reads applied to an external system, if the transformation supports it.
* “`write=...`” shows the count of writes applied to an external system, if the transformation supports it.
* “`err=...`” shows the count of exceptions that happened while running the transformation. Note that exception will abort
a call, but the execution will move to the next row.
However, if you run the tutorial.py it happens too fast and you can't see the status change. Let's add some delays to your code.
At the top of tutorial.py add a new import and add some delays to the 3 stages:
.. code-block:: python
import time
def extract():
"""Placeholder, change, rename, remove... """
time.sleep(5)
yield 'hello'
time.sleep(5)
yield 'world'
def transform(*args):
"""Placeholder, change, rename, remove... """
time.sleep(5)
yield tuple(
map(str.title, args)
)
def load(*args):
"""Placeholder, change, rename, remove... """
time.sleep(5)
print(*args)
Now run tutorial.py again, and you can see the status change during the process.
Wrap up
:::::::
That's all for this first step.
You now know:
* How to create a new job (using a single file).
* How to inspect the content of a job.
* What should go in a job file.
* How to execute a job file.
* How to read the console output.
It's now time to jump to :doc:`2-jobs`.