Serverless Backend for Data Transformation (Part 1)

Serverless Backend for Data Transformation (Part 1)

This is the first part of my series on Serverless. In this first article, I will introduce the fundamental concepts and features of Serverless. Once the fundamentals are covered we will move on to implement a serverless backend for transforming bank statements using Azure Functions. In the following posts we will upgrade our solution to a full-blown process for processing bank statements and uploading them into an online accounting software.

If you are already familiar with Serverless, please feel free to skip ahead to the main part of the article.

Introduction to serverless computing

It is Serverless because you think about servers less.

— Chris Anderson, Senior Program Manager @ Microsoft

What is Serverless? Simply put, Serverless is execution of code in the cloud. On itself the term could be a bit misleading since Serverless involves a ton of servers but one of the main value propositions is that you do not have to think about those servers. With Serverless you should focus on your code and data and leave the rest to the cloud environment were things like scaling, updates, the OS are handle for you.

Chronologically, Serverless is the latest evolution of the cloud.

Evolution of the cloud

So, what is Serverless?

Let’s try to come up with a definition without going into unnecessary details. Usually, when we talk about Serverless we have the following features in mind:

  • Abstraction of servers
  • Quick / automatic scaling
  • Sub-second billing

Abstraction of servers means that someone else takes care of the infrastructure - creating, maintaining, and disposing servers. This includes setting up networking, OS updates, patching frameworks, etc. We do not care how our code is installed on a particular server nor how this server functions.

Automatic scaling means that the platform will determine when and how many additional resources to bring online to handle the load.

Sub-second billing is probably one of the main reasons to embrace Serverless. With Serverless, cloud providers will only charge for the time serverless code runs. For example, imagine that we have a serverless backend for resizing images. If our backend has been invoked 10 times and each invocation took 1 second, the cloud provider will only charge for the 10 seconds that the code has been running.

To put this into perspective, let’s compare that to a more traditional approach where we have a virtual machine hosting our image resizing code. With 10 invocations for 1 second each, the provider will charge us a lot more because in this scenario we are being charged for having this VM running 24-hours a day. To the provider it does not matter if the code ran 10 or 10 billions times.

To summarize, Serverless offers the following benefits:

  • Reduced DevOps
  • Cost saving
  • Improved productivity by focusing on the business

How serverless platforms work

I have already mentioned the serverless backend for resizing images. Let’s compare such backend with a traditional one which runs on a regular virtual machine.

Traditional vs. Serverless infrastructure

As we can see from the illustration, with Serverless we do not own a server - it is created whenever it is needed and disposed when requests are no longer being received. Of course there are a lot of details around the creation and termination of servers which I will leave for another article.

Use cases for Serverless

Let’s now take a look at scenarios where serverless is a natural fit. As a rule of thumb, Serverless would fit in nicely anywhere, but not limited to, we need to respond to events. Here are some common examples:

Serverless scenarios

The business problem

The backbone of this post is a business need that I encountered where developing a serverless solutions was a great fit.

Companies usually use some sort of software for financial reporting to keep an eye on sales, revenue, costs, and other important metrics. To do that, businesses usually connect their bank accounts to one of the readily available online accounting solutions. There are are cases though, where the software does not integrate with a particular bank.

When integration is lacking, you have two options - enter data manually or somehow bridge the gap between your bank and the software. Option two is the preferred one and it usually looks like this:

  1. Export statements from bank accounts to files (Excel or other format).
  2. Format data to match format expected by the accounting software.
  3. Import data into software.

The steps are not that many but they are definitely time consuming and error prone. Our task is to automate this manual process to save time and eliminate any chance or human error.

The solution

Obviously there are many viable solutions here one of them being a serverless pipeline to deal with exporting, formatting and importing of bank statements.

Serverless data transformation workflows

Serverless is particularly fitting here because the pipeline needs to run only sporadically - whenever a new statement is exported which usually happens at the end or at the beginning of each month.

Our focus for the remainder of this article will be the process in the middle which is the most important one. As I mentioned in the beginning, we are going to utilize Azure Functions to run our serverless code. Let’s get started.

Introduction to Azure functions

Azure Functions Logo In a nutshell, Azure Functions is a platform for executing code in the cloud.

Alternatives to Azure Functions are available from all other major cloud providers - Google Cloud Functions from Google and AWS Lambda from Amazon. The good news is that many of the concepts that we are discussing in this article are applicable across vendors.

Azure Functions provides the following features:

  • Choice of language (C#, JavaScript, F#, Java, etc.)
  • Flexible development
  • Automatic scaling
  • Pay-per-use pricing model
  • Open source
  • Based on Azure App Service

When working with Azure Functions we often talk about the so called Function App. A Function App is a logical grouping of code, deployment, configuration, and anything in between needed to build, deploy, and run serverless code.

Since we are just starting with Azure Functions, we will limit ourselves to working locally. Once we are ready with the code, we will move it to the cloud.

Creating a local Function App project

The first thing we need is to install Azure functions Core Tools - CLI that enables creation, testing, debugging, and deployment of azure functions. There are two versions of the CLI and we are going to use the new one - v2. Let’s install the CLI using npm.

npm i -g azure-functions-core-tools@core –unsafe-perm true

Alternative methods for installing the CLI are listed on Github.

Once installed, we can use the CLI to create a project which will host our serverless code. Here is how:

func init DataTransform

You will be prompted to choose language/framework - please choose Node as we will be using JavaSript for the rest of the article. This will create a folder DataTransform to host the project. The next thing to do is to create a function.

cd DataTransform
func new –-template "Http Trigger" --name TransformBankStatement

This command will create a function named TransformBankStatement which uses Http trigger. Triggers are a fundamental concept in Azure Functions since they determine how functions are invoked. By choosing “Http trigger” we are enabling our function to be invoked by HTTP request. Here is a list of other triggers that are available. Just to give an example, there is a “Timer trigger” which is used to create a function that runs on a schedule - for example every hour.

Now that we have a function, let’s run it.

# In folder DataTransform
func start

Func start starts the so called Function host that allows us to run our code locally as if it ran in the cloud. Once you run func start, you should see something very similar to this:

Listening on http://0.0.0.0:7071/
Hit CTRL-C to exit…

Http Functions:

TransformBankStatement: http://localhost:7071/api/TransformBankStatement

[8/2/2018 2:38:45 PM] Host started (29487ms)
[8/2/2018 2:38:45 PM] Job host started

In order to test our first serverless function just open a browser and send a request to http://localhost:7071/api/TransformBankStatement?name=World.

Project Structure

The project that was just created should look like this:

DataTransform
 | - host.json
 | - local.settings.json
 | - TransformBankStatement
 | | - function.json
 | | - index.js

The file host.json contains global configuration options that affect all serverless code in a Function App project. You can find detailed information about all options here.

The file local.settings.json hosts connection strings and other settings that are being used only for local development and won’t have any effect once the code is deployed to the cloud.

Every Function App project can contain one of many functions - in our case we have the TransformBankStatement function. Every function at a minimum is composed of function.json and index.js.

The index.js file hosts the code that will be executed whenever the serverless function is invoked while the functions.json file contains configuration metadata for the respective function - for instance, this is where the trigger of the function is specified.

Let’s now take a look at the code itself.

Index.js

As we already know, this file is the entry point of our serverless code.

module.exports = function (context, req) {
    context.log('JavaScript HTTP trigger function processed a request.');

    if (req.query.name || (req.body && req.body.name)) {
        context.res = {
            // status: 200, /** Defaults to 200 **/
            body: "Hello " + (req.query.name || req.body.name)
        };
    }
    else {
        context.res = {
            status: 400,
            body: "Please pass a name on the query string or in the request body"
        };
    }
    context.done();
};

Let’s know go over the most important aspects of this code.

In order to have a working function, it must be exported using module.exports.

Every function has a required parameter context. This is the so called context object, that acts as a gateway to the runtime environment. Using the context we can access the input and output parameters of our function.

There are two important methods on the context object - context.done() and context.log(). The first one is used to tell the runtime that our method has completed its execution. We want to use this method in every function. The second method speaks for itself.

Apart from the first required context parameter, our function can have additional ones that represent input data. In our case we have one named req, that represents the HTTP request that invoked our function. As the code suggests, we can use req to check if the request contains a parameter called name and if it does we can get its value.

The last important aspect is how serverless functions return data. Since we are dealing with HTTP, it is expected that our functions returns a HTTP response. Using context.res we can create a response object, which represents the HTTP response having - body, headers, status, and isRaw fields. The most-used ones are body (the payload we are returning) and status (the HTTP status code).

I think we are now ready to start implementing our serverless backend for transforming bank statements. For in-depth information on using Azure Functions with JavaScript feel free to check out the Azure Functions JavaScript developer guide.

Creating a backend for data transformation

Let’s quickly recap what we set out to do at the beginning of the article. The goal is to create a service that can transform bank statements to a specific format which can then be imported into an accounting software.

Here is a sample bank statement (some fields have been removed for brevity):

datetime reference amount dtkt trname
11/08/2018 11:01:07 001FTЗЗ182121504 4000.00 K Credit received - invoice 12345
15/08/2018 00:00:00 001ЗЗMRBGNL55001 0.60 D Account maintenance fee
15/08/2018 00:00:00 001ЗЗMRBGNL55001 1.40 D Account maintenance fee
16/08/2018 12:31:20 001AЗЗF182149938 25.96 D POS merchant sale - Visa
18/08/2018 11:46:57 001AЗЗ4182122509 200.00 D ATM withdrawal
18/08/2018 11:46:57 001AЗЗ4182122509 1.06 D ATM withdrawal fee

The format that we see at the top is the raw data we get from the bank. The format that this must be transformed to is quite similar with the following notable differences.

  • The datetime field must be in “YYYY-MM-DD” format
  • The amount field must have minus sign if the entry represents a debit. In the original format debit vs. credit is determined by the dtkt field.

We are aiming at the following format after the transformation:

date amount_signed reference details
2018-08-11 4000 001FTЗЗ182121504 Credit received - invoice 12345
2018-08-15 -0.6 001ЗЗMRBGNL55001 Account maintenance fee
2018-08-15 -1.4 001ЗЗMRBGNL55001 Account maintenance fee
2018-08-16 -25.96 001AЗЗF182149938 POS merchant sale - Visa
2018-08-18 -200 001AЗЗ4182122509 ATM withdrawal
2018-08-18 -1.06 001AЗЗ4182122509 ATM withdrawal fee

Let’s now turn to the code.

Using NPM packages

There is no problem whatsoever to use NPM packages in our serverless code. For the time being we will make use of only one such package which will help us deal with dates.

# In folder DataTransform  
npm init
npm install moment --save

Module for transforming bank statements

Here is the code that can actually do the transformation we need. Let’s put this code in a module by creating a file called ”bank-statement-transform.js” in folder ”TransformBankStatement“.

const moment = require('moment');

const columnIndices = {
    "datetime": 0,
    "amount": 2,
    "dtkt": 3,
    "reference": 1,
    "transactionName": 4,
}

function transformStatement(bankStatement) {
    let newLines = bankStatement
        .split("\n")
        .slice(1)
        .filter(line => line != "")
        .map(line => transformStatementLine(line));

    let transformedStatement = createHeader() + newLines.join("\n"); 

    return transformedStatement;
}

function transformStatementLine(line) {
    let columns = line.split(",");
    let dateTime = parseDateTime(columns[columnIndices.datetime]);
    let debitCredit = columns[columnIndices.dtkt];
    let amount = parseFloat(columns[columnIndices.amount]);
    let amountSigned = debitCredit.toLowerCase() == "d" ? amount * -1.0 : amount;
    let transactionName = columns[columnIndices.transactionName];
    let reference = columns[columnIndices.reference];

    let newLine = [
        dateTime.format("YYYY-MM-DD"),
        amountSigned,
        reference,
        transactionName
    ].join(",");

    return newLine;
}

function parseDateTime(dateTime) {
    var datePart = dateTime.split(" ")[0];
    return moment(datePart, "DD/MM/YYYY");
}

function createHeader() {
    return "date,amount_signed,reference,details\n";
}

module.exports = transformStatement;

This code is not dependent in any way on Azure Functions and can be used in any other service - it is our business logic. The main function transformStatement accepts a bank statement as string and outputs the transformed statement as string. Here moment is utilized to converts dates into the required format.

Using the business logic in our serverless function

If you remember, index.js is the entry point of your serverles code. Here is how we can integrate the business logic into the function.

const transformStatement = require('./bank-statement-transform');

module.exports = function (context, req) {
    if (req.body) {
        let statement = req.body.toString();
        let transformedStatement = transformStatement(statement);

        context.res = {
            body: transformedStatement,
            headers: {
                'Content-Type': 'text/csv'
            }
        };
    }
    else {
        context.res = {
            status: 400,
            body: "Please pass a file for processing"
        };
    }

    context.done();
};

First, we must import the code that does the data transformation - this is the transformStatement() function from bank-statement-transform.js.

const transformStatement = require('./bank-statement-transform');

Secondly, our serverless code must check if the HTTP request contains a body. If there is, we expect that the bank statement that must be transformed is there and we pass it to transformStatement().

Finally, we produce an output using the response object, which contains the transformed bank statement.

context.res = {
            body: transformedStatement,
            headers: {
                'Content-Type': 'text/csv'
            }
        };

In the case where the HTTP request does not have a body, we return an error response. That is it! The serverless code is ready for being tested and deployed to the cloud.

Running the code

We can run the code locally by executing func sart. Just as before, the code is available at http://localhost:7071/api/TransformBankStatement but this time we need more complex request. We can use various tools like curl, Postman, or something else. I will use Postman to simulate an HTTP POST request with sample bank statement.

Steps for running the code

Here is some test data:

datetime,reference,amount,dtkt,trname
11/08/2018 11:01:07,001FTЗЗ182121504,4000.00,K,Credit received - invoice 12345
15/08/2018 00:00:00,001ЗЗMRBGNL55001,0.60,D,Account maintenance fee
15/08/2018 00:00:00,001ЗЗMRBGNL55001,1.40,D,Account maintenance fee
16/08/2018 12:31:20,001AЗЗF182149938,25.96,D,POS merchant sale - Visa
18/08/2018 11:46:57,001AЗЗ4182122509,200.00,D,ATM withdrawal
18/08/2018 11:46:57,001AЗЗ4182122509,1.06,D,ATM withdrawal fee

If everything is working fine, the results should be as follows:

date,amount_signed,reference,details
2018-08-11,4000,001FTЗЗ182121504,Credit received - invoice 12345
2018-08-15,-0.6,001ЗЗMRBGNL55001,Account maintenance fee
2018-08-15,-1.4,001ЗЗMRBGNL55001,Account maintenance fee
2018-08-16,-25.96,001AЗЗF182149938,POS merchant sale - Visa
2018-08-18,-200,001AЗЗ4182122509,ATM withdrawal
2018-08-18,-1.06,001AЗЗ4182122509,ATM withdrawal fee

Seems like the function is running as expected - let’s publish it to the cloud.

Deployment

In order to publish the code to Microsoft Azure, we need access to the platform.

Creating Microsoft Azure account

To continue beyond this point an active account is needed. Luckily there are trial accounts that can do the job without costing you a dime. You can either go for Student or Standard free account.

Creating a Function App

Once we have access, we need a Function App resource that will host and execute the code. If you are just starting with Azure Functions I would recommend to check this article which explains the whole process of creating a Function App.

The next step is to publish the code. There are many options here but I will showcase the most straightforward one - using a Git repository. I have already publish the last version of our code at - https://github.com/milannankov/serverless-data-transformation-part1.

Publishing the code

Let’s navigate to the Function App that was created in the previous step. This is what your screen should look like:

Function App Home

In order to connect the Function App to the git repository, please navigate to Platform Features. There you will find a link to Deployment Center. There are a lot of options there - VSTS, Github, Bitbucket, etc. Scroll down and choose External. Continue and input https://github.com/milannankov/serverless-data-transformation-part1.git as repository and master as branch

Function App Deployment Setup

Once the deployment is configured, Azure will try to get and build your code. Soon after there should be an entry with success status confirming that the code is deployed.

Deployment Ready

Our data transformation code should be running in the cloud now. If we go back to the main screen of our Function App, it should be evident from the tree structure on the left that our TransformBankStatement function has been deployed.

To test the function just copy the URL of the function by selecting it and then activating Get function URL. Paste the cloud URL into Postman to verify that everything is working fine.

Function Address

Wrapping up and next steps

Let’s briefly summarize what we have achieved. We have created a Serverless backend that handles HTTP requests containing bank statements and transforms them into a format that can be consumed by a specific accounting software.

Of course, what we set out to do is automate the whole process of transferring data to the accounting software. We should not expect that the guy taking care of accounting knows how to call our serverlesss API. In the follow-up articles we will build on today’s effort and make it so that the calls to the data transformation API happen transparently to the business users.

Something very important to keep in mind is that software solutions built around the Serverless architecture, favor crating small (self-contained) functions which are later combined to created more sophisticated workflows that model business processes. We aim to achieve the same thing with the pipeline we are building.

Thank you for your attention. As always I am available to answer any questions or comments.

Published 6 Nov 2018

Building software and sharing knowledge.
Milan Nankov on Twitter