My Lock-down Project: A Data Management Thing

Visit the Github repo for this project

I tend to write code when I get bored.

I figure it’s been two years since my last blog post so maybe I ought to actually finish one of the several half-finished drafts I have sitting around and publish it. Or maybe I’ll just write a new blog post talking about the little personal project I’ve been working on for the past few months.

Yep, that sounds more fun, I’ll do that.

So, like most of us I’ve had some extra time on my hands since this COVID situation started, and I happened to get an idea for a new project. I’m terrible at naming things (see: Contabulo), so I just gave up on trying to be clever and went with a boring, generic, somewhat descriptive name for this project: Open Data Management Platform (OpenDMP). Open, because it’s open source, and “Data Management Platform” because it’s a platform that… manages data.

Motivation

Well, the motivation for this little project was to create a generic data management system that can serve as a common starting point for new data projects and help prevent the proliferation of bespoke solutions that are being implemented for every single organization, project and/or program. In short: To help engineers stop reinventing the wheel for every project involving data processing and management. I think there’s a lot of common functionality at least for small to medium-scale needs (obviously, there’s no getting around the need for custom solutions for organizations dealing with highly specialized data/use cases or truly massive volumes of data. But such organizations probably have the time, inclination, and in-house talent necessary to build whatever they need).

I’d like to think I’m targeting users that have been manually processing data with scripts on their own workstations and are in need of something a little bit more automated and robust.

How Does it Work?

The idea is to allow users to design simple data flows that ingest data, perform transformations and/or run data through different scripts and external tools, and export the data elsewhere (could be the file system, object storage, a database, etc).

A basic dataflow

A user constructs a flow made out of processors organized into “phases.” Processors can be of various types, for example: Ingest, for bringing data into the system, Script, for executing basic transformations on data, Collect, for collecting results and exporting them to external data stores, and more. Phases are meant to help the user layout the dependencies between processors (i.e., a flow should run in phase order), although it should be noted the system doesn’t necessarily obey this order if a processor lacks a hard dependency on a previous phase.

Here’s a basic example of a Script Processor:

Script processor using Python

This is a super-simple example that just takes in a string and converts it to upper case. As you can see, this processor allows you to write an arbitrary script in Python that modifies your data and passes it along to the next processor in the flow (As an aside, the Script Processor supports Clojure as well – I couldn’t resist adding support for my favorite language). The flow interface is also nice enough to let you know when something goes wrong:

Something has gone wrong in a Script Processor

When data is collected, it is exported to an external data source, and an entry is made into a “Collection” of the user’s choice:

A basic Collection
An individual dataset

Currently only a few rudimentary types of processors are available, but I definitely plan to expand this. It’s a design goal that OpenDMP be expandable with Plugin processors, for example.

Technology

OpenDMP architecture (Note: not 100% implemented yet)

If I haven’t lost you yet, read on as I delve into the architecture and technical details of OpenDMP:

First, OpenDMP is a system made up various services running in a containerized environment. It does not make any assumptions about your container orchestration solution (there are sample docker-compose files in the repo), since that choice should be dictated by your workload, infrastructure, and ops expertise. Also, beyond the core services required to run a basic system, users will probably want to run additional services (databases, external tools, etc.) that interact with OpenDMP within their environment.

Core Services

The frontend consists of a re-frame application written in Clojurescript. This was my first project using re-frame and Clojurescript and I found it to be a huge win over javascript. State management in particular was much less of a problem than I’d experienced on previous projects. The astute observer will probably notice the use of Material-UI as well. I found integrating Material-UI components with re-frame to be completely painless.

The backend services are written in Kotlin using Spring Boot. The Dataflow Service exposes an API for the frontend and owns the MongoDB database that stores dataflow, processor, and collection information. The Processor Service (also Kotlin + Spring Boot) is where the data flows are actually “run.” It relies heavily on Apache Camel to control the flow of data through the system. It also uses a Redis cache to keep track of things as it executes dataflows. If you’re wondering how the scripting support works, Clojure can be evaluated by the Clojure interpreter right on the JVM, whilst Python support is achieved using Jep, which embeds CPython in the JVM via JNI (I thought this a better solution than Jython).

Communication between the various backend services utilizes Apache Pulsar, which is starting to get a lot of attention as a Kafka alternative. Before I came across Pulsar in my research, I was torn between Kafka and using a traditional message broker such as RabbitMQ. Pulsar makes an attempt at combining the functionality of both, and at least so far appears to do a fairly decent job of it.

OpenDMP uses OpenID for auth and Keycloak, a widely used SSO and Identity Management solution, is used to handle Identity and Access Management. In fact, user management is completely absent from the rest of the OpenDMP system as well as the frontend. From a development perspective, this simplified things tremendously. The use of an industry-standard auth mechanism (OpenID) should also help in situations where OpenDMP needs to be integrated into an existing enterprise architecture.

Everything sits behind HAProxy that does reverse proxying and ssl termination.

How a Dataflow Runs

So, what happens when you build a dataflow in the UI and click the “enable” button? Well, after the request is sent to the Dataflow API, the following sequence is executed:

  1. The Dataflow Service loads the Dataflow and creates a “Run Plan.” It does this by analyzing the dataflow and mapping out the dependencies between processors to determine the order in which different processors need to be executed, and which can be run in parallel. Note: “Phases” are ignored here in favor of finding the most efficient way to execute the dataflow – phases are a device of the UI/UX to help users layout their flows.
  2. The Run Plan is sent to the Processor Service. It contains everything the Processor Service needs to run the dataflow.
  3. The Processor Service goes through the Run Plan and generates Camel routes and begins executing them immediately.
  4. Errors and/or Collection results are sent back to the Dataflow Service via Pulsar.

It sounds simpler than it is, I promise ūüôā

Future Plans

Quite a bit of the functionality I’ve got planned hasn’t been done yet, of course, and OpenDMP is definitely not production-ready at the time of this writing. There’s still critical features to be added and much proofing to do on various workloads. Big features still to be added include:

  1. Aggregation Processors – bringing divergent flow paths together again.
  2. Plugin processors – I’d like to see common Data Science/ML tools wrapped in OpenDMP API-aware services for clean integration in dataflows.
  3. UI improvements – In particular, better searching, sorting, etc. of Data sets
  4. Notification support – Send notifications to people when things go wrong (email, et al).
  5. Better scaling – While multiple instances of the processor services can be run to better support running many dataflows, within a dataflow it would be helpful to create worker nodes for the Processor service to distribute individual processor tasks using Camel and Pulsar.
  6. And of course, more ingest and export data stores need to be supported.

Final Thoughts

This has been a challenging and fun project so far to build, I hope others will eventually find it useful. Also, it’s open source, so contributions are definitely welcome!

Golang – The Okayest Programming Language

Note: This is a repost (and expansion) of my quora answer here. 

Someone on Quora recently posted a question asking for developers to discuss their negative experiences with Golang.  That got me thinking about my past (and current) experiences with the language, and I ended up replying with a list of pros and cons:

It’s hard to hate Golang. It’s also hard to love Golang. It has got to be the okayest programming language I’ve ever used.

Things I like about Go:

  • The development environment: Simple to get set up and use. I usually work in Javaland and have had to suffer through Ant/Ivy, Maven, sbt, et al., so Go‚Äôs environment feels much more thought out and coherent by comparison.
  • No need for an IDE – Java development practically requires the use of an IDE to be productive (as much as I like IntelliJ, I resent having to keep a heavyweight IDE running all the time). I routinely edit Go source files in vim and compile/run/test all from the command line. Which is nice.
  • It isn‚Äôt Object Oriented (OOP has its uses, but I hate, and I mean hate the everything-is-an-object paradigm in Java). Object-Oriented Design (the raison d’√™tre of Java) in particular needs to die.
  • Channels and goroutines. Not a terrible way to do concurrency. Nicer than managing threads in Java. Not as powerful as Clojure or Akka (but definitely simpler, at least in the case of Akka).

Things I don’t like about Go:

  • I‚Äôm a big fan of functional programming (I use Clojure or Scala on the JVM whenever I can get away with it). Golang isn‚Äôt a functional programming language. This is more a problem with me than Golang, I suppose ūüôā
  • Lack of some advanced programming constructs. Generics, in particular, are something I miss, especially given my last point:
  • The type system. It‚Äôs clunky, limiting, and just feels so antiquated. Easily the worst type system of any language I‚Äôve used in the last 20 years. Ada, for example, has the best static type system I‚Äôve ever seen. Scala is pretty decent as well. This is the main reason I only use Go for small utilities and can‚Äôt stand programming large projects with it.

I don’t have a lot of optimism anymore about the future of Golang as a general purpose language.¬† It seems to be carving out a niche for itself in network services (it’s become especially popular with fans of ‘microservices’), but I can’t see it expanding much beyond that.¬† It’s not really suitable for use as a systems programming language (better to use something like Rust or even good ‘ol C), and I certainly don’t see it displacing Java in enterprise environments.¬† As for myself, I’ll continue to use it here and there when it suits the problem I’m working on, but it’s not going to be my “go to” language for general use (Clojure fills that role for me at the moment).

Building a SaaS product with Clojure

The Idea

Like many software developers, I spend a lot of my spare time on side projects.  Some side projects are purely for my own edification: learning a new language (such as my current favorite, Clojure), or  exploring a new technology or problem domain.  However, a great many project ideas that tend to pop into my head at random intervals tend more toward the entrepreneurial.  These sorts of ideas either originate internally (usually the most insane or impractical, since I tend to live inside my own head a little too much), or as a result of observing (or experiencing) problems for which a good solution is not immediately evident.  The former can usually be disposed of quickly by subjecting them to a few thought experiments, while the latter type tend to be more persistent, begging to be investigated further.

So, enter my current obsession with creating an enterprise wiki that isn’t painful to use. ¬†I’ve used many expensive “Knowledge Management” and corporate wiki tools over my career. ¬†Some are better than others, but I have yet to find one that I actually like. ¬†Either they have serious problems in the information search and discover-ability department (e.g., SharePoint), or they seem to be designed around the idea that you want to spend your time interacting with the application itself rather than finding the information you need before going back to your actual job-related tasks (e.g., various tools with overwrought interfaces with hundreds of knobs and buttons that consume half of the available screen real estate).

I want a KM tool in which content is front-and-center and an interface that stays out of the way. ¬†I also want a functional search capability and a way to browse information and content when I don’t already know exactly what I’m looking for. ¬† My first attempt at this ended up in what was basically a traditional wiki with perhaps a nicer interface. ¬†The voices in my head laughed at me, saying “So, what’s different about this thing? ¬†You’ve created nothing new. ¬†Forget it and try again.” ¬†Eventually a new concept presented itself to me, which I’ve named Contabulo. ¬†A wiki with a much more visual interface – inspired by card-based PM tools such as Trello, but geared toward being a knowledge management and collaboration tool. ¬†The basic idea for the interface came to me when I observed my wife using Pinterest: “There should be a productivity tool that looks kind of like that.” ¬†Earlier versions of Contabulo even used a masonry-like layout, but I quickly realized that was a poor fit and switched to a gridded layout based on CSS-grid (which happened to gain majority browser support just in time to be usable).

Given that I was tackling this project alone, in my spare time, I needed to select tools and technologies that would help me maximize my personal productivity. ¬†While I’m paid good money to develop in Java at my day job, I have to say Java is absolutely atrocious for individual productivity. ¬†I could write at length about my problems with Java, but suffice to say that a high-ceremony language that forces you into designing your software architecture in terms of class hierarchies (i.e., OOD) does a pretty good job of sucking the joy (and speed) out of software development. ¬†When I’m beginning a new personal project where I have total freedom of choice, I will generally gravitate toward a functional programming language. ¬†Being that I’m both a huge Lisp fan and also very (albeit somewhat unwillingly) familiar with the Java ecosystem, I chose Clojure to implement the backend services.

Architecture

contabulo architecture
Contabulo System Architecture

My background in enterprise software development is clearly evident in the system architecture ūüôā ¬†My favorite relational database, PostgreSQL, serves as the primary data-store. ¬†Most of the CRUD (Create, Read, Update, Delete) functionality is taken care of by the “API Service”, which is built around the excellent compojure-api. ¬†Now, when it comes to working with a relational database, I’m generally used to being forced to do things the Java way and use an ORM like hibernate or eclipselink. ¬†However, with a functional programming language an ORM doesn’t make sense (not that an ORM makes sense in any case, but I digress), and I found HugSQL to be an awesome way to work with a SQL database. ¬†I’ll take writing SQL queries anytime over composing JPA/Hibernate criteria queries. ¬†Additionally, rows are written to and read from the database as simple Clojure maps and vectors, rather than having to define domain objects for every table in the database.

The contabulo-auth service is responsible for verifying user login credentials an issuing tokens.  Tokens are generated and verified using the buddy-sign library.

Elasticsearch¬†powers the search functionality, provided by the “Search Service” and is kept in sync with the main database via an “Index Service.” ¬†I actually opted to use Elastic’s own Java SDK for interacting with elasticsearch, writing Clojure wrappings as necessary (One of the many benefits of running on the JVM is the wealth of pre-existing Java libraries for just about anything you can think of — the only downside being they’re written in Java and you’ll have to wrap them in your language of choice).

Synchronous communication between processes is accomplished using HTTP RESTful interfaces, while asynchronous tasks are dispatched via RabbitMQ.  The contabulo-mail service, for example, is responsible for sending email notifications to users and receives all of its tasks via a JMS queue.  For communicating with RabbitMQ from Clojure services, the Langohr library is used.

The frontend is a rich javascript web application (I decided to stick with plain javascript/ES6 rather than go with clojurescript) written using the mithril.js framework.  Mithril is a great little client-side framework that boasts a very functional-style of javascript programming (truth be told, you can be quite functional with javascript, but it requires a lot of discipline).

Clojure (and FP) Benefits

Thanks to Clojure, I have backend services composed from side-effect free functions and immutable data structures. ¬†Whole classes of runtime errors I’m used to seeing on large Java applications are simply absent from Contabulo. ¬†This makes iterating on the product and adding new features *so* much easier. ¬† Further, ‘statelessness’ is considerably easier to achieve in Clojure, partly due to it’s non OO-ness, partly due to its immutability-by-default design. ¬†This means that adding additional instances of each service will be much easier.

I’ve also found the unit testing experience in Clojure to be vastly superior to ¬†Java’s. ¬†Though, I doubt this is due to any inherit superiority of clojure-test over JUnit but rather simply a result of the fact that side-effect free deterministic functions are easier to unit test. ¬†In any case, it’s still a win.¬† As an aside, I’ve also found Scala to have a much more pleasant testing experience than Java (probably tied with Clojure in that regard).

While the frontend web application was written in ES6, the mithril.js framework’s functional style has resulted in a rich web application constructed from modular, de-coupled components that can be reused easily. ¬†This makes regressions on the frontend much less likely as well (Which is good in my case, because I’m really much more of a backend developer).

I will have to admit to writing one small transient service in a non-FP language. ¬†A small program written in Golang is triggered by a systemd timer periodically to perform housekeeping tasks (moving files around, doing some basic database ops, etc). ¬†This was one case where I didn’t want to suffer the overhead of spinning up a JVM every ten minutes and since the housekeeping service does nothing but I/O operations (which tend to be inherently procedural anyway), I decided to go (pun sort of intended) with something simple.

Conclusion

I’ve heard some critics say that Clojure is unsuitable for big projects with large teams, usually due to some combination of dynamic typing and ¬†difficulty in finding Clojure developers. ¬†I’m not sure I’d agree with that assessment. ¬†I think prodigious use of type hints and tools like schema¬†(and more recently, clojure.spec) can largely mitigate most of the problems caused by dynamic typing, and I’m not convinced the latter issue even exists. ¬†Contabulo is the largest project I’ve done using Clojure (or any Lisp, for that matter) and now that the all of the basic architecture and core functionality is in place, I’m quite pleased with how quickly I was able to build it and how easy it is to add new features or make modifications. ¬†I think that’s a win. ¬†Now it remains to be seen how Contabulo succeeds as a product.

On Boards and Cards

Unless you’ve been living under a rock, you’ve doubtless noticed the current trend in project management tools involving “boards full of cards.” ¬†Trello is probably the most well known (and cheapest, being free, which might have something to do with its popularity) of this archetype. ¬†Other, more elaborate (and expensive) enterprise tools have implemented “agile” boards as part of their offerings.

Personally, I think most of these tools are well suited for their intended use case: ¬†Agile project management. ¬†However, as someone who’s tried to use Trello as a more, erm,¬†general purpose¬†tool, I’ve found it lacking (shopping lists? No. ¬†Notes? No). ¬†Which is fine, it’s a specialized tool designed for a particular set of use cases, which just so happen to not align with my needs. ¬†Most of these PM tools assume the following:

1.)  You are managing projects (duh).  Which are temporary efforts with a definite start and end date.  Thus, boards are meant to be used temporarily.

2.) ¬†You’re using some sort of Agile (or perhaps Kanban) Project Management methodology. ¬†You may not like Agile, or it may not suit your particular effort.

3.) They’re not geared toward long term knowledge management, you’ll need a completely separate tool (such as a wiki) for this.

Issues 1 and 3 are related. ¬†You want to store lessons learned, institutional knowledge and other “knowledge” artifacts created during the course of a project. ¬†The traditional answer to this is to make your engineers enter this knowledge into a wiki, as the project board that was so nice for organizing and tracking tasking turns out not to be so great when it comes to organizing and archiving information for long term usage. ¬†This, of course, introduces yet another ‘chore’ for engineers who would rather be building things.¬† And it boils down to the fact that taking the time to enter information into a wiki takes people out of their normal workflow.¬† Thus, it is frequently forgotten and/or skipped when deadlines loom.¬† I, for one, have never seen a corporate wiki that was well maintained and kept up-to-date ūüôā

So, I think a worthy goal should be to keep information together as much as possible. ¬†Place notes, design artifacts, project planning documents, meeting minutes, miscellaneous documentation, etc. in the same system you also place your tasking “cards,” thus folding “updating the wiki” into your normal workflow.¬† The trick, of course, is to keep this vast pile of information from turning into an unmanageable, unsearchable, and unusable mess.

That’s the idea I’ve been pursuing with my latest project, which I’ve named¬†Contabulo.

Half wiki, half board

contabulo_screenshot1
An example Contabulo project board

From the screenshot, at first blush, Contabulo looks similar to an agile board. ¬†Unlike Trello and friends, however, these cards aren’t organized into lists. ¬†In fact, there’s not a set ordering or arrangement at all as the interface assumes (at least on a well-populated board) that you’re going to be making use of that search bar to find the cards you seek. ¬†The search feature is powered by Elasticsearch on the backend, so it does a fairly decent job (the ‘main’ database is¬†PostgreSQL). ¬†As you can see from the graphic, cards can be assigned different colors, and you can assign a board background image, which I know is pure vanity.

Cards

contabulo_donut_card
A Contabulo card with a header image

Contabulo cards are designed to be flexible, and contain anything from a quick note to an entire ‘wiki’-esque article. ¬†The editing format is Markdown (specifically, Commonmark¬†with a few extensions), and I don’t know about you, but I much prefer Markdown to the Wikitext format you’ll find in MediaWiki based wikis ūüôā ¬†As you can see from the image above, you can also attach files (including images), and can specify a card image to appear in the header.

Card_in_edit_mode
A card being edited using Markdown – at some point I’ll add a rich editor

In addition to the card body and attachments, you can add special ‘content’ blocks to a card. ¬†As of this writing, checklists are supported, but it is easy to imagine adding many other types of useful content: maps (think – directions to events), discussion/comment sections, integrations with third party apps, etc. ¬†You can also add ‘tags’ to cards, which of course are optional, but the search index does use to improve results.

Groups

Contabulo has the concept of ‘groups’ in order to control access to different boards. ¬†Each user can have ‘personal’ boards which are either totally private, or can be given public read-only access for outsiders. ¬†“Group” boards can likewise be visible only to the group, or additionally be read-only to non-group members (one notional use of a public read-only board could be a job board). ¬†Each group member can be assigned read-only, editor, or admin-level access.

group_screen
Managing groups in Contabulo

Tech

For the curious, I’ll give a brief overview of the Contabulo “tech stack” (I think that’s what the kids are calling it these days).¬† I’ve been building Contabulo in my spare time (evenings and weekends), since right now I have to keep a day job in order to eat and pay the mortgage.¬† Thus, the need to maximize my personal productivity was definitely a major factor in selecting tools and technologies.¬† The backend is written primarily in Clojure, which is in the Lisp family of languages and runs on the Java Virtual Machine.¬† I chose Clojure because:

  1. I Like Lisp, and think it’s an amazingly productive language
  2. I have a large amount of familiarity with the JVM and the Java ecosystem in general (and there are a *huge* number of helpful Java libraries which you can easily call from Clojure), even if I may not care much for the Java language itself ūüôā

I briefly flirted with the idea of writing the backend services in Haskell after working through Learn You a Haskell (great book, btw), but then the fever broke and I got better.

So, I’ve split up my backend into several Clojure services (not quite as fine-grained as ‘microservices‘ would be, but I don’t care), and one little database maintenance/housekeeping utility I wrote in Golang that runs once an hour so.¬† interprocess communication is largely using RabbitMQ, though some services call each other using HTTP when synchronous communication is needed.¬† I’m using my favorite webserver, Nginx, as a load-balancer (and also for SSL termination), and all of my backend services are stateless so that running multiple instances of each shouldn’t be a problem.¬† I’ve tried to lay the foundation for a scalabe architecture which will hopefully prove necessary.¬† I suppose I could have probably just spun up a Ruby on Rails app much more quickly, but where is the fun in that? ūüôā

As I mentioned earlier, the main database is PostgreSQL, and search is powered by Elasticsearch (though a Clojure ‘search’ service sits in front of the Elasticsearch service, since ES, as with any database, shouldn’t be directly exposed to the Internet).

For the web application, I used ES6 with the Mithril microframework and Bootstrap, along with a plethora of smaller Javascript libraries, of course. I’m not the most experienced or skilled frontend developer in the world, but I think I do ok (when properly incentivized, like when there’s noone else to do it).

Conclusion

Contabulo isn’t my first attempt at a Software-as-a-Service (SaaS) offering, but I feel pretty good about this one.¬† I find it useful, in any case, and will continue to use it (and keep it updated) for my own needs.¬† My focus right now is to complete testing and squash any show-stopping bugs so I can launch this thing and hopefully start getting some paying users.¬† I have a long list of features I want to implement, but I think Contabulo is far enough along it’s time to start getting user feedback in order to guide its future direction.¬† If you’d like to receive updates (i.e., to be notified when Contabulo is live), please sign up for the mailing list!

Implementing Search-as-you-type with Mithril.js

I’ve been working on a new project, the front-end of which I’m coding up in ES6 with Mithril.js (using 1.0.x now after spending the better part of a day migrating from 0.2.x).¬† I wanted to implement “search as you type” functionality, since I’m using Elasticsearch on the back-end for its full-text search capability.

search_box
A simple user search box

Took me a bit of trial-and-error, but I came up with this mithril component that provides a simple text field that automatically fires a callback periodically (the timing is configurable)

export let SearchInputComponent = {

    /**
     * attributes:
     * helpText - small text to display under search box
     * callback - callback to be fired when timeout/minchar conditions are met
     * minchars - min number of characters that must be present to begin searching
     * inputId  - DOM ID for input field
     * @param vnode
     */
    oninit: function(vnode) {
        vnode.attrs = vnode.attrs || {};
        // Search box caption
        vnode.attrs.helpText = vnode.attrs.helpText || "Enter search text here";
        //Callback to fire when user is done typing
        vnode.attrs.callback = vnode.attrs.callback || function(){};
        //Minimum number of characters that must be present to fire callback
        this.minchars = vnode.attrs.minchars || 3;
        this.timeout = vnode.attrs.timeout || 1e3; //default 1 second for input timeout
        this.inputId = vnode.attrs.inputId || "searchTextInput";
        this.timeref = null;
        this.err = stream(false);

        this.doneTyping = function(){
            let searchString = $('#' + this.inputId).val();
            if (this.timeref && searchString.length >= this.minchars){
                this.timeref = null;
                vnode.attrs.callback(searchString);
            }

        }.bind(this);

        this.oninput = function(event){
            if (this.timeref){
                clearTimeout(this.timeref);
            }
            this.timeref = setTimeout(this.doneTyping, this.timeout);
        }.bind(this);

    },

    view: function(vnode) {
        return m("fieldset", {class: "search-input-box"}, [
            m("input", {type: "text", class: "form-control", id: vnode.state.inputId,
                autofocus: true, "aria-describedby": "searchHelp", oninput: vnode.state.oninput,
                onblur: vnode.state.doneTyping}),
            m("small", {id:"searchHelp", class: "form-text text-muted"}, vnode.attrs.helpText)
        ]);
    }

};

Pretty basic.  In my case, I wanted to avoid firing my search callback (which makes a request to my back-end search service) until a certain number of characters had been entered.

Thoughts On Java 8 Functional Programming (and also Clojure)

After working with Java 8 for the better part of a year, I have to say I find its new “Functional” features both useful and aggravating at the same time. ¬†If you’ve used true functional programming languages (or you use them on your own personal projects while being forced to use Java at work), you’ll find yourself comparing each of Java 8’s new functional constructs with those in say Clojure, Scala, or Haskell, and you’ll inevitably be dissapointed.

Case in point: Java 8 includes an “Optional” container which wants you to treat it like a monad. ¬†Now, I find it useful, but of course it’s nowhere near as nice as Scala’s “option” or Haskell’s “maybe”:

For example, at work I found myself having to refactor an older part of the code base. ¬†Attempting to think ‚ÄúFunctionally‚ÄĚ I decomposed this block of fairly typical Java (simplified with class and variable names changed, of course) ¬†:

arrivals = new HashMap<>();

for (Delivery delivery : route.getDeliveries()) {
    Truck deliveryTruck;
    try {
        deliveryTruck = delivery.getPackage().getTruck();
    } catch (NullPointerException ex) {
    //Log the error
    continue;
    }
    if (deliveryTruck == null || arrivals.containsKey(deliveryTruck)) {
        continue;
    }

    Location truckLoc;
    try {
        truckLoc = deliveryTruck.getLocation();
    } catch (NullPointerException ex) {
        //Log error
        continue;
    }
    arrivals.put(deliveryTruck, new TimeAndLocation(delivery.getDeliveryTime(), truckLoc));
}

Into this:

public Optional getArrival(Delivery delivery){
       return delivery.getPackageOpt()
            .flatMap(Package::getTruckOpt)
            .map((t) -> new TimeAndLocation(delivery.getDeliveryTime(), t.getLocation()));
}


public List getArrivalsFoRoute(Route route){
        Map<Truck, TimeAndLocation> arrivals = new HashMap<>();
        for (Delivery delivery : route.getDeliveries()){
            Optional truck = delivery.getPackageOpt().flatMap(Package::getTruckOpt);
            truck.flatMap((t) -> getArrival(delivery)).ifPresent((a) -> arrivals.put(truck.get(), a));
        }
        return new ArrayList<>(arrivals.values());
}

These two functions have the advantage of being easier to reason about and trivial to unit test individually. ¬†Java 8‚Äôs Optional monad eliminates the possibility for NullPointerExceptions,¬†so there is no longer an excuse for the evil “`return null;“` when you can just as easily do “`return Optional.empty();“` ūüėČ

Now, the bad parts: ¬†Optional inherits from Object. ¬†That’s it. ¬†It’s really just a Java container class. ¬†Much nicer (and more useful) would have been a hierarchy of Monadic types (Either, an “Exception” monad, etc) all implementing a common Monad interface. ¬†But alas, no.

That said, this code is fairly functional, if ugly.  True functional programming languages make things like this easier to express (and easier to express elegantly).

Off the top of my head, in clojure you could do something like this:

(defn arrival [delivery] {:datetime (get-dtime-M delivery) :location (get-truck-M delivery)}) 

(defn arrivals [route] (cat-maybes (map #(arrival %) (get-deliveries route))))

Which I think everyone would agree is easier to read. ¬†Now, I’m making use of Monads here, which aren’t really “built-in” to the language, but Clojure is extensible enough (macros, etc.) that adding them is easy. ¬†Personally, I rather like this library: http://funcool.github.io/cats/latest/. ¬† So, if we assume our get-dtime-M and get-truck-M functions return “maybe” monads (which contain either a ‘Just x’ or a ‘Nothing’), we get the same advantage as Java 8’s Optionals (no null checks littering our code). ¬†We can also wait to evaluate the value of our Maybe monads till the end of our processing (The cat-maybes function does that, pulling only the ‘Just’ values from our array of Maybe monads generated by the map function).

In addition to Maybe, The cats library exposes many other useful monadic types (and you can create your own as well) which have no Java 8 counterpart.

Of course, If you really want to explore the full power of Monads, I’d suggest learning some Haskell — then come back to Clojure (or Scala, I suppose) when you need to write some real-world software ūüôā

C and Go – Dealing with void* parameters in cgo

Wrapping C libraries with cgo is usually a pretty straightforward process.¬† However, one problematic situation I’ve come across recently is dealing with C functions which take a void* type as a parameter.¬† In C, a void* is a pointer to an arbitrary data type.

I’ve been using one C library which allows you to store arbitrary data in a custom Tree-like data structure.¬† So naturally, you want to be able to feed a raw interface{} to your wrapper function.¬† Suppose your C library has two functions:

void set_data(void *the_data);
void *get_data();

What should our Go wrapper pass to set_data?  The easy solution is to do this:

func Wrapper SetData(theData interface{}) {
    C.set_data(unsafe.Pointer(&theData));
}

And this would work, prior to Go 1.6 (and is in fact the solution I had been using in my wrapper — though I don’t really use this feature of the library in question myself).¬† This, however, is dangerous (and also has the obvious problem that calling C.get_data is going to give you an unsafe.Pointer — hope you remember what kind of data you stored).¬† You’re passing a Go pointer to a C function which then stores said pointer, which means the Go garbage collector can no longer manage it.¬† This was considered so inadvisable that it is completely disallowed in Go 1.6 (try it and you’ll be rewarded with a nice runtime error: panic: runtime error: cgo argument has Go pointer to Go pointer¬†)
Since I was busy building stuff using Go instead of following all of the discussions about this issue on the golang mailing list, I missed this little gem and had to spend some time figuring out why my builds started failing :/

So, is there a solution to this?  Not really.  Well, nothing good.  You could do something like this, I suppose:

func SetData(theData interface{}) error{
	//Get bytes from raw interface
	var buf bytes.Buffer
	enc := gob.NewEncoder(&buf)
	if err := enc.Encode(theData); err != nil {
		return false
	}
	data := buf.Bytes()
	C.set_data(unsafe.Pointer(&data[0]))
	return nil
}

What’s going on here?¬† Well, we’re making use of the encoding/gob package to convert the go interface{} into a Go byte array.¬† Then we’re passing a pointer to the first element in our byte array to C.set_data.

I don’t like it much.¬† I mean, sure – the data is there, but in order to extract the data later and gob-decode it, you’re going to need to remember the size of the byte array you fed into C.set_data.

So, let’s just disallow passing a raw interface{} to our wrapper function.¬† We could force the user to give us a byte array.¬† This simplifies things a bit:

func SetData(data []byte) {
	set_data(unsafe.Pointer(&data[0]))
}

But we still have the problem of remembering the length of our byte array so we can extract it later using C.GoBytes.

So, unfortunately passing arbitrary data from a Go interface{} into a C void* is just not practical.¬† For my particular C library wrapper, I just decided to require the user to supply a string rather than an interface{}.¬† This has the benefit of being easier to convert back and forth and you also don’t have to stash things like array sizes or references to C.free later.¬† Strings are reasonably versatile in that you could, say, convert a struct to JSON or even b64encode some binary data and shove it into a string if you really need to.¬† You could implement this solution thusly:

func SetData(theData string) {
	cstr := C.CString(userData)
	res := C.set_data(unsafe.Pointer(&cstr))
}

func GetData() string {
	data := C.get_data()
	return C.GoString((*C.char)(*(*unsafe.Pointer)(data)))
}

As you can see, converting the data back from a void* (unsafe.Pointer) to a String isn’t very straightforward, but it works well enough.

Making an extensible wiki system with Go

My little side project for roughly the last year has been a wiki system intended for enterprise use.

This certainly isn’t a new idea — off the top of my head, I can think of several of these “enterprise” wiki systems, with Confluence and Liferay being the most obvious (and widely used) examples.¬† The problem with the current solutions, at least in my mind, is that they have become so bloated and overloaded with features that they are difficult to use, difficult to administer, and difficult to extend.¬† I’ve been forced to use these (and other) collaboration/wiki systems and while I see the value in them, the sort of system I want to use just doesn’t seem to exist.

My goals were/are thus:

  1. Provide basic wiki functionality in a simple and clean UI as part of the core system
  2. Use markdown as the markup language (never liked wikicode) for editing
  3. Be horizontally scalable (I’ve suffered through overburdened enterprise software)
  4. Be extensible, both in the frontend UI (plugins) and the backend services
  5. Don’t run on the JVM (because reasons) ūüėČ

Ok, that last one is kind of a joke (but not really). At least the core system ought to be native, while of course additional services can be written in just about any programming language.

After a year of working a few hours a week on this, I’ve come up with something I call Wikifeat (hey, the .com was available). It’s not finished, of course, and likely won’t be for a while yet.¬† Building an enterprise wiki/collaboration platform is proving a daunting task (especially goal #4 above), but the basics are done and the system is at least usable:

wikifeat_screenshot
Screenshot of the Wikifeat interface, showing a map plugin.

Technology

The Wikifeat backend consists of a CouchDB server(s) and several small (you might almost call them ‘micro’) services written in Go.¬† These services have names like ‘users’, ‘wikis’, etc. denoting their function.¬† Multiple instances of these services can be run across multiple machines in a network, in order to provide scalability.¬† The ‘frontend’ service acts a router for user requests from the Wikifeat web application, directing them to the appropriate service to fulfil each request.¬† Backend services¬† communicate directly with one another when needed as well via RESTful APIs.

The service ‘instances’ find each other via a service registry.¬† I decided on the excellent etcd to serve as my registry.¬† Each service instance maintains a cache of all of the other services registered with etcd that it can pull from when it needs to send a request to another service.

Extending the backend is a simple matter of developing a new service, registering it with etcd, and making use of the other services’ REST APIs.¬† The frontend service also has a facility for routing requests to these custom services (in the anticipated common use case of pairing a front-end javascript ‘plugin’ with a custom backend service).¬†¬† Frontend plugins are written in Javascript and placed in a directory in the frontend service area.

The webapp itself is written as a single-page application with Backbone and Marionette.  The Wiki system is mostly complete.  Pages can be edited using CommonMark, a rather new implementation of Markdown.  I personally like Markdown for its simplicity, and always hated the various WYSIWYG HTML editors commonly included with CMS / Collaboration software.  Most developers already know markdown, and most non-techies should be able to pick it up quickly (being that it *is* meant to be a human-readable markup language):

wikifeat_edit
Wikifeat markdown editor

If that text editor looks familiar, it’s basically a tweaked version of the wmd editor used on stackexchange ūüôā

Future Plans

I’m looking forward to continuing to evolve this thing.¬† The frontend probably needs the most help right now.¬† I actually hope to replace the wmd markdown editor with something more ‘custom’ that can make inserting plugins easier.¬† I’d also like to allow plugins to add custom ‘insertion’ helper buttons to the editor, along with a plugin ‘editor’ view, rather than requiring the user to enter a custom div block.

My mind is also overflowing with ideas for future plugins.¬† Calendars, blogs, integration with third party services/applications, etc.¬† Hopefully I can get to those eventually.¬† It would be *really* swell if someone else took on some of that work as well.¬† The project is open source (GPLv2 BSD), so that’s certainly possible…

UPDATE: After some feedback and reflection, I’ve decided to change the license from GPL to BSD ūüôā

Creating RPMs from python packages

While I can use pip to install additional python packages on my development box, sometimes I need to deploy an application into an environment where this isn’t possible.¬† The best solution if the target box is an RPM-based linux distro is to install any necessary python dependencies as RPMs.¬† However, not all python packages are available as rpms.

To build them yourself, you’ll need a package called py2pack.¬† Install it thusly:

pip install py2pack

Let’s say you need to RPM-ify the fastkml package.¬† On CentOS/Fedora/RHEL, do the following:

mkdir -p ~/rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS} # If you don't already have this

cd ~/rpmbuild/SOURCES
py2pack fetch fastkml 0.9  # Version number is optional

This will download the fastkml tarball into ~rpmbuild/SOURCES.¬† Next, you’ll need to create the RPM spec file.¬† Py2pack has a few templates, we’ll use the ‘Fedora’ one:

cd ~/rpmbuild/SPECS
py2pack generate fastkml 0.9 -t fedora.spec -f python-fastkml.spec

This will generate a spec file, which you may then feed into rpmbuild:

rpmbuild -bb python-fastkml.spec #Use -bs to build a source rpm

This hopefully should work, and will dump an rpm file into the ~/rpmbuild/RPMS directory.¬† Note: This isn’t perfect, I’ve already encountered a few python packages for which this procedure doesn’t work cleanly.

CouchDB – check your default settings

While running a pkg upgrade on my FreeBSD box, I noticed the following message scroll by while couchdb was being updated to the latest version:

CONFIGURATION NOTES:

PERFORMANCE
For best response (minimal delay) most sites will wish to uncomment this line from /usr/local/etc/couchdb/local.ini:

socket_options = [{recbuf, 262144}, {sndbuf, 262144}, {nodelay, true}]

Otherwise you'll see a large delay when establishing connections to the DB.

And just like that my response times for simple fetch requests to couchdb went from ~200ms to ~4ms.¬† This was something that was seriously bothering me, and had me contemplating dropping couchdb in favor of something else.¬† If ‘most sites’ will want to do this, why isn’t this the default setting, I wonder?¬† Oh well.