In-House Distributed File System for faster IO and optimized Compute
NLPCORE DFS 1.0
NLPCORE DFS is a native implementation of WebDAV protocol over a relatively new but stable distributed file system, glusterfs - primarily implemented with a goal to store large blobs on a cluster in an efficient manner. Our current design involves Cassandra which is essentially a key-value store. Though Cassandra is very well optimized distributed storage technology, but a relatively poor design on top could quickly take up a lot of resources in maintaining the Cassandra cluster.
o Data Description & Problem
Data that we handle is typically in the form of graph, logically representing nodes and edges. Most well established graph based databases that use Cassandra as their back-end, implement a bunch of sophisticated methods such as hashed keys to ensure data localization in a cluster. But since our models are still very experimental and change very often, a little trade-off between speed and convenience makes a huge difference in saving R&D time.
o Bottleneck in the process
Graph traversals are one of the major activities that the platform does periodically on per user query request. While people would imagine a fast matrices based algorithm to quickly discover interacting nodes in certain neighborhood, our graphs are mostly extremely sparse and creating a large entity map for each document to simulate dense matrix in the computer, requires creating another blob per document. These blobs are not often problematic but accumulate into dense key maps. While one would argue that most technologies like Cassandra would optimize blob data and only maintain indexes to file reference in the memory. This is exactly the behavior we are trying to achieve from this Distributed File System implementation.
While key value stores provide advanced query capabilities and sometimes ability to filter, such capabilities might not be required for blobs, rather a smartly articulated key in the form of a path can do the magic and only the bare minimum metadata can be saved in the key value store.
o Configuring underlying file system
There are large number of distributed file systems available and other than some minor configuration differences, most provide similar capabilities. Our implementation uses Nginx's WebDAV module as the interaction layer. For the purpose of interacting with the file system, file system is mounted on to a directory, which is then eventually used as the web root directory.
Glusterfs can be configured in various modes. Default is the distributed mode in which each server maintains a part of the file system as a discreet file and then these are distributed evenly so that no individual node has all the data. Other way of setting up the cluster is to use striping mode, in which the data is tripped across nodes and works best if the file sizes varies a lot in the cluster, in which case each server ends up picking a portion of every file. Finally, cluster can be setup to create replica of data across different nodes. These configurations can be combined in conjunction as well, just like most RAID setups.
For example, data can be striped and replicated at the same time to make sure read/writes are faster.
Though we have not done a systematic study yet to compare both the implementations, some positive signs in the new system is the increased CPU utilization, which is a good indication that CPU is spending more time in computations. But this could be due to the new driver that we have written as well hence as part of future update to this blog, we plan on publishing profiling results of both the platforms.