Unplugging the public cloud!
Last week Tuesday we finally hit the delete button and terminated all our frontend and backend instances at Google Cloud before switching our portal (https://beta.nlpcore.com) over to our private cloud stack sitting in a locked closet in the backyard. It was quite a momentous occasion for us as for past few months we have been experimenting across various options to find an optimal price and performance balance for our continuously high compute requirements.
To put this into perspective, we are an enterprise search startup, heavily experimenting with various datasets including many in the domain of health sciences. NLPCORE website provides contextually relevant color coded bioentities, materials along with their annotation references from published research that they can easily filter, or explore visually on a relationship graph at any depth. Underneath the covers, NLPCORE has a workflow pipeline managing Docker instances that crawl research articles, extract and parse text, gather linguistic (parts of speech tags) and statistical vitals (terms and colocations frequencies), update its neural network, update text index as well as compute most likely results. All this happens in a completely automated and distributed environment spread across multiple nodes that are added/removed dynamically based upon workloads (including selecting node specs to match size of workload as and when required). This architecture requires us to have high number of compute instances available on demand to find contextual relevant results as well as a reasonably good number of background instances on a frequent basis to crawl the web and maintain the index cache respectively.
Extracted from millions of articles, the average size of a dataset is roughly 100GB and computed results amount to 1/3rd of the original dataset size at a minimum. All the cloud vendors provide amazing infrastructure for high volume storage at cheap rates. Computing once on an expensive high compute instance and pushing all the computed data to the cloud is an ideal situation for most companies and therefore cloud remains a popular choice for them. But for companies like ourselves that are still in the experimentation stage, have to throw away computed data and often start from scratch. We are therefore different from and more demanding than a traditional transactional cloud application where the application instances only come alive to query or update a limited set of results from/to a database instance. While quite not there yet but we are more like building Amazon Alexa on AWS, Bing Search on Azure or Google Search on Google App Engine and paying for each compute instance. Recent story of Dropbox moving off AWS is a good example on this front (http://www.wired.com/2016/03/epic-story-dropboxs-exodus-amazon-cloud-empire/).
Being a self-funded startup has been a blessing in disguise for us as it forces us to find the most efficient hardware and software combination possible to build an industrial strength enterprise search platform that we feel offers unique value proposition to our users. We experimented with all three major cloud vendors - Amazon AWS, Microsoft Azure and Google App Engine along the way. Even though each platform offers a variety of compute instances and software infrastructure components that potentially could help a third-party like ourselves optimize our costs and slim down our own software stack; being on a virtual instance on any of these platforms we never could accurately map our performance to physical characteristics of the underlying hardware CPU, GPU or RAM. So eventually we decided to build hardware of our own, ordering parts at a time from where else - Amazon Prime! In a couple of days, we had a custom PC ready with Intel Core i7-5820 Haswell 6 Core processor, 16GB RAM and a high performance liquid CPU cooler to keep our stuff literally cool! We added a second off the shelf (we literally lifted it off the shelf from a reuse store here in Seattle) older CPU i5-750 Quad Core processor to divvy up the workload. We have been pleasantly surprised as the table below show our results.
Our machines have been operating continuously for past 3 months at full scale, while we honed our new algorithms to find more meaningful relations, dig deeper in our graphs and surface meaningful results. We believe we now have a good reference platform in place that allows us to experiment both on hardware (we are thinking of adding raspberry or banana-pi nodes next in the mix as low cost worker nodes for simple tasks such as crawling the web) and software (a bottomless list of ideas, requirements and feature requests) front. We also intend to leverage our hardware experiment as a reference platform that in future we could offer as a high performance compute instances.
Verdict on Cloud
Cloud is here to stay without a doubt. It offers ubiquitous access, resource consolidation, redundancy, scale and configurability at low cost. However, I believe business models currently on offer need to evolve to truly pass-on these savings to all customers. At present, I would say the model works really well for class of vendors where they can factor in their cloud costs as a fraction of their own service offering. Heroku is a great example in this sense. They offer a limited free instance that is based on Amazon's limited free instance but if a customer goes over, they add a premium over what Amazon charges them, generating that as their incremental revenue. For Enterprise customers, this is primarily a TCO (Total Cost of Ownership) question. If they factor in their own hardware purchase, data-center maintenance and staff costs and compare that against that of cloud subscriptions, they may find the right point where it may make sense for them to be on cloud versus maintaining their own data center or vice-versa. Finally, for companies (such as ourselves) where compute or storage requirements are mission critical and scale exponentially as number of users increase, unless there is a business model that offers pricing based on number of users or volume of data, they are better off charting their own course - precisely what Dropbox has just achieved.