how to create your own site

Go Back...

NLPCORE: COVID-19 Data Visualizer!  (By Yos Wagenmans, High School Intern)


First, I would like to thank Naveen Garg (Co-founder and CEO) and Varun Mittal (Co-founder and CTO) for their constant support in this project, as it would not have been possible without them. We invite you to check back in the future for any updates.

Our objective was to give researchers a better way of understanding the contents of the COVID-19 Open Research Dataset and the connections that might exist among various biological concepts across one or more articles. Using Natural Language Processing and Machine Learning algorithms created by NLPCORE, we can find not only the most relevant documents but also various biological concepts and their inter-connections, for an inputted search term. The programs we have created are being constantly updated and streamlined to accomplish the required tasks more effectively.

In our approach, we explored using a node-based graph but ultimately decided on a concentric ring chart to represent the vast amount of data. The main drawback with node-based visualization is the lack of scalability when hundreds of nodes have to be shown. For my summer 2019 project, I had explored building this node graph (where each node represents a relevant concept (word or a phrase) from an article); see below. We looked for a diagram where the query term could be centrally located, depth/degree of relevance could be shown, and topics/types could be represented. The D3.js Sunburst Visualization library enabled us to do just that, even displaying the path to the selected topic. An example is shown below:


NLPCORE

NLPCORE

The chart on the left has Coronavirus as the central term and transmission as a default secondary term. From there the algorithm sorts the data by the degree of relevance and groups the similar-typed nodes together. Each bar of color is a singular node, however, the size of which is determined by the frequency of nodes of that type. The path diagram on the right provides an easy way to see the related parent categories.

 Once a user clicks a wedge piece, the new query term is sent to the program, where it displays a node-graph below where there is an updated central term (elaborated in NLPcore blog post: here). Shown in the visual below is the interactive force-based graph with Coronavirus as the default center. Color coding helps organize the terms; click-functionality on the nodes/edges has been written, but not yet implemented in this version. This would generally direct to a ranked list of scientific articles with the search terms highlighted in corresponding colors. Additionally, physics features can be altered using the sliders on the right.


NLPCORE

NLPCORE

To process the vast amounts of data, this program makes API calls to the NLPCORE server and rapidly receives the necessary outputs for a visualization to be drawn. More information on the APIs can be found on the API catalog.

Code can be found in and run from the NLPCORE Gitlab: 

    https://gitlab.nlpcore.com/yoswagenmans/cord-19-visualization

References: 

    D3.js - https://github.com/d3/d3/wiki

    Sunburst - https://observablehq.com/@d3/sunburst

Additional Links: 

    NLPCORE Blog on In-House DFS for faster IO and optimized Compute

    NLPCORE Blog Update

    NLPCORE Revamping

     NLPcore algorithms - Corpus Search Methods Patents

        #10102274 https://patents.google.com/patent/US10102274B2

        #10372739 https://patents.google.com/patent/US10372739B2

Summer 2019 Internship - Node Visualizer

Introduction: 

I designed an improved data sorter/visualizer for data points (legal, medical, etc.) on the server. The overall goal for this project was to create an efficient visualizer that could easily be implemented in the existing product.

Research process: 

In the first several weeks, I familiarized myself with pandas Dataframes and the different attributes that could be used to sort it into the desired data. As I began to create visualizers for the data being gathered, some constraints needed to be followed: scalability (10 or 500 nodes), physics modeling, and compatibility with key data points. The first iteration used a library called NetworkX, which ran using Matplotlib; there were however many limitations as seen in the following:

It was both unscalable and contained no physics engine. From here I moved onto Pyvis, which captured several important features like physics-based modeling and scalability, yet it was not compatible with the data points we hoped to show in the graph. (The Pyvis model is shown on the right.)


NLPCORE


I decided on the D3.js library[3], which met all our criteria for an ideal visualizer. We now have several working features, many of which can be easily customized: force modeling, grouped nodes, fisheye zoom [1], right-click on nodes/links [2], and data-defined visual aspects of nodes/links (Refer to gallery at the bottom for an image of each feature).

NLPCORE


How the program is run:

Using Jupyter, a client can access data from the server using the inputs provided at the top, which are parameters such as the search term/query and amount of data. After running the remainder of the code, you are given fully sorted .JSON data which is then loaded locally using javascript available on the gitLab. Once a local web-server is initialized, you will see all the data linked and grouped as shown here:


NLPCORE

Code snippets - the important functions: 

Link finder - finds every connection for the desired nodes


NLPCORE


Data sorting methods - reduces large DF to small list


NLPCORE


Key takeaways:

While I had a few years of previous coding experience, it mainly consisted of building applications and games in Java, so many of the libraries and programs I encountered were brand new to me. Similarly, I had encountered databases before, but nowhere near the scale involved here. I gained valuable experience using python and building efficient methods that could sort thousands of data points to just a simple, organized list of desired terms. Also, I learned the fundamentals of javascript and using vast libraries such as D3. The key takeaway is the use of the engineering design process: a process of iteration and learning from failure which consists of brainstorming, sketching, coding, test/try again, refining, and communicating results. To create the most efficient but effective solution, I tried several different methods and studied my mistakes to create a program that achieved the desired goal.

If a slide-show style is preferred, here is the link to the google slides 


Image Gallery

Fisheye-zoom: 

NLPCORE


Right-click on link:

NLPCORE

Right-click on node:

NLPCORE

Varying size of node based on count:

NLPCORE

Weighted Links, shows relevance:

NLPCORE


Go Back...