Skip to main content
SearchLoginLogin or Signup

Understanding the Grants Knowledge Graph: A Catalogue of (Node) Views

This publication is meant to serve as a reference for participants in Philanthrobotics and the Open Grants Commons who want to understand the underlying structure of our knowledge graph. There are also some slight updates to our other work maintained here.

Published onJul 21, 2022
Understanding the Grants Knowledge Graph: A Catalogue of (Node) Views
key-enterThis Pub is a Supplement to

The following document serves as an update on the knowledge graph construction of the Open Grant Commons [OGC]. The OGC combines the data of both Lever for Change [LFC] and Massachusetts Institute of Technology’s Solve program. Here is a brief update on the progress being made in each of the areas of the OGC, which we refer to as Philanthrobotics – the knowledge graph and graph data science side of the project — and Philanthrolytics – the analytics dashboard side of the project. 

  1. Knowledge Graph Construction

The knowledge graph is considered the central component of the Philanthrobotics side of OGC. It is where the Knowledge Futures Group, LFC, and Solve can all ask interesting questions about the universe of data that is open to us. Currently, the data exists on a Neo4j database backend and contains multiple relationships and nodes that create sublayers. The sublayers are based on the central nodes that can be analyzed for their connections or relationships to one another. These sublayers are:

  1. The Budget Sublayer. This sublayer contains financial information about each proposal in the universe of proposals. The proposals are “binned” into groups of budgetary characteristics like “$1 to $5 Million”, “Less than $1 Million” and so on (see attached for sample).

These nodes can be expanded to show the relationship of “BUDGETED_PROJS” (“Budgeted Projects”) in each bin, exposing each of the individual proposals nodes for further analysis (see attached for sample).

Other metadata shown as properties include a unifying ID number, the organization’s name (which is shown on the node as well), the proposal name, projected costs, and a brief solution overview. 

The team has a rudimentary view of where money is flowing for each competition. Cross-referenced with other competition data, we can get an idea of the influence competition has on the bottom line of an organization, for instance (see “Future Directions”). 

  1. The Competition Sublayer. This sublayer contains specific information about specific competitions in the universe of grants and proposals. The proposals are “binned” by to which competitions they applied. They can also be subdivided by the organization providing the data either as a (a) property of the competition node itself or (b) as a parent node to the competition node (see attached for LFC competition nodes example).

The competition sublayer can also be expanded to show the individual proposals “underneath” each competition, exposing the relationship “COMPETED_IN” which sounds exactly like what it is representing – the competition to which the proposal was submitted (see attached example).

Other metadata shown as properties include a unifying ID number, the organization’s name (which is shown on the node as well), the proposal name, projected costs, and a brief solution overview. 

  1. The Employee Sublayer. Setting the stage for future metadata analyses, sometimes various teams wonder about the resources with which the organization submitting a proposal is working. Specifically, human capital. For this, we created the employee sublayer. Organizations can be “binned” by the amount of human capital they maintain. Currently, we only take into account full-time employees, with contract and part-time employees to be implemented later (see attached example). The current bins include a range from less than 10 full-time employees to greater than 1000 full-time employees.

The employee binned nodes can also be expanded 

The employee sublayer can be expanded to show the relationship “EMPLOYEE_COUNT” which is the count of employees the parent organization houses. Although there is not a specific number of employees specified for the organization as a property of their node, this is planned for the future. 

In addition, other metadata shown as properties include a unifying ID number, the organization’s name (which is shown on the node as well), the proposal name, projected costs, and a brief solution overview.

  1. The Proposal Sublayer. Unlike the other layers of the knowledge graph, the proposal sublayer is strictly a “child” sublayer and can be accessed as a layer of the information under the other sublayers. However convoluted it may sound, this actually makes the proposal sublayer the strongest of all the sublayers. This is because each of the aforementioned relationships between nodes can be exposed between proposals in the graph, and give a real “feel” to how the universe of proposals looks like for a competition, an organization, or a specific group of employee counts. 

For instance, let’s take the following example (see attached). The African Diaspora Council can expose its various relationships to other proposing organizations in the “eoc2019” competition. We see that like the other organizations, it is small with “up to 10 full-time employees” but it is one of three organizations with a projected proposal budget  of “$1 to $5 Million” (along with Montgomery College Foundation, Inc., and mRelief:

Together, this data makes a unique universe of grants and proposals, that is truly stunning to look at. This instance shows the universe of grants currently with cleaned data: 8 complete competitions, 3387 grant proposals, and over 5 years of data. This is just clean data. With Neo4j, very strict cleaning standards are upheld. One of our current objectives is to establish proper cleaning pipelines over the next 30 days to take in more of the data that we currently have access to. 

2. Neo4j Instance

Currently, the knowledge graph is maintained on a local server that exists on a Knowledge Futures computer. The current universe of data contains elements that cannot be released to the public without anonymization. There are efforts to define anonymized aggregates of this data for visualization.

In the near future, the knowledge graph will be deployed to a Neo4j AuraDB – a fully managed cloud-based knowledge graph instance – so that multiple project participants can access the data for professional projects. This will lead to collaborative curation and cleansing of the data on various levels, as well as validation and ideation of projects across organizations. 

3. Dash Application

Further efforts have been made to relate different proposals to each other across the contest boundaries. This effort has been driven by a textual analysis of the executive summaries of each of the proposals. An example of the Dash application can be found here

The Dash application is a culmination of many organizing efforts around the proposal data for LFC. Referenced in the repository for this map are a few other tools that will be outlined before we explore the map itself.

  1. The bulk downloader. First, there is a bulk downloader of all LFC data from the Torque API. This leverages the Python Torque client and works through the various data objects in the API structure to pull in a tabular (.csv) format of the data. This is maintained in a proposal_downloader.py script which will be made available to the public on OGC’s Github site soon.

  2. Cleaner script. This script cleans the data from Torque. Referencing the column that has the textual information, it cleans the location data by removing bars and characters that are unwanted.

  3. Preprocessor script. This script takes the document information and transforms it into TF-IDF vector matrices. Currently, this is hardcoded to only take document/executive summary information but plans to be extended to other features in the future. 

The main script of the Dash application makes a U-Map from the data and plots a 2D matrix of k-nearest neighbors. A 3rd dimension is constructed from a downstream topic model matrix, and made part of the U-Map. 

4. Future Directions

Below is a tabular view of some of the 30/60/90-day goals for each of the areas of the knowledge graph components of the Philanthrobotics arm of OGC. These goals are derived from one-on-one conversations Knowledge Futures has with LFC and Solve. These are a sampling of our larger to-do list, that should help scale both the Dash Application and our Knowledge Graph. 

30 days

Data cleaning pipelines


  • Automate data cleaning efforts to incorporate data scraping of websites and remove HTML artifacts

  • Remove the reliance on API calls to make bulk data requests

Graph embeddings de-bias knowledge framework

  • Establish a conceptual framework of de-biasing model embeddings from literature

  • Run journal clubs with Carmen and Vincent (academics in field) to get idea of knowledge framework

60 days

Extend TF-IDF model to take in more features

  • Leverage metadata about companies from the knowledge graph to tabular form to modeling in the landscape

  • Make relationships to other organizations and create tabular forms of data that augment tf-idf matrices of similar length and characteristics

90 days

Refine topic modeling to include other parts of speech, exclude parts of speech that organizations do not find useful

  • Survey organizations for a lexicon that they find useful for various contests and topics

    • Make this part of “suggestion” U-Map model 


Connections
1 of 3
Comments
0
comment
No comments here
Why not start the discussion?