Connecting Data, People and Ideas since 2016.
11 July 2021

A Knowledge Graph-Based Semantic Database for Biomedical Sciences

by Christian Jakenfelds, GRAKN.AI

 

 

 

 

 
 
  • 1. T H E D A T A B A S E F O R A I
  • 2. 1960 1970 1980 1990 2000 2010 2020 2030 Relational/SQL Databases NoSQL & NewSQL Databases SCALE COMPLEXITY COMPLEXITY Business Intelligence (BI) Web Applications Artificial Intelligence (AI) ? AI S YST EMS P RO CES S KN OW L EDG E T HAT I S TO O CO MP L EX FO R CURREN T DATABAS ES Punch cards & Tapes Navigational Databases Record Keeping SCALE Follow us @GraknLabs
  • 3. 1960 1970 1980 1990 2000 2010 2020 2030 Relational/SQL Databases NoSQL & NewSQL Databases Business Intelligence (BI) Web Applications Artificial Intelligence (AI) SCALE COMPLEXITY SCALE COMPLEXITY WHAT RELATIONAL DID FOR BI, IS WHAT GRAKN WILL DO FOR AI Punch cards & Tapes Navigational Databases Record Keeping Follow us @GraknLabs
  • 4. Follow us @GraknLabs What is the problem with complex data? Too complex to model Current modelling techniques only based on binary relationships Could not model complex domains Too complex to query Current languages only allow you to query for explicitly stored data Could not simplify verbose queries Too expensive analytics Automated distributed algorithms (BSP) expensive and not reusable Could not reuse analytics algorithms DB QLs are too low-level Strong abstraction over low- level constructs and complex relationships Difficult to work with complex data
  • 5. Follow us @GraknLabs GRAKN.AI is a hyper-relational database for knowledge-oriented systems i.e. GRAKN.AI is a knowledge baseKnowledge Storage System Novel Knowledge Representation System based on Hypergraph Theory Knowledge Inference OLTP Reasoning Engine Knowledge Analytics OLAP Distributed Analytics
  • 6. Follow us @GraknLabs What is a hyper-relational database? Hyper-expressive schema Flexible Entity-Relationship concept-level schema to build knowledge models Model complex domains Real-time inference Automated deductive reasoning of data points during runtime (OLTP) Derive implicit facts & simplification Analytics as a Language Automated distributed algorithms (BSP) as a language (OLAP) Automated large scale analytics High-level query language Strong abstraction over low- level constructs and complex relationships Easier to work with complex data
  • 7. Follow us @GraknLabs
  • 8. Follow us @GraknLabs THE CENTRAL DOGMA TRANSLATION RNA to PROTEINS TRANSCRIPTION DNA to RNA REPLICATION DNA to DNA Francis Crick, 1958 Nobel Prize Winner 1962
  • 9. Follow us @GraknLabs https://www.ncbi.nlm.nih.gov http://www.uniprot.org http://www.geneontology.org http://reactome.org http://www.mirbase.org http://mircancer.ecu.edu http://bioinfo.life.hust.edu.cn/miRNASNP2/index.php http://mirtarbase.mbc.nctu.edu.tw http://www.genenames.org http://www.microrna.org/microrna/home.do A SMALL SAMPLE…
  • 10. Follow us @GraknLabs …AND THE CHALLENGE
  • 11. Follow us @GraknLabs Schema Example: Basic Model Employ- ment Person CompanyName Employee Employer has has relates relates plays plays
  • 12. Follow us @GraknLabs Schema Example: Type-Hierarchy Employ- ment Person Customer Company Startup Name Employee Employer has has sub sub relates relates plays plays plays plays
  • 13. Follow us @GraknLabs THE BIOGRAKN SCHEMA
  • 14. Follow us @GraknLabs
  • 15. Follow us @GraknLabs
  • 16. Follow us @GraknLabs
  • 17. Follow us @GraknLabs
  • 18. Follow us @GraknLabs THE CENTRAL DOGMA: INFERRED TRANSLATION RNA to PROTEINS TRANSCRIPTION DNA to RNA REPLICATION DNA to DNA Francis Crick, 1958 Nobel Prize Winner 1962
  • 19. Follow us @GraknLabs WHAT’S NEXT?
  • 20. T H E D A T A B A S E F O R A I
  • 21. Follow us @GraknLabs Schema Example: Type-Hierarchy Employ- ment Person Customer Company Startup Name Employee Employer has has sub sub relates relates plays plays Husband Wife Marriage plays plays relates relates
  • 22. Follow us @GraknLabs Valid Data Insertion Alice Bob IBM Grakn mar emp emp employer employer wife husband ✓ Write commit success customerperson startup
  • 23. Follow us @GraknLabs Invalid Data insertions – [intelligent] Schema Constraints are Back! Charlie Applemar husband wife companyperson ❌ Write commit fails ❌ Invalid relationship
  • 24. Follow us @GraknLabs Hyper-Relationship Example: Nested-Relationship Alice Bob Austin mar loc wife husband personperson City 07/01/2017 has date
  • 25. Follow us @GraknLabs Hyper-Relationship Example: N-ary Relationship Titanic Jack Leonardo cast figuremovie person actor 1 Billing-number
  • 26. Follow us @GraknLabs Rule Example: Transitive Relationship Kings Cross London loc countryward UK loc city loc
  • 27. Follow us @GraknLabs Rule Example: Simple Business Rule Schedule A Schedule B A Start B Start A End B end
  • 28. Follow us @GraknLabs THE INFERENCE OLTP LANGUAGE A knowledge-oriented query language should not only be able to retrieve explicitly stored data, but also implicitly derived information.
  • 29. Follow us @GraknLabs Complex Query Example drive drive drive travel travel travel Alice Full-time Emp Bob Part-time Emp Charlie Temporary Emp AB123 Bus BC234 Van CD345 Truck Kings Cross Ward London City UK Country loc loc Who are all the drivers that will be arriving in the UK? The query would be very long and complex in SQL, NoSQL or even Graphs
  • 30. Follow us @GraknLabs Complex Query Example: Type and Relationship Inference drive drive drive travel travel travel Alice Full-time Emp Bob Part-time Emp Charlie Temporary Emp AB123 Bus BC234 Van CD345 Truck Kings Cross Ward London City UK Country loc loc Who are all the drivers that will be arriving in the UK?
  • 31. Follow us @GraknLabs THE ANALYTICS OLAP LANGUAGE Large-scale analytics is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it too. At the end of the day, very few people know how to code it.
  • 32. Follow us @GraknLabs Example of a Distributed Analytics Algorithm For each vertex V, Superstep 1: V sends its own id via both out going and incoming edges V sets its own id as cluster label Do superstep n: For every received message m of V, compare it to its current cluster label L: If m > L, set the label to m; If the cluster label has not changed in this super step, vote to halt; Else, send the new cluster label via all edges; Global operation: While not every vertex votes to halt, and n < N, do another superstep n + 1. Connected Component: a clustering algorithm (pseudocode) An efficient implementation of this algorithm is about 200 lines of code in Java
  • 33. Follow us @GraknLabs Example of a Distributed Analytics Algorithm For each vertex V, Superstep 1: V sends its own id via both out going and incoming edges V sets its own id as cluster label Do superstep n: For every received message m of V, compare it to its current cluster label L: If m > L, set the label to m; If the cluster label has not changed in this super step, vote to halt; Else, send the new cluster label via all edges; Global operation: While not every vertex votes to halt, and n < N, do another superstep n + 1. Connected Component: a clustering algorithm (pseudocode) An efficient implementation of this algorithm is about 200 lines of code in Java
  • 34. Follow us @GraknLabs Graql Distributed Analytics Queries And we’ll continue to add more algorithms into the language, such as PageRank, K-Core, Triangle Count, Density, Cliques, Centrality, and so on

 

Connected Data World 2021  All Rights Reserved.


Connected Data is a trading name of Neural Alpha LTD.

Edinburgh House - 170 Kennington Lane
Lambeth, London - SE11 5DP