Text-to-Graph Conversion
An innovative way to improve search: convert documents into a knowledge graph, so related information scattered across sections is grouped in one place and reachable by traversing edges.
In documents, related information is often scattered across different sections. Graphs provide multiple viewpoints, letting you see related information grouped together in one place. By traversing edges in multiple directions, searching becomes more intuitive and systematic — easier to navigate and manage across large, complex data.
Converting natural language into a graph is ambitious. It requires a user-friendly graph schema capable of representing the variety of sentences found in the literature. Here we explore an approach that builds a schema on the same principles as natural language, aiming to address the complexity and scope of human language.
Project overview
- Knowledge-graph schema — a flexible schema to capture the vast array of information in scientific articles.
- Text–graph interface language — guidelines for converting text into its graph representation.
- Fine-tuning language models — leveraging LLMs to automate the transformation of text into graph format.
- Benchmarking the system — assessing how much the knowledge graph improves data retrieval and understanding.
Developing a knowledge-graph schema
The first step is a schema capable of representing all the diverse information across scientific studies. Written language, using only a simple set of grammatical rules, has become a universal tool for representing diversity. We believe we can endow our schema with similar capabilities by carefully crafting a new syntax based on entities and relations instead of punctuation and spacing.
The versatility of language comes from its vast vocabulary; for intuitive operation, the knowledge graph will also use a vast vocabulary. The schema, like language, is a basic syntax enhanced by an extensive vocabulary. It must be user-friendly — it’s intended to function as a search engine that requires no extensive training — and adaptable, to keep up with evolving information and practice. Our design also reduces redundancy and enhances connectivity for efficient retrieval.
Annotation guidelines, in tandem with the text-graph notation
Graph-database files are too technical to serve as a good annotation format. Since annotations are primarily done by humans, the format should be intuitive, resemble natural language, be easy to edit, and clearly show the graph structure. To bridge this gap we need an interface that mediates between the two representations, which can then be converted into the graph-database format.
Just as one English sentence can have several valid French translations, there can be multiple correct graph representations. Too many options complicates the application, so clearer guidelines standardize the labels. Creating those guidelines requires careful consideration of all system constraints — complex, but crucial.
Fine-tuning a language model
Once the annotation guidelines are consistent and the graph works properly as a search engine, the next step is to automate the text-to-graph translation. The ideal candidates are LLMs — exceptionally good translators. The LLM’s main role is to interpret and organize vast amounts of unstructured text into a clear, graph-based format. To make this work, the LLM is specifically trained for text-to-graph translation, extending the dataset created during annotation with more sentence–graph pairs.
Benchmarking the search function and the LLM
We evaluate the knowledge graph on its ability to make information hidden in text more accessible — measured by the overlap between KG search results and other databases. Next, we use precision, recall, and F1 to evaluate nested-relation extraction, relation classification, coreference resolution, entity linking, and inter-sentence relations. Our information extraction differs from prior event extraction, making direct benchmark adoption challenging; we also sidestep inter-annotator agreement, which has limited relevance to our goals.
By following these steps, the project aims to revolutionize how scientific data is stored, retrieved, and understood — turning raw text into structured, searchable, interconnected data.