Literature Review — knowledge graphs & information extraction

Historical context

Back in 1982, Japan launched an offensive to become the global AI superpower: a 10-year initiative called the “Japanese Fifth Generation Computing Project” — revolutionary, but it failed. Their idea was to use a large number of CPUs, which at the time were far less powerful. Nowadays AI has the same foundation: NVIDIA’s GPUs are nothing more than many simplified cores — each weak alone, but powerful as a whole.

Apart from the CPU assumption, the project assumed logic instead of probability — not the probabilistic, next-token-prediction models we call neural networks. That logic-based approach was never implemented at the scale that today’s transformer-based models (GPT, DeepSeek) reach.

Douglas Lenat, founder of the Cyc project, explains the failure: among about half a dozen reasons, “one of the reasons they failed was because they tried to represent knowledge as a tree, rather than as a graph.” Knowledge resists hierarchical representation.

Douglas Lenat on why the Japanese Fifth Generation effort failed.

Lenat embarked on a logic-based machine-learning project in 1984, publishing the first Cyc results in 1985. The first stage was accomplished through extensive manual axiom-writing, eventually evolving into OpenCyc 4.0 (June 2012). The open project shut down in 2017, but Lenat continued the research within the Cyc company.

Topic

Popularity-based, keyword, and semantic searches — useful for everyday tasks — have significant limitations in scientific research, where researchers need highly specific details that are often mentioned only tangentially. Information Extraction (IE) presents two solutions: apply IE after the search to reduce manual skimming, or beforehand to generate a data structure that enhances search efficiency. Traditional techniques constrain the first; the second, by building a knowledge graph from extracted data, lets researchers bypass documents and directly target information.

The gap

State-of-the-art IE models predominantly rely on supervised fine-tuning of LLMs, whose effectiveness derives from the quality and scope of their training data. Creating such datasets is resource-intensive, which leads to datasets focused on narrow topics while discarding information outside their regimen. Each dataset becomes a distinct extraction task with its own annotation scheme. Carried into applications, this fragments information across separate databases, each with its own schema and ontology.

The relevant literature divides along the two elements of this project — navigate to whichever excites you more:

Information-extraction literature
Universal knowledge-representation literature

The text-to-graph project Rethinking search