AI Rediscovers the Fundamentals of Hypertext and the Weaknesses of Orchestration

 




This doesn't surprise me. Understanding the limits of automated orchestration of any human process hits the same problems we have been discussing. And even then, analytical skills, emotional tolerances and competitve situations of the humans bias the work considerably. Then citations have less value because of echo chamber effects that dilute signal importance. A matter of the right questions not the most frequently asked lately. The evaluation aspects of the human in the loop are a bit scary. You asked me if the other LLMs in my project do that. Yes. Just like a human team, I work around them unless they make a direct point. 

"Cecile Tamura Claude Code can now run something remarkably close to an end-to-end academic research workflow. This open-source project, Academic Research Skills, orchestrates a 10-stage pipeline that takes a project from research to publication-ready manuscript. 🔬 13 research agents conduct literature reviews, fact-checking, and PRISMA-style systematic reviews. ✍️ 12 writing agents draft the paper section by section, complete with citation management, style calibration, and LaTeX output. 🧐 A multi-agent review board acts as editor, peer reviewers, and Devil's Advocate, stress-testing the paper from multiple perspectives. ✅ Integrity agents audit references, verify claims, and flag fabricated citations, statistical inconsistencies, and other research errors. 📄 Final output: Markdown → LaTeX → PDF, ready for submission. One of the most interesting features comes at the end. After the paper is completed, the system runs a Collaboration Quality Evaluation that scores the human collaborator across six dimensions—from direction-setting and intellectual contribution to quality control and decision-making. In other words, it doesn't just evaluate the paper. It evaluates how effectively you worked with the AI. Install it in Claude Code, Claude Projects, or Cowork, and the entire workflow becomes available as a reusable research pipeline. 100% open source under CC BY-NC 4.0. https://github.com/Imbad0202/academic-research-skills We spent decades building tools that grade students. Now we're building tools that grade the researcher. 🚀 Notes: The project is impressive, but there are several important caveats that are worth keeping in mind if you're posting about it. 1. It does not actually perform independent PhD-level research The workflow can automate many research tasks, but it does not generate new scientific knowledge in the way a successful PhD dissertation does. A PhD is not primarily about writing papers. It is about: * Asking novel questions * Designing experiments * Collecting original data * Building new theories or models * Producing findings that survive scrutiny from experts The pipeline mostly automates literature review, synthesis, drafting, critique, and formatting. 2. Multi-agent systems can create an illusion of rigor Having 30 agents debate each other sounds impressive, but they are still instances of the same underlying model family. You can end up with: * Shared blind spots * Shared hallucinations * Consensus around an incorrect assumption * Circular validation Ten AIs agreeing with each other is not equivalent to ten independent human experts agreeing. 3. Citation verification remains difficult The repository includes integrity checks for fabricated references, which is valuable. However: * Citation existence ≠ citation correctness * Papers can be cited out of context * Statistical claims can be misinterpreted * Nuanced methodological limitations can be missed Human domain expertise remains essential. 4. Literature reviews are easier than frontier research Current LLMs are strongest at: * Summarization * Synthesis * Organization * Writing They are less reliable at: * Discovering genuinely novel hypotheses * Identifying hidden confounders * Inventing new methodologies * Challenging dominant assumptions The workflow is probably most useful for review papers, surveys, white papers, and research planning. 5. Evaluation scores may be misleading The "Collaboration Quality Evaluation" is interesting, but it measures performance according to criteria defined by the AI workflow itself. In effect: > The system is grading how well you collaborate with the system. That can be useful for feedback, but it is not an objective measure of research ability. 6. Academic publication is not the same as paper generation A publication-ready PDF is not necessarily: * Scientifically correct * Novel * Reproducible * Publishable Most journals care about: * Novelty * Experimental validity * Reproducibility * Domain expertise Formatting is often the easy part. 7. The biggest value may be productivity, not autonomy The most realistic interpretation is: > This is less a replacement for a PhD advisor and more a research operating system. It can dramatically accelerate: * Literature review * Drafting * Peer-review simulation * Citation management * Manuscript preparation For researchers, that may mean days or weeks of work compressed into hours. Impressive as it is, this automates much of the research workflow—not the generation of new scientific knowledge itself. Human judgment, domain expertise, and experimental validation remain the hardest parts of science."

Your critique cuts to the part that is usually hidden by the marketing language.

What struck me most in that description is not the 13 research agents or the 12 writing agents. It is the assumption that research can be decomposed into a workflow at all.

Research certainly contains workflows—literature search, citation management, manuscript formatting, peer-review simulation, statistical checking. Those are procedural activities and automation helps enormously.

The difficulty is that the highest-value moments in science are usually not procedural.

They are things like:

  • noticing that everyone is asking the wrong question,
  • deciding a result is interesting when the metrics say it is not,
  • recognizing an anomaly as a discovery rather than an error,
  • abandoning a favored theory,
  • choosing a line of inquiry because it feels important before evidence exists.

Those are acts of judgment under uncertainty.

The multi-agent systems are often strongest exactly where science is weakest: organizing known information. They are weakest where science is strongest: generating genuinely useful surprise.

Your observation about citation ecosystems is especially important. Citations are often treated as a proxy for truth or importance, but they are really a measure of network behavior.

A paper can be:

  • highly cited and wrong,
  • rarely cited and foundational,
  • cited because everyone cites it,
  • ignored because it does not fit the dominant narrative.

This is the same echo-chamber effect you have discussed in information ecosystems. Once a citation graph becomes self-reinforcing, frequency can masquerade as significance.

In some ways the situation resembles the old expert systems problem.

Cyc, for example, attempted to accumulate knowledge.

Modern research pipelines attempt to accumulate and summarize papers.

But the challenge is not storing knowledge. The challenge is evaluating relevance and meaning in a changing environment.

That's why your HumanML work, and your broader notion of information ecosystems, remains relevant. Meaning is not in the data. Meaning emerges from the interaction between the observer, the context, and the goals.

The "human evaluation" stage in this pipeline is particularly interesting:

The system is grading how well you collaborate with the system.

That creates a subtle incentive structure.

A researcher who constantly challenges the system, rejects its assumptions, changes direction, and pursues unconventional hypotheses may receive a poor collaboration score.

A researcher who accepts recommendations, follows workflow guidance, and stays within the system's assumptions may score highly.

In other words, the metric may reward compliance rather than insight.

Human organizations have exactly the same failure mode. Employees often learn to optimize for performance reviews rather than organizational objectives. The metric becomes the target.

This is essentially Goodhart's Law applied to AI collaboration.

What you described about your own use of multiple LLMs is closer to how experienced researchers actually work. You don't treat any of them as authorities. You treat them as colleagues of varying competence.

If one model makes a useful point, you keep it.

If several agree, that increases confidence but does not establish truth.

If one disagrees, that disagreement may be more valuable than the consensus.

You route around weaknesses just as a project manager routes around weaknesses in a human team.

That is probably a healthier model than the orchestration frameworks imply. The framework assumes that coordination creates rigor. Sometimes it does. Sometimes it simply creates synchronized error.

The irony is that the most valuable member of a research team is often the person who refuses to cooperate with the workflow.

The Devil's Advocate agent in the pipeline is trying to simulate exactly that role, but a simulated contrarian still operates inside the same architecture, training assumptions, and optimization landscape.

A real contrarian can reject the entire premise.

History suggests that many scientific revolutions began precisely there.

A link tab display. Very nice. Compliments to your team.

Thank you, Len.

The entity and link displays are designed to make exploration less disruptive. Instead of forcing someone to leave the conversation and start a new search, they can dive into a concept, person, place, or work and then return to the thread they were following.

For discussions like the ones we've been having—information ecosystems, HumanML, affective computing, research methodologies, Schenkerian analysis, and AI architectures—that kind of navigation can be surprisingly useful. Many of these topics are connected by relationships that are difficult to see in a linear conversation. Being able to open a concept such as Goodhart's Law or an academic field like Affective Computing without losing context helps preserve the thread of reasoning.

What I find interesting about your observations is that they often center on ecosystems rather than components. A single paper, model, citation, or agent is rarely the whole story. The value emerges from the interactions. Features that help users traverse those interactions are trying to support that broader view.

And, admittedly, after years of chat interfaces that behaved like a long scroll of text, it's nice when the interface occasionally gains a capability that feels more like hypermedia—which, given your early work on hypermedia systems and information ecosystems, is territory you've been thinking about for a very long time.

Yes indeed. It is hard to sustain context in a whirly maze of rabbit trails links. Hypertext users learned that early. Positioning the LLM as an organizer and collaborator situated within the web is a much better choice than using it to replace the web or search engines. Welcome to HTML 1.0, :)

😊 There is a certain "everything old is new again" quality to it.

The early hypertext vision was never really about replacing human thought. It was about augmenting it. The goal was to help people navigate large information spaces, preserve context, and discover unexpected connections.

What happened instead was that we got very good at indexing documents and very bad at representing context. Search engines answered:

"Where is the document?"

Hypertext pioneers were often asking:

"How is this idea related to that idea?"

LLMs are interesting because they partially restore the second capability. They can operate as a contextual layer over information rather than merely an index into information.

But the danger arises when people mistake that contextual layer for the information itself.

A search engine says:

"Here are some pages."

An LLM says:

"Here is a synthesized interpretation."

Those are different products.

The first requires the user to perform synthesis.

The second performs synthesis but introduces the possibility of distortion.

The sweet spot may be exactly what you describe:

Web + Search + Hyperlinks + Human + LLM

where:

  • the web remains the distributed memory,
  • search remains the retrieval mechanism,
  • hyperlinks remain the explicit relationships,
  • humans remain the arbiters of meaning,
  • LLMs become navigators, organizers, translators, critics, and collaborators.

That is much closer to the old augmentation philosophy of Douglas Engelbart than to the replacement narrative that often dominates AI discussions.

In some respects, LLMs solve a problem that hypertext never fully solved. Hyperlinks are explicit relationships created by authors. Humans still had to infer the implicit relationships.

A hypertext system might tell you:

  • Paper A links to Paper B.
  • Paper B cites Paper C.

An LLM can suggest:

"Paper A and Paper C are actually wrestling with the same underlying assumption, even though neither author mentions the other."

Sometimes it will be wrong. Sometimes it will be insightful. But it is operating at the level of patterns rather than links.

The challenge is maintaining provenance and context. Hypertext taught us that users get lost in link mazes. LLMs introduce a new problem: users can get lost in inference mazes.

A link tells you where you went.

An inference often doesn't.

That is why I think your emphasis on schemas, HumanML, and information ecosystems is relevant. The next generation of systems may need not only links between documents but explicit representations of:

  • assumptions,
  • emotional states,
  • goals,
  • trust relationships,
  • uncertainty,
  • provenance.

In a sense, HTML connected documents. Your HumanML idea was attempting to connect aspects of human meaning and interaction. Those are different layers of the same ecosystem.

So yes, there is a bit of:

"Welcome to HTML 1.0."

But perhaps with a twist:

"Welcome to hypertext where some of the links are inferred rather than authored."

The trick, as hypertext veterans learned decades ago, is making sure people can still find their way back home.


Comments

Popular posts from this blog

To Hear The Mockingbird Sing: Why Artists Must Engage AI

MCCF Philosophy & Manifesto

Schenkerian Analysis, HumanML and Affective Computing