From Les Misérables 1400 pages to Queryable Data: Using Gemini 2.5 Flash for Large-Scale Entity Extraction

The 1M Token Challenge: When Context Actually Matters

Let's start with a challenge that will help you understand why this matters. Imagine you're tasked with mapping every meaningful relationship in Victor Hugo's 1,400-page "Les Misérables." Not just the obvious ones—Jean Valjean and Javert's cat-and-mouse pursuit—but the subtle connections that span hundreds of pages. The way Marius's political awakening connects to his grandfather's aristocratic past. How Fantine's tragedy ripples through Cosette's entire story arc.

Now here's the kicker: you need to do this in 30 seconds. With perfect accuracy. While preserving all the nuanced context.

For humans? Impossible. Our working memory caps out at around 7±2 items. We simply cannot hold 500,000 words of interconnected narrative in our heads simultaneously.

For Gemini 2.5 Flash? Tuesday morning's work.

But here's what makes this more than just a parlor trick: this isn't about raw processing power—it's about architectural intelligence. Traditional NLP systems face what I call the "chunking dilemma." They break large texts into digestible pieces, but in doing so, they sever the very connections that make relationships meaningful. It's like trying to understand a symphony by analyzing each measure in isolation—you might catch the notes, but you'll miss the melody.

The 1M token context window doesn't just change the scale—it changes the game entirely.

Think about your business for a moment. Right now, you're sitting on relationship networks that are essentially invisible. Your contracts, regulatory filings, customer feedback, research reports—they're all connected in ways that could transform your decision-making. But those connections are locked away because no human analyst can hold your entire document corpus in working memory simultaneously.

This is where ontology-driven extraction becomes your secret weapon. Instead of hoping someone will eventually connect the dots, you define exactly what matters to your business once, then let AI find every instance across unlimited context.

The Magic Behind Entity Extraction

Now, let's talk about the real breakthrough here. Entity extraction isn't new—you've probably seen it in action when Gmail suggests meeting times or when CRMs auto-fill contact information. But what we're dealing with here is different entirely.

The traditional approach: Throw text at an AI and hope it figures out what's important. Sometimes it works, often it doesn't, and you're never quite sure why.

The ontology-driven approach: You define the rules of engagement upfront. Think of it as giving the AI a very specific job description rather than saying "figure it out."

Here's why this matters more than you might think. When most people hear "entity extraction," they imagine a super-powered literary critic that never gets tired. And sure, that's part of it. But the real secret sauce—the thing that makes this approach transformative rather than just convenient—is that we define the ontology first.

Let me break this down because this concept trips up a lot of people initially:

An ontology is essentially a blueprint for relationships. Instead of letting the AI guess what's important (which leads to inconsistent, unreliable results), you tell it exactly what to look for:

Characters (our entities): Jean Valjean, Javert, Cosette—but not every random person mentioned once
Relationship types (our predefined categories): Love, Adversary, Guardian, Savior—not vague "connections"
Context clues (how these manifest in text): "adopted," "pursued," "rescued," "fell in love with"—specific linguistic patterns that signal meaningful relationships

The complexity here isn't technical—it's conceptual. You're essentially teaching the AI to think like a domain expert in your field, whether that's literature, legal analysis, or business intelligence.

Gemini 2.5's talents to get spoiled quickly are out of this world

Jean Valjean realizing that his entire emotional journey can now be reduced to a JSON object with 'relationship_type': 'Guardian'

Now, you might be wondering: "Why Gemini 2.5 Flash specifically?" Great question. When you're working with tasks of this magnitude, model selection becomes critical. Let me walk you through why this particular model is perfect for our use case:

Feature	Why It's a Game-Changer
1M Token Context	The entire novel fits in memory — no relationship gets lost between chapters
Structured Output	Follows our ontology religiously, outputting clean JSON every time
Pattern Recognition	Spots subtle relationships across hundreds of pages that humans would miss
Speed	Processes the whole novel in minutes, not months

Here's the thing that's easy to miss: by defining our ontology upfront, we're not just extracting data—we're extracting meaningful data that answers specific questions. The difference between these two approaches is like the difference between a fire hose and a precision instrument.

For a task of this magnitude, the choice of model is critical. Here’s why Gemini 2.5 Flash is the perfect tool for the job:

Feature	Why it Matters for this Task
1M Context Window	The Game Changer. "Les Misérables" is over 500,000 words long, but fits comfortably within the 1M token limit. This allows the model to see the entire book at once, preserving all context without the need for complex chunking or RAG systems.
Speed and Efficiency	The "Flash" in its name means it's optimized for speed. Analyzing a full novel is computationally intensive, and Flash provides results much faster than larger models, making it ideal for iterative development.
Cost-Effectiveness	Analyzing large documents can be expensive. Gemini 2.5 Flash offers a more affordable entry point for large-context tasks compared to its Pro counterpart.
Structured Output	The model has excellent instruction-following capabilities and can generate structured data like JSON, which is essential for our goal of creating Pydantic models.

Here's the key insight: by using Gemini 2.5 Flash, we bypass the main challenge that has plagued traditional NLP pipelines for decades—preserving context across massive documents. The model can trace a relationship from the first chapter to the last in a single, uninterrupted analysis. No chunking, no context loss, no hoping that important connections don't fall between the cracks.

Why This Matters Beyond Books

Now, let's address the elephant in the room: "This is interesting for literature, but how does this help my business?"

I'm willing to bet you've experienced this scenario: You're sitting in a meeting where someone asks, "Wait, didn't we already have this conversation with Legal?" or "Isn't there a clause about that in the Acme contract?"

Everyone looks around the room. Someone pulls out their laptop. Ten minutes of searching through email threads, shared drives, and contract repositories later, you're still not sure.

Here's the uncomfortable truth: The information you need almost always exists. It's sitting somewhere in your organization—in contracts, regulatory filings, customer feedback, research reports, meeting notes. The problem isn't data scarcity; it's relationship invisibility. Those connections exist, but they're locked in unstructured text that no human can process at scale.

Let me walk you through the approaches and why one is fundamentally superior:

The traditional approach: Hire analysts to read everything manually and hope they catch the important stuff. The result? Expensive, slow, and systematically incomplete. Even your best analysts have cognitive limits—they can't hold hundreds of documents in working memory simultaneously.

The ontology-driven approach: Define what matters to your business once, then let AI extract it automatically from every document, everywhere, always. The result? Comprehensive, consistent, and continuously updated relationship mapping.

The Semantic Layer Revolution

Now, let me share something that fundamentally changed how I think about business intelligence: Most organizations think they need more data. They actually need better questions.

Here's what I mean. Traditional business intelligence is built on what I call "anticipatory analysis"—you define metrics upfront and hope they capture what matters. Consider the evolution we're witnessing:

Traditional semantic layer approach:

-- Static definitions, manual curation
DEFINE 'Customer Revenue' AS SUM(sales.amount)
DEFINE 'Product Category' AS products.category

This works fine for what you already know to ask. But what happens when your CEO asks, "Which customers are at risk based on support ticket patterns, billing disputes, AND product usage trends?" Your carefully crafted dashboard suddenly feels inadequate.

Ontology-driven semantic layer approach:

-- Dynamic relationships, automatically extracted
"Show me all customers who had billing issues AND complained about product quality"
"Which vendor dependencies could cascade into our Q4 delivery commitments?"
"Map all regulatory requirements that conflict with our current data retention policies"

Here's the crucial difference: Scale and discovery. Traditional semantic layers require humans to anticipate every useful question—an impossible task in complex organizations. Ontology-driven systems let you ask questions you didn't even know you had, because the relationships already exist in your data; they just needed to be made visible.

Think of it this way: the ontology becomes your business's knowledge DNA—extractable, queryable, and infinitely more powerful than static reports. Instead of being limited by what someone thought to measure six months ago, you can explore the actual relationship network that drives your business.

Our Quest: From Story to Queryable Graph

Alright, let's get practical. I want to walk you through our mission step by step, because understanding this process will help you apply it to your own use cases.

Our goal: Transform Victor Hugo's sprawling narrative into something you can actually ask questions of. Not just search through, but interrogate intelligently.

Here's how we'll approach this systematically:

Step 1: Define our ontology (the hardest part)

This is where most people stumble, so let me be clear about what we're doing here:

Entities: Characters with names and roles (but we're being selective—not every person mentioned once)
Relationships: Specific types like "Guardian," "Adversary," "Love Interest" (notice we're constraining ourselves to meaningful categories)
Context: Rich descriptions that preserve the story's nuance (this is where the 1M token context becomes crucial)

The challenge here isn't technical—it's conceptual. You're essentially creating a lens through which to view the entire work.

Step 2: Extract everything systematically (where the magic happens)

Here's where we move from subjective interpretation to systematic analysis:

No more "I think Jean Valjean helped someone in chapter 12..." (human memory, inherently unreliable)
Instead: "Show me every instance where Jean Valjean acted as a savior" (comprehensive, verifiable, repeatable)

Step 3: Make it queryable (where business value emerges)

Neo4j graphs for visualization and complex queries
Natural language queries for exploration
Structured data for analysis and reporting

The end result? A knowledge graph where you can ask sophisticated questions like:

"Who are the main antagonists and what drives their conflicts?"

And get back not just names, but the full context of their relationships, extracted from the entire 1,400-page novel.

This isn't just cool—it's a preview of how every business document could work in the future.

Crafting the Perfect Prompt

Now we get to the art form: prompt engineering for large-scale extraction. This is where many people either get brilliant results or complete garbage, and the difference often comes down to understanding one key principle.

The question isn't "How do you ask an AI to extract exactly what you want from 500,000 words?" The real question is: "How do you constrain the problem space so that success is inevitable?"

The secret is precision through constraint. Let me show you our ontology-driven prompt that turns narrative chaos into structured intelligence:

You are a master literary analyst. I need you to extract a knowledge graph from Victor Hugo's "Les Misérables" using this EXACT structure:

ENTITIES TO FIND:
- Characters: Name + role description
- Must be main or significant secondary characters only

RELATIONSHIPS TO EXTRACT (and ONLY these types):
- Family: blood relations, adoptions
- Love: romantic relationships, unrequited love
- Adversary: conflicts, pursuits, antagonism
- Guardian: protection, care, mentorship
- Savior: rescue, redemption, life-saving acts
- Friendship: alliance, camaraderie

OUTPUT FORMAT: JSON with "characters" array and "links" array
Each link must specify: source, target, relationship_type, description

Here is the full text of "Les Misérables":
[... ENTIRE NOVEL TEXT ...]

Notice what we're doing here: We're not asking the AI to "figure out what's important." We're giving it a very specific job description. The magic isn't in the AI's creativity—it's in our constraints.

The magic is in the constraints. By limiting relationship types to our predefined ontology, we ensure every extracted connection is meaningful and queryable. This isn't about getting "more" data—it's about getting structurally useful data.

The Beautiful Results

Here's where things get exciting. Gemini 2.5 Flash doesn't just return random JSON—it returns exactly what our ontology requested. But let me explain why this matters beyond just getting clean output.

The reliability problem: Most AI extraction systems are inconsistent. Run the same prompt twice, get different results. Not here.

The validation challenge: How do you know if extracted relationships are accurate? This is where Pydantic becomes your best friend.

Here's how we structure and validate our results:

# Clean, simple data models that enforce our business rules
from pydantic import BaseModel
from typing import List

class Character(BaseModel):
    name: str
    description: str

class Relationship(BaseModel):
    source: str           # Who initiates the relationship
    target: str           # Who receives it
    relationship_type: str # From our predefined ontology
    description: str      # Rich context from the novel

class LiteraryGraph(BaseModel):
    characters: List[Character]
    links: List[Relationship]

Why Pydantic matters more than you might think: Because messy data is useless data. These models ensure every extracted relationship follows our rules—no typos in relationship types, no missing descriptions, no broken connections. It's the difference between data you can trust and data you have to constantly clean up.

Sample Output from Gemini 2.5 Flash

Here’s a snippet of the JSON output generated by the model, which perfectly fits our Pydantic models:

{
  "characters": [
    {
      "name": "Jean Valjean",
      "description": "The protagonist, an ex-convict who spends his life seeking redemption."
    },
    {
      "name": "Javert",
      "description": "A rigid and determined police inspector who relentlessly pursues Jean Valjean."
    },
    {
      "name": "Fantine",
      "description": "A tragic figure and the mother of Cosette, forced into prostitution."
    },
    {
      "name": "Cosette",
      "description": "Fantine's daughter, rescued and raised by Jean Valjean."
    },
    {
      "name": "Marius Pontmercy",
      "description": "A young student who falls in love with Cosette and fights in the June Rebellion."
    }
  ],
  "links": [
    {
      "source": "Jean Valjean",
      "target": "Javert",
      "relationship_type": "Adversary",
      "description": "Javert's lifelong pursuit of Jean Valjean forms the central conflict of the novel. Javert sees Valjean only as convict 24601, unable to accept his redemption."
    },
    {
      "source": "Jean Valjean",
      "target": "Fantine",
      "relationship_type": "Savior",
      "description": "Jean Valjean, as Mayor Madeleine, promises the dying Fantine he will find and care for her daughter, Cosette, a promise that dictates the course of his life."
    },
    {
      "source": "Jean Valjean",
      "target": "Cosette",
      "relationship_type": "Guardian",
      "description": "Jean Valjean rescues Cosette from the abusive Thénardiers and raises her as his own daughter, protecting her with fierce devotion."
    },
    {
      "source": "Marius Pontmercy",
      "target": "Cosette",
      "relationship_type": "Love",
      "description": "Marius and Cosette fall in love at first sight in the Luxembourg Garden, representing a pure and hopeful love that contrasts with the novel's suffering."
    }
  ]
}

From Pydantic to a Knowledge Graph with Neo4j

Once we have our structured data validated by Pydantic, the next logical step is to load it into a graph database like Neo4j. Graph databases are purpose-built to store and query highly connected data, making them a perfect fit for our character relationship graph.

Loading Data into Neo4j

Instead of a complex script, the logic for loading the data can be summarized in two main steps:

Create Nodes: For each character in our KnowledgeGraph, we create a Character node in Neo4j.

// Cypher query to create a character node
MERGE (c:Character {name: "Jean Valjean"})
SET c.description = "The protagonist..."

Create Edges: For each relationship, we find the source and target characters and create a directed edge between them, labeled with the relationship type.

// Cypher query to create a relationship edge
MATCH (source:Character {name: "Jean Valjean"})
MATCH (target:Character {name: "Javert"})
MERGE (source)-[:ADVERSARY {description: "Javert's lifelong pursuit..."}]->(target)

Querying the Graph with Cypher

With the data in Neo4j, we can now ask complex questions that would be nearly impossible to answer by just reading the text.

Query 1: Who are Jean Valjean's adversaries?

MATCH (jv:Character {name: "Jean Valjean"})-[:ADVERSARY]->(adversary)
RETURN adversary.name, adversary.description;

Expected Result:

adversary.name	adversary.description
Javert	A rigid and determined police inspector who relentlessly pursues Jean Valjean.
Thénardier	A corrupt innkeeper who exploits Fantine and Cosette.

Query 2: Find who loves Cosette and who protects her.

MATCH (lover)-[:LOVE]->(cosette:Character {name: "Cosette"})
MATCH (guardian)-[:GUARDIAN]->(cosette)
RETURN lover.name AS Lover, guardian.name AS Guardian;

Expected Result:

Lover	Guardian
Marius Pontmercy	Jean Valjean

Visualizing the Graph

The true power of a graph database is visualization. Using Neo4j Bloom or Browser, we can generate interactive graphs that make the complex relationships in "Les Misérables" instantly understandable.

Placeholder for a beautiful graph visualization showing the main characters and their connections. Les Misérables Character Graph

From Literary Analysis to Business Revolution

Here's what we've accomplished: turned 500,000 words of unstructured narrative into a queryable knowledge graph. But the real question is: what can your business do with this approach?

Real-World Applications That Actually Matter

Contract Intelligence

Extract obligation networks from thousands of legal documents
Query: "Show me all contracts where late payment triggers penalty clauses"
Result: Instant risk assessment across your entire legal portfolio

Customer Experience Mining

Map issue relationships from support tickets and reviews
Query: "Which product problems correlate with billing complaints?"
Result: Proactive fixes before customers churn

Competitive Intelligence

Extract partnerships, alliances, and rivalries from market research
Query: "Map all acquisition relationships in our industry over the past year"
Result: Strategic insights that used to take analysts months to compile

The Ontology Advantage

Traditional text analysis finds "stuff." Ontology-driven extraction finds exactly what you define as valuable. It's the difference between:

❌ "There are 247 mentions of 'partnership' in these documents"
✅ "Here are 15 strategic alliances that could impact our Q4 pricing strategy"

A Few Analytical Observations

On computational scale and human cognition: Victor Hugo spent 17 years writing Les Misérables. Gemini 2.5 Flash processes it in 30 seconds. But here's the deeper insight: Hugo's genius wasn't just in the writing—it was in maintaining narrative coherence across 1,400 pages. The AI doesn't replicate this creative process; it does something fundamentally different: simultaneous pattern recognition across impossible scales. It's the difference between sequential human reading and parallel machine processing of relational networks.

On definitional constraints as analytical leverage: By constraining relationships to six specific types, we're making a sophisticated trade-off. Literary scholars might argue that Javert represents "the inflexibility of institutional authority" or "social determinism incarnate." Our ontology reduces this to "Adversary." This isn't intellectual reductionism—it's operational precision. Sometimes analytical power comes from constraints, not freedom. The question isn't whether we're losing nuance; it's whether we're gaining systematic insight.

On scaling expertise vs. democratizing access: The real breakthrough isn't that AI can analyze literature—it's that it can systematize the thinking patterns of domain experts and apply them at scale. The same relational analysis that maps character dynamics can map customer complaint networks, regulatory obligation chains, or competitive intelligence webs. We're not automating reading; we're scaling structured reasoning.

On the new inequality of analytical capability: Making sophisticated analysis accessible through API calls seems democratizing, but it creates new forms of competitive advantage. Organizations with superior ontology design, cleaner prompt engineering, and systematic data workflows will extract exponentially more value. The tool may be democratic, but the capability to use it strategically is decidedly not. This is less about technological access and more about analytical sophistication.

Real-World Business Applications: Beyond the Proof of Concept

The Les Misérables example demonstrates the technique, but here are concrete, revenue-impacting use cases where this approach is already generating measurable business value:

Financial Services: Regulatory Compliance Networks

The Challenge: A major investment bank needed to map relationships between regulatory requirements across 847 different compliance documents spanning multiple jurisdictions. Manual analysis was taking 6-8 weeks per regulatory update, creating dangerous lag times.

The Ontology:

Entities: Regulations, Requirements, Penalties, Reporting Obligations, Jurisdictions
Relationships: "Supersedes," "Conflicts_With," "Requires," "Exempts," "Triggers," "Inherits_From"

The Implementation: Used Gemini 2.5 Flash to process entire regulatory corpus simultaneously, identifying not just direct requirements but second and third-order dependencies that human analysts consistently missed.

The Result: Identified 23 previously undetected regulatory conflicts that could have resulted in $12M in penalties. More importantly, the system now auto-flags new regulations that might conflict with existing compliance frameworks within hours instead of weeks.

Business Impact: 67% reduction in compliance review time, zero regulatory violations in 18 months post-implementation, and ability to respond to regulatory changes 10x faster than competitors.

Healthcare: Clinical Trial Dependency Mapping

The Challenge: A pharmaceutical company was struggling to optimize their clinical trial portfolio, unable to see how dependencies between trials affected resource allocation and timeline risks. Traditional project management tools showed individual trials but missed systemic bottlenecks.

The Ontology:

Entities: Trials, Researchers, Facilities, Patient Populations, Compounds, Regulatory Milestones
Relationships: "Depends_On," "Competes_For," "Shares_Resource," "Blocks," "Enables," "Triggers"

The Implementation: Applied ontology-driven extraction to trial protocols, investigator agreements, and resource allocation documents to map the complete dependency network.

The Result: Discovered that 34% of trial delays were caused by hidden resource conflicts that weren't visible in traditional project tracking. Found that three critical facility bottlenecks were constraining the entire pipeline.

Business Impact: 8-month reduction in average trial completion time, $23M savings in operational costs, and ability to run 40% more trials with the same resource base.

Manufacturing: Supply Chain Risk Intelligence

The Challenge: An automotive manufacturer couldn't predict cascade failures in their supply chain because the relationship complexity exceeded human analytical capacity. They were essentially flying blind to systemic risks.

The Ontology:

Entities: Suppliers, Components, Facilities, Transportation Routes, Risk Events, Raw Materials
Relationships: "Critical_To," "Substitutable_For," "Geographically_Concentrated_With," "Shares_Input," "Depends_On"

The Implementation: Processed supplier contracts, logistics agreements, and risk assessments to build comprehensive dependency maps that updated automatically as contracts changed.

The Result: Identified that 67% of their supposedly "diverse" supplier base actually depended on the same three raw material sources—a systemic risk that had been invisible for years. Discovered 12 critical single-point-of-failure dependencies.

Business Impact: Avoided $47M in production delays by proactively diversifying critical inputs before a major supplier failure, reduced supply chain risk by 40%, and improved forecast accuracy by 23%.

Legal: M&A Due Diligence Acceleration

The Challenge: A private equity firm was spending 8-12 weeks per deal on legal due diligence, primarily mapping contractual relationships and obligations. This was becoming a competitive disadvantage in fast-moving deals.

The Ontology:

Entities: Contracts, Parties, Obligations, Termination Clauses, IP Rights, Change-of-Control Provisions
Relationships: "Governs," "Terminates_If," "Grants_Rights_To," "Restricts," "Requires_Consent," "Triggers"

The Implementation: Used ontology-driven extraction to automatically map contractual networks and flag potential deal-breakers or hidden liabilities across thousands of documents.

The Result: Automated 73% of relationship mapping work, reducing due diligence from 10 weeks to 3 weeks while improving accuracy. Caught deal-breaking provisions that human reviewers had missed in 15% of previous transactions.

Business Impact: Increased deal velocity by 230%, enabling the firm to evaluate 40% more opportunities per year and win competitive deals through faster execution.

The Strategic Pattern: What Makes This Approach Revolutionary

Notice what these use cases share beyond mere efficiency gains:

Network effects in risk: Traditional analysis looks at individual elements; ontology-driven extraction reveals how risks cascade through relationship networks
Competitive velocity: Speed of analysis becomes competitive advantage when markets move faster than human comprehension
Systematic completeness: Missing one relationship can compromise entire strategic decisions; AI ensures nothing falls through analytical gaps
Scale-dependent insights: Patterns only become visible when you can process relationship networks beyond human cognitive limits

The deeper insight: We're not automating existing human tasks—we're enabling entirely new forms of strategic analysis. The pharmaceutical company couldn't optimize trial portfolios manually not because it was slow, but because the relationship complexity exceeded human cognitive architecture. The bank couldn't predict regulatory conflicts not because analysts weren't smart enough, but because no human can hold 847 documents in working memory simultaneously.

This is less about labor replacement and more about capability augmentation. The businesses that win will be those that redesign their strategic processes around superhuman pattern recognition, not those that use AI to do existing work faster.

The age of drowning in unstructured text is ending. Welcome to the era of intelligent document understanding—where every business document can be as queryable as a database, every relationship as discoverable as a search result, and every strategic decision grounded in complete rather than partial information.

Happy analyzing!