I worked on a project for the Graph For All Million Dollar Challenge that was (basically) a real-life Pokedex.
My team ended up winning Most Applicable (1st place), which felt like the perfect category for something meant to be used outside a demo.
The headline idea was simple: point your phone at an animal or insect and get a useful identification back.
But we weren’t aiming for a “cat vs dog” demo. We wanted something that could keep narrowing the answer down to genus (and ideally beyond) while staying fast enough to feel interactive.
The tempting move is to throw one huge model at it. We didn’t. The whole system is basically an argument that, in the right shape of problem, bigger isn’t always better.
This post is about the machine learning approach. The graph database mattered, but mostly as a way to organize decisions, not as the story.
The problem with one big model
The obvious solution is a single large classifier that outputs thousands of classes.
It works, but it comes with tradeoffs:
- It gets heavy quickly (size, latency, memory).
- Retraining is expensive when you add categories.
- It becomes harder to reason about failures (“why did it think this beetle was a bird?”).
We wanted a system that could evolve over time, where adding a new branch doesn’t mean rebuilding the whole world.
A different approach: route through a taxonomy
Instead of one massive model, we designed a pipeline of smaller models.
Conceptually:
- Run a cheap, high-recall model to determine a coarse bucket.
- Use that result to pick the next, more specialized classifier.
- Repeat until confidence drops or we hit the desired resolution.
If you’re thinking “decision tree”, you’re not far off. The twist is that the tree mirrors biological taxonomy.
In practice, it looks like this:
- “is this an insect, bird, mammal, fish, reptile, amphibian?”
- if mammal: “four legs? hooves? paws?”
- if paws: “cat-family vs dog-family vs something else”
- then down into smaller and smaller subcategories
The goal isn’t to be perfect at each step. The goal is to be usefully right while staying cheap.
Why the graph helped (without being the point)
We modeled the taxonomy and its relationships as a graph so the system could answer questions like:
- given a prediction, what are the valid next classifiers?
- if we have multiple plausible branches, which ones do we run next?
- if a branch doesn’t exist yet, what’s the nearest ancestor we can confidently return?
This let us treat “classification” as a traversal problem:
image -> coarse prediction -> candidate nodes -> pick next model(s) -> refine
”Bunch of vague models” (on purpose)
We intentionally leaned into the idea of many smaller, somewhat-vague models.
For example, a coarse model might say:
- 0.52 mammal
- 0.31 bird
- 0.12 reptile
That doesn’t need to be a final answer. It’s a routing hint.
From there, we can run the next-stage models for the top 1-2 branches, and stop when we hit a threshold:
- confidence is high enough to label
- confidence is low enough that we should fall back to a broader category
- we hit a compute budget (latency)
This was the core bet: a chain of small models could be faster and easier to maintain than a single giant model, especially as the taxonomy grows.
Handling reality: uncertainty is a feature
In the real world, photos are messy:
- partial views
- motion blur
- lighting and shadows
- backgrounds doing half the classification by accident
So we planned for uncertainty as a first-class output.
If the system can’t confidently reach genus, it should say something like:
“Likely a ground beetle (Carabidae). Not confident enough to pick genus.”
Returning a high-quality “I don’t know” is better than hallucinating a specific label.
The open-source angle: sightings + feedback loops
The part I was most excited about was open-sourcing it as a citizen-science tool:
- let users log sightings (photo, location, time)
- allow “suggested ID” vs “confirmed ID”
- surface low-confidence / high-novelty sightings to reviewers
- give the scientific community a workflow to validate and improve the dataset
Over time, this becomes a feedback loop:
- users submit sightings
- experts verify or correct
- the dataset improves
- the models get better
And if something looks truly new (or just rare), the system can elevate it instead of forcing it into the closest known bucket.
What I’d build next
If I revisited this today, I’d push on three things:
- Calibration: make probabilities meaningful so routing decisions are stable.
- Active learning: spend training budget where the model is most uncertain.
- On-device constraints: aggressive quantization and a hard compute budget.
The point isn’t to “beat” large models. It’s to build something you can iterate on, ship, and keep improving without turning every update into a multi-week training event.