The AI models I use and what I think of them

I mostly rotate between three models: Claude, ChatGPT, and Gemini. I don’t treat them like interchangeable “AI assistants.” They have different personalities in practice: different failure modes, different strengths, and different areas where they save me real time.

This isn’t a benchmark post. It’s the way I compare them as an operator: Can it help me ship work faster without adding risk? Can it stay consistent across a longer task? Does it make up details when it doesn’t know? Does it help me think, or does it just produce text that sounds right?

What I’ll cover:

The differences between each model, as I experience them
How I use them in day-to-day work (writing, software, analysis, and workflows)
A straightforward review of each one, including what I avoid using them for

Models covered and how I’ll compare them

I use Claude, ChatGPT, and Gemini

These are the three I keep coming back to:

ChatGPT: my default for writing support and structured “take this messy thing and make it coherent” work, some adhoc SQL debugging and related work.
Claude: my primary coding model when I want careful reasoning, refactors, and larger-codebase style changes. Also good for UI text and design critique.
Gemini: I use it most when I’m already inside Google tooling, and for tasks that benefit from browser-based checking or cross-referencing.

What I’ll cover

Differences between each model

I pay attention to:

Consistency under constraints: does it keep the rules straight over multiple turns?
Error behavior: when it’s wrong, is it wrong confidently? Does it acknowledge uncertainty?
Usefulness per minute: does it reduce my work, or just rephrase it?
Code and query quality: does it propose changes I’d actually merge?

How I use them

I don’t use them as “do my job for me” systems. I use them to:

Increase throughput on routine work (review, refactor, summarize)
Reduce friction in analysis (SQL debugging, schema sketches)
Improve clarity (writing edits, structure, tone)
Catch obvious mistakes before a human review

My review of each one

I’ll be specific about what I reach for each model to do, and what I’ve learned not to ask it to do.

ChatGPT

ChatGPT is the one I use most for writing support and for “turn this into a usable artifact” work: outlines, drafts, rewrites, meeting notes turned into decisions, that kind of thing. On the coding side, I use it heavily for SQL because it’s fast and generally pragmatic.

Use cases

Writing

Review my writing

This is the core value for me: I can write in my normal voice, then ask it to clean up the grammar and tighten the structure without turning it into polished corporate filler.

The prompting that works best is specific. I don’t say “make this better.” I say things like:

Correct grammar and punctuation.
Keep my voice plain and direct.
Remove filler and hedge words.
If something is unclear, ask a question instead of guessing.

What it does well:

Identifies clunky sentences and repetition
Fixes grammar without changing meaning
Suggests sharper verbs and removes unnecessary modifiers

Where it can go wrong:

It sometimes “smooths” the writing into something generic.
If you don’t constrain it, it may change your intent while improving flow.

Correct grammar / suggest edits (before/after example)

Here’s a representative example of the kind of editing pass I use it for.

Before

We basically had a bunch of automations running but nobody really knew which ones were still needed. Some of them were failing sometimes but we didn’t have alerts set up properly and it was kind of unclear who owned what. I think we should just clean it up and document it so it’s not tribal knowledge.

After (ChatGPT-style edit with constraints to stay plain-spoken)

We had several automations running without clear ownership. A few were failing intermittently, and our alerting wasn’t reliable, so issues often went unnoticed. The next step is to prune what we no longer need and document what remains so it isn’t tribal knowledge.

That’s a real win for me: same meaning, less drift, more readable.

Content generation

I use ChatGPT for content generation when I already know what I want to say, but I don’t want to start from a blank page. It’s good at producing a workable structure that I can then rewrite.

I don’t ask it to invent expertise or provide authoritative claims. I treat it like a drafting assistant: “give me a first pass that matches these constraints.”

Example: generated outline or draft

If I’m writing something like an internal memo about improving reporting reliability, I might give it a few bullet points and ask for a draft.

Input (what I’d provide)

Audience: operators and analysts
Problem: reports break due to upstream schema changes
Constraints: no blame; propose process changes
Requirements: add ownership, monitoring, and a change log
Tone: direct, plain

Representative output draft (first pass I’d then edit)

Subject: Making reporting more reliable without slowing down changes

Our reporting breaks too often because upstream tables change without warning. The issue isn’t that anyone is careless—it’s that we don’t have a lightweight process for managing change across dependencies.

What’s happening
- Reports depend on tables and fields that aren’t versioned or owned.
- Schema changes land, and downstream queries fail silently or return bad data.
- When something breaks, it’s not obvious who should fix it or how urgent it is.

What we should do
1. Assign an owner for each production dataset used for reporting.
2. Add simple monitoring: freshness checks, row count deltas, and query failure alerts.
3. Keep a small change log for schema changes that affect downstream users.
4. Treat breaking changes as incidents: write down what changed, what broke, and how we prevent recurrence.

This doesn’t require heavy governance. It’s just enough structure to keep reports dependable while still allowing fast iteration upstream.

That kind of output saves me time because the skeleton is there. My job becomes editing, adding specifics, and making it accurate.

Software development

ChatGPT is useful for pragmatic coding assistance, but I treat it differently depending on the stakes. For high-stakes code paths, I still want Claude more often (more on that later). For SQL and reporting work, ChatGPT is consistently helpful.

Debug SQL queries

SQL debugging is where it earns its keep. The pattern is: I paste a query, describe the issue (“duplicates,” “missing rows,” “slow”), and tell it what database I’m using (Postgres, BigQuery, Snowflake, etc.). It usually finds the obvious mistake fast.

Example: query + fix

Here’s a representative example: a query intended to count weekly active users, but it double-counts due to a join.

Buggy query

    sql
  
SELECT
  date_trunc('week', e.event_time) AS week,
  COUNT(DISTINCT u.user_id) AS active_users
FROM events e
JOIN users u ON u.user_id = e.user_id
LEFT JOIN subscriptions s ON s.user_id = u.user_id
WHERE e.event_name = 'login'
  AND e.event_time >= CURRENT_DATE - INTERVAL '90 days'
  AND s.status = 'active'
GROUP BY 1
ORDER BY 1;

What’s wrong

The LEFT JOIN subscriptions turns into an inner join because of AND s.status = 'active' in the WHERE clause. It also risks duplication if a user has multiple subscription rows.

Fixed query (one clean approach)

    sql
  
SELECT
  date_trunc('week', e.event_time) AS week,
  COUNT(DISTINCT e.user_id) AS active_users
FROM events e
WHERE e.event_name = 'login'
  AND e.event_time >= CURRENT_DATE - INTERVAL '90 days'
  AND EXISTS (
    SELECT 1
    FROM subscriptions s
    WHERE s.user_id = e.user_id
      AND s.status = 'active'
  )
GROUP BY 1
ORDER BY 1;

This is the kind of fix ChatGPT lands quickly, especially when you give it the intention (“active users who have an active subscription, without duplicates”).

Propose SQL schemas

I use it to sketch schemas when I’m moving fast: “I need to track jobs, runs, and run logs; I want to query success rate by workflow and time window; propose a minimal schema.”

It’s not the final answer, but it gets me a first-pass shape that I can adjust.

Example: schema sketch

Representative schema sketch for tracking automation runs:

    sql
  
-- workflows: one row per automation/workflow
CREATE TABLE workflows (
  workflow_id      UUID PRIMARY KEY,
  name             TEXT NOT NULL,
  owner_team       TEXT,
  created_at       TIMESTAMP NOT NULL DEFAULT now(),
  is_active        BOOLEAN NOT NULL DEFAULT true
);

-- workflow_runs: one row per execution
CREATE TABLE workflow_runs (
  run_id           UUID PRIMARY KEY,
  workflow_id      UUID NOT NULL REFERENCES workflows(workflow_id),
  status           TEXT NOT NULL CHECK (status IN ('queued','running','success','failed','canceled')),
  started_at       TIMESTAMP,
  finished_at      TIMESTAMP,
  triggered_by     TEXT, -- 'schedule', 'manual', 'webhook'
  error_code       TEXT,
  error_message    TEXT
);

-- run_metrics: optional key/value metrics (duration_ms, rows_processed, etc.)
CREATE TABLE run_metrics (
  run_id           UUID NOT NULL REFERENCES workflow_runs(run_id),
  metric_key       TEXT NOT NULL,
  metric_value_num DOUBLE PRECISION,
  metric_value_txt TEXT,
  PRIMARY KEY (run_id, metric_key)
);

-- indexes that usually matter
CREATE INDEX idx_workflow_runs_workflow_time ON workflow_runs (workflow_id, started_at);
CREATE INDEX idx_workflow_runs_status_time ON workflow_runs (status, started_at);

What I like here is that it separates concerns: stable entities (workflows) vs. events (runs) vs. flexible metrics. What I don’t blindly accept is the data typing, naming, and checks—those depend on the database and conventions.

Reporting

For reporting work, ChatGPT is good at turning questions into queries and turning query results into explanations. I might paste a table schema and ask:

“What are three ways this metric can be wrong?”
“What dimensions would you add to make this dashboard actionable?”
“Given this event table, how would you compute retention without biasing toward heavy users?”

The value is less “write the perfect query” and more “help me think through edge cases.” It’s especially helpful when I’m trying to avoid the classic reporting traps: partial periods, late-arriving events, timezone issues, and join duplication.

Ad hoc queries

This is the day-to-day grind: “Show me the top 20 workflows by failure rate last week,” “Find users with more than X errors in 24 hours,” “Estimate cost per run.”

ChatGPT accelerates the first draft. I still validate the logic and run EXPLAIN when performance matters, but it’s faster than building everything from memory every time.

Claude

Claude is my primary model for coding when I expect the task to be multi-step, or when I need it to reason carefully through a change without breaking adjacent behavior. Claude is by far the best model for vibe coding in my opinion. I use Claude code in my terminal (Ghostty) to develop personal projects (like this web application you're viewing now) using Ruby on Rails.

Use cases

Coding (primary)

Claude is the one I go to for:

Refactors across multiple files
Designing a small module with clear interfaces
Writing tests alongside implementation
Explaining tradeoffs in a way that helps me choose

A representative example: I have a script that grew into a small service—some data fetching, some transformations, some side effects (writing to a database, sending a notification). I’ll ask Claude to propose a structure: separate pure transformation from IO, add a config object, define a couple of invariants, and then write tests for the pure pieces. Its suggestions tend to line up with how I’d actually maintain the code later.

Writing (some)

I use Claude for writing less often than ChatGPT, but it’s good when the writing needs to stay coherent across a long document or when I’m trying to keep consistent definitions.

For example, if I’m writing internal documentation for a workflow system—definitions like “job,” “run,” “replay,” “backfill”—Claude is good at keeping terminology straight and spotting places where I used the same word for two different concepts.

UI design

I use Claude code's 'frontend-design' skill to generate designs for my web applications. It's amazing how good it is. It can do designs for entire web applications in minutes or hours depending on the complexity of the application and number of screens. A UI/UX designer will usually take atleast 1 week or month to produce the same design.

Debugging

Claude is strong at debugging when I can give it:

The error message
The relevant code path
The expected behavior
What I already tried

It tends to ask the right clarifying questions rather than jumping straight into a confident but wrong fix. It also does better when the bug is caused by interaction effects (for example, caching plus concurrency plus unexpected input).

DevOps

For DevOps-type work, Claude is useful for:

Drafting a deployment checklist that matches constraints (rollback plan, migrations, feature flags)
Reviewing a Dockerfile or CI pipeline and suggesting simplifications
Explaining why a particular configuration causes a particular failure mode

I’m careful here: I don’t paste secrets, and I don’t blindly run commands. But as a reviewer—“does this pipeline do what I think it does?”—it’s solid.

Gemini

Gemini is the model I end up using when the work is close to Google’s ecosystem or when browser-backed checking is helpful. It’s not that it’s always “better” at raw generation. It’s that it’s often closer to the context I’m already in.

Use cases

Testing using the browser

This is one of the more practical differences: when I need a model to sanity-check something against what’s actually on a page, Gemini is often the first one I try.

A representative scenario:

I’m troubleshooting why a documented API parameter isn’t behaving as expected.
I suspect the docs changed, or I’m looking at the wrong version.
I want a quick verification: “Is this parameter still supported? What’s the current name?”

Having browser access (when available) turns that into a more grounded task. The model can still be wrong, but it’s less likely to hallucinate a confident answer when the page contradicts it.

Software development

Gemini is capable for dev work, but for my use I treat it as a secondary opinion:

“Here’s my plan—what did I miss?”
“Given this function signature, propose edge-case tests.”
“Rewrite this code to be clearer without changing behavior.”

It’s also helpful when I’m already using Google Cloud services, because it tends to be comfortable with the vocabulary and the common patterns.

Writing

I use Gemini for writing mainly when the writing is connected to something I’m doing in Google Workspace (docs, notes, summaries). It’s fine for drafts and rewrites. I don’t find it noticeably better than ChatGPT for my style, but it’s “good enough” when it’s already in the tool.

Embedded in Google Workspace tools

This is the main reason it stays in my rotation. If I’m inside Gmail, Docs, or Sheets, having a model right there changes whether I use it at all. The best model is the one that reduces friction.

A practical example: in Sheets, I might want a formula that’s slightly annoying to write correctly (nested IF, date parsing, conditional formatting rules). Gemini can usually get me 80% of the way there in context, and then I adjust.

Overall review and final thoughts

Where each model is strongest/weakest for me

Here’s the simplest comparison, from actual usage.

Claude
- Strongest: careful coding help, refactors, reasoning through tradeoffs, debugging with multi-step causality, keeping terminology consistent in longer docs - it's my go to model
- Weakest: honestly, I haven't encountered many weaknesses with Claude.

ChatGPT
- Strongest: writing edits, drafting structured text, SQL debugging, quick ad hoc analysis, turning messy notes into something usable
- Weakest: long refactors where consistency matters; occasionally “over-edits” voice; can be too eager to provide an answer even when ambiguity exists; can produce cheesy over-the-top text

Gemini
- Strongest: tasks tied to Google Workspace; browser-backed checking/sanity tests; decent general dev and writing support when already in that environment
- Weakest: for my workflow, it’s less of a “default brain” and more situational; output can be uneven depending on the task and context

The important part: I don’t look for “the smartest model.” I look for predictable help with the kind of work I do repeatedly.

How I choose which model to use day-to-day

My selection logic is mostly about failure cost and context.

If I’m writing and I care about staying in my voice: I start with ChatGPT or Gemini (in Google Docs) for editing or structure.
If I’m coding or doing any technical work and the task is non-trivial (refactor, tests, design decisions): I start with Claude.
If I’m inside Google tools or I need a quick check against what’s on a page: I use Gemini.

Then there’s a second layer: I’ll sometimes ask two models the same question when the cost of being wrong is high. If both agree, I move faster. If they disagree, that’s usually a sign the question is underspecified or the edge cases matter more than I initially thought.

One more practical rule: when a model starts producing confident output that I can’t verify quickly, that’s a stop sign. I either narrow the question, add constraints, or move back to first principles. These tools are most valuable when they shorten the path to something I can validate—code I can test, queries I can run, writing I can read and judge.

Used that way, all three are useful. The trick is not expecting one of them to be universally reliable. They’re tools with different shapes. I get the best results when I pick the shape that fits the job.