Ai Tools Resources

AI Code Generation: What It Does Well and What It Doesn't

Realistic assessment of AI coding capabilities. Types of code AI generates well, where it struggles, when to trust AI output, and how experienced developers use these tools effectively.

January 6, 2025
17 min read
By Thalamus AI

The marketing pitch for AI code generation is compelling: developers writing code at superhuman speed, complex applications built in days instead of months, and technical barriers dissolved by natural language commands.

Here's what actually happens after six months of daily AI-assisted development across real business projects.

The Spectrum of AI Code Generation

Not all code is created equal, and AI tools handle different types very differently. Understanding this spectrum is critical to using these tools effectively.

As of January 6, 2025: Findings based on extensive testing with GitHub Copilot, Cursor, Claude, and ChatGPT across multiple production projects. AI capabilities evolve rapidly—expect gradual improvements but not fundamental shifts in these patterns.

Tier 1: AI Excels (Trust with Review)

These are the tasks where AI code generation genuinely saves time and produces reliable results:

CRUD Operations and Boilerplate

AI generates standard create-read-update-delete operations competently. Give it a data model and it produces working database operations in your framework of choice.

Example quality: 85-95% immediately usable.

Why it works: Patterns are consistent, well-documented in training data, and variations are limited.

What to review: Database connection handling, error cases, data validation edge cases.

API Endpoint Scaffolding

REST or GraphQL endpoints follow predictable patterns. AI generates route handlers, request validation, response formatting, and basic error handling effectively.

Example quality: 80-90% immediately usable.

Why it works: Framework conventions are well-established and extensively documented in training data.

What to review: Authentication/authorization logic, rate limiting, input sanitization, business rule implementation.

Type Definitions and Interfaces

TypeScript interfaces, Python type hints, and data structure definitions are AI's sweet spot. Given a description or example data, AI produces accurate type definitions.

Example quality: 90-95% immediately usable.

Why it works: Type systems have formal rules that AI models learn well.

What to review: Optional vs. required fields, union types accuracy, generic constraints.

Unit Test Skeletons

AI generates test structure, setup/teardown, and basic assertions well. The test framework, naming conventions, and organization follow best practices.

Example quality: 70-85% immediately usable (needs assertion refinement).

Why it works: Test structure is formulaic and patterns are consistent.

What to review: Edge cases, assertion logic, test data quality, mock setup accuracy.

Configuration File Generation

Dockerfile, docker-compose.yml, CI/CD configurations, and framework config files are generated reliably from descriptions.

Example quality: 75-90% immediately usable.

Why it works: Configuration formats are structured and documentation is extensive.

What to review: Environment-specific settings, security configurations, resource allocations, version pins.

Documentation and Comments

AI generates clear documentation from code, explains complex functions, and writes helpful comments.

Example quality: 80-90% immediately usable.

Why it works: Explaining code is similar to the chat tasks AI is trained for.

What to review: Accuracy of explanations, completeness of edge case documentation, tone consistency.

Tier 2: AI Assists (Heavy Review Required)

These tasks produce useful starting points but require significant developer refinement:

Business Logic Implementation

AI can translate requirements into code, but business rules have nuances that get lost. The logic structure is often reasonable, but specifics require correction.

Example quality: 40-60% immediately usable.

Why it struggles: Business rules have context and exceptions that aren't captured in simple descriptions.

What to review: Everything. Treat as initial draft, not solution.

Real example from our testing:

  • Prompt: "Implement pricing calculation with tiered discounts and promotional codes"
  • AI output: Basic structure correct, discount stacking logic wrong, promo code validation missing, currency handling incomplete
  • Developer time: 40% less than writing from scratch, but required deep review and significant correction

Database Query Optimization

AI generates working queries but often misses performance considerations. Indexes, join strategies, and query structure may be inefficient.

Example quality: 50-70% immediately usable.

Why it struggles: Performance optimization requires understanding data distribution and database internals.

What to review: Query plans, index usage, N+1 problems, join strategies, pagination approaches.

Error Handling and Edge Cases

AI implements happy path well but edge case handling is often superficial or wrong. Error messages, retry logic, and failure recovery need attention.

Example quality: 40-60% immediately usable.

Why it struggles: Edge cases aren't well-represented in training data and require domain knowledge.

What to review: All error paths, validation logic, boundary conditions, race conditions, timeout handling.

Integration Code

AI generates API client code and integration logic, but authentication, rate limiting, retry strategies, and error handling often need refinement.

Example quality: 50-70% immediately usable.

Why it struggles: Each API has quirks and undocumented behaviors that require experience.

What to review: Authentication flows, error handling, rate limiting, webhook signatures, pagination, data transformation.

Refactoring Suggestions

AI identifies code smells and suggests improvements, but recommendations vary in quality. Some are excellent, others break functionality.

Example quality: 40-70% immediately usable (highly variable).

Why it struggles: Refactoring requires understanding intent and constraints that aren't obvious from code alone.

What to review: Does refactoring preserve behavior? Does it improve maintainability? Does it introduce new problems?

Tier 3: AI Struggles (Use as Reference Only)

These tasks produce output that's more inspiration than solution:

Architecture and System Design

AI can describe architectural patterns and generate component structures, but system design requires understanding business constraints, team capabilities, future requirements, and trade-offs.

Example quality: 20-40% immediately usable.

Why it struggles: Architecture is about decisions under uncertainty with incomplete information.

What to review: Everything. Use AI suggestions as brainstorming, not blueprints.

Security-Critical Code

Authentication, authorization, cryptography, and sensitive data handling require expertise. AI generates code that looks secure but often has subtle vulnerabilities.

Example quality: 30-50% immediately usable (security issues common).

Why it struggles: Security requires adversarial thinking and knowledge of attack patterns.

What to review: Hire security experts to review. Don't trust AI for security-critical code.

Performance-Critical Code

Algorithms requiring optimal performance, memory management, concurrent access patterns, and low-level optimizations are beyond reliable AI generation.

Example quality: 30-50% immediately usable.

Why it struggles: Performance optimization requires profiling, measurement, and deep understanding of execution environment.

What to review: Benchmark everything. Profile. Optimize based on data, not AI suggestions.

Complex State Management

State machines, complex workflows, distributed state coordination, and transaction management require careful design. AI produces code that works for simple cases but fails under complexity.

Example quality: 30-50% immediately usable.

Why it struggles: State management edge cases and race conditions are subtle and context-dependent.

What to review: All state transitions, concurrency handling, consistency guarantees, rollback logic.

Domain-Specific Algorithms

Specialized algorithms for finance, scientific computing, machine learning pipelines, or industry-specific calculations require domain expertise AI lacks.

Example quality: 20-40% immediately usable.

Why it struggles: Training data lacks depth in specialized domains.

What to review: Verify against domain knowledge, existing implementations, or academic references.

The Quality Question

Code quality from AI isn't consistent. We measured quality across five dimensions:

Correctness

Does the code do what it's supposed to do?

Findings:

  • Tier 1 tasks: 85-95% correct initially
  • Tier 2 tasks: 50-70% correct initially
  • Tier 3 tasks: 30-50% correct initially

Correctness improves with:

  • More specific prompts
  • Context from existing codebase
  • Iterative refinement with AI
  • Clear requirement specifications

Readability

Is the code easy to understand?

Findings:

  • AI-generated code is generally readable
  • Naming conventions are good
  • Structure is logical
  • Comments can be overly verbose or absent entirely
  • Consistent style within generated blocks, but may not match project style

Improvement needed:

  • Style guide enforcement after generation
  • Comment refinement (too many or too few)
  • Project-specific naming conventions

Maintainability

Can the code be modified and extended?

Findings:

  • Simple code is maintainable
  • Complex code often has hidden coupling
  • Error handling paths may be incomplete
  • Test coverage for AI-generated code varies widely

Maintainability issues:

  • Insufficient error handling
  • Magic numbers and hardcoded values
  • Poor separation of concerns
  • Incomplete abstraction boundaries

Efficiency

Does the code perform well?

Findings:

  • AI optimizes for correctness, not performance
  • Database queries often inefficient
  • Memory usage sometimes excessive
  • Algorithmic complexity not always optimal

Performance problems:

  • N+1 query problems
  • Inefficient data structures
  • Unnecessary copying
  • Missing caching opportunities

Security

Does the code protect against attacks?

Findings:

  • Input validation is inconsistent
  • SQL injection protection often present but not guaranteed
  • XSS protection varies
  • Authentication/authorization logic requires careful review
  • Cryptography implementations should never be trusted without expert review

Security concerns:

  • Missing input sanitization
  • Inadequate authentication checks
  • Weak cryptography choices
  • Insufficient rate limiting
  • Information disclosure in errors

The Trust Calibration Problem

Learning when to trust AI-generated code is the critical skill. We found experienced developers develop intuition over 2-3 months:

Trust More:

  • Boilerplate and standard patterns
  • Code in well-established frameworks
  • Type definitions and interfaces
  • Test structure (not assertions)
  • Documentation generation
  • Simple transformations

Trust Less:

  • Business logic implementation
  • Security-sensitive code
  • Performance-critical paths
  • Error handling
  • Edge cases
  • Integration logic

Never Trust Without Review:

  • Database operations
  • Authentication/authorization
  • Payment processing
  • Data validation
  • Cryptography
  • System design decisions

How Experienced Developers Use AI Code Generation

After observing our team and interviewing developers experienced with AI tools, patterns emerged:

Pattern 1: Scaffolding + Refinement

Generate structure with AI, implement critical logic manually.

Process:

  1. Use AI to generate API endpoints, models, basic CRUD
  2. Manually implement business logic
  3. Use AI to generate tests
  4. Manually refine test assertions and edge cases
  5. Use AI to generate documentation
  6. Manually review and improve

Time savings: 25-35% on typical feature development

Pattern 2: Explanation + Implementation

Use AI to explain approaches, implement yourself.

Process:

  1. Ask AI to explain how to solve problem
  2. Review suggestions and alternatives
  3. Implement yourself with AI assistance for boilerplate
  4. Use AI to review your implementation
  5. Refine based on feedback

Time savings: 15-25%, better understanding of implementation

Pattern 3: Iteration + Correction

Generate with AI, iteratively fix issues through conversation.

Process:

  1. Generate initial implementation with AI
  2. Identify problems through testing or review
  3. Ask AI to fix specific issues
  4. Repeat until acceptable
  5. Final manual review and refinement

Time savings: 20-30%, works best for well-defined problems

Pattern 4: Reference + Adaptation

Use AI output as reference, adapt to specific needs.

Process:

  1. Generate multiple AI approaches to problem
  2. Extract good ideas from each
  3. Implement yourself combining best elements
  4. Use AI for boilerplate parts
  5. Manual refinement throughout

Time savings: 15-20%, highest quality outcomes

The Prompting Skill Gap

Effective use of AI code generation requires learning how to prompt. We documented principles that improved output quality:

Principle 1: Context is Critical

Bad prompt: "Create a user authentication system"

Good prompt: "Create a JWT-based authentication system for a FastAPI application. Use PostgreSQL for user storage, bcrypt for password hashing, and include endpoints for registration, login, logout, and token refresh. Follow RESTful conventions and return appropriate HTTP status codes."

Context improves output quality dramatically.

Principle 2: Specify Constraints

Bad prompt: "Write a function to calculate discounts"

Good prompt: "Write a Python function to calculate discounts with these rules: 1) Tiered discounts based on quantity, 2) Promotional codes apply before tier discounts, 3) Maximum discount is 40%, 4) Return both original and discounted prices. Include type hints and docstring."

Constraints prevent ambiguity.

Principle 3: Provide Examples

Bad prompt: "Parse this log file"

Good prompt: "Parse this Apache access log file. Example line: '192.168.1.1 - - [01/Jan/2025:12:00:00 +0000] "GET /api/users HTTP/1.1" 200 1234'. Extract IP, timestamp, HTTP method, path, status code, and response size. Return as list of dictionaries."

Examples clarify requirements better than descriptions.

Principle 4: Iterate Explicitly

Bad approach: Generate once, accept or reject.

Good approach: "Generate initial implementation" → Review → "Fix error handling" → Review → "Add input validation" → Review → "Optimize database query" → Final review

Iterative refinement produces better results than one-shot generation.

Principle 5: Specify Code Style

Bad prompt: "Create a React component"

Good prompt: "Create a React component using TypeScript, functional components with hooks, following Airbnb style guide. Include prop types, error boundaries, and loading states. Use styled-components for styling."

Style specifications improve consistency with existing codebase.

The Testing Imperative

AI-generated code requires testing discipline:

Rule 1: Test AI code more thoroughly than human code

AI can be confidently wrong. Tests catch subtle errors that look correct on review.

Rule 2: Don't trust AI-generated tests

Test structure is fine, but assertions may be incomplete or wrong. Verify test logic independently.

Rule 3: Generate tests after code, not before

TDD with AI is problematic. AI tests may pass for wrong implementations. Write tests yourself or generate after manual verification.

Rule 4: Use AI for test data generation

AI excels at creating diverse test data, edge cases, and fixture generation. This is a high-value use case.

Rule 5: Manual integration testing is essential

AI unit tests may pass while integration fails. Don't skip manual testing even with comprehensive AI-generated tests.

Real Project Results

We tracked AI code generation impact across three real projects:

Project 1: Customer Management System

Project scope: Migration from legacy system to modern Python/FastAPI backend

AI usage: Heavy (60% of code had AI assistance)

Results:

  • Development time: 32% faster than estimated without AI
  • Initial bug density: 12% higher than historical average
  • Post-review bug density: Normal after adjustment period
  • Code quality: Acceptable after establishing review processes
  • Developer satisfaction: High (after learning curve)

Lessons:

  • AI excels at API endpoint generation
  • Business logic required significant refinement
  • Testing discipline prevented quality issues
  • Code review processes needed updating

Project 2: E-commerce Dashboard

Project scope: React/TypeScript frontend rebuild

AI usage: Moderate (40% of code had AI assistance)

Results:

  • Development time: 23% faster than estimated
  • Component quality: Good for simple components, mixed for complex
  • State management: Required heavy manual refinement
  • Styling: AI-generated styles needed design review
  • Developer satisfaction: Moderate (mixed value)

Lessons:

  • UI component scaffolding saved significant time
  • Complex state management better done manually
  • AI styling suggestions rarely matched design system
  • Type definitions were AI's strength

Project 3: Integration Microservice

Project scope: Node.js service connecting multiple external APIs

AI usage: Light (25% of code had AI assistance)

Results:

  • Development time: 15% faster than estimated
  • Integration quality: Required significant manual work
  • Error handling: AI suggestions were inadequate
  • Documentation: AI-generated docs were helpful
  • Developer satisfaction: Low (limited value for this project type)

Lessons:

  • Integration logic requires API-specific knowledge
  • Error handling and retry logic too complex for AI
  • Boilerplate savings were minimal for this project type
  • Documentation generation was the primary value

The Cost-Benefit Reality

Across all projects, we measured:

Average time savings: 20-25% (not 50%, not 10x)

Time investment required:

  • Learning effective prompting: 10-20 hours per developer
  • Establishing review processes: 5-10 hours for team
  • Tool setup and configuration: 2-5 hours per developer

Break-even point: 3-4 weeks of regular usage

ROI varies by:

  • Project type (backend CRUD: high value, complex integration: low value)
  • Developer experience (senior devs extract more value)
  • Codebase maturity (greenfield projects: easier, legacy: harder)
  • Code review discipline (good processes: high value, poor: negative value)

When AI Code Generation Doesn't Make Sense

Be honest about situations where AI tools provide minimal value:

Low value scenarios:

  • Highly specialized domain code (finance, scientific, healthcare)
  • Performance-critical systems (real-time, embedded, high-throughput)
  • Security-focused development (authentication, encryption, compliance)
  • Legacy codebase maintenance (AI lacks historical context)
  • Exploratory development (requirements unclear)

Negative value scenarios:

  • Junior developers learning fundamentals (AI interferes with learning)
  • Code review processes are weak (AI amplifies quality problems)
  • Team resistance to AI tools (adoption requires buy-in)
  • Strict regulatory environments (code provenance tracking required)

The Future Trajectory

Based on six months of daily use, here's our assessment of where AI code generation is heading:

Improving:

  • Context understanding across larger codebases
  • Framework-specific code quality
  • Multi-file refactoring capabilities
  • Test generation quality

Plateauing:

  • Algorithm implementation
  • Security-critical code
  • Performance optimization
  • Domain-specific logic

Still problematic:

  • System architecture decisions
  • Complex state management
  • Integration with poorly documented APIs
  • Code that requires deep domain expertise

Expect gradual improvement, not revolutionary leaps. The fundamental limitations (lack of true understanding, inability to reason about real-world constraints) remain.

Practical Recommendations

After six months of intensive AI code generation usage:

Do:

  • Use AI for boilerplate and scaffolding (high value, low risk)
  • Establish clear code review processes for AI output
  • Invest time learning effective prompting
  • Measure actual productivity impact in your context
  • Treat AI as junior developer, not expert
  • Generate tests after verifying code logic
  • Use AI for documentation and explanation

Don't:

  • Trust AI-generated security code without expert review
  • Accept complex business logic without thorough testing
  • Skip code review because "AI generated it"
  • Use AI as excuse for not understanding implementation
  • Generate production code without manual verification
  • Rely on AI for architecture decisions
  • Assume AI-generated tests are comprehensive

Consider carefully:

  • Is the code type suited to AI generation? (Tier 1: yes, Tier 3: no)
  • Do you have review processes to catch AI errors?
  • Is development speed the bottleneck? (Often it's not)
  • Will AI assistance help or hinder learning?
  • Can you afford the tool subscription costs?

The Bottom Line

AI code generation is neither revolution nor hype. It's a productivity tool that excels at specific tasks and struggles with others.

What it does well:

  • Boilerplate and scaffolding
  • Standard patterns in established frameworks
  • Documentation and explanation
  • Type definitions and interfaces
  • Test structure generation

What it doesn't do well:

  • Complex business logic
  • Security-critical code
  • Performance optimization
  • Architecture decisions
  • Domain-specific algorithms

Real productivity improvement: 20-25% for appropriate tasks

That's meaningful but not transformative. It makes good developers somewhat faster at some things.

The developers who benefit most are those who understand the limitations, develop effective prompting skills, maintain review discipline, and use AI strategically rather than universally.

AI code generation is a tool. Use it where it excels, avoid it where it struggles, and always verify the output.

%%{init: {'theme':'base', 'themeVariables': {
  'primaryColor':'#e3f2fd',
  'primaryTextColor':'#0d47a1',
  'primaryBorderColor':'#1976d2',
  'secondaryColor':'#e8f5e9',
  'secondaryTextColor':'#1b5e20',
  'tertiaryColor':'#fff3e0',
  'tertiaryTextColor':'#e65100',
  'lineColor':'#1976d2',
  'fontSize':'16px'
}}}%%
graph TD
    A[Code Generation Task] --> B{Task Type?}

    B -->|Boilerplate/CRUD| C[Tier 1: AI Excels]
    B -->|Business Logic| D[Tier 2: AI Assists]
    B -->|Architecture/Security| E[Tier 3: AI Struggles]

    C --> F[Generate with AI]
    F --> G[Light Review]
    G --> H[85-95% Usable]

    D --> I[Generate with AI]
    I --> J[Heavy Review & Refinement]
    J --> K[50-70% Usable]

    E --> L[Use AI as Reference Only]
    L --> M[Manual Implementation]
    M --> N[20-40% Helpful]

    H --> O[Test Thoroughly]
    K --> O
    N --> O

    O --> P{Quality OK?}
    P -->|Yes| Q[Ship]
    P -->|No| R[Refine/Rewrite]

    style C fill:#e8f5e9,stroke:#4caf50,color:#1b5e20
    style D fill:#fff3e0,stroke:#f57c00,color:#e65100
    style E fill:#ffebee,stroke:#f44336,color:#c62828
    style Q fill:#e8f5e9,stroke:#4caf50,color:#1b5e20
    style R fill:#ffebee,stroke:#f44336,color:#c62828

Pro Tip: The best AI code generation strategy is knowing when not to use it. Master the fundamentals first, then use AI to accelerate what you already understand.

AI code generation won't replace developers. It will, however, change what "being a good developer" means. Understanding when and how to use AI tools effectively is becoming part of the skillset.

The future isn't AI writing all the code. It's developers who know which code to let AI write, which to write themselves, and how to verify everything works correctly.

That's the reality. Not as sexy as "AI replaces programmers," but a lot more useful.

Related Products:

Related Articles

Ready to Build Something Better?

Let's talk about how Thalamus AI can help your business scale with enterprise capabilities at SMB pricing.

Get in Touch