AI Code Generation: What It Does Well and What It Doesn't

Realistic assessment of AI coding capabilities. Types of code AI generates well, where it struggles, when to trust AI output, and how experienced developers use these tools effectively.

The marketing pitch for AI code generation is compelling: developers writing code at superhuman speed, complex applications built in days instead of months, and technical barriers dissolved by natural language commands.

Here's what actually happens after six months of daily AI-assisted development across real business projects.

The Spectrum of AI Code Generation

Not all code is created equal, and AI tools handle different types very differently. Understanding this spectrum is critical to using these tools effectively.

As of January 6, 2025: Findings based on extensive testing with GitHub Copilot, Cursor, Claude, and ChatGPT across multiple production projects. AI capabilities evolve rapidly—expect gradual improvements but not fundamental shifts in these patterns.

Tier 1: AI Excels (Trust with Review)

These are the tasks where AI code generation genuinely saves time and produces reliable results:

CRUD Operations and Boilerplate

AI generates standard create-read-update-delete operations competently. Give it a data model and it produces working database operations in your framework of choice.

Example quality: 85-95% immediately usable.

Why it works: Patterns are consistent, well-documented in training data, and variations are limited.

What to review: Database connection handling, error cases, data validation edge cases.

API Endpoint Scaffolding

REST or GraphQL endpoints follow predictable patterns. AI generates route handlers, request validation, response formatting, and basic error handling effectively.

Example quality: 80-90% immediately usable.

Why it works: Framework conventions are well-established and extensively documented in training data.

What to review: Authentication/authorization logic, rate limiting, input sanitization, business rule implementation.

Type Definitions and Interfaces

TypeScript interfaces, Python type hints, and data structure definitions are AI's sweet spot. Given a description or example data, AI produces accurate type definitions.

Example quality: 90-95% immediately usable.

Why it works: Type systems have formal rules that AI models learn well.

What to review: Optional vs. required fields, union types accuracy, generic constraints.

Unit Test Skeletons

AI generates test structure, setup/teardown, and basic assertions well. The test framework, naming conventions, and organization follow best practices.

Example quality: 70-85% immediately usable (needs assertion refinement).

Why it works: Test structure is formulaic and patterns are consistent.

What to review: Edge cases, assertion logic, test data quality, mock setup accuracy.

Configuration File Generation

Dockerfile, docker-compose.yml, CI/CD configurations, and framework config files are generated reliably from descriptions.

Example quality: 75-90% immediately usable.

Why it works: Configuration formats are structured and documentation is extensive.

What to review: Environment-specific settings, security configurations, resource allocations, version pins.

Documentation and Comments

AI generates clear documentation from code, explains complex functions, and writes helpful comments.

Example quality: 80-90% immediately usable.

Why it works: Explaining code is similar to the chat tasks AI is trained for.

What to review: Accuracy of explanations, completeness of edge case documentation, tone consistency.

Tier 2: AI Assists (Heavy Review Required)

These tasks produce useful starting points but require significant developer refinement:

Business Logic Implementation

AI can translate requirements into code, but business rules have nuances that get lost. The logic structure is often reasonable, but specifics require correction.

Example quality: 40-60% immediately usable.

Why it struggles: Business rules have context and exceptions that aren't captured in simple descriptions.

What to review: Everything. Treat as initial draft, not solution.

Real example from our testing:

Prompt: "Implement pricing calculation with tiered discounts and promotional codes"
AI output: Basic structure correct, discount stacking logic wrong, promo code validation missing, currency handling incomplete
Developer time: 40% less than writing from scratch, but required deep review and significant correction

Database Query Optimization

AI generates working queries but often misses performance considerations. Indexes, join strategies, and query structure may be inefficient.

Example quality: 50-70% immediately usable.

Why it struggles: Performance optimization requires understanding data distribution and database internals.

What to review: Query plans, index usage, N+1 problems, join strategies, pagination approaches.

Error Handling and Edge Cases

AI implements happy path well but edge case handling is often superficial or wrong. Error messages, retry logic, and failure recovery need attention.

Example quality: 40-60% immediately usable.

Why it struggles: Edge cases aren't well-represented in training data and require domain knowledge.

What to review: All error paths, validation logic, boundary conditions, race conditions, timeout handling.

Integration Code

AI generates API client code and integration logic, but authentication, rate limiting, retry strategies, and error handling often need refinement.

Example quality: 50-70% immediately usable.

Why it struggles: Each API has quirks and undocumented behaviors that require experience.

What to review: Authentication flows, error handling, rate limiting, webhook signatures, pagination, data transformation.

Refactoring Suggestions

AI identifies code smells and suggests improvements, but recommendations vary in quality. Some are excellent, others break functionality.

Example quality: 40-70% immediately usable (highly variable).

Why it struggles: Refactoring requires understanding intent and constraints that aren't obvious from code alone.

What to review: Does refactoring preserve behavior? Does it improve maintainability? Does it introduce new problems?

Tier 3: AI Struggles (Use as Reference Only)

These tasks produce output that's more inspiration than solution:

Architecture and System Design

AI can describe architectural patterns and generate component structures, but system design requires understanding business constraints, team capabilities, future requirements, and trade-offs.

Example quality: 20-40% immediately usable.

Why it struggles: Architecture is about decisions under uncertainty with incomplete information.

What to review: Everything. Use AI suggestions as brainstorming, not blueprints.

Security-Critical Code

Authentication, authorization, cryptography, and sensitive data handling require expertise. AI generates code that looks secure but often has subtle vulnerabilities.

Example quality: 30-50% immediately usable (security issues common).

Why it struggles: Security requires adversarial thinking and knowledge of attack patterns.

What to review: Hire security experts to review. Don't trust AI for security-critical code.

Performance-Critical Code

Algorithms requiring optimal performance, memory management, concurrent access patterns, and low-level optimizations are beyond reliable AI generation.

Example quality: 30-50% immediately usable.

Why it struggles: Performance optimization requires profiling, measurement, and deep understanding of execution environment.

What to review: Benchmark everything. Profile. Optimize based on data, not AI suggestions.

Complex State Management

State machines, complex workflows, distributed state coordination, and transaction management require careful design. AI produces code that works for simple cases but fails under complexity.

Example quality: 30-50% immediately usable.

Why it struggles: State management edge cases and race conditions are subtle and context-dependent.

What to review: All state transitions, concurrency handling, consistency guarantees, rollback logic.

Domain-Specific Algorithms

Specialized algorithms for finance, scientific computing, machine learning pipelines, or industry-specific calculations require domain expertise AI lacks.

Example quality: 20-40% immediately usable.

Why it struggles: Training data lacks depth in specialized domains.

What to review: Verify against domain knowledge, existing implementations, or academic references.

The Quality Question

Code quality from AI isn't consistent. We measured quality across five dimensions:

Correctness

Does the code do what it's supposed to do?

Findings:

Tier 1 tasks: 85-95% correct initially
Tier 2 tasks: 50-70% correct initially
Tier 3 tasks: 30-50% correct initially

Correctness improves with:

More specific prompts
Context from existing codebase
Iterative refinement with AI
Clear requirement specifications

Readability

Is the code easy to understand?

Findings:

AI-generated code is generally readable
Naming conventions are good
Structure is logical
Comments can be overly verbose or absent entirely
Consistent style within generated blocks, but may not match project style

Improvement needed:

Style guide enforcement after generation
Comment refinement (too many or too few)
Project-specific naming conventions

Maintainability

Can the code be modified and extended?

Findings:

Simple code is maintainable
Complex code often has hidden coupling
Error handling paths may be incomplete
Test coverage for AI-generated code varies widely

Maintainability issues:

Insufficient error handling
Magic numbers and hardcoded values
Poor separation of concerns
Incomplete abstraction boundaries

Efficiency

Does the code perform well?

Findings:

AI optimizes for correctness, not performance
Database queries often inefficient
Memory usage sometimes excessive
Algorithmic complexity not always optimal

Performance problems:

N+1 query problems
Inefficient data structures
Unnecessary copying
Missing caching opportunities

Security

Does the code protect against attacks?

Findings:

Input validation is inconsistent
SQL injection protection often present but not guaranteed
XSS protection varies
Authentication/authorization logic requires careful review
Cryptography implementations should never be trusted without expert review

Security concerns:

Missing input sanitization
Inadequate authentication checks
Weak cryptography choices
Insufficient rate limiting
Information disclosure in errors

The Trust Calibration Problem

Learning when to trust AI-generated code is the critical skill. We found experienced developers develop intuition over 2-3 months:

Trust More:

Boilerplate and standard patterns
Code in well-established frameworks
Type definitions and interfaces
Test structure (not assertions)
Documentation generation
Simple transformations

Trust Less:

Business logic implementation
Security-sensitive code
Performance-critical paths
Error handling
Edge cases
Integration logic

Never Trust Without Review:

Database operations
Authentication/authorization
Payment processing
Data validation
Cryptography
System design decisions

How Experienced Developers Use AI Code Generation

After observing our team and interviewing developers experienced with AI tools, patterns emerged:

Pattern 1: Scaffolding + Refinement

Generate structure with AI, implement critical logic manually.

Process:

Use AI to generate API endpoints, models, basic CRUD
Manually implement business logic
Use AI to generate tests
Manually refine test assertions and edge cases
Use AI to generate documentation
Manually review and improve

Time savings: 25-35% on typical feature development

Pattern 2: Explanation + Implementation

Use AI to explain approaches, implement yourself.

Process:

Ask AI to explain how to solve problem
Review suggestions and alternatives
Implement yourself with AI assistance for boilerplate
Use AI to review your implementation
Refine based on feedback

Time savings: 15-25%, better understanding of implementation

Pattern 3: Iteration + Correction

Generate with AI, iteratively fix issues through conversation.

Process:

Generate initial implementation with AI
Identify problems through testing or review
Ask AI to fix specific issues
Repeat until acceptable
Final manual review and refinement

Time savings: 20-30%, works best for well-defined problems

Pattern 4: Reference + Adaptation

Use AI output as reference, adapt to specific needs.

Process:

Generate multiple AI approaches to problem
Extract good ideas from each
Implement yourself combining best elements
Use AI for boilerplate parts
Manual refinement throughout

Time savings: 15-20%, highest quality outcomes

The Prompting Skill Gap

Effective use of AI code generation requires learning how to prompt. We documented principles that improved output quality:

Principle 1: Context is Critical

Bad prompt: "Create a user authentication system"

Good prompt: "Create a JWT-based authentication system for a FastAPI application. Use PostgreSQL for user storage, bcrypt for password hashing, and include endpoints for registration, login, logout, and token refresh. Follow RESTful conventions and return appropriate HTTP status codes."

Context improves output quality dramatically.

Principle 2: Specify Constraints

Bad prompt: "Write a function to calculate discounts"

Good prompt: "Write a Python function to calculate discounts with these rules: 1) Tiered discounts based on quantity, 2) Promotional codes apply before tier discounts, 3) Maximum discount is 40%, 4) Return both original and discounted prices. Include type hints and docstring."

Constraints prevent ambiguity.

Principle 3: Provide Examples

Bad prompt: "Parse this log file"

Good prompt: "Parse this Apache access log file. Example line: '192.168.1.1 - - [01/Jan/2025:12:00:00 +0000] "GET /api/users HTTP/1.1" 200 1234'. Extract IP, timestamp, HTTP method, path, status code, and response size. Return as list of dictionaries."

Examples clarify requirements better than descriptions.

Principle 4: Iterate Explicitly

Bad approach: Generate once, accept or reject.

Good approach: "Generate initial implementation" → Review → "Fix error handling" → Review → "Add input validation" → Review → "Optimize database query" → Final review

Iterative refinement produces better results than one-shot generation.

Principle 5: Specify Code Style

Bad prompt: "Create a React component"

Good prompt: "Create a React component using TypeScript, functional components with hooks, following Airbnb style guide. Include prop types, error boundaries, and loading states. Use styled-components for styling."

Style specifications improve consistency with existing codebase.

The Testing Imperative

AI-generated code requires testing discipline:

Rule 1: Test AI code more thoroughly than human code

AI can be confidently wrong. Tests catch subtle errors that look correct on review.

Rule 2: Don't trust AI-generated tests

Test structure is fine, but assertions may be incomplete or wrong. Verify test logic independently.

Rule 3: Generate tests after code, not before

TDD with AI is problematic. AI tests may pass for wrong implementations. Write tests yourself or generate after manual verification.

Rule 4: Use AI for test data generation

AI excels at creating diverse test data, edge cases, and fixture generation. This is a high-value use case.

Rule 5: Manual integration testing is essential

AI unit tests may pass while integration fails. Don't skip manual testing even with comprehensive AI-generated tests.

Real Project Results

We tracked AI code generation impact across three real projects:

Project 1: Customer Management System

Project scope: Migration from legacy system to modern Python/FastAPI backend

AI usage: Heavy (60% of code had AI assistance)

Results:

Development time: 32% faster than estimated without AI
Initial bug density: 12% higher than historical average
Post-review bug density: Normal after adjustment period
Code quality: Acceptable after establishing review processes
Developer satisfaction: High (after learning curve)

Lessons:

AI excels at API endpoint generation
Business logic required significant refinement
Testing discipline prevented quality issues
Code review processes needed updating

Project 2: E-commerce Dashboard

Project scope: React/TypeScript frontend rebuild

AI usage: Moderate (40% of code had AI assistance)

Results:

Development time: 23% faster than estimated
Component quality: Good for simple components, mixed for complex
State management: Required heavy manual refinement
Styling: AI-generated styles needed design review
Developer satisfaction: Moderate (mixed value)

Lessons:

UI component scaffolding saved significant time
Complex state management better done manually
AI styling suggestions rarely matched design system
Type definitions were AI's strength

Project 3: Integration Microservice

Project scope: Node.js service connecting multiple external APIs

AI usage: Light (25% of code had AI assistance)

Results:

Development time: 15% faster than estimated
Integration quality: Required significant manual work
Error handling: AI suggestions were inadequate
Documentation: AI-generated docs were helpful
Developer satisfaction: Low (limited value for this project type)

Lessons:

Integration logic requires API-specific knowledge
Error handling and retry logic too complex for AI
Boilerplate savings were minimal for this project type
Documentation generation was the primary value

The Cost-Benefit Reality

Across all projects, we measured:

Average time savings: 20-25% (not 50%, not 10x)

Time investment required:

Learning effective prompting: 10-20 hours per developer
Establishing review processes: 5-10 hours for team
Tool setup and configuration: 2-5 hours per developer

Break-even point: 3-4 weeks of regular usage

ROI varies by:

Project type (backend CRUD: high value, complex integration: low value)
Developer experience (senior devs extract more value)
Codebase maturity (greenfield projects: easier, legacy: harder)
Code review discipline (good processes: high value, poor: negative value)

When AI Code Generation Doesn't Make Sense

Be honest about situations where AI tools provide minimal value:

Low value scenarios:

Highly specialized domain code (finance, scientific, healthcare)
Performance-critical systems (real-time, embedded, high-throughput)
Security-focused development (authentication, encryption, compliance)
Legacy codebase maintenance (AI lacks historical context)
Exploratory development (requirements unclear)

Negative value scenarios:

Junior developers learning fundamentals (AI interferes with learning)
Code review processes are weak (AI amplifies quality problems)
Team resistance to AI tools (adoption requires buy-in)
Strict regulatory environments (code provenance tracking required)

The Future Trajectory

Based on six months of daily use, here's our assessment of where AI code generation is heading:

Improving:

Context understanding across larger codebases
Framework-specific code quality
Multi-file refactoring capabilities
Test generation quality

Plateauing:

Algorithm implementation
Security-critical code
Performance optimization
Domain-specific logic

Still problematic:

System architecture decisions
Complex state management
Integration with poorly documented APIs
Code that requires deep domain expertise

Expect gradual improvement, not revolutionary leaps. The fundamental limitations (lack of true understanding, inability to reason about real-world constraints) remain.

Practical Recommendations

After six months of intensive AI code generation usage:

Do:

Use AI for boilerplate and scaffolding (high value, low risk)
Establish clear code review processes for AI output
Invest time learning effective prompting
Measure actual productivity impact in your context
Treat AI as junior developer, not expert
Generate tests after verifying code logic
Use AI for documentation and explanation

Don't:

Trust AI-generated security code without expert review
Accept complex business logic without thorough testing
Skip code review because "AI generated it"
Use AI as excuse for not understanding implementation
Generate production code without manual verification
Rely on AI for architecture decisions
Assume AI-generated tests are comprehensive

Consider carefully:

Is the code type suited to AI generation? (Tier 1: yes, Tier 3: no)
Do you have review processes to catch AI errors?
Is development speed the bottleneck? (Often it's not)
Will AI assistance help or hinder learning?
Can you afford the tool subscription costs?

The Bottom Line

AI code generation is neither revolution nor hype. It's a productivity tool that excels at specific tasks and struggles with others.

What it does well:

Boilerplate and scaffolding
Standard patterns in established frameworks
Documentation and explanation
Type definitions and interfaces
Test structure generation

What it doesn't do well:

Complex business logic
Security-critical code
Performance optimization
Architecture decisions
Domain-specific algorithms

Real productivity improvement: 20-25% for appropriate tasks

That's meaningful but not transformative. It makes good developers somewhat faster at some things.

The developers who benefit most are those who understand the limitations, develop effective prompting skills, maintain review discipline, and use AI strategically rather than universally.

AI code generation is a tool. Use it where it excels, avoid it where it struggles, and always verify the output.

%%{init: {'theme':'base', 'themeVariables': {
  'primaryColor':'#e3f2fd',
  'primaryTextColor':'#0d47a1',
  'primaryBorderColor':'#1976d2',
  'secondaryColor':'#e8f5e9',
  'secondaryTextColor':'#1b5e20',
  'tertiaryColor':'#fff3e0',
  'tertiaryTextColor':'#e65100',
  'lineColor':'#1976d2',
  'fontSize':'16px'
}}}%%
graph TD
    A[Code Generation Task] --> B{Task Type?}

    B -->|Boilerplate/CRUD| C[Tier 1: AI Excels]
    B -->|Business Logic| D[Tier 2: AI Assists]
    B -->|Architecture/Security| E[Tier 3: AI Struggles]

    C --> F[Generate with AI]
    F --> G[Light Review]
    G --> H[85-95% Usable]

    D --> I[Generate with AI]
    I --> J[Heavy Review & Refinement]
    J --> K[50-70% Usable]

    E --> L[Use AI as Reference Only]
    L --> M[Manual Implementation]
    M --> N[20-40% Helpful]

    H --> O[Test Thoroughly]
    K --> O
    N --> O

    O --> P{Quality OK?}
    P -->|Yes| Q[Ship]
    P -->|No| R[Refine/Rewrite]

    style C fill:#e8f5e9,stroke:#4caf50,color:#1b5e20
    style D fill:#fff3e0,stroke:#f57c00,color:#e65100
    style E fill:#ffebee,stroke:#f44336,color:#c62828
    style Q fill:#e8f5e9,stroke:#4caf50,color:#1b5e20
    style R fill:#ffebee,stroke:#f44336,color:#c62828

Pro Tip: The best AI code generation strategy is knowing when not to use it. Master the fundamentals first, then use AI to accelerate what you already understand.

AI code generation won't replace developers. It will, however, change what "being a good developer" means. Understanding when and how to use AI tools effectively is becoming part of the skillset.

The future isn't AI writing all the code. It's developers who know which code to let AI write, which to write themselves, and how to verify everything works correctly.

That's the reality. Not as sexy as "AI replaces programmers," but a lot more useful.

AI Code Generation: What It Does Well and What It Doesn't

The Spectrum of AI Code Generation

Tier 1: AI Excels (Trust with Review)

Tier 2: AI Assists (Heavy Review Required)

Tier 3: AI Struggles (Use as Reference Only)

The Quality Question

Correctness

Readability

Maintainability

Efficiency

Security

The Trust Calibration Problem

How Experienced Developers Use AI Code Generation

Pattern 1: Scaffolding + Refinement

Pattern 2: Explanation + Implementation

Pattern 3: Iteration + Correction

Pattern 4: Reference + Adaptation

The Prompting Skill Gap

Principle 1: Context is Critical

Principle 2: Specify Constraints

Principle 3: Provide Examples

Principle 4: Iterate Explicitly

Principle 5: Specify Code Style

The Testing Imperative

Real Project Results

Project 1: Customer Management System

Project 2: E-commerce Dashboard

Project 3: Integration Microservice

The Cost-Benefit Reality

When AI Code Generation Doesn't Make Sense

The Future Trajectory

Practical Recommendations

The Bottom Line

Related Products:

Related Articles

GitHub Copilot vs. Cursor vs. Claude: Real Developer Productivity Comparison

Content Strategy in the AI Era: What Still Requires Humans

AI Image Generation for Business: Beyond the Art Projects

Ready to Build Something Better?