Skip to main content
Article 5 of 5

How We Built 24 Microservices in 6 Months (For Under $100K): Complete Case Study

A transparent breakdown of how Thalamus built SYNAPTICA, our enterprise-grade AI infrastructure, including architecture decisions, team structure, lessons learned, and exact cost breakdowns.

Shawn Sloan

Co-founder & CTO

February 6, 202618 minPart 5 of 5

How We Built 24 Microservices in 6 Months (For Under $100K): Complete Case Study

In January 2025, we set out to build SYNAPTICA: enterprise-grade AI infrastructure that could compete with platforms costing millions of dollars annually from vendors. Six months later, we had 24 production microservices handling thousands of requests per day, with multi-LLM orchestration, comprehensive observability, and enterprise security.

The total cost? Under $100,000.

This is not a theoretical case study. This is exactly how we did it: the architecture decisions we made, the mistakes we learned from, the team structure that worked, and the precise cost breakdowns that prove building is viable for any organization with competent engineers.

The Challenge: Building Enterprise AI Infrastructure on a Startup Budget

What We Needed to Build

Our requirements were ambitious:

  • Multi-LLM orchestration: Route requests across GPT-4, Claude, Gemini, and open-source models
  • Prompt management: Version control, A/B testing, dynamic composition
  • Safety layer: Input/output validation, PII detection, content filtering
  • Governance: Audit trails, policy enforcement, human-in-the-loop
  • Observability: Request tracing, cost attribution, performance analytics
  • Enterprise security: SOC 2 compliance, encryption, access controls
  • Scalability: Handle traffic spikes, multi-tenant isolation
  • Developer experience: Clean APIs, comprehensive documentation

What Vendors Charge for This

Data table with 4 columns
Vendor CategoryAnnual CostImplementation3-Year Total
AI Orchestration Platform$300,000-500,000$200,000-400,000$1,100,000-1,900,000
Prompt Management$100,000-200,000$50,000-100,000$350,000-700,000
Safety/Governance Layer$150,000-300,000$100,000-200,000$550,000-1,100,000
Observability Suite$50,000-100,000$25,000-50,000$175,000-350,000
Combined Estimate$600,000-1,100,000$375,000-750,000$2,175,000-4,050,000

We needed to build equivalent capability for less than 5% of the vendor cost.

The Architecture: Designing for Speed and Scale

Core Architectural Principles

Before writing code, we established these principles:

  1. Cloud-native from day one: No legacy baggage, serverless where possible
  2. API-first design: Every service speaks HTTP/REST or gRPC
  3. Event-driven communication: Async for decoupling, sync where needed
  4. Microservices with bounded contexts: Clear service boundaries
  5. Infrastructure as code: Terraform for reproducible environments
  6. Observability built-in: Logging, metrics, tracing from the start

The SYNAPTICA Architecture

Our architecture follows a simple pattern:

  1. API Gateway handles authentication, rate limiting, and routing
  2. Router Service determines which LLM to use for each request
  3. Prompt Manager handles versioning and template composition
  4. Safety Service validates inputs and outputs
  5. LLM Adapters connect to OpenAI, Anthropic, and open-source models
  6. Response Processor handles caching and formatting

This modular design allowed us to build, test, and deploy each component independently.

The 24 Microservices

Here is what each service does:

Data table with 4 columns
#ServicePurposeComplexity
1API GatewayEntry point, auth, rate limitingMedium
2Router ServiceLLM selection logicHigh
3Prompt ManagerVersion control, templatesMedium
4Safety ServiceContent validationHigh
5PII DetectorPersonal information detectionMedium
6Cache ServiceResponse cachingLow
7Cost TrackerUsage tracking, attributionMedium
8Audit LoggerCompliance loggingMedium
9Policy EngineGovernance rulesHigh
10-14LLM AdaptersOpenAI, Claude, Gemini, Llama, MistralMedium
15-18Response ProcessorsFormatting, caching, streamingLow-Medium
19-21ObservabilityMetrics, logging, alertingLow
22-24InfrastructureConfig, secrets, health checksLow

Average per service: ~490 lines of code

This is not massive complexity—it is well-factored, focused services doing specific jobs.

The Team Structure: Who Did What

Team Composition

Data table with 3 columns
RoleBackgroundTime Commitment
Tech Lead / Architect (Shawn)20 years enterprise architecture6 months, 80%
Senior Engineer8 years backend, distributed systems6 months, 100%
ML Engineer5 years ML, previously at research lab4 months, 100%
DevOps Engineer6 years cloud infrastructure3 months, 100%

Total engineering capacity: ~2.5 FTE over 6 months = ~15 person-months

Work Distribution

Months 1-2: Foundation

  • Tech Lead: Architecture design, API specifications, infrastructure planning
  • Senior Engineer: Core services (Gateway, Router, Adapters)
  • ML Engineer: Model evaluation, selection criteria, fine-tuning pipeline
  • DevOps Engineer: CI/CD setup, cloud infrastructure, monitoring baseline

Months 3-4: Core Features

  • Tech Lead: Safety layer design, governance framework
  • Senior Engineer: Prompt Manager, Cache Service, Response processing
  • ML Engineer: PII detection, content classification, evaluation framework
  • DevOps Engineer: Security hardening, compliance preparation, scaling setup

Months 5-6: Polish and Scale

  • Tech Lead: Performance optimization, documentation, developer experience
  • Senior Engineer: Batch processing, webhooks, edge cases
  • ML Engineer: Model performance tuning, fallback strategies
  • DevOps Engineer: Load testing, disaster recovery, production readiness

Key Team Dynamics

What Worked:

  • Small team = minimal coordination overhead
  • Clear ownership = no ambiguity
  • Daily standups = quick problem resolution
  • Shared codebase = collective code ownership
  • Weekend prototyping = rapid experimentation

What Was Challenging:

  • Context switching across services
  • Wearing multiple hats (dev, ops, testing)
  • Limited time for comprehensive testing
  • Documentation lagged behind code

Technology Stack: What We Used

Programming Languages

Core Frameworks and Libraries

Data table with 3 columns
LanguageUsageRationale
Python70% of codebaseAI/ML libraries, rapid development
TypeScript25% of codebaseType safety, developer experience
Go5% of codebasePerformance-critical paths
CategoryTechnologyCost
Web FrameworkFastAPI (Python), Express (Node)Free
AI/MLTransformers, LangChain, OpenAI SDKFree
DatabasePostgreSQL, RedisFree
Message QueueRedis Pub/SubFree (existing)
ObservabilityOpenTelemetry, Prometheus, GrafanaFree
Testingpytest, JestFree
DocumentationMkDocs, Swagger/OpenAPIFree

Total software licensing cost: $0

Cloud Infrastructure (GCP)

Third-Party Services

Development Methodology: How We Moved Fast

Sprint Structure

Data table with 3 columns
ServiceUsageMonthly Cost
Cloud RunContainer hosting for all 24 services$1,500
Cloud SQLPostgreSQL for persistence$1,000
MemorystoreRedis for caching/messaging$500
Cloud StorageModel weights, logs, backups$250
Load BalancingHTTPS termination$400
Cloud MonitoringLogs, metrics, alerts$200
Secret ManagerCredential storage$50
NetworkingEgress, NAT$300
Total$4,200/month
ServicePurposeMonthly Cost
OpenAI APIGPT-4, GPT-3.5$2,000
Anthropic APIClaude 3$1,000
DatadogAPM, advanced monitoring$1,000
GitHub EnterpriseSource control, CI/CD$400
SentryError tracking$200
Total$4,600/month

We used 1-week sprints with this rhythm:

Data table with 2 columns
DayActivity
MondaySprint planning (1 hour), feature development
Tuesday-ThursdayFeature development, pair programming
FridayDemo, retrospective, deployment

Key rule: Every Friday, something deployed to production.

Development Practices

1. Feature Flags

  • All new features behind flags
  • Deploy incomplete work safely
  • Gradual rollout to users

2. Trunk-Based Development

  • No long-lived feature branches
  • Merge to main daily
  • Feature flags control visibility

3. Automated Testing

  • Unit tests: ~70% coverage
  • Integration tests: Critical paths
  • Contract tests: Service boundaries

4. Infrastructure as Code

  • Terraform for all infrastructure
  • Code review for infra changes
  • Reproducible environments

5. Observability First

  • Structured logging from day one
  • Distributed tracing across services
  • Custom metrics for business logic

The Cost Breakdown: Exact Numbers

Labor Costs (Fully-Loaded)

Data table with 4 columns
RoleMonthsMonthly CostTotal
CTO (Shawn)6$10,000*$60,000
Senior Engineer6$10,000$60,000
ML Engineer4$12,500$50,000
DevOps Engineer3$10,000$30,000
Total Labor$200,000

*Founder rate—actual cash outlay was lower

Infrastructure Costs (First 6 Months)

Data table with 3 columns
CategoryMonthly6 Months
GCP Infrastructure$4,200$25,200
Third-party APIs$3,000**$18,000
Monitoring/Tooling$1,600$9,600
Total Infrastructure$52,800

**API costs were lower during development

Other Costs

Grand Total: $263,500

Data table with 2 columns
ItemCost
Domain registration, SSL certs$200
Security audit (basic)$5,000
Documentation tools$500
Development tools$2,000
Legal (terms of service, privacy)$3,000
Total Other$10,700

Wait—that is more than $100K. Here is the context:

If paying market rates for everything: $263,500

Actual cash outlay (founders + lean operations): ~$80,000

What external company would pay to replicate: $200,000-300,000

Even at full market rates, we built for <15% of vendor pricing.

Lessons Learned: What Worked and What Didn't

What Worked Exceptionally Well

1. Microservices from Day One

  • Enabled parallel development
  • Clear boundaries reduced conflicts
  • Independent deployment reduced risk
  • Team ownership was clear

2. Serverless/Containerization

  • Cloud Run's pay-per-request model saved thousands
  • Auto-scaling handled traffic spikes without config
  • Zero server management overhead

3. API-First Design

  • Clear contracts between services
  • Easy to test independently
  • Frontend and backend developed in parallel
  • Documentation was automatic

4. Event-Driven Architecture

  • Decoupled services
  • Async processing for resilience
  • Easy to add new consumers
  • Natural audit trail

5. Open Source Everything

  • Zero licensing costs
  • Large community for support
  • No vendor lock-in
  • Could self-host if needed

What We Would Do Differently

1. Start with Fewer Services

  • 24 was too many initially
  • Could have started with 8-10 larger services
  • Refactored to 24 later as needed

2. Invest More in Testing Early

  • Integration tests were underdeveloped
  • Caught issues in production that tests would have found

3. Better Documentation Culture

  • Docs lagged behind code
  • Onboarding new team members was harder

4. Local Development Environment

  • Running 24 services locally was challenging
  • Should have invested in better dev tooling

Mistakes That Cost Us Time

Performance and Scale: What We Achieved

Throughput Metrics

Cost Efficiency

Can You Do This? Assessment Framework

Data table with 3 columns
MistakeImpactLesson
Over-engineered caching2 weeks wastedStart simple, optimize when needed
Premature abstraction1 week refactoringConcrete first, abstract later
Wrong database choice initially3 days migrationEvaluate more carefully upfront
Overly complex auth1 week simplificationStandard solutions first
MetricTargetAchieved
Requests per second100500+
Average latency (p50)<500ms320ms
Average latency (p95)<1000ms780ms
Error rate<1%0.3%
Uptime99.9%99.97%
MetricVendor EstimateOur CostSavings
Per-request cost$0.05$0.00394%
Monthly infrastructure$20,000$4,20079%
Annual platform cost$600,000$50,40092%

Not every organization should build their own AI infrastructure. Here is how to decide:

Build If You Have:

Buy If You Have:

Scaling After Build

Data table with 2 columns
RequirementMinimum Threshold
Engineering team2+ backend engineers
Timeline4-6 months available
Budget$100K-300K for build
Strategic valueCore differentiator
Usage volume>$50K/month projected
Customization needsSignificant
SituationRecommendation
No engineering teamUse managed APIs directly
Immediate need (<1 month)Rent temporarily, build in parallel
Low volume (<$10K/month)Direct API usage
Commodity use caseStandard SaaS solution

Once built, ongoing staffing needs are modest:

Maintenance Team (Steady State)

Data table with 3 columns
RoleFTEAnnual Cost
Platform Engineer0.5$75,000
ML Engineer0.25$50,000
Total0.75$125,000

Compare to vendor platform:

  • Annual license: $300,000-600,000
  • Savings: $175,000-475,000/year

Plus: You own the IP, have internal capability, and can customize freely.

Conclusion: Building Is More Accessible Than Ever

Six months. Four people. Under $100,000 in actual cash outlay.

We built what vendors charge millions for. Not because we are exceptional—though our team is skilled—but because modern tools have democratized software development to an unprecedented degree.

What Made This Possible

  1. Cloud-native infrastructure - No servers to manage, pay for what you use
  2. Open-source ecosystem - World-class tools, freely available
  3. AI commoditization - Foundation models via simple APIs
  4. Modern frameworks - FastAPI, Next.js, etc. accelerate development
  5. Small team dynamics - Minimal overhead, maximum focus

The Real Lesson

The barrier to building enterprise-grade software is not technical complexity—it is the illusion that building is impossibly difficult. Vendors perpetuate this illusion because it justifies their pricing.

The truth: A small team of competent engineers can build extraordinary things in months, not years, for hundreds of thousands, not millions.

Your Next Steps

If you are considering building:

  1. Start with a proof of concept (2-4 weeks)
  2. Validate technical approach with your team
  3. Build incrementally - one service at a time
  4. Measure religiously - track costs, performance, value
  5. Document everything - future you will thank present you

Continue Your Education:

This article is part of our Enterprise AI Illusion series:

Ready to explore building your own AI infrastructure? Contact our team to discuss how SYNAPTICA and our approach can accelerate your journey. Or explore the SYNAPTICA platform to see what we built.

Tags:#enterprise-ai#case-study#building-vs-buying#synaptica

Shawn Sloan

Co-founder & CTO

Building the future of enterprise AI at Thalamus. Passionate about making powerful technology accessible to businesses of all sizes.

Exploring The Enterprise AI Illusion Exposed: A Comprehensive Guide to Building vs Buying

This article is part of a comprehensive guide. Check out the other articles to continue your learning journey.

View Full Guide

Enjoyed this article?

Subscribe to get notified when we publish new articles on AI implementation, governance, and best practices.

No spam, ever. Unsubscribe anytime.