AI Integration Best Practices - Production-Ready AI

The AI Integration Landscape

Integrating AI into production applications has become more accessible than ever, but moving from a proof-of-concept to a reliable production system requires careful planning and execution.

Choosing the Right Model

Model Selection Criteria

Not all AI models are created equal. Consider:

Task-specific performance - Classification vs generation vs embedding
Latency requirements - Real-time vs batch processing
Cost constraints - Token pricing and volume
Context window size - How much data can you process?
Privacy requirements - On-premise vs cloud

Popular Model Options

Model	Best For	Context	Cost
GPT-4	Complex reasoning, high quality	128K	High
GPT-3.5-turbo	General purpose, fast	16K	Medium
Claude 3	Long documents, coding	200K	Medium
Llama 2	Self-hosted, privacy	4K	Self-hosted
Mistral	Cost-effective, fast	32K	Low

Prompt Engineering

Good prompts are the foundation of reliable AI systems:

System Prompt Template

const SYSTEM_PROMPT = `
You are a helpful assistant for [SPECIFIC_USE_CASE].

Your responsibilities:
- [Responsibility 1]
- [Responsibility 2]

Guidelines:
- Be concise and accurate
- If unsure, say so
- Format output as [JSON/markdown/etc]
- Never make up information

Constraints:
- Response must be under [X] tokens
- Use [SPECIFIC_TONE]
`

User Prompt Best Practices

// ✗ Vague prompt
const badPrompt = "Summarize this document"

// ✓ Specific prompt with structure
const goodPrompt = `
Analyze the following document and provide:

1. Main Topic (1 sentence)
2. Key Points (3-5 bullet points)
3. Action Items (if any)
4. Sentiment (positive/neutral/negative)

Document:
${document}

Format your response as JSON.
`

Error Handling & Fallbacks

AI systems can fail in many ways. Always have fallbacks:

async function generateWithFallback(prompt: string) {
  try {
    // Try primary model (GPT-4)
    return await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [{ role: 'user', content: prompt }],
      timeout: 30000,
    })
  } catch (error) {
    if (error.code === 'rate_limit_exceeded') {
      // Fallback to GPT-3.5
      return await openai.chat.completions.create({
        model: 'gpt-3.5-turbo',
        messages: [{ role: 'user', content: prompt }],
      })
    }

    if (error.code === 'context_length_exceeded') {
      // Truncate and retry
      const truncated = truncateToTokens(prompt, 8000)
      return await generateWithFallback(truncated)
    }

    // Log and return graceful failure
    await logError(error)
    return getDefaultResponse()
  }
}

Rate Limiting & Cost Control

Implement safeguards to prevent runaway costs:

import { Ratelimit } from '@upstash/ratelimit'
import { Redis } from '@upstash/redis'

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, '1 m'), // 10 requests per minute
})

export async function POST(req: Request) {
  const userId = await getUserId(req)

  // Check rate limit
  const { success, remaining } = await ratelimit.limit(userId)

  if (!success) {
    return new Response('Rate limit exceeded', { status: 429 })
  }

  // Track costs
  const response = await generateCompletion(prompt)
  await trackUsage({
    userId,
    tokens: response.usage.total_tokens,
    cost: calculateCost(response.usage),
  })

  return Response.json(response)
}

Streaming Responses

For better UX, stream long responses:

import { OpenAIStream, StreamingTextResponse } from 'ai'

export async function POST(req: Request) {
  const { prompt } = await req.json()

  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    stream: true,
    messages: [{ role: 'user', content: prompt }],
  })

  const stream = OpenAIStream(response, {
    onStart: async () => {
      // Save to DB that generation started
      await db.generations.create({ status: 'started' })
    },
    onCompletion: async (completion) => {
      // Save completed response
      await db.generations.update({
        status: 'completed',
        response: completion
      })
    },
  })

  return new StreamingTextResponse(stream)
}

Caching & Performance

Cache expensive AI calls:

import { unstable_cache } from 'next/cache'

const getCachedCompletion = unstable_cache(
  async (prompt: string) => {
    return await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [{ role: 'user', content: prompt }],
    })
  },
  ['ai-completion'],
  {
    revalidate: 3600, // Cache for 1 hour
    tags: ['ai'],
  }
)

// Use semantic caching for similar prompts
async function semanticCache(prompt: string) {
  const embedding = await getEmbedding(prompt)
  const similar = await findSimilarPrompts(embedding, 0.95)

  if (similar) {
    return similar.response // Return cached response
  }

  const response = await generateCompletion(prompt)
  await cacheWithEmbedding(prompt, embedding, response)
  return response
}

Quality Monitoring

Track AI output quality over time:

interface QualityMetrics {
  responseTime: number
  tokenCount: number
  success: boolean
  userFeedback?: 'positive' | 'negative'
  errorType?: string
}

async function trackQuality(metrics: QualityMetrics) {
  await analytics.track('ai_completion', {
    ...metrics,
    timestamp: Date.now(),
  })

  // Alert if quality degrades
  const recentSuccess = await getSuccessRate('1h')
  if (recentSuccess < 0.95) {
    await sendAlert('AI success rate below 95%')
  }
}

Handling Sensitive Data

Protect user privacy:

// PII detection and removal
function sanitizeInput(text: string): string {
  return text
    .replace(/\b[\w\.-]+@[\w\.-]+\.\w{2,4}\b/gi, '[EMAIL]')
    .replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '[PHONE]')
    .replace(/\b\d{3}-\d{2}-\d{4}\b/g, '[SSN]')
}

// Encrypted storage
async function storePromptSecurely(prompt: string) {
  const encrypted = await encrypt(prompt)
  await db.prompts.create({
    data: { encrypted },
    // Auto-delete after 30 days
    expiresAt: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000),
  })
}

Testing AI Systems

import { describe, it, expect } from 'vitest'

describe('AI Summarization', () => {
  it('should handle empty input', async () => {
    const result = await summarize('')
    expect(result).toBe(null)
  })

  it('should respect token limits', async () => {
    const longText = generateText(10000)
    const result = await summarize(longText)
    expect(countTokens(result)).toBeLessThan(500)
  })

  it('should extract key points', async () => {
    const text = 'The company grew 50% YoY. Revenue was $10M.'
    const result = await summarize(text)
    expect(result).toContain('50%')
    expect(result).toContain('$10M')
  })
})

Production Checklist

Before going live:

Common Pitfalls

1. Not Handling Timeouts

// Always set timeouts
const response = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [...],
  max_tokens: 1000,
  timeout: 30000, // 30 second timeout
})

2. Ignoring Token Limits

// Check token count before sending
import { encode } from 'gpt-tokenizer'

function ensureTokenLimit(text: string, limit: number) {
  const tokens = encode(text)
  if (tokens.length > limit) {
    return decode(tokens.slice(0, limit))
  }
  return text
}

3. No Output Validation

// Validate AI output structure
const OutputSchema = z.object({
  summary: z.string().min(10).max(500),
  sentiment: z.enum(['positive', 'neutral', 'negative']),
  keyPoints: z.array(z.string()).min(1).max(5),
})

const parsed = OutputSchema.safeParse(JSON.parse(aiResponse))
if (!parsed.success) {
  // Retry with clearer instructions
}

Conclusion

Successful AI integration requires:

Careful model selection based on requirements
Robust error handling with fallbacks
Cost and rate limiting safeguards
Quality monitoring and alerting
Privacy protection for sensitive data
Comprehensive testing including edge cases

Start small, monitor closely, and iterate based on real usage patterns.

Need help integrating AI into your product? Let's talk.

AI Integration Best Practices: From POC to Production

Quick Answer