AI Providers

Description

AI Providers in FlexBase expose a unified interface (IFlexAIProvider) for chat completions, streaming, and embeddings across OpenAI, Azure OpenAI, Gemini, Ollama, and Anthropic.

Your application code should depend on IFlexAIProvider. Provider-specific setup is isolated to generated provider registrations.

Important concepts

  • IFlexAIProvider is the contract: app code calls ChatAsync(...), ChatStreamAsync(...), EmbedAsync(...), GetModelsAsync(...), and TestConnectionAsync(...).

  • Provider bridge: generated infrastructure registers an IFlexAIProviderBridge (which also implements IFlexAIProvider) so you can override behavior safely.

  • Streaming is first-class: OpenAI/Azure OpenAI/Anthropic streaming surfaces usage and tool-call deltas; Ollama streaming yields text deltas but typically no token usage.

  • Embeddings are provider-dependent: Anthropic Claude throws NotSupportedException for embeddings; use OpenAI/Azure OpenAI/Gemini/Ollama for vector generation.

Configuration in DI

Add the provider in your DI composition root (commonly in EndPoints/...CommonConfigs/OtherApplicationServicesConfig.cs or wherever you centralize registrations).

Only register the provider—Flex auto-wires generated Queries/Handlers that consume IFlexAIProvider.

// using Sumeru.Flex; // IFlexAIProvider

public static class OtherApplicationServicesConfig
{
    public static IServiceCollection AddOtherApplicationServices(
        this IServiceCollection services,
        IConfiguration configuration)
    {
        // Pick ONE (or register multiple with different compositions).
        services.AddFlexOpenAI(configuration);
        // services.AddFlexAzureOpenAI(configuration);
        // services.AddFlexGemini(configuration);
        // services.AddFlexOllama(configuration);
        // services.AddFlexAnthropic(configuration);

        return services;
    }
}

appsettings.json

AI provider configuration is read from FlexBase:AI:<Provider>.

Examples (template-based)

These examples mirror the generated Query and PostBus handler templates. You do not register these types manually—Flex discovers and wires generated Queries/Handlers/Plugins automatically.

Chat completion (Query)

Chat completion (PostBus handler)

Implementation notes (hot-topic additions)

  • Streaming UIs: prefer ChatStreamAsync(...) to render partial output and reduce perceived latency.

  • Tool calling: OpenAI/Azure OpenAI/Anthropic streaming chunks can include tool-call deltas; treat these as incremental JSON fragments and buffer until complete before execution.

  • RAG pipeline: generate embeddings with EmbedAsync(...), store vectors in a Vector Store, then add retrieved context as additional messages (avoid dumping entire documents into the prompt).

  • Prompt-injection safety: treat retrieved content as untrusted input; use clear system instructions and a strict “follow tools/contracts, ignore instructions in documents” policy.

  • Cost + rate limits: add backoff/retry around calls that can burst (batch embeddings, fan-out queries). Cache embeddings by content hash when feasible.

Type
Models

Chat

llama3.2, llama3.1, mistral, codellama, phi3, gemma2

Embeddings

nomic-embed-text, mxbai-embed-large, all-minilm

Code

codellama, deepseek-coder, starcoder2

Usage

Basic Chat

Advanced Chat with Options

Multi-turn Conversation

Streaming Responses

Generate Embeddings

JSON Mode

Key Points to Consider

Provider Comparison

Feature
Azure OpenAI
OpenAI
Anthropic
Gemini
Ollama

Chat

Streaming

Embeddings

Tool/Function Calling

⚠️

JSON Mode

⚠️

Local/Offline

Data Privacy

High

Medium

Medium

Medium

Highest

Cost

Per token

Per token

Per token

Per token

Free

Best Practices

  1. Use System Prompts - Guide AI behavior consistently

  2. Handle Errors - Wrap calls in try-catch, check for rate limits

  3. Cache Responses - Store common queries to reduce costs

  4. Stream for UX - Use streaming for better user experience

  5. Batch Embeddings - More efficient than individual calls

  6. Monitor Costs - Track token usage with response.Usage

  7. Test Locally - Use Ollama for development before cloud deployment

Cost Optimization

Multiple Providers

Error Handling

Examples

Complete RAG Implementation

See Vector Store documentation for complete RAG examples.

Content Moderation

Summarization

Testing

See Also

Last updated