Refactoring Blog Search from Pinecone to Claude

How we replaced vector search with AI-powered semantic search using Claude via AWS Bedrock

Refactoring Blog Search from Pinecone to Claude

When I initially implemented search for this blog, I chose Pinecone's vector database for its promise of semantic search capabilities. After running it for several months, I decided to refactor the entire search implementation to use Claude via AWS Bedrock instead. This post walks through the technical journey of that migration.

The Original Architecture

The Pinecone implementation followed a standard vector search pattern. During deployment, a GitHub Actions workflow would read all blog posts, generate embeddings using Pinecone's multilingual-e5-large model, and store them in a Pinecone index. When users searched, the Lambda function would generate an embedding for their query, search the vector database, and use Pinecone's reranking model to improve results.

The architecture looked like this:

Build Process

Frontend

CloudFront

Edge Lambda

API Gateway

Search Lambda

Pinecone

GitHub Actions

Generate Embeddings

This worked well enough, but came with some challenges. The deployment process required managing embeddings and keeping them synchronized with content changes. Most importantly, while vector search found semantically similar content, it sometimes missed obvious matches that a more context-aware system would catch.

The New Approach

The refactored architecture simplifies the search flow while leveraging Claude's natural language understanding:

Build Process

Frontend

CloudFront

Edge Lambda

API Gateway

Search Lambda

S3 Corpus

Bedrock Claude

GitHub Actions

Deploy CDK

Instead of maintaining vector embeddings, the system now generates a single JSON file containing all blog content during the deployment process. The Next.js build process (via yarn build) generates the static site, RSS feeds, and llms.txt files for each post. Then during CDK deployment, the bundling process generates the search corpus and deploys everything to S3. When a search request comes in, the Lambda function fetches this corpus (with 5-minute caching), constructs a prompt for Claude, and returns the most relevant results.

Corpus Generation

The blog corpus generation happens during the CDK deployment's bundling process. When CDK bundles the website assets, it runs a TypeScript script that reads the Velite-generated blog data and creates a structured JSON file:

interface BlogCorpusEntry {
  id: string;
  title: string;
  description: string;
  content: string;
  date: string;
  tags: string[];
  categories: string[];
  author: string;
}
 
export function generateBlogCorpus(outputPath: string): void {
  try {
    // Read the Velite-generated blogs.json
    const blogsPath = join(process.cwd(), "../web/.velite/blogs.json");
 
    if (!existsSync(blogsPath)) {
      console.error(`Blog data not found at ${blogsPath}`);
      console.log("Make sure you've built the web app first with 'yarn build'");
      return;
    }
 
    const blogsJson = readFileSync(blogsPath, "utf-8");
    const blogs: VeliteBlogPost[] = JSON.parse(blogsJson);
 
    // Filter published posts and transform to corpus format
    const corpus: BlogCorpusEntry[] = blogs
      .filter((post) => post.published)
      .map((post) => ({
        id: post.slug,
        title: post.title,
        description: post.description,
        content: post.body, // Full content for best search results
        date: post.date,
        tags: post.tags || [],
        categories: post.categories || [],
        author: post.author,
      }));
 
    // Write corpus to output file
    writeFileSync(outputPath, JSON.stringify(corpus, null, 2));
    console.log(
      `Blog corpus generated with ${corpus.length} posts at ${outputPath}`,
    );
 
    // Log size for monitoring
    const sizeInMB = (
      Buffer.byteLength(JSON.stringify(corpus)) /
      1024 /
      1024
    ).toFixed(2);
    console.log(`Corpus size: ${sizeInMB} MB`);
  } catch (error) {
    console.error("Error generating blog corpus:", error);
    throw error;
  }
}

Then, during deployment of the CDK, the generateBlogCorpus is run as part of the websiteSource bundling. Instead of running separate scripts to create indexes and generate embeddings, the corpus generation happens during the local bundling:

const websiteSource = Source.asset(webOutPath, {
  bundling: {
    local: {
      tryBundle(outputDir: string) {
        // Copy the Next.js build output
        execSync(`cp -r ${webOutPath}/* ${outputDir}/`, {
          stdio: "inherit",
        });
 
        // Generate the blog corpus
        const corpusPath = path.join(outputDir, "blog-corpus.json");
        generateBlogCorpus(corpusPath);
 
        return true;
      },
    },
  },
});

This approach ensures that the blog corpus is always synchronized with the deployed content, as it's generated fresh during each deployment.

Search Lambda

The search Lambda underwent a complete rewrite. Instead of managing Pinecone clients and generating embeddings, it now focuses on prompt engineering and Claude integration:

async function searchWithClaude(
  query: string,
  corpus: BlogCorpusEntry[],
): Promise<SearchResult[]> {
  const prompt = `You are a search engine for a technical blog. Given the following blog posts and a search query, return the 5 most relevant blog posts.
 
Blog posts:
${JSON.stringify(
  corpus.map((post) => ({
    id: post.id,
    title: post.title,
    description: post.description,
    tags: post.tags,
    categories: post.categories,
    content: post.content.substring(0, 500),
  })),
  null,
  2,
)}
 
Search query: "${query}"
 
Return ONLY a JSON array with the 10 most relevant blog posts in this exact format, with no additional text:
[
  {
    "id": "blog-post-slug",
    "text": "Title - Description",
    "score": 0.95
  }
]`;
 
  const command = new InvokeModelCommand({
    modelId: "us.anthropic.claude-3-5-haiku-20241022-v1:0",
    contentType: "application/json",
    accept: "application/json",
    body: JSON.stringify({
      anthropic_version: "bedrock-2023-05-31",
      max_tokens: 1000,
      messages: [
        {
          role: "user",
          content: [{ type: "text", text: prompt }],
        },
      ],
      temperature: 0,
      top_p: 0.999,
    }),
  });
 
  const response = await bedrockClient.send(command);
  // Parse and return results...
}

The Lambda caches the blog corpus in memory for 5 minutes to avoid repeated S3 fetches. This provides a good balance between performance and ensuring fresh content after deployments.

Here's the complete search flow:

ClaudeS3LambdaFrontendUserClaudeS3LambdaFrontendUseralt[Corpus not cached]Enter search queryAPI request with queryFetch blog-corpus.jsonReturn corpusCache corpus (5 min TTL)Send prompt with corpus + queryReturn ranked resultsReturn search resultsDisplay results

Frontend Enhancements

The refactoring provided an opportunity to improve the search user experience. The original implementation simply displayed a loading spinner, which felt inadequate for searches that now take 2-3 seconds instead of being nearly instant.

The new search component implements several enhancements:

type SearchState = "idle" | "typing" | "searching" | "complete" | "error";
 
export function Search() {
  const [searchState, setSearchState] = useState<SearchState>("idle");
  const [cachedResults, setCachedResults] = useState<
    Record<string, SearchResult[]>
  >({});
  const { recentSearches, addRecentSearch } = useRecentSearches();
 
  // Show "Searching with AI..." during search
  // Display recent searches in dropdown
  // Cache results for instant repeated searches
  // Progressive result display
}

Recent searches are stored in localStorage and displayed in a dropdown, making it easy to revisit previous queries. Results are cached in memory during the session, so clicking on a recent search displays results instantly without another API call.

The search state management provides clear feedback throughout the process. Users see when they're typing, when the search is processing, and when results are ready. If searches take longer than expected, the interface communicates this rather than leaving users wondering if something is broken.

Infrastructure Changes

The CDK infrastructure required several modifications. The search construct no longer needs Pinecone secrets or API keys, but does require access to the S3 Bucket and permission to InvokeModel on Amazon Bedrock.

export class SearchConstruct extends Construct {
  constructor(scope: Construct, id: string, props: SearchConstructProps) {
    super(scope, id);
 
    const searchFunction = new NodejsFunction(this, "SearchFunction", {
      runtime: Runtime.NODEJS_22_X,
      handler: "handler",
      timeout: Duration.seconds(30),
      entry: join(__dirname, "../lambda/search/index.ts"),
      environment: {
        WEBSITE_BUCKET_NAME: props.websiteBucket.bucketName,
      },
    });
 
    // Grant S3 read permissions
    props.websiteBucket.grantRead(searchFunction);
 
    // Add Bedrock permissions
    searchFunction.addToRolePolicy(
      new PolicyStatement({
        actions: ["bedrock:InvokeModel"],
        resources: [
          `arn:aws:bedrock:${Stack.of(this).region}::foundation-model/*`,
        ],
      }),
    );
  }
}

Performance and Cost Analysis

The shift from Pinecone to Claude brought interesting tradeoffs. Search latency increased from near-instant to 2-3 seconds, but this feels acceptable given the improved search quality. The loading states and recent searches help mitigate the perception of slowness.

The search quality improved noticeably. Claude understands context and intent better than pure vector similarity. Searches for "CDK deployment" now surface posts about CDK pipelines, CDK constructs, and deployment strategies in a more intuitive order. The AI can infer relationships between concepts that vector embeddings might miss.

Conclusion

This refactoring taught me several valuable lessons about choosing the right tool for the job. Vector databases excel at scale and when millisecond latency matters, but for smaller applications, the complexity and cost might not justify their use.

The importance of user experience during longer operations became clear. The original implementation assumed fast responses, while the new one embraces the slower speed with better feedback and caching strategies.

The simplified deployment process reduces operational overhead significantly. No more managing embeddings, synchronizing vector databases, or handling index migrations. The entire search system rebuilds automatically with each deployment.

While the search is significantly slower the results are dramatically better. Combined with not having to worry about embeddings, this makes the switch from a vector database to LLM powered search a big win.