View a markdown version of this page

Operational architecture - AWS Prescriptive Guidance

Operational architecture

In their scaling strategies, Solr and OpenSearch are optimized for distinct operational patterns and deployment scenarios. These differences reflect their design philosophies and target use cases in enterprise environments.

Scaling philosophy

Solr employs a scaling model that's centered on horizontal distribution through collection sharding, where data is partitioned across multiple nodes to distribute load and storage requirements. The Solr architecture maintains separate ingestion and query paths, which provide clear separation between data processing and retrieval operations.

This approach positions Solr as a dedicated search service that's typically deployed as a specialized component within larger system architectures. The separation of concerns supports the targeted optimization of search functionality, and makes Solr particularly effective in environments where search performance is the primary concern.

OpenSearch implements a more dynamic scaling approach through specialized node roles, including data nodes, coordinator nodes, and master nodes, where each type of node is optimized for specific functions within the cluster. (OpenSearch also supports ingest nodes, which aren't yet supported in Amazon OpenSearch Service.) This node role-based architecture enables elastic scaling where different aspects of the system can be scaled independently based on workload demands. The platform is designed for horizontal scaling across these specialized nodes, allowing for granular resource allocation.

The OpenSearch scaling model integrates naturally into multi-purpose data stacks that support diverse workloads beyond traditional search operations. This flexibility makes it particularly well-suited for rapid scaling in cloud environments, where resources can be dynamically allocated and deallocated based on demand. The elastic nature of the platform supports modern DevOps practices and cloud-native deployment patterns.

Both Solr and OpenSearch deliver high-performance search use cases, but their optimization strategies reflect different design priorities and target use cases.

Solr supports vector search capabilities through its DenseVectorField type and KnnVectorQuery functionality, and operates primarily as a self-managed solution. The Solr vector search implementation supports approximate nearest neighbor search but requires manual integration with external ML services for embedding generation. If you're running Solr on AWS, you would have to architect your own connections to Amazon SageMaker endpoints or other ML services to generate vectors, manage model versioning, and handle the operational complexity of maintaining both the search infrastructure and ML pipeline. Unlike the OpenSearch managed service approach, Solr deployments require significant operational overhead for scaling, patching, and integrating AI/ML workflows, which makes Solr less streamlined for modern vector search use cases within AWS.

As a fully managed service for OpenSearch, Amazon OpenSearch Service provides AI/ML integration through its neural search capabilities and native vector engine. The service supports k-nearest neighbors (k-NN) search by using multiple algorithms, including Hierarchical Navigable Small World (HNSW), Inverted File Index (IVF), and brute force methods, which enable efficient similarity search across high-dimensional vector embeddings. Amazon OpenSearch Service integrates directly with Amazon SageMaker and Amazon Bedrock, so you can generate embeddings from text, images, or other data types by using pretrained large language models (LLMs) or custom ML models. The neural search plugin simplifies the ingestion-to-search pipeline by automatically vectorizing documents during indexing and queries during search time. OpenSearch also supports hybrid search approaches that combine traditional lexical search with semantic vector search by using score normalization and combination techniques. These hybrid searches provide more relevant results than either method used alone. 

Plugin support

Solr provides a plugin architecture with extensibility across all core components. It supports custom request handlers, search components, update request processors, query parsers, tokenizers, and response writers through well-defined Java APIs. Solr modules include pre-built plugins for Learning to Rank (LTR), data import handlers, language detection, and clustering. You can deploy custom plugins by packaging them as JAR files and configuring them through solrconfig.xml or managed schemas. The plugin system lets you modify every stage in the request/response pipeline, from document indexing to query processing and result formatting. The extensibility of Solr supports its integration with external systems, implementation of custom scoring algorithms, and specialized text analysis chains for domain-specific requirements. This flexibility requires Java development expertise and careful version compatibility management during upgrades, but provides unlimited customization potential when you have specific search requirements that standard functionality cannot address.

OpenSearch provides extensibility through APIs, ingest processors, and script processors by using the Painless scripting language. The ML Commons plugin enables model hosting and inference directly within OpenSearch clusters.

As a fully managed service for OpenSearch, Amazon OpenSearch Service supports a set of plugins that are pre-installed and managed by AWS, including plugins for alerting, anomaly detection, asynchronous search, Index State Management (ISM), SQL and PPL query languages, and Performance Analyzer. The service restricts custom plugin installation to maintain security, stability, and compliance standards across the managed infrastructure. If you need custom functionality, you can submit a RFC request to the OpenSearch project for evaluation and potential inclusion in future releases. This managed approach reduces operational burden but limits the ability to deploy proprietary or experimental plugins that organizations might develop independently.

Data ingestion

Solr provides data ingestion through the Data Import Handler (DIH) framework, update request processors, and streaming expressions for pipeline construction. DIH supports direct connections to relational databases through Java Database Connectivity (JDBC). It runs SQL queries and transforms results into Solr documents without using intermediate ETL tools. Update request processors enable field manipulation, document cloning, language detection, and script-based transformations during indexing. Solr streaming expressions create computational graphs for aggregations, joins, and transformations across distributed collections. The platform accepts data through RESTful APIs in JSON, XML, and CSV formats, and SolrJ client libraries simplify application integration. Solr integrates with Apache NiFi for visual dataflow orchestration and Apache Kafka through connectors that stream records directly into collections. The Tika integration extracts text and metadata from binary documents, including PDFs and Microsoft Office files, during ingestion.

Amazon OpenSearch Service integrates with multiple AWS services for data ingestion without using traditional ETL processes. Amazon OpenSearch Ingestion (OSI) provides serverless, managed pipelines that automatically scale to handle variable data volumes from sources such as Amazon Simple Storage Service (Amazon S3), Amazon Kinesis Data Streams, Amazon DynamoDB, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). These pipelines support data transformation, enrichment, and filtering by using processors such as grok, mutate, and date parsing before indexing. The service offers zero-ETL integration with Amazon S3 through direct querying capabilities, and allows federated searches across Amazon S3 data lakes without data movement. Amazon OpenSearch Service supports direct ingestion from Amazon CloudWatch Logs, AWS IoT Core, and application logs through Fluent Bit and Logstash integrations. Change data capture (CDC) from DynamoDB streams enables near real-time synchronization of database changes into OpenSearch indexes. Built-in connectors for Amazon Bedrock facilitate automatic embedding generation within the ingestion pipeline and eliminate separate vectorization workflows for AI-powered search applications.