Difference between revisions of "Development/Search Architecture"

From MusicBrainz Wiki
m (Add categories)
(Searching: Tag lookup builds advanced indexed search queries only, see "external" subroutine in https://github.com/metabrainz/musicbrainz-server/blob/v-2021-01-11/lib/MusicBrainz/Server/Controller/TagLookup.pm TagLookup.pm)
Line 41: Line 41:
 
| Search webpage  || Yes || Yes (default) || Yes
 
| Search webpage  || Yes || Yes (default) || Yes
 
|-
 
|-
| Tag lookup webpage  || || || Yes (limited)
+
| Tag lookup webpage  || No || No || Yes (limited)
 
|-
 
|-
 
| Other lookup webpage  || || || Yes (limited)
 
| Other lookup webpage  || || || Yes (limited)

Revision as of 07:53, 19 January 2021

Data flow

Indexing

dataflow-reindex.png


  1. When MusicBrainz updates its database,
  2. PostgreSQL triggers queue reindex messages;
  3. These are pulled from RabbitMQ by SIR,
  4. which then gathers data to be indexed from the database,
  5. and finally builds searchable documents and sends these to the Solr search server.

Searching

dataflow-search.png

Search can be accessed either by website visitors, or by editors, or by users of MusicBrainz API clients such as MusicBrainz Picard:

  • Search webpage (GET at https://musicbrainz.org/search):
    • Search form (POST to musicbrainz.org/search?query=…):
      This form is usually accessed from the search field in website top navigation bar.
    • Tag lookup form (POST to musicbrainz.org/taglookup?…):
      This form is available from the above Search webpage but makes more specific queries.
      Its existence is probably legacy from before advanced indexed search was available.
    • Other lookup forms (POST to musicbrainz.org/otherlookup/…?…):
      Same comments as above about tag lookup form.
  • Field completion (POST to musicbrainz.org/ws/js/…?query=…):
    Fields that match MusicBrainz entities (for example, the area of an artist)
    have an autocompletion feature which is making search queries behind the scene.
  • API search query (POST to musicbrainz.org/ws/2/…?query=…):
    This kind of query is made by API clients such as MusicBrainz Picard.
    See “MusicBrainz_API/Search” for client developer documentation.

There are three search modes:

  • Direct database search: This is the legacy method to search the database directly using PostgreSQL. It is currently kept as a fallback when indexed search is not working. It has limited capabilities (no searchable field, name search only, etc.).
  • Indexed search: This is the simplest plain search mode, using Solr. It searches through both accented and unaccented names, aliases, and more. See request-params.xml files in mbsssss repository.
  • Advanced indexed search: This is the most versatile search mode, using Solr. It allows to search through specific fields using the Lucene query syntax. See schema.xml files in mbsssss repository. See also “Indexed_Search_Syntax” for user documentation.

These modes are used or are made available as follows:

Search access Direct database search (Simple) Indexed search Advanced indexed search
Search webpage Yes Yes (default) Yes
Tag lookup webpage No No Yes (limited)
Other lookup webpage Yes (limited)
Field autocompletion Yes Yes (default) No
API search query No Yes Yes (default)

Searching from MusicBrainz mirror without local Solr server

dataflow-search-remote.png

MusicBrainz mirrors can be set up with or without local Solr server, for a matter of resources consumption. When they run their own local Solr server, search works as described in the above section. When they do not run their own local Solr server, they rather rely on remote search.musicbrainz.org.

Components

dependencies-search.png

Services explained in the above section about data flows are composed of many components maintained across different repositories.

Here is a complete list of components with their repositories used to make indexed search to work:

  • MusicBrainz database schema: See admin/sql/ directory in musicbrainz-server repository.
    It defines how data is stored in PostgreSQL by MusicBrainz Server.
    • Python bindings for the above MB DB schema: See SQLAlchemy Models in lalinsky/mbdata repository.
  • MusicBrainz XML metadata schema: See schema/ directory in mmd-schema repository.
    It defines data returned by MusicBrainz API which is handled by MusicBrainz Server for lookup and browse queries, and by Solr search server for search queries.
    • Java bindings for the above MB RELAX NG schema: See brainz-mmd2-jaxb/ directory in the same mmd-schema repository.
    • Python bindings for the above MB RELAX NG schema: See Python package in distinct mb-rngpy repository.
  • MusicBrainz Solr search schema: See cores defined in mbsssss repository.
    It mainly defines how searchable documents are structured and searched, that is mostly everything about searchable fields.
  • MusicBrainz Solr query response writer: See mb-solr directory in mb-solr repository.
    It defines how search results are formatted, using the Java bindings for the above MB RELAX NG schema.
  • MusicBrainz Solr standalone server: See Dockerfile file in the same mb-solr repository.
  • MusicBrainz Solr cloud deployment: See deployment scripts in private mb-solr-cloud repository.
  • Search index rebuilder (SIR): See sir repository and sir documentation.
    It uses both Python bindings above: the one for the above MB DB schema and the other for the above MB RELAX NG schema.
    It also uses pysolr to communicate with Solr server and must comply with MusicBrainz Solr search schema.