使用Docker和Elasticsearch搭建全文字搜尋引擎應用-知識星球

給應用新增快速、靈活的全文字搜尋對誰都不是一件容易的事情。許多主流資料庫，如PostgreSQL和MongoDB，受限於查詢和索引結構，只提供基礎文字搜尋能力。為了提供高效全文字搜尋一般都需要一個獨立的資料庫。Elasticsearch正是這樣一個能夠提供靈活性和快速全文字搜尋能力的開源資料庫。

本文采用Docker來設定依賴環境。Docker是目前最常見的容器化引擎，Uber、Spotify、ADP和Paypal都是用這個技術，它的優勢在於與作業系統無關，可以執行在Windows、macOS和Linux之上——寫操作指南很容易。如果從來沒有用過Docker也沒問題，本文會詳細提供配置檔案。

本文也分別採用Node.js採（用Koa框架）和Vue.js建立搜尋API和前端Web應用。

1. 什麼是Elasticsearch

現代應用中全文字檢索是高請求負載的應用。搜尋功能也是比較困難完成的功能（許多大眾網站都有subpar功能，但不是傳回很慢就是傳回結果不準確），大部分原因是因為底層資料庫：許多標準關係型資料庫只能提供基本字串匹配功能，而對CONTAINS或者LIKE SQL查詢只能提供有限支援。

而本文提供的搜尋應用能夠提供：

快速：查詢結果應該實時傳回，提高使用者體驗。
靈活：根據不同資料和使用場景，可以調整搜尋過程。
最佳建議：對於輸入錯誤，傳回最可能的結果。
全文字：除了搜尋關鍵詞和標簽之外，希望能夠搜尋到所有匹配文字。

實現以上要求的搜尋應用，最好採用一個為全文字檢索最佳化的資料庫，這也是本文采用Elasticsearch的原因。Elasticsearch是一個用Java開發的，開源的記憶體資料庫，最開始是包含在Apache Lucene庫中。以下是一些官方給出的Elasticsearch使用場景：

Wikipedia使用Elasticsearch提供全文檢索，提供高亮顯示、search-as-you-type和did-you-mean建議等功能。
Guardian使用Elasticsearch將訪問者社交資料整合反饋給作者。
Stack Overflow將位置資訊和more-like-this功能與全文字檢索整合提供相關問題和答案。
GitHub使用Elasticsearch在一千三百億行程式碼中進行搜尋。

Elasticsearch有什麼獨特之處

本質上，Elasticsearch透過使用反向索引提供快速和靈活的全文字搜尋。

“索引”是一種在資料庫中提供快速查詢和傳回的資料結構。資料庫一般將資料域和相應表位置生成索引資訊。將索引資訊存放在一個可搜尋的資料結構中（一般是B-Tree），資料庫可以為最佳化資料請求獲得線性搜尋響應（例如“Find the row with ID=5”）。

可以把資料庫索引看做學校圖書館卡片分類系統，只要知道書名和作者，就可以準確告訴查詢內容的入口。資料庫表一般都有多個索引表，可以加速查詢（例如，對name列的索引可以極大加速對特定name的查詢）。

而反向索引工作原理與此完全不同。每行（或者每個檔案）的內容被分拆，每個入口（本案例中是每個單詞）反向指向包含它的檔案。

反向索引資料結構對查詢“football”位於哪個檔案這種查詢非常迅速。Elasticsearch使用記憶體最佳化反向索引，可以實現強大和客製化全文字檢索任務。

2. 專案安裝

2.0 Docker

本文使用Docker作為專案開發環境。Docker是一個容器化引擎，應用可以執行在隔離環境中，不依賴於本地作業系統和開發環境。因為可以帶來巨大靈活性和客製化，許多網際網路公司應用都已經執行在容器中。

對於作者來說，Docker可以提供平臺一致性安裝環境（可以執行在Windows、macOS和Linux系統）。一般Node.js、Elasticsearch和Nginx都需要不同安裝步驟，如果執行在Docker環境中只需要定義好不同配置檔案，就可以執行在任何Docker環境。另外，由於應用各自執行在隔離容器中，與本地宿主機關係很小，因此類似於“但是我這可以執行啊”這種排錯問題就很少會出現。

2.1 安裝Docker和Docker-Compose

本專案只需要Docker和Docker-Compose環境。後者是Docker官方工具，在單一應用棧中編排定義多個容器配置。

安裝Docker——https://docs.docker.com/engine/installation/
安裝Docker Compose——https://docs.docker.com/compose/install/

2.2 設定專案安裝目錄

建立一個專案根目錄（例如guttenberg_search），在其下定義兩個子目錄：

/public——為前端 Vue.js webapp存放資料。
/server——伺服器端Node.js 源檔案。

2.3 新增Docker-Compose配置檔案

下一步，建立docker-compose.yml檔案，定義應用棧中每個容器的配置：

gs-api——Node.js 容器後端應用邏輯。
gs-frontend——為前端webapp提供服務的Nginx容器。
gs-search——儲存搜尋資料的Elasticsearch容器。

version: '3'
services:
  api: # Node.js App
    container_name: gs-api
    build: .
    ports:
      - "3000:3000" # Expose API port
      - "9229:9229" # Expose Node process debug port (disable in production)
    environment: # Set ENV vars
     - NODE_ENV=local
     - ES_HOST=elasticsearch
     - PORT=3000
    volumes: # Attach local book data directory
      - ./books:/usr/src/app/books
  frontend: # Nginx Server For Frontend App
    container_name: gs-frontend
    image: nginx
    volumes: # Serve local "public" dir
      - ./public:/usr/share/nginx/html
    ports:
      - "8080:80" # Forward site to localhost:8080
  elasticsearch: # Elasticsearch Instance
    container_name: gs-search
    image: docker.elastic.co/elasticsearch/elasticsearch:6.1.1
    volumes: # Persist ES data in seperate "esdata" volume
      - esdata:/usr/share/elasticsearch/data
    environment:
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - discovery.type=single-node
    ports: # Expose Elasticsearch ports
      - "9300:9300"
      - "9200:9200"
volumes: # Define seperate volume for Elasticsearch data
  esdata:

此檔案定義應用棧，而不需要在本地宿主機安裝Elasticsearch、Node.js、或者Nginx。每個容器都對宿主機開放相應埠，以便從宿主機訪問和排錯Node API，Elasticsearch實體和前端應用。

2.4 新增Dockerfile

本文使用官方的Nginx和Elasticsearch映象，但是需要重新為Node.js建立自己的映象。

在應用根目錄定義一個簡單的Dockerfile配置檔案。

# Use Node v8.9.0 LTS
FROM node:carbon
# Setup app working directory
WORKDIR /usr/src/app
# Copy package.json and package-lock.json
COPY package*.json ./
# Install app dependencies
RUN npm install
# Copy sourcecode
COPY . .
# Start app
CMD [ "npm", "start" ]

此Docker配置檔案中將應用原始碼複製進來，安裝了NPM依賴包，形成了自己的映象。同樣需要新增一個.dockerignore檔案，避免不需要的檔案被拷入。

node_modules/
npm-debug.log
books/
public/

註意：不需要將node_modules拷入，因為我們後續要用npm install來安裝這些行程。如果複製node_modules到容器中容易引起相容性問題。例如在macOS上安裝bcrypt包，如果將此module拷入Ubuntu容器就會引起作業系統不匹配問題。

2.5 新增基礎檔案

測試配置檔案前，還需要往應用目錄拷入一下佔位檔案。在public/index.html中加入如下基礎配置資訊：

<html><body>Hello World From The Frontend Containerbody>

html>

下一步，在server/app.js中加入Node.js的應用檔案。

const Koa = require('koa')
const app = new Koa()
app.use(async (ctx, next) => {
  ctx.body = 'Hello World From the Backend Container'
})
const port = process.env.PORT || 3000
app.listen(port, err => {
  if (err) console.error(err)
  console.log(`App Listening on Port ${port}`
  })

最後，加入package.json節點配置檔案：

{
  "name": "guttenberg-search",
  "version": "0.0.1",
  "description": "Source code for Elasticsearch tutorial using 100 classic open source books.",
  "scripts": {
    "start": "node --inspect=0.0.0.0:9229 server/app.js"
  },
  "repository": {
    "type": "git",
    "url": "git+https://github.com/triestpa/guttenberg-search.git"
  },
  "author": "patrick.triest@gmail.com",
  "license": "MIT",
  "bugs": {
    "url": "https://github.com/triestpa/guttenberg-search/issues"
  },
  "homepage": "https://github.com/triestpa/guttenberg-search#readme",
  "dependencies": {
    "elasticsearch": "13.3.1",
    "joi": "13.0.1",
    "koa": "2.4.1",
    "koa-joi-validate": "0.5.1",
    "koa-router": "7.2.1"
  }
}

此檔案定義應用開始命令和Node.js依賴包。

註意：不需要特意執行npm install，容器建立時候會自動安裝依賴包。

2.6 開始測試

都準備好了，接下來可以測試了。從專案根目錄開始，執行docker-compose，會自動建立Node.js容器應用。

執行docker-compose up啟動應用：

註意：這一步可能會執行時間比較長，因為Docker可能需要下載基礎映象。以後執行速度會很快，因為本地已經有了基礎映象。

訪問localhost:8080，應該看到如下圖輸出“hello world”。

訪問localhost:3000驗證伺服器端傳回“hello world”資訊。

最後，訪問localhost:9200確認Elasticsearch是否執行，如果正常，應該傳回如下輸出：

{
  "name" : "SLTcfpI",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "iId8e0ZeS_mgh9ALlWQ7-w",
  "version" : {
    "number" : "6.1.1",
    "build_hash" : "bd92e7f",
    "build_date" : "2017-12-17T20:23:25.338Z",
    "build_snapshot" : false,
    "lucene_version" : "7.1.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

如果所有URL輸出都正常，恭喜，整個應用框架可以正常工作，下麵開始進入真正有趣的部分了。

3. 接入Elasticsearch

第一步是要接入本地Elasticsearch實體。

3.0 加入ES連結模組

在server/connection.js中加入如下初始化程式碼：

const elasticsearch = require('elasticsearch')
// Core ES variables for this project
const index = 'library'
const type = 'novel'
const port = 9200
const host = process.env.ES_HOST || 'localhost'
const client = new elasticsearch.Client({ host: { host, port } })
/** Check the ES connection status */
async function checkConnection () {
  let isConnected = false
  while (!isConnected) {
    console.log('Connecting to ES')
    try {
      const health = await client.cluster.health({})
      console.log(health)
      isConnected = true
    } catch (err) {
      console.log('Connection Failed, Retrying...', err)
    }
  }
}
checkConnection()

下麵用docker-compose來重建更改過的應用。之後執行docker-compose up -d重新啟動後臺行程。

應用啟動後，命令列執行docker exec gs-api “node” “server/connection.js”，在容器中執行指令碼，應該可以看到如下輸出：

{ cluster_name: 'docker-cluster',
  status: 'yellow',
  timed_out: false,
  number_of_nodes: 1,
  number_of_data_nodes: 1,
  active_primary_shards: 1,
  active_shards: 1,
  relocating_shards: 0,
  initializing_shards: 0,
  unassigned_shards: 1,
  delayed_unassigned_shards: 0,
  number_of_pending_tasks: 0,
  number_of_in_flight_fetch: 0,
  task_max_waiting_in_queue_millis: 0,
  active_shards_percent_as_number: 50 }

如果一切順利，就可以把最後一行的checkConnection()呼叫刪掉，因為最終應用會從connection模組之外呼叫它。

3.1 給Reset Index新增Helper功能

在server/connection.js檔案checkConnection之下新增如下內容, 以便更加方便重置索引。

/** Clear the index, recreate it, and add mappings */
async function resetIndex () {
  if (await client.indices.exists({ index })) {
    await client.indices.delete({ index })
  }
  await client.indices.create({ index })
  await putBookMapping()
}

3.2 新增Book Schema

緊接resetIndex之後，新增如下功能：

/** Add book section schema mapping to ES */
async function putBookMapping () {
  const schema = {
    title: { type: 'keyword' },
    author: { type: 'keyword' },
    location: { type: 'integer' },
    text: { type: 'text' }
  }
  return client.indices.putMapping({ index, type, body: { properties: schema } })
}

此處為書目索引定義了mapping（對映）。Elasticsearch索引類似於SQL的表或者MongoDB的connection。透過mapping我們可以定義檔案每個域和資料型別。Elasticsearch是schema-less，因此技術上說不需要新增mapping，但是透過mapping可以更好控制資料處理方式。

例如，有兩個關鍵詞域，分別是“titile”和“author”，文字定為“text”域。這樣定義搜尋引擎會有完全不同的動作：搜尋中，引擎會在text域中查詢可能匹配項，而在關鍵詞域則是精確匹配。看起來差別不大，但卻對搜尋行為和搜尋速度有很大影響。

在檔案最後輸出功能和屬性，可以被其它模組訪問。

module.exports = {
  client, index, type, checkConnection, resetIndex
}

4. 載入源資料

本文使用從Gutenberg專案（一個線上提供免費電子書的應用）提供的資料。包括100本經典書目，例如《80天環繞地球》、《羅密歐與朱麗葉》以及《奧德賽》等。

4.1 下載書籍資料

本文的資料可以從以下網站下載：

https://cdn.patricktriest.com/data/books.zip，之後解壓到專案根目錄下的books/ 子目錄下。

也可以用命令列實現以上操作：

wget https://cdn.patricktriest.com/data/books.zip
unar books.zip

4.2 預覽書籍

開啟一本書，例如219-0.txt。書籍以公開訪問license開始，跟著是書名、作者、發行日期、語言以及字元編碼。

Title: Heart of Darkness
Author: Joseph Conrad
Release Date: February 1995 [EBook #219]
Last Updated: September 7, 2016
Language: English
Character set encoding: UTF-8

隨後是宣告資訊：*** START OF THIS PROJECT GUTENBERG EBOOK HEART OF DARKNESS ***，緊接著就是書的實際內容。

書的最後會發現書籍結束宣告：*** END OF THIS PROJECT GUTENBERG EBOOK HEART OF DARKNESS ***，緊跟著是更加詳細的書籍license。

下一步將用程式設計方法從書中提取元資料，並且從* * *之間將書籍內容抽取出來。

4.3 讀取資料目錄

本節寫一段指令碼讀取書籍內容新增到Elasticsearch中，指令碼存放在server/load_data.js 中。

首先，獲得books目錄下所有檔案串列。

const fs = require('fs')
const path = require('path')
const esConnection = require('./connection')
/** Clear ES index, parse and index all files from the books directory */
async function readAndInsertBooks () {
  try {
    // Clear previous ES index
    await esConnection.resetIndex()
    // Read books directory
    let files = fs.readdirSync('./books').filter(file => file.slice(-4) === '.txt')
    console.log(`Found ${files.length} Files`)
    // Read each book file, and index each paragraph in elasticsearch
    for (let file of files) {
      console.log(`Reading File - ${file}`)
      const filePath = path.join('./books', file)
      const { title, author, paragraphs } = parseBookFile(filePath)
      await insertBookData(title, author, paragraphs)
    }
  } catch (err) {
    console.error(err)
  }
}
readAndInsertBooks()

執行docker-compose -d –build重建映象更新應用。

執行docker exec gs-api “node” “server/load_data.js”呼叫包含load_data指令碼應用，應該看到Elasticsearch輸出如下。隨後，指令碼會因為錯誤退出，原因是呼叫了一本目前還不存在的helper函式（parseBookFile）。

4.4 讀取資料檔案

建立server/load_data.js檔案，讀取每本書元資料和內容：

/** Read an individual book text file, and extract the title, author, and paragraphs */
function parseBookFile (filePath) {
  // Read text file
  const book = fs.readFileSync(filePath, 'utf8')
  // Find book title and author
  const title = book.match(/^Title:\s(.+)$/m)[1]
  const authorMatch = book.match(/^Author:\s(.+)$/m)
  const author = (!authorMatch || authorMatch[1].trim() === '') ? 'Unknown Author' : authorMatch[1]
  console.log(`Reading Book - ${title} By ${author}`)
  // Find Guttenberg metadata essay-header and footer
  const startOfBookMatch = book.match(/^\*{3}\s*START OF (THIS|THE) PROJECT GUTENBERG EBOOK.+\*{3}$/m)
  const startOfBookIndex = startOfBookMatch.index + startOfBookMatch[0].length
  const endOfBookIndex = book.match(/^\*{3}\s*END OF (THIS|THE) PROJECT GUTENBERG EBOOK.+\*{3}$/m).index
  // Clean book text and split into array of paragraphs
  const paragraphs = book
    .slice(startOfBookIndex, endOfBookIndex) // Remove Guttenberg essay-header and footer
    .split(/\n\s+\n/g) // Split each paragraph into it's own array entry
    .map(line => line.replace(/\r\n/g, ' ').trim()) // Remove paragraph line breaks and whitespace
    .map(line => line.replace(/_/g, '')) // Guttenberg uses "_" to signify italics.  We'll remove it, since it makes the raw text look messy.
    .filter((line) => (line && line.length !== '')) // Remove empty lines
  console.log(`Parsed ${paragraphs.length} Paragraphs\n`)
  return { title, author, paragraphs }
}

此函式執行以下功能：

從檔案系統中讀入檔案
使用正則運算式抽取書名和作者
透過定位***，來抽取書中內容
解析出段落
清洗資料，移除空行

最後傳回一個包含書名、作者和段落串列的物件。

執行docker-compose up -d –build和docker exec gs-api “node” “server/load_data.js” ，輸出如下：

到這步，指令碼順利分理出書名和作者，指令碼還會因為同樣問題出錯（呼叫還未定義的函式）。

4.5 在ES中索引資料檔案

最後一步在load_data.js中新增insertBookData函式，將上一節中提取資料插入Elasticsearch索引中。

/** Bulk index the book data in Elasticsearch */
async function insertBookData (title, author, paragraphs) {
  let bulkOps = [] // Array to store bulk operations
  // Add an index operation for each section in the book
  for (let i = 0; i < paragraphs.length; i++) {
    // Describe action
    bulkOps.push({ index: { _index: esConnection.index, _type: esConnection.type } })
    // Add document
    bulkOps.push({
      author,
      title,
      location: i,
      text: paragraphs[i]
    })
    if (i > 0 && i % 500 === 0) { // Do bulk insert in 500 paragraph batches
      await esConnection.client.bulk({ body: bulkOps })
      bulkOps = []
      console.log(`Indexed Paragraphs ${i - 499} - ${i}`)
    }
  }
  // Insert remainder of bulk ops array
  await esConnection.client.bulk({ body: bulkOps })
  console.log(`Indexed Paragraphs ${paragraphs.length - (bulkOps.length / 2)} - ${paragraphs.length}\n\n\n`)
}

此函式索引書籍段落，包括作者、書名和段落元資料資訊。使用bulk操作插入段落，比分別索引段落效率高很多。

批次bulk索引這些段落可以使本應用執行在低配電腦上（我只有1.7G記憶體），如果你有高配電腦（大於4G內容），也許不用考慮批次bulk操作。

執行docker-compose up -d –build 和 docker exec gs-api “node” “server/load_data.js” 輸出如下：

5. 搜尋

Elasticsearch已經灌入100本書籍資料（大約230000段落），本節做一些搜尋操作。

5.0 簡單http查詢

首先，使用http://localhost:9200/library/_search?q=text:Java&pretty; ，這裡使用全文字查詢關鍵字“Java”，輸入應該如下：

{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 13,
    "max_score" : 14.259304,
    "hits" : [
      {
        "_index" : "library",
        "_type" : "novel",
        "_id" : "p_GwFWEBaZvLlaAUdQgV",
        "_score" : 14.259304,
        "_source" : {
          "author" : "Charles Darwin",
          "title" : "On the Origin of Species",
          "location" : 1080,
          "text" : "Java, plants of, 375."
        }
      },
      {
        "_index" : "library",
        "_type" : "novel",
        "_id" : "wfKwFWEBaZvLlaAUkjfk",
        "_score" : 10.186235,
        "_source" : {
          "author" : "Edgar Allan Poe",
          "title" : "The Works of Edgar Allan Poe",
          "location" : 827,
          "text" : "After many years spent in foreign travel, I sailed in the year 18-- , from the port of Batavia, in the rich and populous island of Java, on a voyage to the Archipelago of the Sunda islands. I went as passenger--having no other inducement than a kind of nervous restlessness which haunted me as a fiend."
        }
      },
      ...
    ]
  }
}

Elasticsearch HTTP介面對於測試資料是否正常插入很有用，但是如果直接暴露給web應用就很危險。不應該將操作性API功能（例如直接新增和刪除檔案）直接暴露給應用，而應該寫一段簡單Node.js API接收客戶端請求，（透過私網）轉發給Elasticsearch進行查詢。

5.1 請求指令碼

這一節介紹如何從Node.js應用中向Elasticsearch中傳送請求。首先建立新檔案：server/search.js。

const { client, index, type } = require('./connection')
module.exports = {
  /** Query ES index for the provided term */
  queryTerm (term, offset = 0) {
    const body = {
      from: offset,
      query: { match: {
        text: {
          query: term,
          operator: 'and',
          fuzziness: 'auto'
        } } },
      highlight: { fields: { text: {} } }
    }
    return client.search({ index, type, body })
  }
}

本模組定義了一個簡單的search功能，使用輸入資訊進行匹配查詢。詳細欄位解釋如下：

from：為結果標出頁碼。每次查詢預設傳回10個結果；因此指定from為10，可以直接顯示10-20的查詢結果。
query：具體查詢關鍵詞。
operator：具體查詢操作；本例中採用“and”運運算元，優先顯示包含所有查詢關鍵詞的結果。
fuzziness：錯誤拼寫修正級別（或者是模糊查詢級別），預設是2。數值越高，允許模糊度越高；例如數值1，會對Patricc的查詢傳回Patrick結果。
highlights：傳回額外資訊，其中包含HTML格式顯示匹配文字資訊。

可以調整這些引數看看具體的顯示資訊，可以檢視Elastic Full-Text Query DSL[1]獲得更多資訊。

6. API

本節提供前端程式碼訪問的HTTP API。

6.0 API Server

修改server/app.js內容如下：

const Koa = require('koa')
const Router = require('koa-router')
const joi = require('joi')
const validate = require('koa-joi-validate')
const search = require('./search')
const app = new Koa()
const router = new Router()
// Log each request to the console
app.use(async (ctx, next) => {
  const start = Date.now()
  await next()
  const ms = Date.now() - start
  console.log(`${ctx.method} ${ctx.url} - ${ms}`)
})
// Log percolated errors to the console
app.on('error', err => {
  console.error('Server Error', err)
})
// Set permissive CORS essay-header
app.use(async (ctx, next) => {
  ctx.set('Access-Control-Allow-Origin', '*')
  return next()
})
// ADD ENDPOINTS HERE
const port = process.env.PORT || 3000
app
  .use(router.routes())
  .use(router.allowedMethods())
  .listen(port, err => {
    if (err) throw err
    console.log(`App Listening on Port ${port}`)
  })

這段程式碼匯入服務依賴環境，為Koa.js Node API Server設定簡單日誌和錯誤處理機制。

6.1 將服務端點與查詢連結起來

這一節為Server端新增服務端點，以便暴露給Elasticsearch查詢服務。

在server/app.js中//ADD ENDPOINTS HERE 之後插入如下程式碼：

/**
 * GET /search
 * Search for a term in the library
 */
router.get('/search', async (ctx, next) => {
    const { term, offset } = ctx.request.query
    ctx.body = await search.queryTerm(term, offset)
  }
)

用docker-compose up -d –build重啟服務端。在瀏覽器中，呼叫此服務。例如:http://localhost:3000/search?term=java。

傳回結果看起來應該如下：

{
    "took": 242,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 93,
        "max_score": 13.356944,
        "hits": [{
            "_index": "library",
            "_type": "novel",
            "_id": "eHYHJmEBpQg9B4622421",
            "_score": 13.356944,
            "_source": {
                "author": "Charles Darwin",
                "title": "On the Origin of Species",
                "location": 1080,
                "text": "Java, plants of, 375."
            },
            "highlight": {
                "text": ["Java, plants of, 375."]
            }
        }, {
            "_index": "library",
            "_type": "novel",
            "_id": "2HUHJmEBpQg9B462xdNg",
            "_score": 9.030668,
            "_source": {
                "author": "Unknown Author",
                "title": "The King James Bible",
                "location": 186,
                "text": "10:4 And the sons of Javan; Elishah, and Tarshish, Kittim, and Dodanim."
            },
            "highlight": {
                "text": ["10:4 And the sons of Javan; Elishah, and Tarshish, Kittim, and Dodanim."]
            }
        }
        ...
      ]
   }
}

6.2 輸入驗證

此時服務端還是很脆弱，下麵對輸入引數進行檢查，對無效或者缺失的輸入進行甄別，並傳回錯誤。

我們使用Joi和Koa-Joi-Validate庫進行這種型別的驗證：

/**
 * GET /search
 * Search for a term in the library
 * Query Params -
 * term: string under 60 characters
 * offset: positive integer
 */
router.get('/search',
  validate({
    query: {
      term: joi.string().max(60).required(),
      offset: joi.number().integer().min(0).default(0)
    }
  }),
  async (ctx, next) => {
    const { term, offset } = ctx.request.query
    ctx.body = await search.queryTerm(term, offset)
  }
)

現在如果重啟服務端，並做一個缺失引數查詢（http://localhost:3000/search），將會傳回HTTP 400錯誤，例如：Invalid URL Query – child “term” fails because [“term” is required]。

可以用docker-compose logs -f api 檢視日誌。

7. 前端應用

/search服務端硬體可以了，本節寫一段簡單前端web應用測試API。

7.0 Vue.js

本節使用Vue.js來開發前端。建立一個新檔案/public/app.js：

const vm = new Vue ({
  el: '#vue-instance',
  data () {
    return {
      baseUrl: 'http://localhost:3000', // API url
      searchTerm: 'Hello World', // Default search term
      searchDebounce: null, // Timeout for search bar debounce
      searchResults: [], // Displayed search results
      numHits: null, // Total search results found
      searchOffset: 0, // Search result pagination offset
      selectedParagraph: null, // Selected paragraph object
      bookOffset: 0, // Offset for book paragraphs being displayed
      paragraphs: [] // Paragraphs being displayed in book preview window
    }
  },
  async created () {
    this.searchResults = await this.search() // Search for default term
  },
  methods: {
    /** Debounce search input by 100 ms */
    onSearchInput () {
      clearTimeout(this.searchDebounce)
      this.searchDebounce = setTimeout(async () => {
        this.searchOffset = 0
        this.searchResults = await this.search()
      }, 100)
    },
    /** Call API to search for inputted term */
    async search () {
      const response = await axios.get(`${this.baseUrl}/search`, { params: { term: this.searchTerm, offset: this.searchOffset } })
      this.numHits = response.data.hits.total
      return response.data.hits.hits
    },
    /** Get next page of search results */
    async nextResultsPage () {
      if (this.numHits > 10) {
        this.searchOffset += 10
        if (this.searchOffset + 10 > this.numHits) { this.searchOffset = this.numHits - 10}
        this.searchResults = await this.search()
        document.documentElement.scrollTop = 0
      }
    },
    /** Get previous page of search results */
    async prevResultsPage () {
      this.searchOffset -= 10
      if (this.searchOffset < 0) { this.searchOffset = 0 }
      this.searchResults = await this.search()
      document.documentElement.scrollTop = 0
    }
  }
})

應用特別簡單，只是定義一些共享資料屬性，新增一個接收方法以及為結果分頁的功能；搜尋間隔設定為100ms，以防API被頻繁呼叫。

解釋Vue.js如何工作超出本文的範圍，如果想瞭解相關內容，可以檢視Vue.js官方檔案[2]。

7.1 HTML

將/public/index.html用如下內容代替：



<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Elastic Librarytitle>

7.2 CSS

新增一個新檔案：/public/styles.css：

body { font-family: 'EB Garamond', serif; }
.mui-textfield > input, .mui-btn, .mui--text-subhead, .mui-panel > .mui--text-headline {
  font-family: 'Open Sans', sans-serif;
}
.all-caps { text-transform: uppercase; }
.app-container { padding: 16px; }
.search-results em { font-weight: bold; }
.book-modal > button { width: 100%; }
.search-results .mui-divider { margin: 14px 0; }
.search-results {
  display: flex;
  flex-direction: row;
  flex-wrap: wrap;
  justify-content: space-around;
}
.search-results > div {
  flex-basis: 45%;
  box-sizing: border-box;
  cursor: pointer;
}
@media (max-width: 600px) {
  .search-results > div { flex-basis: 100%; }
}
.paragraphs-container {
  max-width: 800px;
  margin: 0 auto;
  margin-bottom: 48px;
}
.paragraphs-container .mui--text-body1, .paragraphs-container .mui--text-body2 {
  font-size: 1.8rem;
  line-height: 35px;
}
.book-modal {
  width: 100%;
  height: 100%;
  padding: 40px 10%;
  box-sizing: border-box;
  margin: 0 auto;
  background-color: white;
  overflow-y: scroll;
  position: fixed;
  top: 0;
  left: 0;
}
.pagination-panel {
  display: flex;
  justify-content: space-between;
}
.title-row {
  display: flex;
  justify-content: space-between;
  align-items: flex-end;
}
@media (max-width: 600px) {
  .title-row{ 
    flex-direction: column; 
    text-align: center;
    align-items: center
  }
}
.locations-label {
  text-align: center;
  margin: 8px;
}
.modal-footer {
  position: fixed;
  bottom: 0;
  left: 0;
  width: 100%;
  display: flex;
  justify-content: space-around;
  background: white;
}

7.3 測試

開啟localhost:8080，應該能夠看到一個簡單分頁傳回結果。此時可以鍵入一些關鍵詞進行查詢測試。

這一步不需要重新執行docker-compose up命令使修改生效。本地public目錄直接掛載在Ngnix伺服器容器中，因此前端本地系統資料改變直接反應在容器化應用中。

如果點任一個輸出，沒什麼效果，意味著還有一些功能需要新增進應用中。

8. 頁面檢查

最好點選任何一個輸出，可以查出背景關係來自哪本書。

8.0 新增Elasticsearch查詢

首先，需要定義一個從給定書中獲得段落的簡單查詢。在server/search.js下的module.exports中加入如下內容：

/** Get the specified range of paragraphs from a book */
getParagraphs (bookTitle, startLocation, endLocation) {
  const filter = [
    { term: { title: bookTitle } },
    { range: { location: { gte: startLocation, lte: endLocation } } }
  ]
  const body = {
    size: endLocation - startLocation,
    sort: { location: 'asc' },
    query: { bool: { filter } }
  }
  return client.search({ index, type, body })
}

此功能將傳回給定書排序後的段落。

8.1 新增API服務埠

本節將把上節功能連結到API服務埠。在server/app.js中原來的/search服務埠下新增如下內容：

/**
 * GET /paragraphs
 * Get a range of paragraphs from the specified book
 * Query Params -
 * bookTitle: string under 256 characters
 * start: positive integer
 * end: positive integer greater than start
 */
router.get('/paragraphs',
  validate({
    query: {
      bookTitle: joi.string().max(256).required(),
      start: joi.number().integer().min(0).default(0),
      end: joi.number().integer().greater(joi.ref('start')).default(10)
    }
  }),
  async (ctx, next) => {
    const { bookTitle, start, end } = ctx.request.query
    ctx.body = await search.getParagraphs(bookTitle, start, end)
  }
)

8.2 新增UI介面

本節新增前端查詢功能，並顯示書中包含查詢內容的整頁資訊。在/public/app.js methods功能塊中新增如下內容：

    /** Call the API to get current page of paragraphs */
    async getParagraphs (bookTitle, offset) {
      try {
        this.bookOffset = offset
        const start = this.bookOffset
        const end = this.bookOffset + 10
        const response = await axios.get(`${this.baseUrl}/paragraphs`, { params: { bookTitle, start, end } })
        return response.data.hits.hits
      } catch (err) {
        console.error(err)
      }
    },
    /** Get next page (next 10 paragraphs) of selected book */
    async nextBookPage () {
      this.$refs.bookModal.scrollTop = 0
      this.paragraphs = await this.getParagraphs(this.selectedParagraph._source.title, this.bookOffset + 10)
    },
    /** Get previous page (previous 10 paragraphs) of selected book */
    async prevBookPage () {
      this.$refs.bookModal.scrollTop = 0
      this.paragraphs = await this.getParagraphs(this.selectedParagraph._source.title, this.bookOffset - 10)
    },
    /** Display paragraphs from selected book in modal window */
    async showBookModal (searchHit) {
      try {
        document.body.style.overflow = 'hidden'
        this.selectedParagraph = searchHit
        this.paragraphs = await this.getParagraphs(searchHit._source.title, searchHit._source.location - 5)
      } catch (err) {
        console.error(err)
      }
    },
    /** Close the book detail modal */
    closeBookModal () {
      document.body.style.overflow = 'auto'
      this.selectedParagraph = null
    }

以上五個功能塊提供在書中下載和分頁（每頁顯示10段）邏輯操作。

在/public/index.html 中的分界符下加入顯示書頁的UI程式碼如下：

    
    <div v-if="selectedParagraph" ref="bookModal" class="book-modal">
      <div class="paragraphs-container">
        
        <div class="title-row">
          <div class="mui--text-display2 all-caps">{{ selectedParagraph._source.title }}div>

<div class=“mui–text-display1”>{{ selectedParagraph._source.author }}div>
div>
<br>
<div class=“mui-divider”>div>
<div class=“mui–text-subhead locations-label”>Locations {{ bookOffset – 5 }} to {{ bookOffset + 5 }}div>
<div class=“mui-divider”>div>
<br>

<div v-for=“paragraph in paragraphs”>
<div v-if=“paragraph._source.location === selectedParagraph._source.location” class=“mui–text-body2”>
<strong>{{ paragraph._source.text }}strong>
div>
<div v-else class=“mui–text-body1”>
{{ paragraph._source.text }}
div>
<br>
div>
div>

<div class=“modal-footer”>
<button class=“mui-btn mui-btn–flat” v-on:click=“prevBookPage()”>Prev Pagebutton>
<button class=“mui-btn mui-btn–flat” v-on:click=“closeBookModal()”>Closebutton>
<button class=“mui-btn mui-btn–flat” v-on:click=“nextBookPage()”>Next Pagebutton>
div>
div>

重啟應用伺服器（docker-compose up -d –build），開啟localhost:8080。此時如果點選搜尋結果，就可以查詢段落背景關係。如果對查到結果感興趣，甚至可以從查詢處一直讀下去。

恭喜！！到這一步主體框架已經搭建完畢。以上所有程式碼都可以從這裡[3]獲得。

9. Elasticsearch的不足

9.0 資源消耗

Elasticsearch是計算資源消耗的應用。官方建議至少執行在64G以上記憶體的裝置上，不建議少於8GB記憶體。Elasticsearch是一個記憶體資料庫，因此查詢速度會很快，但是也會消耗大量記憶體。生產中，強烈推薦執行Elasticsearch叢集提供高可用性、自動分片和資料冗餘功能。

我在一個1.7GB的雲裝置上（每月15美金）執行以上示例（search.patriktriest.com），這些資源僅是能夠執行Elasticsearch節點。有時整個節點會在初始裝載資料時候hang住。從我的經驗看，Elasticsearch比傳統的PostgreSQL和MongoDB跟消耗資源，如果需要提供理想服務效果，成本可能會很貴。

9.1 資料庫之間的同步

對許多應用，將資料存放在Elasticsearch中並不是理想的選擇。建議將ES作為交易型資料庫，但是因為ES不相容ACID標準（當擴充套件系統匯入資料時，可能造成寫入操作丟失的問題），所以也不推薦。很多場景下，ES承擔著很特殊的角色，例如全文字查詢，這種場景下需要某些資料從主資料庫複製到Elasticsearch資料庫中。

例如，假設我們需要將使用者存放到PostgreSQL表中，但是使用ES承擔使用者查詢功能。如果一個使用者，“Albert”，決定修改名字為“Al”，就需要在主PostgreSQL庫和ES叢集中同時進行修改。

這個操作有些複雜，依賴現有的軟體棧。有許多開源資源可選，既有監控MongoDB操作日誌並自動同步刪除資料到ES的行程，到建立客製化基於PSQL索引自動與ES通訊的PostgreSQL外掛。

如果之前提到的選項都無效，可以在服務端程式碼中根據資料庫變化手動更新Elasticsearch索引。但是我認為這種選擇並不是最佳的，因為使用客製化商業邏輯保持ES同步很複雜，而且有可能會引入很多bugs。

Elasticsearch與主資料庫同步需求，與其說是ES的弱點，不如說是架構複雜造成的；給應用新增一個專用搜索引擎是一件值得考慮的事情，但是要折衷考慮帶來的問題。

結論

全文字搜尋對現代應用來說是一個很重要的功能，同時也是很難完成的功能。Elasticsearch則提供了實現快速和客製化搜尋的實現方式，但是也有其它替代選項。Apache Solr是另外一個基於Apache Lucene（Elasticsearch核心也採用同樣的庫）實現的開源類似實現。Algolia則是最近很活躍的search-as-a-service樣式web平臺，對初學者來說更加容易上手（缺點是客製化不強，而且後期投入可能很大）。

“search-bar”樣式功能遠不僅是Elasticsearch的唯一使用場景。ES也是一個日誌儲存和分析常用工具，一般用於ELK架構（Elasticsearch，Logstash，Kibana）。ES實現的靈活全文字搜尋對資料科學家任務也很有用，例如修改、規範化資料集拼寫或者搜尋資料集。

如下是有關本專案的考慮：

在應用中新增更多喜愛的書，建立自己私有庫搜尋引擎。
透過索引Google Scholar論文，建立一個防抄襲引擎。
透過索引字典中單詞到ES中，建立拼寫檢查應用。
透過載入Common Crawl Corpus到ES（註意，有50億頁內容，是一個非常巨大資料集），建立自己的與谷歌競爭的網際網路搜尋引擎。
在新聞業中使用Elasticsearch：在例如Panama論文和Paradise論文集中搜索特點名稱和詞條。

本文所有程式碼都是開源的，可以在GitHub庫中找到，具體下載地址[4]。希望本文對大家有所幫助。

相關連結：

https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html
https://vuejs.org/v2/guide/
https://search.patricktriest.com
https://github.com/triestpa/guttenberg-search

原文連結：https://blog.patricktriest.com/text-search-docker-elasticsearch/

Kubernetes 實戰培訓

本次培訓內容包括：Docker容器的原理與基本操作；容器網路與儲存解析；Kubernetes的架構與設計理念詳解；Kubernetes的資源物件使用說明；Kubernetes 中的開放介面CRI、CNI、CSI解析；Kubernetes監控、網路、日誌管理；容器應用的開發流程詳解等，點選識別下方二維碼加微信好友瞭解具體培訓內容。

3月23日開始上課，點選閱讀原文連結即可報名。

使用Docker和Elasticsearch搭建全文字搜尋引擎應用

相關推薦

熱門標籤

熱門文章

分享創造快樂