Solr索引和基本資料操作

阿新 • • 發佈：2019-01-15

1. 介紹

Solr索引可以接收不同的資料來源，包括XML檔案，逗號分隔值(CSV)檔案，從資料庫提取的資料，常見的檔案格式如MS Word或PDF.

有三種常用的方法載入資料到Solr索引：
* 使用Apache Tika的Solr Cell框架，處理二進位制或結構化檔案如Office, Word, PDF 和其他專有格式。
* 通過HTTP請求上傳XML檔案
* 使用SolrJ寫一個Java應用程式。這可能是最好的選擇。

2. Post工具

2.1 索引XML
$ bin/post -h
$ bin/post -c gettingstarted *.xml
$ bin/post -c gettingstarted -p 8984 *.xml
$ bin/post -c gettingstarted -d '<delete><id>42</id></delete>'

2.2 索引CSV
$ bin/post -c gettingstarted *.csv

索引tab分隔(tab-separated)檔案
$ bin/post -c signals -params "separator=%09" -type text/csv data.tsv

2.3 索引JSON
$ bin/post -c gettingstarted *.json

2.4 索引富檔案
$ bin/post -c gettingstarted a.pdf
$ bin/post -c gettingstarted afolder/
$ bin/post -c gettingstarted -filetypes ppt,html afolder/

3. 使用Index Handlers上傳資料

Index Handlers是用來新增、刪除和更新索引文件的請求處理器。
除了使用Tika外掛匯入富文件，或使用Data Import Handler匯入結構化資料，Solr原生支援索引XML,CSV,JSON文件。

3.1 XML格式索引更新

Content-type: application/xml or Content-type: text/xml

(1) 新增文件
<add>
<doc>
<field name="authors">Patrick Eagar</field>
<field name="subject">Sports</field>
<field name="price">12.40</field>
<field name="title" boost="2.0">Summer of the all-rounder: Test and championship cricket in England 1982</field>
</doc>
<doc boost="2.5">
...
</doc>
</add>

(2) XML更新命令

- Commit 和 Optimize
<commit> 操作將上次commit至今提交的文件寫入磁碟。commit前，新索引的文件對Searcher不可見。
Commit操作可以被顯示的提交一個<commit/>訊息，也可以由solrconfig.xml中的<autocommit>引數觸發。
引數：
waitSearcher
expungeDeletes

<optimize> 操作請求Solr合併內部資料，以獲得更好的搜尋效果。對於大的搜尋需要花費一些時間。
引數：
waitSearcher
maxSegments

<commit waitSearcher="false"/>
<commit waitSearcher="false" expungeDeletes="true"/>
<optimize waitSearcher="false"/>

- 刪除(Delete)操作
兩種方式："Delete by ID" (UniqueID) 或 "Delete by Query"

可包含多個刪除操作：
<delete>
<id>0002166313</id>
<id>0031745983</id>
<query>subject:sport</query>
<query>publisher:penguin</query>
</delete>

- 回滾(Rollback)操作
回滾上次commit後的新增和刪除操作。
<rollback/>

- 使用curl命令執行更新

curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary '
<add>
<doc>
<field name="authors">Patrick Eagar</field>
<field name="subject">Sports</field>
<field name="dd">796.35</field>
<field name="isbn">0002166313</field>
<field name="yearpub">1982</field>
<field name="publisher">Collins</field>
</doc>
</add>'

curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary @myfile.xml

curl http://localhost:8983/solr/my_collection/update?stream.body=%3Ccommit/%3E

(3) Using XSLT to Transform XML Index Updates

3.2 JSON 格式索引更新

Content-Type: application/json 或 Content-Type: text/json

- 新增文件
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update/json/docs' --data-binary '
{
"id": "1",
"title": "Doc 1"
}'

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
[
{
"id": "1",
"title": "Doc 1"
},
{
"id": "2",
"title": "Doc 2"
}
]'

curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary
@example/exampledocs/books.json -H 'Content-type:application/json'

- 傳送命令
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
{
"add": {
"doc": {
"id": "DOC1",
"my_boosted_field": { /* use a map with boost/value for a boosted field */
"boost": 2.3,
"value": "test"
},
"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a multi-valued field */
}
},
"add": {
"commitWithin": 5000, /* commit this document within 5 seconds */
"overwrite": false, /* don't check for existing documents with thesame uniqueKey */
"boost": 3.45, /* a document boost */
"doc": {
"f1": "v1", /* Can use repeated keys for a multi-valued field */
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, /* delete by ID */
"delete": { "query":"QUERY" } /* delete by query */
}'

簡單的delete-by-id:
{ "delete":"myid" }
{ "delete":["id1","id2"] }

{
"delete":"id":50,
"_version_":12345
}

便捷請求路徑：
/update/json
/update/json/docs

- 轉換和索引自定義JSON

curl 'http://localhost:8983/solr/my_collection/update/json/docs'\
'?split=/exams'\
'&f=first:/first'\
'&f=last:/last'\
'&f=grade:/grade'\
'&f=subject:/exams/subject'\
'&f=test:/exams/test'\
'&f=marks:/exams/marks'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test" : "term1",
"marks" : 90},
{
"subject": "Biology",
"test" : "term1",
"marks" : 86}
]
}'

3.3 CSV格式索引更新

curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary
@example/exampledocs/books.csv -H 'Content-type:application/csv'

curl 'http://localhost:8983/solr/update/csv?commit=true&separator=%09&escape=%5c'
--data-binary @/tmp/result.txt

4. 使用Apache Tika的Solr Cell上傳資料

ExtractingRequestHandler可以使用Tika來支援上傳二進位制檔案，如Word,PDF, 用於資料抽取和索引。

curl
'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true'
-F "[email protected]/exampledocs/solr-word.pdf"

- 配置ExtractingRequestHandler

需要配置solrconfig.xml包含相關Jar:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />

然後在solrconfig.xml配置:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>

<str name="tika.config">/my/path/to/tika.config</str>

<lst name="date.formats">
<str>yyyy-MM-dd</str>
</lst>
</requestHandler>

5. 使用Data Import Handler上傳結構化儲存資料

配置solrconfig.xml:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/path/to/my/DIHconfigfile.xml</str>
</lst>
</requestHandler>

配置DIHconfigfile.xml:
參考：example/example-DIH/solr/db/conf/db-data-config.xml

6. 部分更新文件

對於只有部分改變的文件，Solr支援兩種方法更新進行更新：
- atomic updates, 允許只修改一個或多個欄位，而不需要重新索引整個文件。
- optimistic concurrency 或 optimistic locking, 這是很多NoSQL資料庫的特性，允許基於版本有條件的更新。

6.1 Atomic Updates

set
add
remove
removeregex
inc

已存在文件：
{
"id":"mydoc",
"price":10,
"popularity":42,
"categories":["kids"],
"promo_ids":["a123x"],
"tags":["free_to_try","buy_now","clearance","on_sale"]
}

應用更新命令：
{
"id":"mydoc",
"price":{"set":99},
"popularity":{"inc":20},
"categories":{"add":["toys","games"]},
"promo_ids":{"remove":"a123x"},
"tags":{"remove":["free_to_try","on_sale"]}
}

更新後文檔：
{
"id":"mydoc",
"price":99,
"popularity":62,
"categories":["kids","toys","games"],
"tags":["buy_now","clearance"]
}

6.2 Optimistic Concurrency

Optimistic Concurrency是solr的一個特性，用於客戶端程式來確定其正在更新的文件沒有同時被其他客戶端修改。
此功能需要每個索引文件有一個_version_欄位，並且與更新命令中指定的_version_比較。Solr的schema.xml預設有_version_欄位。

<field name="_version_" type="long" indexed="false" stored="true" required="true" docValues="true"/>

通常Optimistic Concurrency的工作流如下：
(1) client讀取一個文件。/get 確保有最近的版本。
(2) client在本地修改文件。
(3) client向solr提交修改的文件，/update
(4) 如果版本衝突(HTTP error 409), client重複處理步驟。

更新規則：
* 如果_version_值大於1，那麼文件中的_version_必須匹配索引中的_version_.
* 如果_version_值等於1，那麼文件必須存在，且不會進行版本匹配；如果文件不存在，更新會被拒絕。
* 如果_version_值小於0，那麼文件必須不存在；如果文件存在，更新會被拒絕。
* 如果_version_值等於0，那麼不管文件是否存在，版本是否匹配。如果文件存在，將被覆蓋；不存在，將被新增。

$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?versions=true' --data-binary '
[ { "id" : "aaa" },
{ "id" : "bbb" } ]'
{"responseHeader":{"status":0,"QTime":6},
"adds":["aaa",1498562471222312960,
"bbb",1498562471225458688]}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=999999&versions=true'
--data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with wrong existing version" }]'
{"responseHeader":{"status":409,"QTime":3},
"error":{"msg":"version conflict for aaa expected=999999
actual=1498562471222312960",
"code":409}}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=1498562471222312960&versio
ns=true&commit=true' --data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with correct existing version" }]'
{"responseHeader":{"status":0,"QTime":5},
"adds":["aaa",1498562624496861184]}
$ curl 'http://localhost:8983/solr/techproducts/query?q=*:*&fl=id,_version_'
{
"responseHeader":{
"status":0,
"QTime":5,
"params":{
"fl":"id,_version_",
"q":"*:*"}},
"response":{"numFound":2,"start":0,"docs":[
{
"id":"bbb",
"_version_":1498562471225458688},
{
"id":"aaa",
"_version_":1498562624496861184}]
}}

7. 刪除重複資料(De-duplication)

Solr通過<Signature>類原生支援去重技術，並且容易新增新的hash/signature實現。
一個Signature可以通過幾種方式實現：
MD5Signature 128 bit hash
Lookup3Signature 64 bit hash
TextProfileSignature Fuzzy hashing from nutch

配置：
- solrconfig.xml
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">name,features,cat</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
...
</requestHandler>

- schema.xml
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />

8. 索引時語言檢測

使用langid UpdateRequestProcessor可以在索引時檢測語言並對映文字到語言相關的欄位。Solr支援兩種實現：
* Tika's language detection feature: http://tika.apache.org/0.10/detection.html
* LangDetect language detection: http://code.google.com/p/language-detection/

<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>

<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>

9. Content Stream

基於URL地址訪問SolrRequestHandlers時，包含請求引數的SolrQueryRequest物件，也可以包含一個ContentStreams列表, 含有用於請求的bulk data.

10. 整合UIMA

Apache Unstructured Information Management Architecture (UIMA)

Solr索引和基本資料操作

1. 介紹

2. Post工具

3. 使用Index Handlers上傳資料

3.1 XML格式索引更新

3.2 JSON 格式索引更新

3.3 CSV格式索引更新

4. 使用Apache Tika的Solr Cell上傳資料

5. 使用Data Import Handler上傳結構化儲存資料

6. 部分更新文件

6.1 Atomic Updates

6.2 Optimistic Concurrency

7. 刪除重複資料(De-duplication)

8. 索引時語言檢測

9. Content Stream

10. 整合UIMA

Solr索引和基本資料操作

易學筆記-go語言-第4章：基本結構和基本資料型別/4.4 變數/4.4.3 函式體內最簡單的變數初始化

易學筆記-go語言-第4章：基本結構和基本資料型別/4.4 變數/4.4.2 宣告和賦值語句結合

易學筆記-go語言-第4章：基本結構和基本資料型別/4.4 變數/4.4.4 函式體內並行初始化

易學筆記-Go語言-第4章：基本結構和基本資料型別/4.5 基本型別/4.5.2 整形

易學筆記-Go語言-第4章：基本結構和基本資料型別/4.5 基本型別/4.5.1 bool型別

易學筆記-Go語言-第4章：基本結構和基本資料型別/4.4 變數/4.4.7 變數的作用域

Java String和基本資料型別的相互轉換

numpy學習3：物件屬性和基本資料型別

資料結構與演算法15-圖的基本資料操作

快取型資料庫redis基本資料操作

Hive的DDL資料定義和DML資料操作

第4章：基本結構和基本資料型別/4.2 Go 程式的基本結構和要素/4.2.5 可見性

第4章：基本結構和基本資料型別/4.2 Go 程式的基本結構和要素/4.2.4 import：匯入包

第4章：基本結構和基本資料型別/4.2 Go 程式的基本結構和要素/4.2.6 函式

Python 基礎之運算子和基本資料型別

HBase基本資料操作詳解【完整版，絕對精品】

Pat甲級題目刷題分享+演算法筆記提煉 ---------------第一部分基本資料操作與常用演算法

MongoDB入門---安裝php擴充套件&php基本增刪改查操作&php7基本資料操作

關係型資料庫MySQL之觸發器和表資料操作

Solr索引和基本資料操作

1. 介紹

2. Post工具

3. 使用Index Handlers上傳資料

3.1 XML格式索引更新

3.2 JSON 格式索引更新

3.3 CSV格式索引更新

4. 使用Apache Tika的Solr Cell上傳資料

5. 使用Data Import Handler上傳結構化儲存資料

6. 部分更新文件

6.1 Atomic Updates

6.2 Optimistic Concurrency

7. 刪除重複資料(De-duplication)

8. 索引時語言檢測

9. Content Stream

10. 整合UIMA

相關推薦