If you
are looking for ways to index a file in Elasticsearch and search through its
contents then, this post is for you. Yes, Elasticsearch does allow us to index
files of any type (e.g.doc,docx,pdf,ppt,xls). It is basically done using the
Apache’s text extraction library Tika.
The file
needs to be encoded as base64 and stored in a field of mapper type
“attachment”. The mapping type will not be available by default like String
type, so
we have to add it using an external plugin.
Below are
the steps to index a file.
Install
plugin:
<ElasticsearchDirectory>/bin/plugin
install elasticsearch/elasticsearch-mapper-attachments/<version>
The attachment mapper versions for corresponding
Elasticsearch version are listed below:
es-1.7
|
2.7.1
|
es-1.6
|
2.6.0
|
es-1.5
|
2.5.0
|
es-1.4
|
2.4.3
|
es-1.3
|
2.3.2
|
es-1.2
|
2.2.1
|
es-1.1
|
2.0.0
|
es-1.0
|
2.0.0
|
es-0.90
|
1.9.0
|
Mapping field type
curl -XPUT
http://localhost:9200/simplyjava/resume -d '{
" resume":{
"properties": {
"file":{
"type":"attachment"
}
}
}
Analyzers can also be added to the attachment type.
Index
file in Elasticsearch
Using Script:
#!/bin/sh
encoded=`cat
<filename> | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
resume="{\"file\":\"${encoded}\"}"
echo
"$resume" > resume.file
curl -X
POST "localhost:9200/simplyjava/resume/user1 " -d @resume.file
Using Java code:
File file =new File(<filepath>);
FileInputStream fis=new FileInputStream(file);
int length=fis.available();
byte[]byteArray=new byte[length];
fis.read(byteArray);
fis.close();
BASE64Encoder encoder = new BASE64Encoder();
base64= encoder.encode(byteArray);
Map<String, Object> json = new HashMap<String, Object>();
json.put("file",encodedFile);
client.prepareIndex("simplyjava",
"resume",”user1”).setSource(json).get();
The above
script will store the file as an encoded string into the type “resume” under id
“user1” with field name as “file”.
No comments:
Post a Comment