Sunday, 10 January 2016

Indexing a file in Elasticsearch using mapper attachment

If you are looking for ways to index a file in Elasticsearch and search through its contents then, this post is for you. Yes, Elasticsearch does allow us to index files of any type (e.g.doc,docx,pdf,ppt,xls). It is basically done using the Apache’s text extraction library Tika. 

The file needs to be encoded as base64 and stored in a field of mapper type “attachment”. The mapping type will not be available by default like String type, so we have to add it using an external plugin.

Below are the steps to index a file.

  Install plugin:

<ElasticsearchDirectory>/bin/plugin install elasticsearch/elasticsearch-mapper-attachments/<version>

The attachment mapper versions for corresponding Elasticsearch version are listed below:

es-1.7
2.7.1
es-1.6
2.6.0
es-1.5
2.5.0
es-1.4
2.4.3
es-1.3
2.3.2
es-1.2
2.2.1
es-1.1
2.0.0
es-1.0
2.0.0
es-0.90
1.9.0

Mapping field type
        curl -XPUT http://localhost:9200/simplyjava/resume -d '{
                " resume":{
                        "properties": {
                        "file":{
                             "type":"attachment"
                        }
          }
 }

Analyzers can also be added to the attachment type.


   Index file in Elasticsearch

    Using Script:

#!/bin/sh
encoded=`cat <filename> | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
resume="{\"file\":\"${encoded}\"}"
echo "$resume" > resume.file
curl -X POST "localhost:9200/simplyjava/resume/user1 " -d @resume.file

   Using Java code:

              File file =new File(<filepath>);
        FileInputStream fis=new FileInputStream(file);
        int length=fis.available();
        byte[]byteArray=new byte[length];
        fis.read(byteArray);
        fis.close();
        BASE64Encoder encoder = new BASE64Encoder();
        base64= encoder.encode(byteArray);
        Map<String, Object> json = new HashMap<String, Object>();
        json.put("file",encodedFile);
       client.prepareIndex("simplyjava", "resume",”user1”).setSource(json).get();

The above script will store the file as an encoded string into the type “resume” under id “user1” with field name as “file”.

No comments:

Post a Comment