Apache Lucene Demo

Posted on 2017-01-29 | Edited on 2018-12-16 | In search

IndexFiles

As we discussed in the previous walk-through, the IndexFiles class creates a Lucene Index. Let’s take a look at how it does this.

The main() method parses the command-line parameters, then in preparation for instantiating IndexWriter, opens a Directory, and instantiates StandardAnalyzer and IndexWriterConfig.

for(int i=0;i<args.length;i++) {
  if ("-index".equals(args[i])) {
    indexPath = args[i+1];
    i++;
  } else if ("-docs".equals(args[i])) {
    docsPath = args[i+1];
    i++;
  } else if ("-update".equals(args[i])) {
    create = false;
  }
}

The value of the -index command-line parameter is the name of the filesystem directory where all index information should be stored. If IndexFiles is invoked with a relative path given in the -index command-line parameter, or if the -index command-line parameter is not given, causing the default relative index path “index” to be used, the index path will be created as a subdirectory of the current working directory (if it does not already exist). On some platforms, the index path may be created in a different directory (such as the user’s home directory).

The -docs command-line parameter value is the location of the directory containing files to be indexed .

The -update command-line parameter tells IndexFiles not to delete the index if it already exists. When -update is not given, IndexFiles will first wipe the slate clean before indexing any documents.

Lucene Directorys are used by the IndexWriter to store information in the index. In addition to the FSDirectory implementation we are using, there are several other Directory subclasses that can write to RAM, to databases, etc.
(索引可以放在文件/内存/数据库…)

1	Directory dir = FSDirectory.open(Paths.get(indexPath));

Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms, and optionally perform other operations on these tokens, e.g. downcasing, synonym insertion, filtering out unwanted tokens, etc. The Analyzer we are using is StandardAnalyzer, which creates tokens using the Word Break rules from the Unicode Text Segmentation algorithm specified in Unicode Standard Annex #29; converts tokens to lowercase; and then filters out stopwords. Stopwords are common language words such as articles (a, an, the, etc.) and other tokens that may have less value for searching. It should be noted that there are different rules for every language, and you should use the proper analyzer for each. Lucene currently provides Analyzers for a number of different languages (see the javadocs under lucene/analysis/common/src/java/org/apache/lucene/analysis).

The IndexWriterConfig instance holds all configuration for IndexWriter. For example, we set the OpenMode to use here based on the value of the -update command-line parameter.

Looking further down in the file, after IndexWriter is instantiated, you should see the indexDocs() code. This recursive function crawls the directories and creates Document objects. The Document is simply a data object to represent the text content from the file as well as its creation time and location. These instances are added to the IndexWriter. If the -update command-line parameter is given, the IndexWriterConfig OpenMode will be set to OpenMode.CREATE_OR_APPEND, and rather than adding documents to the index, the IndexWriter will update them in the index by attempting to find an already-indexed document with the same identifier (in our case, the file path serves as the identifier); deleting it from the index if it exists; and then adding the new document to the index.

/** Indexes a single document */
static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {
  try (InputStream stream = Files.newInputStream(file)) {
    // make a new, empty document
    Document doc = new Document();

    // Add the path of the file as a field named "path".  Use a
    // field that is indexed (i.e. searchable), but don't tokenize
    // the field into separate words and don't index term frequency
    // or positional information:
    Field pathField = new StringField("path", file.toString(), Field.Store.YES);
    doc.add(pathField);

    // Add the last modified date of the file a field named "modified".
    // Use a LongPoint that is indexed (i.e. efficiently filterable with
    // PointRangeQuery).  This indexes to milli-second resolution, which
    // is often too fine.  You could instead create a number based on
    // year/month/day/hour/minutes/seconds, down the resolution you require.
    // For example the long value 2011021714 would mean
    // February 17, 2011, 2-3 PM.
    doc.add(new LongPoint("modified", lastModified));

    // Add the contents of the file to a field named "contents".  Specify a Reader,
    // so that the text of the file is tokenized and indexed, but not stored.
    // Note that FileReader expects the file to be in UTF-8 encoding.
    // If that's not the case searching for special characters will fail.
    doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));

    if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
      // New index, so we just add the document (no old document can be there):
      System.out.println("adding " + file);
      writer.addDocument(doc);
    } else {
      // Existing index (an old copy of this document may have been indexed) so
      // we use updateDocument instead to replace the old one matching the exact
      // path, if present:
      System.out.println("updating " + file);
      writer.updateDocument(new Term("path", file.toString()), doc);
    }
  }
}

Searching Files

The SearchFiles class is quite simple. It primarily collaborates with an IndexSearcher, StandardAnalyzer, (which is used in the IndexFiles class as well) and a QueryParser. The query parser is constructed with an analyzer used to interpret your query text in the same way the documents are interpreted: finding word boundaries, downcasing, and removing useless words like ‘a’, ‘an’ and ‘the’. The Query object contains the results from the QueryParser which is passed to the searcher. Note that it’s also possible to programmatically construct a rich Query object without using the query parser. The query parser just enables decoding the Lucene query syntax into the corresponding Query object.

SearchFiles uses the IndexSearcher.search(query,n) method that returns TopDocs with max n hits. The results are printed in pages, sorted by score (i.e. relevance).

1
2
3

// Collect enough docs to show 5 pages
TopDocs results = searcher.search(query, 5 * hitsPerPage);
ScoreDoc[] hits = results.scoreDocs;

lucene demo

Posted on 2017-01-28 | Edited on 2018-12-16 | In search

Build Index

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import java.io.File;
import java.io.FileReader;
import java.nio.file.Paths;

public class Indexer {
    public  IndexWriter writer;
    /**
     * 实例化写索引
     */
    public Indexer(String indexDir)throws Exception{
        Analyzer analyzer=new StandardAnalyzer();//分词器
        IndexWriterConfig writerConfig=new IndexWriterConfig(analyzer);//写索引配置
        //Directory ramDirectory= new RAMDirectory();//索引写的内存
        Directory directory= FSDirectory.open(Paths.get(indexDir));//索引存储磁盘位置
        writer=new IndexWriter(directory,writerConfig);//实例化一个写索引
    }
    /**
     * 关闭写索引
     * @throws Exception
     */
    public void close()throws Exception{
        writer.close();
    }
    /**
     * 添加指定目录的所有文件的索引
     * @param dataDir
     * @return
     * @throws Exception
     */
    public int index(String dataDir)throws Exception{
        File[] files=new File(dataDir).listFiles();//得到指定目录的文档数组
        for(File file:files){
            indexFile(file);
        }
        return writer.numDocs();
    }
    public void indexFile(File file)throws Exception{
        System.out.println("索引文件:"+file.getCanonicalPath());//打印索引到的文件路径信息
        Document document=getDocument(file);//得到一个文档信息，相对一个表记录
        writer.addDocument(document);//写入到索引，相当于插入一个表记录
    }

    /**
     * 返回一个文档记录
     * @param file
     * @return
     * @throws Exception
     */
    public Document getDocument(File file)throws Exception{
        Document document=new Document();//实例化一个文档
        document.add(new TextField("context",new FileReader(file)));//添加一个文档信息，相当于一个数据库表字段
        document.add(new TextField("fileName",file.getName(), Field.Store.YES));//添加文档的名字属性
        document.add(new TextField("filePath",file.getCanonicalPath(),Field.Store.YES));//添加文档的路径属性
        return document;
    }
    public static void main(String []ages){
        String indexDir="E:\\LuceneIndex";
        String dataDir="E:\\LuceneTestData";
        Indexer indexer=null;
        int indexSum=0;
        try {
            indexer=new Indexer(indexDir);
            indexSum= indexer.index(dataDir);
            System.out.printf("完成"+indexSum+"个文件的索引");

        }catch (Exception e){
            e.printStackTrace();
        }finally {
            try {
                indexer.close();
            }catch (Exception e){
                e.printStackTrace();
            }

        }

    }

}

Read & Query

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import java.io.File;
import java.io.FileReader;
import java.nio.file.Paths;

public class Indexer {
    public  IndexWriter writer;
    /**
     * 实例化写索引
     */
    public Indexer(String indexDir)throws Exception{
        Analyzer analyzer=new StandardAnalyzer();//分词器
        IndexWriterConfig writerConfig=new IndexWriterConfig(analyzer);//写索引配置
        //Directory ramDirectory= new RAMDirectory();//索引写的内存
        Directory directory= FSDirectory.open(Paths.get(indexDir));//索引存储磁盘位置
        writer=new IndexWriter(directory,writerConfig);//实例化一个写索引
    }
    /**
     * 关闭写索引
     * @throws Exception
     */
    public void close()throws Exception{
        writer.close();
    }
    /**
     * 添加指定目录的所有文件的索引
     * @param dataDir
     * @return
     * @throws Exception
     */
    public int index(String dataDir)throws Exception{
        File[] files=new File(dataDir).listFiles();//得到指定目录的文档数组
        for(File file:files){
            indexFile(file);
        }
        return writer.numDocs();
    }
    public void indexFile(File file)throws Exception{
        System.out.println("索引文件:"+file.getCanonicalPath());//打印索引到的文件路径信息
        Document document=getDocument(file);//得到一个文档信息，相对一个表记录
        writer.addDocument(document);//写入到索引，相当于插入一个表记录
    }

    /**
     * 返回一个文档记录
     * @param file
     * @return
     * @throws Exception
     */
    public Document getDocument(File file)throws Exception{
        Document document=new Document();//实例化一个文档
        document.add(new TextField("context",new FileReader(file)));//添加一个文档信息，相当于一个数据库表字段
        document.add(new TextField("fileName",file.getName(), Field.Store.YES));//添加文档的名字属性
        document.add(new TextField("filePath",file.getCanonicalPath(),Field.Store.YES));//添加文档的路径属性
        return document;
    }
    public static void main(String []ages){
        String indexDir="E:\\LuceneIndex";
        String dataDir="E:\\LuceneTestData";
        Indexer indexer=null;
        int indexSum=0;
        try {
            indexer=new Indexer(indexDir);
            indexSum= indexer.index(dataDir);
            System.out.printf("完成"+indexSum+"个文件的索引");

        }catch (Exception e){
            e.printStackTrace();
        }finally {
            try {
                indexer.close();
            }catch (Exception e){
                e.printStackTrace();
            }

        }

    }

}

lucene concept

Posted on 2017-01-28 | Edited on 2018-12-16 | In search

Document

Document used to describe a document, it can be a html page, a email or a text file. a Document made by a series of File.You can imagine a record of DB as a Document, fields as Fields object

Field

Field used to descibe a property in Document, like a email’s title and content can be descibed by two Fileds

Analyzer

Before a Document be Indexed, Document content should be participle first, Analyzer will done the job. Analyzer class is a abstract class, it have a lot of implementations. In different language, it should choose right Analyzer to do this. After Analysis , the content token to IndexWriter to build Index.

IndexWriter

IndexWriter is the core Lucene used to build Index, it’s job is to take every Document into Index.

Query

Query is a abstract class, has a lot of implementations, like TermQuery, BooleanQuery, PrefixQuery. The task of this class is to take user’s query string packing into a Query that Lucene could recognize

IndexSearcher

IndexSearcher is used to search in the builded Index. It’s only way to open a Index is read, so it could be a lot of IndexSearcher on a single Index implementations do operations.

Hits

Hits used to save search result.

Apache Lucene Core

Posted on 2017-01-27 | Edited on 2020-09-17 | In search

Apache LuceneTM is a high-performance, full-featured text search engine library
written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Apache Lucene is an open source project available for free download.
Please use the links on the right to access Lucene.

Features

Lucene offers powerful features through a simple API:\

Scalable, High-Performance Indexing

over 150GB/hour on modern hardware
small RAM requirements — only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed

Powerful, Accurate and Efficient Search Algorithms

ranked searching — best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Cross-Platform Solution

Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
100%-pure Java
Implementations in other programming languages available that are index-compatible

The Apache Software Foundation

The Apache Software Foundation provides support for the Apache community of open-source software projects. The Apache projects are defined by collaborative consensus based processes, an open, pragmatic software license and a desire to create high quality software that leads the way in its field. Apache Lucene, Apache Solr, Apache PyLucene, Apache Open Relevance Project and their respective logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.

collection and array

Posted on 2017-01-26 | Edited on 2020-09-17 | In java

Collection to array

Object[]
1
Object[] listArray = list.toArray();

specific array

1	String[] listArray = (String[]) list.toArray(new String[0])

ps. it’s can’t be used to translate generic paradigm typed array

Array to collection

1 2	List list = new ArrayList(); list = Arrays.asList(array);

ps. primary type can’t do it like this, it’s parameter must be objects

gzip stream

Posted on 2017-01-26 | Edited on 2018-12-16 | In java

Use GZIPOutputStream ZipOutputStream packing the output
maybe it’s destination a file or a socket, it’s not the point. usually we use generic paradigm the reprecent the data source in stream and the compressed data out stream

public static void compress(InputStream is, OutputStream os)
              throws Exception {

    GZIPOutputStream gos = new GZIPOutputStream(os);

    int count;
    byte data[] = new byte[BUFFER];
    while ((count = is.read(data, 0, BUFFER)) != -1) {
        gos.write(data, 0, count);
    }

    gos.finish();
    gos.flush();
    gos.close();
}

url encoding

Posted on 2017-01-24 | Edited on 2020-09-17 | In tool

RFC 3986 section 2.2 reserved january 2005
1
! * '( ) ; : @& = + $ ,/ ? # [ ]

RFC 3986 section 2.3 unreserved january 2005

1
2
3

A B C D E F G H I J K L M N O P Q RS T U V W X Y Z  
a b c d e f g h i j k l m n o p q rs t u v w x y z  
0 1 2 3 4 5 6 7 8 9 - _ .~

RFC 2396 URI Generic Syntax reserved August 1998
1
; / ? : @ & = + $ ,

RFC 2396 URI Generic Syntax unreserved August 1998

1 2	alphanum or mark mark = - _ . ! ~ * ' ( )

java use the older one
for compatible java use the same collection unreserved from all browser just like RFC2986 no ‘~’ add ‘*’

/*
 * Unreserved characters can be escaped without changing the
 * semantics of the URI, but this should not be done unless the
 * URI is being used in a context that does not allow the
 * unescaped character to appear.
 */

regular expression

1	/^((ht\|f)tps?):\/\/[\w\-]+(\.[\w\-]+)+([\w\-\.,@?^=%&:\/~\+#]*[\w\-\@?^=%&\/~\+#])?$/

start with ‘http/https/ftp/ftps’
can’t contain double bytes characters or not unreserved characters

mybatis foreach error

Posted on 2017-01-24 | Edited on 2018-12-16 | In java

Parameter ‘__frch_item_0’ not found. Available parameters are [list]

Mybatis parameter in list

查看parameterType的类型是不是Java.util.List类型，如果是的话，看foreach 的collection属性是不是list，因为传递一个 List 实例或者数组作为参数对象传给 MyBatis,MyBatis 会自动将它包装在一个 Map 中,用名称在作为键。List 实例将会以“list” 作为键,而数组实例将会以“array”作为键
Is parameterType type Java.util.List. If it is, be caution foreach's collection must 'list'. Becase if put a List example or array to Mybatis, it will auto put it to a Map, use it's name as key, example as value. So Mybatis will put a special map to foreach
foreach is any value in list?
foreach is property spell error?
Mybatis set field auto increase but Mysql not.
Item’s property is not right

ps: use Map reduce Bean’s work, but question is if query result is null, then it’s corresponding property will lost(null)

hexo install

Posted on 2017-01-24 | Edited on 2019-01-09 | In blog

Install & theme

use cnpm
npm install -g cnpm —registry=https://registry.npm.taobao.org

sudo cnpm install hexo-cli -g
hexo init blog
cd blog

cnpm install
hexo server

# may ship
git clone https://github.com/tufu9441/maupassant-hexo.git themes/maupassant
npm install hexo-renderer-jade --save
npm install hexo-renderer-sass --save

npm install hexo-tag-katex --save

Optimization

enter themes\landscape\layout_partial，open head.ejs，delete 31th row fonts.googleapis.com
download jquery-2.0.3.min.js put into themes\landscape\source\js , enter themes\landscape\layout_partial, openafter-footer.ejs, replace 17th row to /js/jquery-2.0.3.min.js。

mybatis foreach

Posted on 2017-01-24 | Edited on 2018-12-16 | In java

Option	Description
item	循环体中的具体对象。支持属性的点路径访问，如item.age,item.info.details。具体说明：在list和数组中是其中的对象，在map中是value。该参数为必选。
collection	要做foreach的对象，作为入参时，List<?>对象默认用list代替作为键，数组对象有array代替作为键，Map对象用map代替作为键。当然在作为入参时可以使用@Param(“keyName”)来设置键，设置keyName后，list,array,map将会失效。除了入参这种情况外，还有一种作为参数对象的某个字段的时候。举个例子：如果User有属性List ids。入参是User对象，那么这个collection = “ids”如果User有属性Ids ids;其中Ids是个对象，Ids有个属性List id;入参是User对象，那么collection = “ids.id”上面只是举例，具体collection等于什么，就看你想对那个元素做循环。该参数为必选。
separator	元素之间的分隔符，例如在in()的时候，separator=”,”会自动在元素中间用“,“隔开，避免手动输入逗号导致sql错误，如in(1,2,)这样。该参数可选。
open	foreach代码的开始符号，一般是(和close=”)”合用。常用在in(),values()时。该参数可选。
close	foreach代码的关闭符号，一般是)和open=”(“合用。常用在in(),values()时。该参数可选。
index	在list和数组中,index是元素的序号，在map中，index是元素的key，该参数可选。

select count(*) from users WHERE id in ( ? , ? )

<select id="countByUserList" resultType="_int" parameterType="list">    
select count(*) from users    
  <where>    
    id in    
    <foreach item="item" collection="list" separator="," open="(" close=")" index="">    
      #{item.id, jdbcType=NUMERIC}    
    </foreach>    
  </where>    
</select>

insert into deliver select ?,? from dual union all select ?,? from dual

<insert id="addList">  
    INSERT INTO DELIVER  
        (  
            <include refid="selectAllColumnsSql"/>  
         )  

      <foreach collection="deliverList" item="item" separator="UNION ALL">  
            SELECT   
                 #{item.id, jdbcType=NUMERIC},  
                 #{item.name, jdbcType=VARCHAR}  
            FROM DUAL  
      </foreach>  
</insert>

insert into string_string (key, value) values (?, ?) , (?, ?)

<insert id="ins_string_string">    
    insert into string_string (key, value) values    
    <foreach item="item" index="key" collection="map"    
        open="" separator="," close="">(#{key}, #{item})</foreach>    
</insert>

select count(*) from key_cols where col_a = ? AND col_b = ?

<select id="sel_key_cols" resultType="int">    
    select count(*) from key_cols where    
    <foreach item="item" index="key" collection="map"    
        open="" separator="AND" close="">${key} = #{item}</foreach>    
</select>

ps: 一定要注意到$和#的区别，$的参数直接输出，#的参数会被替换为?，然后传入参数值执行。