Lucene是一個很強大的搜尋引擎(Lucene wiki),重點是Open Source,還有很多子專案很好用喔,這次來介紹一下Tika這個子專案,它是一個文件擷取內容及文件metadata的函式庫,支援的檔案格式可以參考一下這裡,加上Lucene Core及中文IKAnalyzer就可以組合文件的中文全文檢索喔。
先下載
Lucene 4.3.1
Tika
IKAnalyzer
Lucene 4.3.1
先下載
Lucene 4.3.1
Tika
IKAnalyzer
Lucene 4.3.1
- 下載後解壓縮,分別複製以下的jar
- lucene-queryparser-4.3.1.jar
- lucene-queries-4.3.1.jar
- lucene-core-4.3.1.jar
- lucene-analyzers-common-4.3.1.jar
Tika 1.4
- 請下載Source檔案,另外請在下載maven,因為Tika建議重新編譯一份出來,編譯方式參這裡,分別複製以下的jar
- tika-xmp-1.4.jar
- tika-server-1.4.jar
- tika-parsers-1.4.jar
- tika-core-1.4.jar
- tika-bundle-1.4.jar
- tika-app-1.4.jar
- original-tika-app-1.4.jar
- 下載後解壓縮,複製以下的jar
- IKAnalyzer2012FF_ul.jar
TvDocumentVo.java:儲存 Tika解析文件內容vo。
public class TvDocumentVo {
private Metadata metadata;private String content;public Metadata getMetadata() {return metadata;}public void setMetadata(Metadata metadata) {this.metadata = metadata;}public String getContent() {return content;}public void setContent(String content) {this.content = content;}}
TvDocmentExtract.jar:使用Tika將解析出來的文件內容轉成List。P.S這裡需要調整當檔案數量及大小過大時。
public class TvDocmentExtract {public ListparseAllFilesInDirectory(File directory) throws IOException, SAXException, TikaException { Listresult = new ArrayList (); for (File file : directory.listFiles()) {if (file.isDirectory()) {parseAllFilesInDirectory(file);} else {Parser parser = new AutoDetectParser();Metadata metadata = new Metadata();ParseContext parseContext = new ParseContext();ContentHandler handler = new BodyContentHandler(100*100*1024);//若是檔案過大可以放大這裡的參數parser.parse(new FileInputStream(file), handler, metadata, parseContext);TvDocumentVo vo = new TvDocumentVo();metadata.set("filename",file.getAbsolutePath());//增加檔案絕對路徑屬性vo.setMetadata(metadata);vo.setContent(handler.toString());result.add(vo);}}return result;}}
TvIndexManagement.java:建立索引
public class TvIndexManagement {public static void main(String[] arg) throws ParseException, IOException{TvIndexManagement tim = new TvIndexManagement();tim.createIndex("D:\\index", "D:\\document");//參數一為建立索引目錄,參數二為要進行文件擷取目錄}public void createIndex(String indexDir,String filesPath){this.ikanalyzerIndex(indexDir, filesPath);}public RAMDirectory readfsIndexToRam(String indexDir) throws IOException{Directory fsDir = FSDirectory.open(new File(indexDir));IOContext ioContext = new IOContext(Context.DEFAULT);return new RAMDirectory(fsDir, ioContext);}private void ikanalyzerIndex(String indexDir,String filesPath){Analyzer analyzer = new IKAnalyzer(true);//可參IKAnalyzer說明try {Directory index = FSDirectory.open(new File(indexDir));IOContext ioContext = new IOContext(Context.DEFAULT);// Directory index = new RAMDirectory(fsDir, ioContext);IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43,analyzer);IndexWriter w = new IndexWriter(index, config);config.setMaxBufferedDocs(1000);//純粹測試效能調教config.setRAMBufferSizeMB(40);//純粹測試效能調教TvDocmentExtract tde = new TvDocmentExtract();try {Listlists=tde.parseAllFilesInDirectory(new File(filesPath)); for(TvDocumentVo vo :lists){this.addDocument(w,vo.getMetadata().get("filename"),vo.getContent());}w.close();} catch (SAXException ex) {Logger.getLogger(TvIndexManagement.class.getName()).log(Level.SEVERE, null, ex);} catch (TikaException ex) {Logger.getLogger(TvIndexManagement.class.getName()).log(Level.SEVERE, null, ex);}} catch (IOException ex) {Logger.getLogger(TvIndexManagement.class.getName()).log(Level.SEVERE, null, ex);}}private void addDocument(IndexWriter writer,String filename,String content) throws IOException{Document doc = new Document();//System.out.println(filename);doc.add(new TextField("filename", filename, Field.Store.YES));doc.add(new TextField("content", content, Field.Store.YES));writer.addDocument(doc);}}
全文檢索範例程式碼
String querystr = "合法";Directory index = FSDirectory.open(new File("D:\\index"));Analyzer analyzer = new IKAnalyzer(true);QueryParser qp = new QueryParser(Version.LUCENE_43, "content", analyzer);qp.setDefaultOperator(QueryParser.Operator.AND);Query q = qp.parse(querystr);int hitsPerPage = 10;IndexReader reader = DirectoryReader.open(index);IndexSearcher searcher = new IndexSearcher(reader);TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);searcher.search(q, collector);ScoreDoc[] hits = collector.topDocs().scoreDocs;collector.getTotalHits();System.out.println("Found " + hits.length + " hits ==>" + collector.getTotalHits());for (int i = 0; i < hits.length; ++i) {int docId = hits[i].doc;Document d = searcher.doc(docId);System.out.println((i + 1) + ". " + d.get("filename"));}reader.close();
目前正在試試看,不使用Hadoop方式加快建立索引時間,基本上會用Queue方式進行,下次再介紹囉。
留言
張貼留言