266 lines
11 KiB
Plaintext
266 lines
11 KiB
Plaintext
* There are packages available for most linux distributions through the usual channels.
|
|
* The Clucene Sourceforge website also has some distributions available.
|
|
|
|
Also in this document is information how to build from source, troubleshooting,
|
|
performance, and how to create a new distribution.
|
|
|
|
|
|
Building from source:
|
|
--------------------
|
|
|
|
Dependencies:
|
|
* CMake version 2.4.2 or later.
|
|
* A functioning and fairly new C++ compiler. We test mostly on GCC and Visual Studio 6+.
|
|
Anything other than that may not work.
|
|
* Something to unzip/untar the source code.
|
|
|
|
Build instructions:
|
|
1.) Download the latest sourcecode from http://www.sourceforge.net/projects/clucene
|
|
[Choose stable if you want the 'time tested' version of code. However, often
|
|
the unstable version will suite your needs more since it is newer and has had
|
|
more work put into it. The decision is up to you.]
|
|
2.) Unpack the tarball/zip/bzip/whatever
|
|
3.) Open a command prompt, terminal window, or cygwin session.
|
|
4.) Change directory into the root of the sourcecode (from now on referred to as <clucene>)
|
|
# cd <clucene>
|
|
5.) Create and change directory into an 'out-of-source' directory for your build.
|
|
[This is by far the easiest way to build, it has the benefit of being able to
|
|
create different types of builds in the same source-tree.]
|
|
# mkdir <clucene>/build-name
|
|
# cd <clucene>/build-name
|
|
6.) Configure using cmake. This can be done many different ways, but the basic syntax is
|
|
# cmake [-G "Script name"] ..
|
|
[Where "Script name" is the name of the scripts to build (e.g. Visual Studio 8 2005).
|
|
A list of supported build scripts can be found by]
|
|
# cmake --help
|
|
7.) You can configure several options such as the build type, debugging information,
|
|
mmap support, etc, by using the CMake GUI or by calling
|
|
# ccmake ..
|
|
Make sure you call configure again if you make any changes.
|
|
8.) Start the build. This depends on which build script you specified, but it would be something like
|
|
# make
|
|
or
|
|
# nmake
|
|
Or open the solution files with your IDE.
|
|
|
|
[You can also specify to just build a certain target (such as cl_test, cl_demo,
|
|
clucene-core (shared library), clucene-core-static (static library).]
|
|
9.) The binary files will be available in <clucene>build-name/bin
|
|
10.)Test the code. (After building the tests - this is done by default, or by calling make cl_test)
|
|
# ctest -V
|
|
11.)At this point you can install the library:
|
|
# make install
|
|
[There are options to do this from the IDE, but I find it easier to create a
|
|
distribution (see instructions below) and install that instead.]
|
|
or
|
|
# make cl_demo
|
|
[This creates the demo application, which demonstrates a simple text indexing and searching].
|
|
or
|
|
Adjust build values using ccmake or the Cmake GUI and rebuild.
|
|
|
|
12.)Now you can develop your own code. This is beyond the scope of this document.
|
|
Read the README for information about documentation or to get help on the mailinglist.
|
|
|
|
Other platforms:
|
|
----------------
|
|
Some platforms require specific actions to get cmake working. Here are some general tips:
|
|
|
|
Solaris:
|
|
I had problems when using the standard stl library. Using the -stlport4 switch worked. Had
|
|
to specify compiler from the command line: cmake -DCXX_COMPILER=xxx -stlport4
|
|
|
|
Building Performance
|
|
--------------------
|
|
Use of ccache will speed up build times a lot. I found it easiest to add the /usr/lib/ccache directory to the beginning of your paths. This works for most common compilers.
|
|
|
|
PATH=/usr/lib/ccache:$PATH
|
|
|
|
Note: you must do this BEFORE you configure the path, since you cannot change the compiler path after it is configured.
|
|
|
|
Installing:
|
|
-----------
|
|
CLucene is installed in CMAKE_INSTALL_PREFIX by default.
|
|
|
|
CLucene used to put config headers next to the library. this was done
|
|
because these headers are generated and are relevant to the library.
|
|
CMAKE_INSTALL_PREFIX was for system-independent files. the idea is that
|
|
you could have several versions of the library installed (ascii version,
|
|
ucs2 version, multithread, etc) and have only one set of headers.
|
|
in version 0.9.24+ we allow this feature, but you have to use
|
|
LUCENE_SYS_INCLUDES to specify where to install these files.
|
|
|
|
Troubleshooting:
|
|
----------------
|
|
|
|
'Too many open files'
|
|
Some platforms don't provide enough file handles to run CLucene properly.
|
|
To solve this, increase the open file limit:
|
|
|
|
On Solaris:
|
|
ulimit -n 1024
|
|
set rlim_fd_cur=1024
|
|
|
|
GDB - GNU debugging tool (linux only)
|
|
------------------------
|
|
If you get an error, try doing this. More information on GDB can be found on the internet
|
|
|
|
#gdb bin/cl_test
|
|
# gdb> run
|
|
when gdb shows a crash run
|
|
# gdb> bt
|
|
a backtrace will be printed. This may help to solve any problems.
|
|
|
|
Code layout
|
|
--------------
|
|
File locations:
|
|
* clucene-config.h is required and is distributed next to the library, so that multiple libraries can exist on the
|
|
same machine, but use the same header files.
|
|
* _HeaderFile.h files are private, and are not to be used or distributed by anything besides the clucene-core library.
|
|
* _clucene-config.h should NOT be used, it is also internal
|
|
* HeaderFile.h are public and are distributed and the classes within should be exported using CLUCENE_EXPORT.
|
|
* The exception to the internal/public conventions is if you use the static library. In this case the internal
|
|
symbols will be available (this is the way the tests program tests internal code). However this is not recommended.
|
|
|
|
Memory management
|
|
------------------
|
|
Memory in CLucene has been a bit of a difficult thing to manage because of the
|
|
unclear specification about who owns what memory. This was mostly a result of
|
|
CLucene's java-esque coding style resulting from porting from java to c++ without
|
|
too much re-writing of the API. However, CLucene is slowly improving
|
|
in this respect and we try and follow these development and coding rules (though
|
|
we dont guarantee that they are all met at this stage):
|
|
|
|
1. Whenever possible the caller must create the object that is being filled. For example:
|
|
IndexReader->getDocument(id, document);
|
|
As opposed to the old method of document = IndexReader->getDocument(id);
|
|
|
|
2. Clone always returns a new object that must be cleaned up manually.
|
|
|
|
Questions:
|
|
1. What should be the convention for an object taking ownership of memory?
|
|
Some documenting is available on this, but not much
|
|
|
|
Working with valgrind
|
|
----------------------
|
|
Valgrind reports memory leaks and memory problems. Tests should always pass
|
|
valgrind before being passed.
|
|
|
|
#valgrind --leak-check=full <program>
|
|
|
|
Memory leak tracking with dmalloc
|
|
---------------------------------
|
|
dmalloc (http://dmalloc.com/) is also a nice tool for finding memory leaks.
|
|
To enable, set the ENABLE_DMALLOC flag to ON in cmake. You will of course
|
|
have to have the dmalloc lib installed for this to work.
|
|
|
|
The cl_test file will by default print a low number of errors and leaks into
|
|
the dmalloc.log.txt file (however, this has a tendency to print false positives).
|
|
You can override this by setting your environment variable DMALLOC_OPTIONS.
|
|
See http://dmalloc.com/ or dmalloc --usage for more information on how to use dmalloc
|
|
|
|
For example:
|
|
# DMALLOC_OPTIONS=medium,log=dmalloc.log.txt
|
|
# export DMALLOC_OPTIONS
|
|
|
|
UPDATE: when i upgrade my machine to Ubuntu 9.04, dmalloc stopped working (caused
|
|
clucene to crash).
|
|
|
|
Performance with callgrind
|
|
--------------------------
|
|
Really simple
|
|
|
|
valgrind --tool=callgrind <command: e.g. bin/cl_test>
|
|
this will create a file like callgrind.out.12345. you can open this with kcachegrind or some
|
|
tool like that.
|
|
|
|
|
|
Performance with gprof
|
|
----------------------
|
|
Note: I recommend callgrind, it works much better.
|
|
|
|
Compile with gprof turned on (ENABLE_GPROF in cmake gui or using ccmake).
|
|
I've found (at least on windows cygwin) that gprof wasn't working over
|
|
dll boundaries, running the cl_test-pedantic monolithic build worked better.
|
|
|
|
This is typically what I use to produce some meaningful output after a -pg
|
|
compiled application has exited:
|
|
# gprof bin/cl_test-pedantic.exe gmon.out >gprof.txt
|
|
|
|
Code coverage with gcov
|
|
-----------------------
|
|
To create a code coverage report of the test, you can use gcov. Here are the
|
|
steps I followed to create a nice html report. You'll need the lcov package
|
|
installed to generate html. Also, I recommend using an out-of-source build
|
|
directory as there are lots of files that will be generated.
|
|
|
|
NOTE: you must have lcov installed for this to work
|
|
|
|
* It is normally recommended to compile with no optimisations, so change CMAKE_BUILD_TYPE
|
|
to Debug.
|
|
|
|
* I have created a cl_test-gcov target which contains the necessary gcc switches
|
|
already. So all you need to do is
|
|
# make test-gcov
|
|
|
|
If everything goes well, there will be a directory called code-coverage containing the report.
|
|
|
|
If you want to do this process manually, then:
|
|
# lcov --directory ./src/test/CMakeFiles/cl_test-gcov.dir/__/core/CLucene -c -o clucene-coverage.info
|
|
# lcov --remove clucene-coverage.info "/usr/*" > clucene-coverage.clean
|
|
# genhtml -o clucene-coverage clucene-coverage.clean
|
|
|
|
If both those commands pass, then there will be a clucene coverage report in the
|
|
clucene-coverage directory.
|
|
|
|
Benchmarks
|
|
----------
|
|
Very little benchmarking has been done on clucene. Andi Vajda posted some
|
|
limited statistics on the clucene list a while ago with the following results.
|
|
|
|
There are 250 HTML files under $JAVA_HOME/docs/api/java/util for about
|
|
6108kb of HTML text.
|
|
org.apache.lucene.demo.IndexFiles with java and gcj:
|
|
on mac os x 10.3.1 (panther) powerbook g4 1ghz 1gb:
|
|
. running with java 1.4.1_01-99 : 20379 ms
|
|
. running with gcj 3.3.2 -O2 : 17842 ms
|
|
. running clucene 0.8.9's demo : 9930 ms
|
|
|
|
I recently did some more tests and came up with these rough tests:
|
|
663mb (797 files) of Guttenberg texts
|
|
on a Pentium 4 running Windows XP with 1 GB of RAM. Indexing max 100,000 fields
|
|
- Jlucene: 646453ms. peak mem usage ~72mb, avg ~14mb ram
|
|
- Clucene: 232141. peak mem usage ~60, avg ~4mb ram
|
|
|
|
Searching indexing using 10,000 single word queries
|
|
- Jlucene: ~60078ms and used ~13mb ram
|
|
- Clucene: ~48359ms and used ~4.2mb ram
|
|
|
|
Distribution
|
|
------------
|
|
CPack is used for creating distributions.
|
|
* Create a out-of-source build as per usual
|
|
* Make sure the version number is correct (see <clucene>/CMakeList.txt, right at the top of the file)
|
|
* Make sure you are compiling in the correct release mode (check ccmake or the cmake gui)
|
|
* Make sure you enable ENABLE_PACKAGING (check ccmake or the cmake gui)
|
|
* Next, check that the package is compliant using several tests (must be done from a linux terminal, or cygwin):
|
|
# cd <clucene>/build-name
|
|
# ../dist-check.sh
|
|
* Make sure the source directory is clean. Make sure there are no unknown svn files:
|
|
# svn stat ..
|
|
* Run the tests to make sure that the code is ok (documented above)
|
|
* If all tests pass, then run
|
|
# make package
|
|
for the binary package (and header files). This will only create a tar.gz package.
|
|
and/or
|
|
# make package_source
|
|
for the source package. This will create a ZIP on windows, and tar.bz2 and tar.gz packages on other platforms.
|
|
|
|
There are also options for create RPM, Cygwin, NSIS, Debian packages, etc. It depends on your version of CPack.
|
|
Call
|
|
# cpack --help
|
|
to get a list of generators.
|
|
|
|
Then create a special package by calling
|
|
# cpack -G <GENERATOR> CPackConfig.cmake
|
|
|