Published on November 21st, 2018 📆 | 8230 Views ⚑
0ACHE – A Web Crawler For Domain-Specific Search
[adsense size='1' ]
- RegularĀ crawlingĀ of a fixed list of web sites
- Discovery and crawling of new relevant web sites through automatic link prioritization
- Configuration of different types of pages classifiers (machine-learning, regex, etc)
- Continuous re-crawling of sitemaps toĀ discoverĀ new pages
- Indexing of crawled pages using Elasticsearch
- Web interface for searching crawled pages in real-time
- REST API and web-based user interface forĀ crawlerĀ monitoring
- Crawling ofĀ hidden servicesĀ using TOR proxies
Documentation
More info is available on the project'sĀ documentation.
Installation
You can either build ACHE from the source code, download the executable binary usingĀ conda
, or use Docker to build an image and run ACHE in a container.
Build from source with Gradle
Prerequisite:Ā You will need to install recent version of Java (JDK 8 or latest).
To build ACHE from source, you can run the following commands in your terminal:
git clone https://github.com/ViDA-NYU/ache.git
cd ache
./gradlew installDist
which will generate an installation package underĀ ache/build/install/
. You can then makeĀ ache
Ā command available in the terminal by adding ACHE binaries to theĀ PATH
Ā environment variable:
export ACHE_HOME="{path-to-cloned-ache-repository}/build/install/ache"
export PATH="$ACHE_HOME/bin:$PATH"
[adsense size='1' ]
Running using Docker
Prerequisite:Ā You will need to install a recent version of Docker. SeeĀ https://docs.docker.com/engine/installation/Ā for details on how to install Docker for your platform.
We publish pre-built docker images onĀ Docker HubĀ for each released version. You can run the latest image using:
docker run -p 8080:8080 vidanyu/ache:latest
Alternatively, you can build the image yourself and run it:
git clone https://github.com/ViDA-NYU/ache.git
cd ache
docker build -t ache .
docker run -p 8080:8080 ache
TheĀ DockerfileĀ exposes two data volumes so that you can mount a directory with your configuration files (atĀ /config
) and preserve the crawler stored data (atĀ /data
) after the container stops.
Download with Conda
Prerequisite:Ā You need to have Conda package manager installed in your system.
If you use Conda, you can installĀ ache
Ā from Anaconda Cloud by running:
conda install -c vida-nyu ache
NOTE: Only released tagged versions are published to Anaconda Cloud, so the version available through Conda may not be up-to-date. If you want to try the most recent version, please clone the repository and build from source or use the Docker version.
Running ACHE
Before starting a crawl, you need to create a configuration file namedĀ ache.yml
. We provide some configuration samples in the repository'sĀ configĀ directory that can help you to get started.
You will also need a page classifier configuration file namedĀ pageclassifier.yml
. For details on how configure a page classifier, refer to theĀ page classifiers documentation.
After you have configured a classifier, the last thing you will need is a seed file, i.e, a plain text containing one URL per line. The crawler will use these URLs to bootstrap the crawl.
Finally, you can start the crawler using the following command:
ache startCrawl -o <data-output-path> -c <config-path> -s <seed-file> -m <model-path>
where,
<configuration-path>
Ā is the path to the config directory that containsĀache.yml
.<seed-file>
Ā is the seed file that contains the seed URLs.<model-path>
Ā is the path to the model directory that contains the fileĀpageclassifier.yml
.<data-output-path>
Ā is the path to the data output directory.
Example of running ACHE using the sampleĀ pre-trained page classifier modelĀ and the sampleĀ seeds fileĀ available in the repository:
ache startCrawl -o output -c config/sample_config -s config/sample.seeds -m config/sample_model
The crawler will run and print the logs to the console. HitĀ Ctrl+C
Ā at any time to stop it (it may take some time). For long crawls, you should run ACHE in background using a tool like nohup.
[adsense size='1' ]
Data Formats
ACHE can output data in multiple formats. The data formats currently available are:
- FILES (default) - raw content and metadata is stored in rolling compressed files of fixed size.
- ELATICSEARCH - raw content and metadata is indexed in an ElasticSearch index.
- KAFKA - pushes raw content and metadata to an Apache Kafka topic.
- WARC - stores data using the standard format used by the Web Archive and Common Crawl.
- FILESYSTEM_HTML - only raw page content is stored in plain text files.
- FILESYSTEM_JSON - raw content and metadata is stored using JSON format in files.
- FILESYSTEM_CBOR - raw content and some metadata is stored usingĀ CBORĀ format in files.
For more details on how to configure data formats, see theĀ data formats documentationĀ page.
Bug Reports and Questions
We welcome user feedback. Please submit any suggestions, questions or bug reports using theĀ Github issue tracker.
We also have a chat room onĀ Gitter.
Contributing
Code contributions are welcome. We use a code style derived from theĀ Google Style Guide, but with 4 spaces for tabs. A Eclipse Formatter configuration file is available in theĀ repository.
Contact
- AĆ©cio Santos [aecio.santos@nyu.edu]
- Kien Pham [kien.pham@nyu.edu]
Gloss