How to crawl a web site in Solr (Solr 5+)

I'm just getting into this whole Solr thing, and I think it's pretty cool. It's something that I want to spend my time on because it's a really amazing app, a great program for searching.

Now I don't think I'm alone in wanting to use Solr to index web sites, lots of them, and lots of web sites other than my own. That's basically what's driven me to learn Solr and go through every tutorial I can find on it.

I eventually want to create specialized search engines based on specific topics (e.g. gardening, footbal, chevy camaro cars, etc..) which would require me crawling external web sites.

Luckily today I found out how to use Solr's built in web crawling function. It basically goes like this:

 ./post -c gettingstarted www.netdip.com -recursive 1 -delay 1

That would be a command run from the "bin" directory of the downloaded solr distribution. I'm using 5.2 at the moment. Other versions could be different in the future.

But it works well. Of course though, with the default schema, you won't be able to see any of the body content of a web page stored in Solr.

The information is from http://www.norconex.com/whats-new-in-solr-5/

I hope this can help someone. I don't think I'm the only one who's wondered how to crawl a web site in solr.

NetDip

How to crawl a web site in Solr (Solr 5+)

Related Links

Professional Services

Open source software