Would elasticsearch or RavenDB be better for fueling a statistics engine/random forest

nosql

(Note: this question exists on StackOverflow as well but I thought it might have a better reception here. If it proves this is the better place, I'll close/ask to migrate/link to this. Also, if it doesn't really belong here, I'd be happy to delete it.)

I've been looking at the following NoSQL databases for the next phase of my project:

elasticsearch positions itself as primarily serving advanced search scenarios while RavenDB positions itself as a document-oriented-database.

Primarily, the document will be around videos. Each has a natural id. That will be the key of the document.

Around that, I add other content in fields which will not necessarily be scalar or flat, as the information will come from a number of different sources with different structures.

For example, there will be content from the video provider's Atom feeds, blog posts that have the video embedded in it, and other pieces of data from a data warehouse project.

There is no set structure across all of the items (each of them will be very domain-specific, actually), the only thing that will relate them is the natural key of the video mentioned above.

That said, once I have this information in one of the above solutions, I'll want to do a number of things with it:

  • Cull it to help populate variables in a random forest in order to make classifications about the videos
  • Provide general search on the videos (general free-text, not based on the results of the random forest) through a web-based front end (ASP.NET MVC if you must know)

There are some requirements:

  • I will more than likely be in a ASP.NET shared web hosting environment. This means I'll have one machine, and won't have access to set up a service. Something embeddable will be very helpful.

  • The ASP.NET environment will be hosted in IIS, so the embeddable aspect will have to survive app-domain recycling.

  • I'll want to create new indexes based on the results of the statistical analysis which I can easily fascet which will help with the search on the site.

  • Support for autocomplete functionality (I know this isn't an "out-of-the box" request, but being able to get to that point is important).

  • Rich synonym support (there's a number of them in the type of videos I'm indexing content around)

I'm also open to services, such as Truffler, although I do have concerns about the cost (and in Truffler's case, a little concerned about latency between the data centers, because the requests will come from the web host on the West coast, or from a back-end process on the East coast).

Additionally, I don't feel that one solution needs to fit all the requirements. I'm more than fine with having one serve one purpose and having another serve another purpose. Granted, migrations suck, but migrating between these two document stores is a little easier (and I don't expect them to use the same document structure, necessarily).

Best Answer

Ravendb embedds quite nicely into a .net application and also allows you to create full text (embedded) lucene.net indexes. Given your constraints on the hosting elasticsearch won't be a viable option since you'll need it to run as a service alongside of your MVC application.

Lucene.net does not support facets out of the box but ravendb comes to the rescue here too:
http://ravendb.net/documentation/faceted-search

Ravendb also allows you to control your lucene.net analyzers quite nicely: http://ravendb.net/documentation/how-indexes-work

Disclosure: I'm the author of the elasticsearch .net client NEST so if anyone would try to sell you Elasticsearch it be me :)