DSE Cassandra Decimal Type – Usage and Examples

cassandrasolr

I generated solrschema with dsetool on a table name seriesdata (its key is composed by seriesmetadata_id, initialtime; two decimal field).
These fields are defined as DecimalStrField but they are numeric valued (I have to execute range selection). I tried to change schema defining these fields as:

<fieldType class="org.apache.solr.schema.TrieLongField" name="TrieLongField"/>

but I receive this error:

[cassandra@bigdatalin-03 ~]$ dsetool reload_core timeseriesks.seriesdata schema=schema_data.xml solrconfig=solr_config_data.xml reindex=true -l cassandra -p cassandra
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: com.datastax.bdp.search.solr.CassandraIndexSchema$ValidationException: Mismatch between Solr key field [seriesmetadata_id] with type {TrieLongField{class=org.apache.solr.schema.TrieLongField,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={class=org.apache.solr.schema.TrieLongField}}] and Cassandra key alias [seriesmetadata_id] with type [decimal]
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:665)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:303)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:294)
at com.datastax.bdp.tools.SearchDseToolCommands.createOrReloadCore(SearchDseToolCommands.java:559)
at com.datastax.bdp.tools.SearchDseToolCommands.access$200(SearchDseToolCommands.java:59)
at com.datastax.bdp.tools.SearchDseToolCommands$ReloadCore.execute(SearchDseToolCommands.java:209)
at com.datastax.bdp.tools.DseTool.run(DseTool.java:126)
at com.datastax.bdp.tools.DseTool.run(DseTool.java:51)
at com.datastax.bdp.tools.DseTool.main(DseTool.java:186)

Why?
Thanks a lot!

Best Answer

Since you're using Datastax Enterprise search nodes, solr is integrated with cassandra. What has happened here is that when you changed your solr schema, you did not change the associated cassandra schema. The error message is telling you that there is a mismatch between your new solr schema and the cassandra schema for the same table. Check the cassandra schema for seriesdata and make it consistent with the solr schema.

Related Solutions

Cassandra – Query a column with collection type

You can index collection types in cassandra 2.1 and later. You are after:
SELECT * FROM <table> WHERE <field> CONTAINS <value_in_list/map/set>

Detailed example:

cqlsh> USE ks;
cqlsh:ks> CREATE TABLE data_points (
            id text PRIMARY KEY,
            created_at timestamp,
            previous_event_id varchar,
            properties map<text,text>
         );
cqlsh:ks> create index on data_points (properties);
cqlsh:ks> INSERT INTO data_points (id, properties) VALUES ('1', { 'fruit' : 'apple', 'band' : 'Beatles' });
cqlsh:ks> INSERT INTO data_points (id, properties) VALUES ('2', { 'fruit' : 'cherry', 'band' : 'Beatles' });
cqlsh:ks> SELECT * FROM data_points WHERE properties CONTAINS 'Beatles';

 id | created_at | previous_event_id | properties
----+------------+-------------------+----------------------------------------
  2 |       null |              null | {'band': 'Beatles', 'fruit': 'cherry'}
  1 |       null |              null |  {'band': 'Beatles', 'fruit': 'apple'}

(2 rows)

Word of warning, secondary indexes don't scale out well as they use a scatter/gather algorithm to find what you need, if you plan to use them for heavy tagging it might be better to denormalize the properties field int a separate table and carry out multiple queries.

Cassandra – Maintenance Guide

In general, a well designed cluster can live for YEARS without being touched. I've had clusters that ran for years hands-off. However, here are some guidelines:

Monitoring is hugely important:

1) Monitor latencies. Use opscenter or your favorite metrics tools to keep track of latencies. Latencies going up can be signs of problems coming, including GC pauses (more common in read workloads than write workloads), sstable problems, and the like.

2) Monitor sstable counts. SSTable counts will increase if you overrun compaction (each sstable is written exactly one time - deletes are handled by combining old sstables into new sstables through compaction).

3) Monitor node state changes (up/down,etc). If you see nodes flapping, investigate, as it's not normal.

4) Keep track of your disk usage - traditionally, you need to stay under 50% (especially if you use STCS compaction).

There are some basic things you should and shouldn't do regularly:

1) Don't explicitly run nodetool compact. You mention that you've done it, it's not fatal, but it does create very large sstables, which then are less likely to participate in compaction moving forward. You don't necessarily need to keep running it, but sometimes it may help to get rid of deleted/overwritten data.

2) nodetool repair is typically recommended every gc_grace_seconds (10 days by default). There are workloads where this is less important - the biggest reason you NEED repair is to make sure deletion markers (tombstones) are transmitted before they expire (they live for gc_grace_seconds, if a node is down when the delete happened, that data may come back to life without the repair!). If you don't issue deletes, and you query with sufficient consistency level (reads and writes at QUORUM, for example), you can actually live a life without repair.

3) If you are going to repair, consider using incremental repair, and repair small ranges at a time.

4) Compaction strategies matter - a lot. STCS is great for writes, LCS is great for reads. DTCS has some quirks.

5) Data models matter - just like RDBMS/SQL environments get into trouble as unindexed queries hit large tables, Cassandra can be problematic with very large rows/partitions.

6) Snapshots are cheap. Very cheap. Nearly instant, just hard links, they cost almost no disk space immediately. Use snapshot before you upgrade versions, especially major versions.

7) Be careful with deletes. As hinted in #2, delete creates more data on disk, and doesn't free it for AT LEAST gc_grace_seconds.

When all else fails:

I've seen articles that suggest Cassandra in prod requires a dedicated head to manage any sized cluster - I don't know that it's necessarily true, but if you're concerned, you may want to hire a third party consultant (TheLastPickle, Pythian) or have a support contract (Datastax) to give you some peace of mind.

Best Answer

Related Solutions

Cassandra – Query a column with collection type

Cassandra – Maintenance Guide

Related Question