Sunday, September 1, 2019

Memtable and SStable In Cassandra


What are Memtable and SStable In Cassandra?

Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending in with a write of data to disk:
  • 1.     Logging data in the commit log
  • 2.     Writing data to the memtable
  • 3.     Flushing data from the memtable
  • 4.     Storing data on disk in SSTables

 memtable

When a write occurs, Cassandra stores the data in a memory structure called memtable, and to provide configurable durability, it also appends writes to the commit log on disk. The commit log receives every write made to a Cassandra node, and these durable writes survive permanently even if power fails on a node. The memtable is a write-back cache of data partitions that Cassandra looks up by key. The memtable stores writes in sorted order until reaching a configurable limit, and then is flushed.

When memtable contents exceed a configurable threshold or the commitlog space exceeds the commitlog_total_space_in_mb, the memtable data, which includes indexes, is put in a queue to be flushed to disk. To flush the data, Cassandra sorts memtables by partition key and then writes the data to disk sequentially. The process is extremely fast because it involves only a commitlog append and the sequential write.

Data in the commit log is purged after its corresponding data in the memtable is flushed to the SSTable. The commit log is for recovering the data in memtable in the event of a hardware failure.

 SSTable:

A sorted string table (SSTable) is an immutable data file to which Cassandra writes a memtable. Cassandra flushes all the data in the memtables to the SSTables once the memtables reach a threshold value. Consequently, a partition is typically stored across multiple SSTable files. A number of other SSTable structures exist to assist read operations:
For each SSTable, Cassandra creates these structures:
Data (Data.db) 
The SSTable data
Primary Index (Index.db) 
Index of the row keys with pointers to their positions in the data file
Bloom filter (Filter.db) 
A structure stored in memory that checks if row data exists in the memtable before accessing SSTables on disk
Compression Information (CompressionInfo.db) 
A file holding information about uncompressed data length, chunk offsets and other compression information
Statistics (Statistics.db) 
Statistical metadata about the content of the SSTable
Digest (Digest.crc32, Digest.adler32, Digest.sha1) 
A file holding adler32 checksum of the data file
CRC (CRC.db) 
A file holding the CRC32 for chunks in an uncompressed file.
SSTable Index Summary (SUMMARY.db) 
A sample of the partition index stored in memory
SSTable Table of Contents (TOC.txt) 
A file that stores the list of all components for the SSTable TOC
Secondary Index (SI_.*.db) 
Built-in secondary index. Multiple SIs may exist per SSTable

References: Datastax Docs


Bloom Filters In Cassandra


In General, A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set".

In Cassandra, Bloom filters are used to boost the performance of reads. It is non-deterministic algorithms for testing whether an element is a member of a set. They are non-deterministic because it is possible to get a false-positive read from a Bloom filter, but not a false-negative.

Bloom filters work by mapping the values in a data set into a bit array and condensing a larger data set into a digest string using a hash function. The digest, by definition, uses a much smaller amount of memory than the original data would. The filters are stored in memory and are used to improve performance by reducing the need for disk access on key look-ups. Disk access is typically much slower than memory access. So, in a way, a Bloom filter is a special kind of cache.

When a query is performed, the Bloom filter is checked first before accessing disk. Because false-negatives are not possible, if the filter indicates that the element does not exist in the set, it certainly doesn’t; but if the filter thinks that the element is in the set, the disk is accessed to make sure.

  
Bloom filters are implemented by the org.apache.cassandra.utils.BloomFilter
class. Cassandra provides the ability to increase Bloom filter accuracy (reducing the number of false positives) by increasing the filter size, at the cost of more memory. This false positive chance is tuneable per table.


An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3.




References:  wikipedia, Cassandra The Definitive Guide

SQL Server 2016 Database Mail Is Not Working


If you are working on SQL Server 2016 and your machine is having .NET Framework higher than 3.5 then you will not be able to send using DB email.


As Per Microsoft Support, there's a bug in SQL server 2016 Setup that causes the database mail not to work without .net 3.5. So they provided a workaround for this bug  by creating a DatabaseMail.exe.config file. The steps to create this file are as below



Create the DatabaseMail.exe.config and drop it next to the DatabaseMail.exe under the Binn folder. You can use notepad.exe or any other editor to edit it. Just make sure you save it by using UTF-8 encoding (in notepad.exe, select Save As... and in the Encoding combo box, select UTF-8): 

         <?xml version="1.0" encoding="utf-8" ?>
         <configuration>
         <startup useLegacyV2RuntimeActivationPolicy="true"> 
         <supportedRuntime version="v4.0"/>     
         <supportedRuntime version="v2.0.50727"/>
         </startup>
         </configuration>


Note:
Sometimes I have observed that the config file get disappear automatically after reboot. So, to fix that you can create a SQL agent job which will create/copy this file (DatabaseMail.exe.config ) in Binn folder and schedule to run it after every service reboot. This way you can overcome with this issue.




SQL Script to find Last Statistics update date for a Table


You can find below query useful when you are trying to see the last stats update date of few selected tables in SQL Server.

Query:

SELECT
    sch.name + '.' + so.name AS 'Table',
    ss.name AS 'Statistic',
      CASE
            WHEN ss.auto_Created = 0 AND ss.user_created = 0 THEN 'Index Statistic'
            WHEN ss.auto_created = 0 AND ss.user_created = 1 THEN 'User Created'
            WHEN ss.auto_created = 1 AND ss.user_created = 0 THEN 'Auto Created'
            WHEN ss.AUTO_created = 1 AND ss.user_created = 1 THEN 'Not Possible?'
      END AS
'Statistic Type',
    CASE
            WHEN ss.has_filter = 1 THEN 'Filtered Index'
            WHEN ss.has_filter = 0 THEN 'No Filter'
      END AS
'Filtered?' ,
    sp.last_updated AS 'Stats Last Updated',     sp.rows AS 'Rows',
    sp.rows_sampled AS
'Rows Sampled',
    sp.unfiltered_rows AS
'Unfiltered Rows',
      sp.modification_counter AS
'Row Modifications',
      sp.steps AS
'Histogram Steps'
FROM sys.stats ss JOIN sys.objects so ON ss.object_id = so.object_id
JOIN sys.schemas sch ON so.schema_id = sch.schema_id
OUTER APPLY sys.dm_db_stats_properties(so.object_id, ss.stats_id) AS sp WHERE so.TYPE = 'U' 
and so.name in('Your_Table_Name') --Enter Your table name here

ORDER BY sp.last_updated DESC;