Group Details

QA Engineer

Member List

  • RE: Consistent backup of multiple MS SQL databases

    Note that STOPAT refers to a datetime value that each log record has. This has a resolution of 1/300 second (it always ends with .xx0, .xx3 or .xx7).

    Several things can happen between two such values.

    If you want a true "point in time", then you can use marked transactions. Your backup routine will be more complex, though, since you need to add those markers. Below are two articles from MS that discusses this:

    https://docs.microsoft.com/en-us/sql/relational-databases/backup-restore/recovery-of-related-databases-that-contain-marked-transaction?view=sql-server-ver16

    https://docs.microsoft.com/en-us/sql/relational-databases/backup-restore/use-marked-transactions-to-recover-related-databases-consistently?view=sql-server-ver16

    posted in SQL
  • RE: Syntax Error in Boolean Null values

    You can change how COPY represents NULL values.

    For example, assuming the following table:

    CREATE TABLE hasbool (
       id integer PRIMARY KEY,
       is_cool boolean,
       name text NOT NULL
    );
    

    I can use blanks to represent NULL as in the following psql session:

    \pset null '∅'
    Null display is "∅".
    

    COPY hasbool FROM STDIN (FORMAT 'csv', NULL ' ');
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself, or an EOF signal.

    1,TRUE,joe
    2, ,paul
    .
    COPY 2

    TABLE hasbool;

    id │ is_cool │ name
    ════╪═════════╪══════
    1 │ t │ joe
    2 │ ∅ │ paul
    (2 rows)

    posted in SQL
  • RE: Agregation group by date or group, pair index, MySQL, slow response

    That's one month's worth of data? Are you having any trouble with 20K inserts every 15 minutes? (I would expect not.)

    The table can be shrunk:

    • ad_group_name_id probably does not need to be a 4-byte INT; for 20K groups, SMALLINT UNSIGNED would take only 2 bytes (but overflow at 65K).
    • id (4 bytes) is unnecessary. If you remove it, change the PRIMARY KEY as noted below.
    • DECIMAL(8,2) takes 4 bytes. (6,2) would take 3 bytes (but top out at 9999.99).

    I recommend this pair of indexes to handle all 3 of your queries:

    PRIMARY KEY(date_to, ad_group_name_id),
    INDEX(ad_group_name_id, date_to, cost)
    

    A Summary table would work nicely for queries that cover whole days - midnight up to midnight. It would have 1 row per day per group, plus the SUM(cost). A simple aggregation at the end of each day would add the 20K new rows.

    Then use the Summary table instead of the main table for all 3 of your queries, thereby running upwards of 96 times as fast. (It probably won't be more than 10 times as fast.)

    It would have two indexes. (Note that day_to would be of type DATE.)

    PRIMARY KEY(day_to, ad_group_name_id),
    INDEX(ad_group_name_id, day_to, sum_cost)
    

    If you do add the Summary table, you could consider dropping the secondary index from the main table unless it is needed for other lookups based primarily on ad_group_name_id.

    posted in SQL
  • RE: ReadOnly queries on secondary server do not return correct values

    The service first makes an update to a table (probably on the A server) and then a select with Read Only on the B server. Unfortunately, the select does not return the values with which an update was previously made. Only after about 1 to 1.5 seconds do the correct values appear.

    That's expected and how the product was designed. Commit harden (sync-commit) does not include redo, it only encompasses the harden on the secondary. Harden and Redo are two distinctly different processes for an AG in SQL Server.

    I don't quite understand how the two nodes A and B communicate in an AG, but I assumed that both nodes would receive the data at the same time when using "synchronous commit", right?

    The primary is responsible for capturing the "changes" which are packed up into log records, these are in turn packed into log blocks. The log blocks are (over simplifying here) closed and flushed on a commit, which is then used to copy and send to the other replicas independently of each other. Depending on various factors, this may or may not be a speedy process as it must traverse through other layers and threads.

    Eventually the log block will make it on the side of the secondary, be hardened, and depending on partner type will either immediately send a progress message or wait. When the primary receives the progress message for the specific replica that sent it, various internal values are updated, and other various checks occur. That happens for each replica, regardless of partner type (sync/async).

    Thus, data is not received at the same time on all replicas, nor is there any guarantee it completed redo before the progress message is processed on the primary.

    SQL Server is returning valid and correct data as per the start of the read transaction based on snapshot isolation, just not the data that is wanted/expected.

    posted in SQL
  • Why Does SQL Server not have 200 buckets in the statistics histogram when there are >100k distinct values in the table

    Given I am using the https://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver16&tabs=ssms why does the statistics histogram for the index PK_TransactionHistory_TransactionID on table Production.TransactionHistory only contain 3 histogram "buckets" when there are 113k distinct values in that column?

    An example below:

    USE AdventureWorks2016
    

    /* ensure statistics are as accurate as they can be */
    UPDATE STATISTICS Production.TransactionHistory WITH FULLSCAN

    then we can look at the updated histogram

    /* look at the statistics for the primary key column */
    DBCC SHOW_STATISTICS (
        'Production.TransactionHistory', 
        'PK_TransactionHistory_TransactionID')
    WITH HISTOGRAM;
    

    and I see the output:

    enter image description here

    Note the max and min Transaction IDs:

    SELECT MIN(TransactionID) FROM Production.TransactionHistory /* 100000 */
    SELECT MAX(TransactionID) FROM Production.TransactionHistory /* 213442 */
    

    SQL Server seems to have created a "bucket" for the max value, one for the min value and one for all the values in between (which it knows are all distinct)

    I note that if I remove the primary key from this table

    ALTER TABLE Production.TransactionHistory DROP CONSTRAINT PK_TransactionHistory_TransactionID
    

    and then insert some duplicate values

    INSERT INTO [Production].[TransactionHistory]
    (
        TransactionID,
        [ProductID],
        [ReferenceOrderID],
        [ReferenceOrderLineID],
        [TransactionDate],
        [TransactionType],
        [Quantity],
        [ActualCost],
        [ModifiedDate]
    )
    VALUES
    (200001,1,1,1,GETDATE(),'P',1,1,GETDATE()),
    (200011,1,1,1,GETDATE(),'P',1,1,GETDATE()),
    (200021,1,1,1,GETDATE(),'P',1,1,GETDATE()),
    (200031,1,1,1,GETDATE(),'P',1,1,GETDATE())
    

    Update the stats on the table and then look at the statistic for the column (rather than the PK we have deleted)

    USE AdventureWorks2016
    

    /* ensure statistics are as accurate as they can be */
    UPDATE STATISTICS Production.TransactionHistory WITH FULLSCAN

    /* look at the statistics for the primary key column */
    DBCC SHOW_STATISTICS (
    'Production.TransactionHistory',
    'TransactionID')
    WITH HISTOGRAM;

    We still have two buckets, though DISTINCT_RANGE_ROWS has updated accordingly

    enter image description here

    Why does SQL Server not make use of the 200 "buckets" available in a histogram here? Is it something to do with resources required to fill the 8KB statistics page and using all the 200 buckets would mean it may then need to redefine when new data is added to the table?

    posted in SQL
  • Availability Group - Core cluster resources switch to being owned by secondary node after windows patching reboots

    Some of our AG clusters (async, manual failover) shift the ownership of their core cluster resources (e.g. cluster IP) from the primary node to the secondary node, which is in a different datacenter, after both nodes are restarted after monthly Windows patching. I can easily switch the resources back to the primary node in Cluster Manager, but is this something I should be worried about, like if the secondary datacenter goes offline? The quorum will still be intact in that case, but is that core cluster IP address relevant to the cluster and AG functioning? Thanks!

    posted in SQL
  • MYSQL - Querying a table with optional filters and default value

    I have a table that looks like this:

    ID KEY FILTER_NAME FILTER_VALUE VALUE
    1 a country USA value1
    2 a inRockBand true value2
    3 a value3
    4 b city MADRID value4
    5 b isLawer true value5
    6 b value6

    I would like to create a query that can receive generic filter (optional) parameters and fetches a single row per key accordingly (when optional filters don't match - the query will return the row with null filter name\value as these are the default rows)

    for example:

    select * from table where filterName = 'country' and filter_value = 'SPAIN'
    

    will return:

    ID KEY FILTER_NAME FILTER_VALUE VALUE
    3 a value3
    6 b value6

    while

    select * from table where filterName = 'inRockBand' and filter_value = 'true'
    

    will return:

    ID KEY FILTER_NAME FILTER_VALUE VALUE
    2 a inRockBand true value2
    6 b value6

    if several filters matches several keys we should take any one key

    posted in SQL
  • RE: Oracle How to get Execution Plan for SQL executed inside of PLSQL?

    You can look up the sql_id And child_number in v$sql. Something like:

    Select s.sql_id, s.child_number
    From   V$sql s
    Where  upper(sql_text) like upper('SELECT COUNT(%FROM foo')
    And    Sql_text not like '%v$sql%';
    

    Note that pl/sql normalises static SQL for you to boost cursor sharing - it will be in upper case with much less white space.

    Once you’ve got these, you can just input into the dbms_xplan.display_cursor function

    posted in SQL
  • MySQL 8 trailing spaces being evaluated in equals comparison

    I am getting different behaviors after updating a 5.6 database to 8.

    On 5.6, if I run the following statement I get zero rows, which is what I expect to get:

    SELECT * FROM EntityCustomerContact where CustomerContactID <> TRIM(CustomerContactID);
    

    On 8.0 I get hundreds of rows returned. CustomerContactID is a varchar(32);

    My understanding is that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces.

    The update process consisted of importing a mysqldump from 5.6 into 8.0. The only other thing that I changed other than moving to 8.0 was updating the collation on all tables and columns from utf8mb4_unicode_ci to utf8mb4_0900_ai_ci. This is the first upgrade I've done moving to the utf8mb4_0900_ai_ci collation.

    I have performed a number of other upgrades from 5.6/7 to 8.0 and never encountered this issue - I ran my test query on some other 8.0 (specifically 8.0.29) databases that were upgraded in the same fashion excepting the collation change and received the expected results.

    I have searched for possible mysqld settings to address this but have come up short.

    posted in SQL
  • Autocomplete using Postgres Optimization

    I'm trying to build an autocomplete feature using postgres.

    Here's the steps that I followed:

    1. Since the autocomplete is to be done on fields across 2 fields from different tables, I created a materialized view where I merged the two fields on which I'm searching.
    2. Created pg_trgm extension.
    3. Created a gin index on the new merged_field (this is the name of the new field).
    4. Querying for particular products like "ball" using similarity % operator.

    The materialized view had a total of 3 million rows.

    Here's the list of issues that I encountered :

    1. The queries are slow, sometimes it was making seq. scan, since the word is quite common like "ball" it's present in almost 80% of the rows.
    2. Can't decide on threshold pg_trgm.similarity_threshold, initially I left it at 0.3 but noticed that some records that have 200 chars length were not treated as probable hit when I looked for "bat" (even though there are not that many entries for this word).
    3. As the number of search words increased the query time also starts increasing. I tested this by giving one genuine term, that's there in db, while rest of the words weren't.

    Here's the query that I'm using:

    create index on my_schema.mvw_autocomplete using gin (merged_fields gin_trgm_ops);
    

    explain analyze select * from my_schema.mvw_autocomplete where merged_fields % 'bat arkham knight';

    This query took 9 seconds to run, though it did used index as per query plan.

    Is there anything that I can do to improve the runtime and fix the other issues. And also in Explain Analyze output I can see that there's no mention of parallel workers for this one, whereas there are other queries that execute faster and are making use of parallel workers as well.

    If anything is missing please let me know.

    Peace 🙂

    posted in SQL