B
Introduction:
You have two possibilities to do what you want - one works for versions of MySQL from 5.5 (uses aggregates) and upwards and the other works for MySQL 8 and up (uses window functions).
All of the code below is available on the fiddle https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=c72ea2c9ffb6ef10bcac19f8d6c0fba1 . NOTE: The fiddle is for MySQL version 8. If you wish to run versions 5.5 (or 5.6 or 5.7), please change the server in the dropdown at the top of the fiddle. I did this because EXPLAIN ANALYZE can only be used with MySQL > 8.0.18 - previous versions are unusable!
First create your table (you should provide this yourself in the form of a fiddle for this type of question). I made a few changes to your schema:
CREATE TABLE cas -- French for "case" - CASE is an SQL keyword and should not be used for table names!
(
case_id INTEGER NOT NULL,
cat TEXT NOT NULL,
birth DATE NOT NULL, -- store as a date - only requires 4 bytes
c_type TEXT NULL -- type is also a keyword, best avoided https://www.drupal.org/docs/develop/coding-standards/list-of-sql-reserved-words
);
and then populate it. I added a few records for testing purposes:
INSERT INTO cas (case_id, cat, birth) VALUES
(20033738, 'CASE_OWNER', '1996-04-08'),
(20033738, 'WIFE' , '1995-08-22'),
(20033831, 'CASE_OWNER', '1975-03-05'),
(20033831, 'CHILD' , '2005-03-19'),
(20033831, 'CHILD' , '2006-03-25'),
(20033831, 'CHILD' , '2010-05-20'),
(20033831, 'CHILD' , '2013-10-25'),
(20039301, 'CASE_OWNER', '1999-07-27'),
(20039301, 'WIFE' , '2001-07-05'),
(20039301, 'CHILD' , '2018-10-22'),
(20039334, 'CASE_OWNER', '1994-03-10'),
(30033333, 'CASE_OWNER', '1980-01-01'), -- added a single case owner with one child!
(30033333, 'CHILD' , '2012-09-01'),
(30044444, 'CASE_OWNER', '2015-08-10'), -- added a case owner < 21 yrs of age!
(30055555, 'CASE_OWNER', '1970-02-10'), -- added a couple whose children are all > 21!
(30055555, 'WIFE' , '1972-07-05'),
(30055555, 'CHILD' , '1995-11-22'),
(30055555, 'CHILD' , '1997-05-19'),
(30066666, 'CASE_OWNER', '1970-02-10'), -- added single case owner whose children
(30066666, 'CHILD' , '1989-07-05'), -- are all over 21!
(30066666, 'CHILD' , '1992-11-22'),
(30066666, 'CHILD' , '1994-05-19');
VERY IMPORTANT
An understanding of the CASE statement is critical for following the rest of this answer. The query will progress down through the CASE and when it comes across the first matching condition, it will perform the assignment and then drop out of the CASE and start again at the next record - this is a bit like the C (and other) programming language CONTINUE statement, used to break out of loops and start again with the next iteration.
This is why
a DEFAULT is important to track that your assignments are proceding correctly and that you haven't missed anything
you have to have a clear path for your conditions. When there are CASE statements within others, it can become very easy to confuse oneself!
1st form of the query (using window functions - only available in MySQL >= 8).
SELECT
case_id,
cat,
birth,
c_type,
CASE
-- 1st section:
WHEN (COUNT(case_id) OVER (PARTITION BY case_id)) = 1
THEN
CASE
WHEN (DATEDIFF(NOW(), birth) / 365.25) >= 21
THEN 'SINGLE_PERSON'
ELSE 'UNKNOWN'
END
-- 2nd section:
WHEN COUNT(case_id) OVER (PARTITION BY case_id) = 2
AND SUM(CASE WHEN cat = 'WIFE' THEN 1 ELSE 0 END) OVER (PARTITION BY case_id) = 1
THEN 'PAIR_NO_CHILD'
-- 3rd section:
WHEN COUNT(case_id) OVER (PARTITION BY case_id) >= 3
AND SUM(CASE WHEN cat = 'WIFE' THEN 1 ELSE 0 END) OVER (PARTITION BY case_id) = 1
THEN
CASE
WHEN MIN((DATEDIFF(NOW(), birth) / 365.25)) OVER (PARTITION BY case_id) < 21
THEN 'PAIR_WITH_CHILD'
ELSE 'OTHER'
END
-- 4th section:
WHEN COUNT(case_id) OVER (PARTITION BY case_id) >= 2
AND SUM(CASE WHEN cat = 'WIFE' THEN 1 ELSE 0 END) OVER (PARTITION BY case_id) = 0
THEN
CASE
WHEN MIN((DATEDIFF(NOW(), birth) / 365.25)) OVER (PARTITION BY case_id) < 21
THEN 'SINGLE_WITH_CHILD'
ELSE 'MULTIPLE'
END
ELSE 'No assigned type!' -- ALWAYS!! have a default - helps to keep track of where you may have missed a case!
END
-- 5th section:
FROM
cas
ORDER BY
case_id,
CASE
WHEN cat = 'CASE_OWNER' THEN 1
WHEN cat = 'WIFE' THEN 2
WHEN cat = 'CHILD' THEN 3
END,
birth DESC;
I'll go through it section by section as there are a couple of tricky bits!
The 1st section:
CASE
WHEN (COUNT(case_id) OVER (PARTITION BY case_id)) = 1
THEN
CASE
WHEN EXTRACT('YEAR' FROM AGE(birth)) >= 21
THEN 'SINGLE_PERSON'
ELSE 'UNKNOWN'
END
This covers
a) If there is only a CASE_OWNER in a CASE_ID, TYPE should be
a1) SINGLE_PERSON in Column TYPE which matches CASE_ID, IF CASE_OWNER is older than 21
a2) UNKNOWN in Column TYPE which matches CASE_ID, IF CASE_OWNER is younger than 21
This is an example of a CASE within a CASE! If there's only one record with a given case_id then, by definition, it must be the case owner! Then, we check their birthday and if they're over 21 (normal course of events), then set the value to SINGLE_PERSON, UNKNOWN otherwise!
The COUNT(case_id) OVER(... is an example of a https://www.postgresql.org/docs/13/tutorial-window.html . These are extremely powerful and well worth getting to know well (short intro https://buildingvts.com/understanding-postgres-window-functions-697bc0ff2ed4 ) - they will repay any effort spent learning them many times over!
There are other ways of calculating age https://stackoverflow.com/questions/2533890/how-to-get-an-age-from-a-d-o-b-field-in-mysql/16692149 - depending on the precision you require.
The 2nd section:
WHEN COUNT(case_id) OVER (PARTITION BY case_id) = 2
AND SUM(CASE WHEN cat = 'WIFE' THEN 1 ELSE 0 END) OVER (PARTITION BY case_id) = 1
THEN 'PAIR_NO_CHILD'
This covers the case:
b) If a CASE_ID has CAT CASE_OWNER AND WIFE (WITHOUT CHILD) for a CASE_ID, TYPE should be PAIR_NO_CHILD in every row that matches the specific CASE_ID.
The interesting snippet here is the SUM(CASE WHEN... construct which allows us to distinguish between case_ids which do and do not have a WIFE.
The 3rd section:
WHEN COUNT(case_id) OVER (PARTITION BY case_id) >= 3
AND SUM(CASE WHEN cat = 'WIFE' THEN 1 ELSE 0 END) OVER (PARTITION BY case_id) = 1
THEN
CASE
WHEN MIN(EXTRACT('YEAR' FROM AGE(birth))) OVER (PARTITION BY case_id) < 21
THEN 'PAIR_WITH_CHILD'
ELSE 'OTHER'
END
This covers the cases:
c) If a CASE_ID has CAT CASE_OWNER AND WIFE AND 1 or more CHILD(ren) for a CASE_ID, TYPE should be
c1) PAIR_WITH_CHILD if one or more CHILD(ren) are below 21 in every row that matches the specific CASE_ID.
c2) OTHER if all CHILD(ren) are are 21 in every row that matches the specific CASE_ID.
The 4th section:
WHEN COUNT(case_id) OVER (PARTITION BY case_id) >= 2
AND SUM(CASE WHEN cat = 'WIFE' THEN 1 ELSE 0 END) OVER (PARTITION BY case_id) = 0
THEN
CASE
WHEN MIN(EXTRACT('YEAR' FROM AGE(birth))) OVER (PARTITION BY case_id) < 21
THEN 'SINGLE_WITH_CHILD'
ELSE 'MULTIPLE'
END
This covers the cases:
d) If a CASE_ID has CAT CASE_OWNER AND 1 or more CHILD(ren) for a CASE_ID, TYPE should be
d1) SINGLE_WITH_CHILD in case one or more (not all) CHILD(ren) is/are below 21 in every row that matches the specific CASE_ID.
d2) MULTIPLE in case all CHILD(ren) are above 21.
The 5th section:
FROM
cas
ORDER BY
case_id,
CASE
WHEN cat = 'CASE_OWNER' THEN 1
WHEN cat = 'WIFE' THEN 2
WHEN cat = 'CHILD' THEN 3
END,
birth DESC;
Here, we use an ORDER BY with a CASE "embedded" in it. This allows us to have total control over the ordering of our records - given the requirements, this is a logical ordering method and is extremely helpful for testing.
Result:
case_id cat birth c_type c_t
20033738 CASE_OWNER 1996-04-08 PAIR_NO_CHILD
20033738 WIFE 1995-08-22 PAIR_NO_CHILD
20033831 CASE_OWNER 1975-03-05 SINGLE_WITH_CHILD
20033831 CHILD 2013-10-25 SINGLE_WITH_CHILD
...
... snipped for brevity
...
2nd form of the query (using aggregates and a subquery) - works from at least 5.5:
SELECT
s.case_id,
s.cat,
s.birth,
--
-- 1st section: these sections correspond to the sections in query 1 above.
CASE
WHEN s.c_cnt = 1
THEN
CASE
WHEN s.a_min >= 21
THEN 'SINGLE PERSON'
ELSE 'UNKNOWN'
END
--
-- 2nd section:
WHEN s.c_cnt = 2 AND s.w_cnt = 1
THEN 'PAIR_NO_CHILD'
--
-- 3rd section:
WHEN s.c_cnt >= 3 AND s.w_cnt = 1
THEN
CASE
WHEN s.a_min < 21
THEN 'PAIR_WITH_CHILD'
ELSE 'OTHER'
END
--
-- 4th section:
WHEN s.c_cnt >= 2 AND s.w_cnt = 0
THEN
CASE
WHEN s.a_min < 21
THEN 'SINGLE_WITH_CHILD'
ELSE 'MULTIPLE'
END
ELSE 'No assigned type!'
END AS c_t
FROM
(
SELECT
case_id,
cat,
birth,
(SELECT COUNT(c2.case_id)
FROM cas c2 WHERE c2.case_id = c1.case_id GROUP BY c2.case_id) AS c_cnt,
(SELECT SUM(CASE WHEN c3.cat = 'WIFE' THEN 1 ELSE 0 END)
FROM cas c3 WHERE c3.case_id = c1.case_id GROUP BY c3.case_id) AS w_cnt,
(SELECT FLOOR(MIN(DATEDIFF(NOW(), c4.birth) / 365.25))
FROM cas c4 WHERE c4.case_id = c1.case_id GROUP BY c4.case_id) AS a_min
FROM cas c1
ORDER BY
c1.case_id,
CASE
WHEN c1.cat = 'CASE_OWNER' THEN 1
WHEN c1.cat = 'WIFE' THEN 2
WHEN c1.cat = 'CHILD' THEN 3
END,
c1.birth DESC
) AS s;
Result:
Same as for query 1.
Just a couple of points to note:
As mentioned above, please always provide a fiddle with your questions when appropriate - normally if you want to display data of any sort!
PAIR_WITH_CHILD sounds incongruous - a "pair" normally refers to wildlife of some sort, or possibly domestic or farm animals, but not humans! However,both '"child"and"wife"` definitely refer to human beings. You might want to put "Couple with children" or similar!
I've included an UPDATE at the bottom of the fiddle.
So, to answer the questions:
My questions are:
Is this doable using SQL?
Yes, see above.
Should this be solved using SQL or using a programming language?
There's no reason not to use SQL in this case. SQL is now https://en.wikipedia.org/wiki/Turing_completeness , however just because you can to do something in a given language, doesn't mean that you should do it in that language.
There will come a point where you have very complex requirements where using SQL will lead to diminishing returns in terms of your effort vs. outcome - experience will tell you when it's better to use another tool!
If this is doable in SQL - how should it be done?
See above! A fiddle showing how to update using aggregates and a CTE is given https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=3c8f75c7e1b91bce3690dfea6ad9da29 .
Finally, a Performance analysis:
I looked at the plans (from MySQL >= and can't make much sense of them! The usual caveats about performance testing apply - there are only 22 records in this dataset. You should test your query/queries on your own full dataset with your own hardware and other system constraints. Just for the record however, on a locally installed instance of MySQL (8.0.27), Windows 11, 16GB of RAM, 8-core processor, 512GB NVMe drive I obtained these results:
MySQL:
Q2: 'Aggregates_no_ORDER_BY - beginning'; 0.187165 s
Q1: 'Window_no_ORDER_BY - end'; 0.229719 s
Q2: 'Aggregates_with_ORDER_BY - end'; 0.296987 s
Q1 'Window_with_ORDER_BY - end'; 0.344441 s
PostgreSQL (same machine) - using EnterpriseDB's 14.1 binary from https://www.enterprisedb.com/download-postgresql-binaries . See https://dbfiddle.uk/?rdbms=postgres_14&fiddle=984d7500df9282f395ab8f0c8329bb0e for a PostgreSQL fiddle with EXPLAIN (ANALYZE, BUFFERS, VERBOSE) .
Q1: Windows_order_by 1.328 ms
Q2: Windows-NO-order_by 1.35 ms
Q1: Aggregate_order_by 1.8 ms
Q2: Aggregate_no_order_by 2.7 ms
The results for MySQL don't appear to align with the complexity of the plans or the fact that the table has to be scanned 4 times (or does it?).
What is really puzzling is that MySQL is 140 times slower than PostgreSQL? Frankly, I'm baffled - you'll have to test for yourself.