Screening of pandas dataframe with variable step
-
There is a daterime(s):
import pandas as pd d = {'stores': ['AG21', 'AG41', 'AG85', 'AG45', 'AG31', 'AS25', 'AR81', 'AA43', 'AG21', 'AD83', 'AA36', 'AG55', 'AT58', 'AD11', 'AH32', 'AE17'], 'linear': [430, 145 , 120, 180, 250, 250, 250, 320, 376, 390, 420, 580, 350, 190, 125, 390]} df = pd.DataFrame(data=d) df = df.sort_values(by='linear') df
In the linear column, the values from the calculations of the other code and are sorted by increasing.
Then, manually, lines run from 1 to 6. For example, for the datereima above the linear column, there would be some manually similar ratings (in the eye)
import pandas as pd d = {'stores': ['AG21', 'AG41', 'AG85', 'AG45', 'AG31', 'AS25', 'AR81', 'AA43', 'AG21', 'AD83', 'AA36', 'AG55', 'AT58', 'AD11', 'AH32', 'AE17'], 'linear': [430, 145 , 120, 180, 250, 250, 250, 320, 376, 390, 420, 580, 350, 190, 125, 390]} df = pd.DataFrame(data=d) df = df.sort_values(by='linear') df['ratings'] = 1,1,1,2,2,3,3,3,4,4,4,5,5,5,5,6 df
They are approximately applied according to the similarities of the upper lines (after depreciation) with a small step and if the step is very different, the rating increases.
There is not always a sixth or fifth rating. Example below:
import pandas as pd d = {'stores': ['AG21', 'AG41', 'AG85', 'AG45', 'AG31', 'AS25', 'AR81', 'AA43'], 'linear': [330, 145 , 120, 180, 250, 150, 185, 320]} df = pd.DataFrame(data=d) df = df.sort_values(by='linear') df['ratings'] = 1,2,2,3,3,4,5,5 df
Please indicate how these ratings can be automated?
-
If you've got the right values, and you just need to break them to the quantile segments, you can just do it:
df["cat"] = pd.qcut(df["linear"], 6, labels=False).values+1
df:
stores linear cat 2 AG85 120 1 14 AH32 125 1 1 AG41 145 1 3 AG45 180 2 13 AD11 190 2 4 AG31 250 2 5 AS25 250 2 6 AR81 250 2 7 AA43 320 4 12 AT58 350 4 8 AG21 376 4 9 AD83 390 5 15 AE17 390 5 10 AA36 420 6 0 AG21 430 6 11 AG55 580 6
If you don't have any categories (as 3 in this example), then none of the values under these conditions fall into third sex. If you need to get six categories ironly, then I suggest that you first determine the intervals and then reuse by the method.
pd.cut
:intervals = np.linspace(df["linear"].min(), df["linear"].max(), endpoint=True, num=7) print(intervals) # [120. 196.66666667 273.33333333 350. 426.66666667 503.33333333 580. ] df["cat"] = pd.cut(df["linear"], intervals, labels=False, include_lowest=True)+1
df:
stores linear cat 2 AG85 120 1 14 AH32 125 1 1 AG41 145 1 3 AG45 180 1 13 AD11 190 1 4 AG31 250 2 5 AS25 250 2 6 AR81 250 2 7 AA43 320 3 12 AT58 350 3 8 AG21 376 4 9 AD83 390 4 15 AE17 390 4 10 AA36 420 4 0 AG21 430 5 11 AG55 580 6
UPDATE
If you need to divide the dates just into approximately equal parts,
***с потерей статистической значимости***
a simple grouping:d = {'stores': ['AG21', 'AG41', 'AG85', 'AG45', 'AG31', 'AS25', 'AR81', 'AA43'], 'linear': [330, 145 , 120, 180, 250, 150, 185, 320]} df = pd.DataFrame(data=d) df = df.sort_values(by='linear') chunks = 6 df["cat"] = df.groupby(np.arange(len(df))//(len(df)/chunks)).ngroup()+1 print(df)
df:
stores linear cat 2 AG85 120 1 1 AG41 145 1 5 AS25 150 2 3 AG45 180 3 6 AR81 185 4 4 AG31 250 4 7 AA43 320 5 0 AG21 330 6