Sampling of the dateset under several conditions Python
-
A date is given at the entrance where id is the user, url is the page reviewed, timestamp is the time of the page.
id url timestamp 0 a page_1 2021-10-09 15:46:20 1 a page_2 2021-10-09 15:47:20 2 a page_3 2021-10-09 15:48:20 3 a page_4 2021-10-09 15:49:20 4 a page_2 2021-10-09 15:50:20 5 b page_4 2021-10-09 15:18:20 6 b page_3 2021-10-09 15:21:20 7 b page_2 2021-10-09 15:22:20 8 b page_1 2021-10-09 15:24:20 9 b page_1 2021-10-09 15:26:20
Each user guaranteed visit page 2 Each user ' s home page should be selected until page 2. If the user has visited page 2 several times, it is necessary to include all pages before of the page 2 The original mass is very large, so I'd like to find a way faster than a diversion. Thank you.
-
My decision will only work if your data have the following format:
# файл input.csv id,url,timestamp a,page_1,2021-10-09 15:46:20 a,page_2,2021-10-09 15:47:20 a,page_3,2021-10-09 15:48:20 a,page_4,2021-10-09 15:49:20 a,page_2,2021-10-09 15:50:20 b,page_4,2021-10-09 15:18:20 b,page_3,2021-10-09 15:21:20 b,page_2,2021-10-09 15:22:20 b,page_1,2021-10-09 15:24:20 b,page_1,2021-10-09 15:26:20
Decision:
import pandas as pd
def all_visited_pages_until(page, user_id, df):
df_by_user = df[(df.id == user_id)]
df_by_user = df_by_user.sort_values(by="timestamp")
last_row_index_by_page = df_by_user.url.where(df_by_user.url == page).last_valid_index()return df_by_user.loc[:last_row_index_by_page]
def main():
df = pd.read_csv("input.csv", sep=",")
print(all_visited_pages_until("page_2", "a", df))if name == 'main':
main()
Programme withdrawal:
0 page_1
1 page_2
2 page_3
3 page_4
4 page_2
Name: url, dtype: object
On the way out, they got the object.
Series
♪ If necessary, it could be transformed intolist
== sync, corrected by elderman ==
If all pages have a view
page_i
(sighs)i
This number, respectively, may be replaced to optimize the storage of large data volumespage_i
Keep it.i
♪To this end, before calling a function
all_visited_pages_until
enough to writedf.url = df.url.apply(lambda url: int(url.replace("page_", "")))