Sampling of the dateset under several conditions Python



  • A date is given at the entrance where id is the user, url is the page reviewed, timestamp is the time of the page.

        id  url timestamp
    0   a   page_1  2021-10-09 15:46:20
    1   a   page_2  2021-10-09 15:47:20
    2   a   page_3  2021-10-09 15:48:20
    3   a   page_4  2021-10-09 15:49:20
    4   a   page_2  2021-10-09 15:50:20
    5   b   page_4  2021-10-09 15:18:20
    6   b   page_3  2021-10-09 15:21:20
    7   b   page_2  2021-10-09 15:22:20
    8   b   page_1  2021-10-09 15:24:20
    9   b   page_1  2021-10-09 15:26:20
    

    Each user guaranteed visit page 2 Each user ' s home page should be selected until page 2. If the user has visited page 2 several times, it is necessary to include all pages before of the page 2 The original mass is very large, so I'd like to find a way faster than a diversion. Thank you.



  • My decision will only work if your data have the following format:

    # файл input.csv
    id,url,timestamp
    a,page_1,2021-10-09 15:46:20
    a,page_2,2021-10-09 15:47:20
    a,page_3,2021-10-09 15:48:20
    a,page_4,2021-10-09 15:49:20
    a,page_2,2021-10-09 15:50:20
    b,page_4,2021-10-09 15:18:20
    b,page_3,2021-10-09 15:21:20
    b,page_2,2021-10-09 15:22:20
    b,page_1,2021-10-09 15:24:20
    b,page_1,2021-10-09 15:26:20
    

    Decision:

    import pandas as pd
    

    def all_visited_pages_until(page, user_id, df):
    df_by_user = df[(df.id == user_id)]
    df_by_user = df_by_user.sort_values(by="timestamp")
    last_row_index_by_page = df_by_user.url.where(df_by_user.url == page).last_valid_index()

    return df_by_user.loc[:last_row_index_by_page]
    

    def main():
    df = pd.read_csv("input.csv", sep=",")
    print(all_visited_pages_until("page_2", "a", df))

    if name == 'main':
    main()

    Programme withdrawal:

    0    page_1
    1 page_2
    2 page_3
    3 page_4
    4 page_2
    Name: url, dtype: object

    On the way out, they got the object. Series♪ If necessary, it could be transformed into list

    == sync, corrected by elderman ==

    If all pages have a view page_i (sighs)i This number, respectively, may be replaced to optimize the storage of large data volumes page_i Keep it. i

    To this end, before calling a function all_visited_pages_until enough to write df.url = df.url.apply(lambda url: int(url.replace("page_", "")))



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2