How to remove unnecessary words and leave only numbers



  • I'm new to programming. There's a csv file, 3,77185 lines. All values in the pole baths - str, tell me how to remove the word baths from almost every line of this column? In hand 377ths, it's not real. I wanted to try regular expressions first, but for their use, it's necessary that the numbers be int or float format for the search, I understand correctly?

    id baths
    0   3.5 
    1   3 Baths
    2   2 Baths
    5   8 
    200 2
    


  • Use the module. https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - it's perfect for processing, analysing and visualizing tabular data:

    import pandas as pd  # https://pandas.pydata.org/docs/getting_started/install.html
    

    парсим CSV файл в Pandas DataFrame

    df = pd.read_csv("file.csv", sep=",")

    в столбце baths удаляем все символы, кроме цифр и точки

    df["baths"] = df["baths"].str.replace(r"[^\d.]", "", regex=True)

    записываем DataFrame обратно в CSV файл

    df.to_csv("result.csv", index=False)


    Regular expression r"[^\d.]" - indicates:

    1. Everything in square brackets is any symbol of the set.
    2. if First The symbol inside square brackets is ^ - it means denying a set of symbols. i.e. within the meaning of any symbols other than those in square brackets after ^
    3. \d - denotes any 0 before 9
    4. . - defines the symbol of the point, that is. . without shielding shall be indicated in RegEx, any single symbol.

    Together, it means finding all symbols other than the numbers and the symbol of the point and replacing them with an empty line, i.e. removing them.

    PS There are services that permit protest RegEx and explain how it works. For example: https://regex101.com/r/3MJlZ8/1

    https://docs.python.org/3/library/re.html



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2