데이터 프레임의 모든 문자열 제거/잘라내기

itgroup 2023. 7. 17. 20:52

데이터 프레임의 모든 문자열 제거/잘라내기

python/pandas에서 멀티타입 데이터 프레임의 값을 클리닝하여 문자열을 트리밍하고 싶습니다.현재 두 가지 지침으로 진행하고 있습니다.

import pandas as pd

df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])

df.replace('^\s+', '', regex=True, inplace=True) #front
df.replace('\s+$', '', regex=True, inplace=True) #end

df.values

이것은 꽤 느립니다, 제가 무엇을 개선할 수 있을까요?

을(를) 사용하여 선택할 수 있습니다.string열 및 그 다음apply기능을 발휘합니다

주의: 값은 다음과 같을 수 없습니다.types맘에 들다dicts또는lists그들의dtypes이라object.

df_obj = df.select_dtypes(['object'])
print (df_obj)
0    a  
1    c  

df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
print (df)

   0   1
0  a  10
1  c   5

그러나 열이 몇 개만 있는 경우 다음을 사용합니다.

df[0] = df[0].str.strip()

머니샷

다음은 사용에 대한 간략한 버전입니다.applymap호출할 직접 람다 식으로strip값이 문자열 유형인 경우에만:

df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

전체 예제

더 완전한 예:

import pandas as pd


def trim_all_columns(df):
    """
    Trim whitespace from ends of each value across all series in dataframe
    """
    trim_strings = lambda x: x.strip() if isinstance(x, str) else x
    return df.applymap(trim_strings)


# simple example of trimming whitespace from data elements
df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
df = trim_all_columns(df)
print(df)


>>>
   0   1
0  a  10
1  c   5

작업 예제

다음은 Trinket에서 호스팅하는 작업 예제입니다. https://trinket.io/python3/e6ab7fb4ab

시도할 수 있습니다.

df[0] = df[0].str.strip()

또는 보다 구체적으로 모든 문자열 열에 대해

non_numeric_columns = list(set(df.columns)-set(df._get_numeric_data().columns))
df[non_numeric_columns] = df[non_numeric_columns].apply(lambda x : str(x).strip())

정규식을 정말로 사용하고 싶다면,

>>> df.replace('(^\s+|\s+$)', '', regex=True, inplace=True)
>>> df
   0   1
0  a  10
1  c   5

그러나 이렇게 하는 것이 더 빠를 것입니다.

>>> df[0] = df[0].str.strip()

의 기능을 사용할 수 있습니다.Series객체:

>>> df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
>>> df[0][0]
'  a  '
>>> df[0] = df[0].apply(lambda x: x.strip())
>>> df[0][0]
'a'

의 사용법을 기록합니다.strip그리고 그가 아닌regex어느 쪽이 훨씬 빠릅니까?

다른 옵션 - DataFrame 개체의 기능을 사용합니다.

>>> df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
>>> df.apply(lambda x: x.apply(lambda y: y.strip() if type(y) == type('') else y), axis=0)

   0   1
0  a  10
1  c   5

스트립만으로는 문자열의 내부 여분 공간이 제거되지 않습니다.이 문제를 해결하려면 먼저 하나 이상의 공간을 단일 공간으로 바꾸는 것이 좋습니다.이렇게 하면 여분의 내부 공간과 외부 공간을 제거할 수 있습니다.

# Import packages
import re 

# First inspect the dtypes of the dataframe
df.dtypes

# First replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.
df = df.applymap(lambda x: re.sub('\s+', ' ', x) if isinstance(x, str) else x)


# Then strip leading and trailing white spaces
df = df.apply(lambda x: x.str.strip() if isinstance(x, object) else x)

@jezrel의 대답은 좋아 보입니다.그러나 최종 결과 집합에서 다른 열(숫자/정수 등)도 반환하려면 원래 DataFrame과 다시 병합해야 합니다.

그런 경우에는 이 방법을 사용할 수 있습니다.

df = df.apply(lambda x: x.str.strip() if x.dtype.name == 'object' else x, axis=0)

감사합니다!

최상의 답변을 위한 벤치마크:

bm = Benchmark()
df = pd.read_excel(
    path, 
    sheet_name=advantage_sheet_name, 
    parse_dates=True
)
bm.mark('Loaded')

# @jezrael 's answer (accepted answer)
dfClean_1 = df\
    .select_dtypes(['object'])\
    .apply(lambda x: x.str.strip())
bm.mark('Clean method 1')

# @Jonathan B. answer 
dfClean_2 = df\
    .applymap(lambda x: x.strip() if isinstance(x, str) else x)
bm.mark('Clean method 2')

#@MaxU - stop genocide of UA / @Roman Pekar answer 
dfClean_3 = df\
    .replace(r'\s*(.*?)\s*', r'\1', regex=True)
bm.mark('Clean method 3')

결과.

145.734375 - 145.734375 : Loaded
147.765625 - 2.03125 : Clean method 1
155.109375 - 7.34375 : Clean method 2
288.953125 - 133.84375 : Clean method 3

(문자열 열의 경우)

df[col] = df[col].str.replace(" ","")

절대로

def trim(x):
    if x.dtype == object:
        x = x.str.split(' ').str[0]
    return(x)

df = df.apply(trim)

언급URL : https://stackoverflow.com/questions/40950310/strip-trim-all-strings-of-a-dataframe

'IT' 카테고리의 다른 글

유형 스크립트에서 지도 유형을 선언하는 방법은 무엇입니까? (0)	2023.07.17
레스크 vs 사이드키크? (0)	2023.07.17
새로운 Firebase에서 xcode에서 다중 구성 파일을 사용하는 방법은 무엇입니까? (0)	2023.07.17
C에서 데이터 구조 직렬화 (0)	2023.07.12
모든 출력을 Bash의 파일로 리디렉션 (0)	2023.07.12

현재글데이터 프레임의 모든 문자열 제거/잘라내기

각종 프로그래밍 정보를 다루는 블로그입니다.

Ajax, php, MongoDB, JSON, Excel, git, reactjs, powershell, sql-server, MariaDB, Python, MySQL, AngularJS, spring-boot, jQuery, C, Java, oracle, JavaScript, WordPress,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

itgroup