这几个高级技巧，让 Python 类如虎添翼

使用Python类可以创建对象，处理复杂的数据结构、流程、管道、算法或机器学习模型。面向对象编程（OOP）提供了模块化和可重用性，使数据科学家和机器学习工程师能够开发灵活、可扩展的代码库。将代码结构化为类和对象对于回顾性开发工作非常有用，无论是添加新功能、修改现有功能还是修复错误。

在 Python 中，通常有三种类型的方法：实例方法、静态方法和类方法。

实例方法是以 self 作为第一个参数定义的方法，它将类的实例作为隐式输入，允许用户与类的属性进行交互。实例方法功能强大，因为它们可以访问和修改实例中的数据和配置，从而执行复杂的计算和实现复杂的逻辑，并具有很高的可读性和可维护性。

静态方法使用 @staticmethod 装饰器定义，属于类而不是类的实例，不能通过 self 访问实例或其属性。这些方法通常用于在特定类的上下文中定义实用功能。

最后，还有类方法，它们与类绑定，而不是与类的实例绑定，它们可以修改类的状态，使其适用于所有实例。我们将着重讨论“类方法”及其为我们的代码增添额外 OOP 优势潜能。我将分享一些专门针对数据科学和机器学习应用的技巧，希望你能将它们应用到你的日常工作流程中。

什么是类方法？

一个实用的例子就是创建单例类。单例类是一种设计模式，这里你可以限制一个类只能有一个实例。下面是一个实现：

 
class Singleton(object):
    """The famous Singleton class that can only have one instance."""
 
    _instance = None
 
    def __new__(cls, *args, **kwargs):
        """Create a new instance of the class if one does not already exist."""
        if cls._instance is not None:
            raise Exception("Singleton class can only have one instance.")
        cls._instance = object.__new__(cls, *args, **kwargs)
        return cls._instance
        
# 创建第一个实例
single = Singleton()
 
# 创建第二个实例
double = Singleton()
 
# Error Output:
# Exception: Singleton class can only have one instance.

在这里，当尝试实例化 double 时，代码会失败，因为它通过检查类属性 _instance 的状态，检测到 Singleton 的实例已经存在。我们可以通过检查该属性来明确了解这一情况：

 
Singleton._instance
# Output:
# <__main__.Singleton at 0x7f7c10491f30>

但如何改变整个 Singleton 类的状态的呢？

__new__ 方法的定义中，第一个参数是 cls，代表类对象。这意味着 __new__ 是一个类方法，可以改变整个类的状态，而典型的实例方法只能改变类中特定实例的状态。因此，当我们创建第一个实例（隐式调用 __new__）时，我们可以改变类本身的一些基本特性，表明我们已经使用过它一次。

“类方法”背后的整个理念是允许在类中定义与类本身而非其实例绑定的方法，从而允许修改类的行为，使其更加灵活。

在数据科学和机器学习中，这种灵活性非常宝贵。类方法为管理数据处理、模型配置或数据库连接的类的实例化提供了更有效的替代方法，最终会带来更简洁、可维护、可扩展的代码。

这里有一些实际用例，这些用例证明了@classmethods 是特别有用的。

类方法与类本身绑定，而不是与类的实例绑定。它们可以改变类的状态，使其适用于类的所有当前或未来实例。

如何在数据项目中使用类方法

1. 数据处理器的替代构造函数

数据处理类是数据相关项目和管道中最典型的类。想象一下，你有一个名为 "DataProcessor "的类，它负责处理一些复杂的数据处理任务列表。通常，通过使用内存中的数据对其进行初始化，然后对其进行处理来创建该类的实例。

如下所示

 
class DataProcessor:
    def __init__(self, data):
        self.data = data  # take data in from memory
 
    def process_data(self):
        # complicated code to process data in memory
        ...
 
# Using the class with initial data in memory
processor = DataProcessor(data=data)
processor.process_data()

想象一下，你想让这个类更灵活，可以从磁盘读取 csv 文件。如果简单地添加一个读取文件的方法，类的实例化过程就会出现问题。你需要用空数据对象来实例化类，然后运行数据加载方法来覆盖这些数据。

 
class DataProcessor:
    def __init__(self, data):
        self.data = data  # take data in from memory
 
    def process_data(self):
        # complicated code to process data in memory
        ...
    
    def from_csv(self, filepath):
        self.data = pd.read_csv(filepath)
 
# Using the class without initial data in memory
processor = DataProcessor(data=None)
processor.from_csv("path_to_your_file.csv")
processor.process_data()

这种方法有效但效率低、冗长，而且确实难看。

另一种更好的方法是使用 @classmethods，定义一个类方法 from_csv() 作为替代构造函数。它接受替代输入（例如 filepath 而不是内存中的 data），使得我们可以直接从 CSV 文件加载数据创建 DataProcessor 实例。外观如下

 
class DataProcessor:
    def __init__(self, data):
        self.data = data
 
    def process_data(self):
        ...
 
    @classmethod
    def from_csv(cls, filepath):
        data = pd.read_csv(filepath)
        return cls(data)
 
# Instantiating and using the class with classmethod
processor = DataProcessor.from_csv("path_to_your_file.csv")
processor.process_data()

如何通过.from_csv()使类的实例化更简洁高效？就好像有了一个进入类的秘密窗口一样，你需要决定通过门还是窗来获取数据，取决于你的使用情况。（默认情况下，类是在内存中获取数据，还是从文件路径中获取数据）。

当然，这种替代类构造函数的概念还可以扩展。例如，如果要从 parquet 文件加载数据，可以为此添加另一个类方法。

 
class DataProcessor:
    def __init__(self, data):
        self.data = data
 
    def process_data(self):
        ...
 
    @classmethod
    def from_csv(cls, filepath):
        ...
 
    @classmethod
    def from_parquet(cls, filepath):
        data = pd.read_parquet(filepath)
        return cls(data)

此外我们还可以定义一个 from_file 工厂方法，它可以检测传入的文件类型，并调用相应的加载器。

2. 模型封装器的替代构造函数

替代构造函数的概念可以很容易地扩展到ML 模型封装器。假设你有一个名为 MyXGBModel 的模型类，它是 XGBoost 库的封装器。它接收一些参数，初始化一个模型，并可能处理一些训练、评估和其他常规建模任务。

下面是它的基本版本，没有对 XGBoost 模型行为进行任何有趣的更改。

 
import xgboost
 
class MyXGBModel:
    def __init__(self, learning_rate=0.1, n_estimators=100, max_depth=3):
        self.learning_rate = learning_rate
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.model = self._create_model()
 
    def _create_model(self):
        model = xgboost.XGBClassifier(learning_rate=self.learning_rate,
                                      n_estimators=self.n_estimators,
                                      max_depth=self.max_depth)
        return model
 
# Usage example
model = MyXGBModel(learning_rate=0.05, n_estimators=200, max_depth=5)

要初始化这个模型，通常需要像上面的例子一样传递一堆参数。或者，我们也可以像下面这样在参数字典中传递这些变量：

 
# Initializing with a dictionary of parameters
params = {'learning_rate': 0.05, 'n_estimators': 200, 'max_depth': 5}
model_from_dict = MyXGBModel(**params)

将这些参数放在 JSON 配置文件中而不是内存中的字典里，这样该怎么办呢？一种方法是加载 JSON 文件并创建参数，然后将其传递给模型。

 
import json
 
# Load parameters from a json config file
with open('config.json', 'r') as file:
    params = json.load(file)
  
# the contents of params is the same  
# params = {'learning_rate': 0.05, 'n_estimators': 200, 'max_depth': 5}
 
# Now initializing with a dictionary of parameters
model_from_dict = MyXGBModel(**params)

同样，它有点难看，而且过于冗长。更糟糕的是，它依赖于类外的代码块。因此，定义一个类方法来简化和改进这一过程。

 
class MyXGBModel:
    def __init__(self, learning_rate=0.1, n_estimators=100, max_depth=3):
        ...
 
    def _create_model(self):
        ...
 
    @classmethod
    def from_config_file(cls, file_path):
        import json
        with open(file_path, 'r') as file:
            config = json.load(file)
        return cls(**config)
 
# Usage
model_from_config = MyXGBModel.from_config_file('config.json')

这样看起来是不是更漂亮了？

使用类方法，我们可以一次性从文件中获取所有参数。另一种构造函数直接使用配置文件中的参数，省去了类外的任何模板代码。新的实现方式更简洁、直接、可维护性更高，也更容易为其他开发人员所理解。

可以进一步利用这一技术，创建方便的方法来加载预配置模型，从而简化模型初始化过程。在下面我们将介绍如何实现这一点。

3. 预配置模型

为了便针对特定场景预定义设置，从配置文件初始化模型的概念可以通用化，快速初始化模型。比如，我们想定义一个用于快速迭代的 quick_start 模型，以及一个用于更复杂任务的 high_accuracy 模型。可以使用类方法来提供这些预配置选项。

 
class MyXGBoostModel:
    def __init__(self, learning_rate=0.1, n_estimators=100, max_depth=3):
        ...
 
    def _create_model(self):
        ...
 
    @classmethod
    def from_config_file(cls, file_path):
        ...
 
    @classmethod
    def quick_start(cls):
        default_params = {'learning_rate': 0.05, 'n_estimators': 100, 'max_depth': 4}
        return cls(**default_params)
 
    @classmethod
    def high_accuracy(cls):
        high_acc_params = {'learning_rate': 0.01, 'n_estimators': 500, 'max_depth': 10}
        return cls(**high_acc_params)
 
# Usage
quick_start_model = MyXGBoostModel.quick_start()
high_accuracy_model = MyXGBoostModel.high_accuracy()

类方法可以优雅地为模型的预配置设置提供初始化。通过定义每个预配置设置的特定类方法，我们可以快速调用它们，而无需手动指定每个参数。这样做可以大大减少模板代码的数量，使代码更简洁、易读和易维护。

类似地，类方法的功能与数码相机的预设配置（如横向、纵向、夜间模式等）非常相似。虽然可以手动设置光圈和快门速度来进行自定义拍摄，但预设配置可以限制这些设置，以便适合特定使用情况。这些选项为我们的建模探索提供了起点，并且在实践中，我们可以进行适当的网格搜索和超参数调整，以找到预测模型的最佳参数集。

4. 数据库连接器的开发配置与生产配置

来看看类方法的另一个实际用例：创建一个数据库连接器类。需要指出，我不是数据库/开发运营工程师。介绍的代码可能不是建立数据库连接的最安全方法。如果你打算在工作中使用这种设计模式，需要咨询你的工程团队，需要符合他们的标准。

接下来看下如何利用类方法创建一个典型的数据库连接器，并为开发（dev）和生产（prod）环境预定义配置。

设置和管理数据库连接器在处理多个环境时总是很棘手，原因在于每个环境通常都有自己独特的设置。在不同环境之间切换,尤其是在笔记本电脑中，可能会导致许多错误和不一致性。此外，如果要处理敏感数据，出于安全考虑，管理数据库凭据需要格外小心。

一种优雅的解决方案是为每种模式(开发与生产)定义单独的类方法，利用环境变量以独特的设置启动连接器。这不仅提供了一致性，还增强了代码的可读性和健壮性。

为了说明这一点，让我们定义一个"DatabaseConnector"类，其中包含用于在开发和生产环境中设置连接的类方法。

 
from getpass import getpass
 
class DatabaseConnector:
    def __init__(self, account, user, password):
        self.account = account
        self.user = user
        self._password = getpass("enter the DB Password:")  # Keep password private
        self._connection = None
 
    @classmethod
    def development_config(cls):
        return cls(
            account=os.environ.get("DEV_DB_ACCOUNT"),
            user=os.environ.get("DEV_DB_USER"),
            password=os.environ.get("DEV_DB_PASSWORD") or getpass("Enter Dev DB Password: ")
        )
 
    @classmethod
    def production_config(cls):
        return cls(
            account=os.environ.get("PROD_DB_ACCOUNT"),
            user=os.environ.get("PROD_DB_USER"),
            password=os.environ.get("PROD_DB_PASSWORD") or getpass("Enter Prod DB Password: ")
        )
 
    @property
    def params(self):
        # Exclude password from the public parameters
        return {
            "account": self.account,
            "user": self.user
        }
 
    def initialize_connection(self):
        if not self._connection:
     # define create_database_connection based on your specific database (e.g. postgres, snowflake, redshift, etc)
            self._connection = create_database_connection(self.account, self.user, self._password)
 
    def query_data(self, query):
        if not self._connection:
            raise Exception("Database connection not initialized")
        return execute_query(self._connection, query)

使用这个超酷的数据库连接器，只需像这样使用不同的类方法实例化这个类：

 
# Dev Database
dev_db_conn = DatabaseConnector.development_config()
dev_db_conn.initialize_connection()
dev_data = dev_db_conn.query_data("SELECT * FROM dev_table_name")
 
# Prod Database
prod_db_conn = DatabaseConnector.production_config()
prod_db_conn.initialize_connection()
prod_data = prod_db_connector.query_data("SELECT * FROM prod_table_name")

在运行这些行之前，你唯一需要做的是将数据库用户名和密码作为环境变量存储一次。剩下的就交给这个类来处理。

我们将该类方法用作在开发环境和产品环境之间切换的开关，而无需重新配置整个数据库连接。这样可以避免编写访问数据所需的所有模板代码，节省更多时间。

	class Singleton(object):
	"""The famous Singleton class that can only have one instance."""

	_instance = None

	def __new__(cls, args, *kwargs):
	"""Create a new instance of the class if one does not already exist."""
	if cls._instance is not None:
	raise Exception("Singleton class can only have one instance.")
	cls._instance = object.__new__(cls, args, *kwargs)
	return cls._instance

	# 创建第一个实例
	single = Singleton()

	# 创建第二个实例
	double = Singleton()

	# Error Output:
	# Exception: Singleton class can only have one instance.

	Singleton._instance
	# Output:
	# <__main__.Singleton at 0x7f7c10491f30>

	class DataProcessor:
	def __init__(self, data):
	self.data = data # take data in from memory

	def process_data(self):
	# complicated code to process data in memory
	...

	# Using the class with initial data in memory
	processor = DataProcessor(data=data)
	processor.process_data()

	class DataProcessor:
	def __init__(self, data):
	self.data = data

	def process_data(self):
	...

	@classmethod
	def from_csv(cls, filepath):
	data = pd.read_csv(filepath)
	return cls(data)

	# Instantiating and using the class with classmethod
	processor = DataProcessor.from_csv("path_to_your_file.csv")
	processor.process_data()

	import xgboost

	class MyXGBModel:
	def __init__(self, learning_rate=0.1, n_estimators=100, max_depth=3):
	self.learning_rate = learning_rate
	self.n_estimators = n_estimators
	self.max_depth = max_depth
	self.model = self._create_model()

	def _create_model(self):
	model = xgboost.XGBClassifier(learning_rate=self.learning_rate,
	n_estimators=self.n_estimators,
	max_depth=self.max_depth)
	return model

	# Usage example
	model = MyXGBModel(learning_rate=0.05, n_estimators=200, max_depth=5)

	# Initializing with a dictionary of parameters
	params = {'learning_rate': 0.05, 'n_estimators': 200, 'max_depth': 5}
	model_from_dict = MyXGBModel(**params)

	import json

	# Load parameters from a json config file
	with open('config.json', 'r') as file:
	params = json.load(file)

	# the contents of params is the same
	# params = {'learning_rate': 0.05, 'n_estimators': 200, 'max_depth': 5}

	# Now initializing with a dictionary of parameters
	model_from_dict = MyXGBModel(**params)

	class MyXGBoostModel:
	def __init__(self, learning_rate=0.1, n_estimators=100, max_depth=3):
	...

	def _create_model(self):
	...

	@classmethod
	def from_config_file(cls, file_path):
	...

	@classmethod
	def quick_start(cls):
	default_params = {'learning_rate': 0.05, 'n_estimators': 100, 'max_depth': 4}
	return cls(**default_params)

	@classmethod
	def high_accuracy(cls):
	high_acc_params = {'learning_rate': 0.01, 'n_estimators': 500, 'max_depth': 10}
	return cls(**high_acc_params)

	# Usage
	quick_start_model = MyXGBoostModel.quick_start()
	high_accuracy_model = MyXGBoostModel.high_accuracy()

	from getpass import getpass

	class DatabaseConnector:
	def __init__(self, account, user, password):
	self.account = account
	self.user = user
	self._password = getpass("enter the DB Password:") # Keep password private
	self._connection = None

	@classmethod
	def development_config(cls):
	return cls(
	account=os.environ.get("DEV_DB_ACCOUNT"),
	user=os.environ.get("DEV_DB_USER"),
	password=os.environ.get("DEV_DB_PASSWORD") or getpass("Enter Dev DB Password: ")
	)

	@classmethod
	def production_config(cls):
	return cls(
	account=os.environ.get("PROD_DB_ACCOUNT"),
	user=os.environ.get("PROD_DB_USER"),
	password=os.environ.get("PROD_DB_PASSWORD") or getpass("Enter Prod DB Password: ")
	)

	@property
	def params(self):
	# Exclude password from the public parameters
	return {
	"account": self.account,
	"user": self.user
	}

	def initialize_connection(self):
	if not self._connection:
	# define create_database_connection based on your specific database (e.g. postgres, snowflake, redshift, etc)
	self._connection = create_database_connection(self.account, self.user, self._password)

	def query_data(self, query):
	if not self._connection:
	raise Exception("Database connection not initialized")
	return execute_query(self._connection, query)

	# Dev Database
	dev_db_conn = DatabaseConnector.development_config()
	dev_db_conn.initialize_connection()
	dev_data = dev_db_conn.query_data("SELECT * FROM dev_table_name")

	# Prod Database
	prod_db_conn = DatabaseConnector.production_config()
	prod_db_conn.initialize_connection()
	prod_data = prod_db_connector.query_data("SELECT * FROM prod_table_name")