当前位置：首页 > news >正文

宁波建网站费用google海外推广

news 2026/1/13 18:28:53

宁波建网站费用,google海外推广,酒吧营销用什么软件找客源,网站收录就是没排名1 原理 1.1 B树 #xff08;1#xff09;m路查找树一棵m路查找树#xff0c;它或者是一棵空树#xff0c;或者是满足如下性质的树#xff1a; 根最多有m棵子树#xff0c;并具有以下结构#xff1a; ,是指向子树的指针#xff0c;是关键码#xff0c; 在子树中所…1 原理 1.1 B树 1m路查找树一棵m路查找树它或者是一棵空树或者是满足如下性质的树根最多有m棵子树并具有以下结构 ,是指向子树的指针是关键码在子树中所有的关键码都大于小于。在子树中所有的关键码都大于在子树中所有的关键码都小于子树也是m路查找树 2B树 m阶B树时一棵m路查找树它或是空树或者满足以下性质树中每个节点至多有m棵子树根节点至少有两棵子树除根节点以外的所有非终端节点至少有棵子树所有的叶子节点都位于同一层 1.2 步骤具体模拟过程参考https://www.cnblogs.com/pinard/p/6179132.html 参考资料 BIRCH能够识别出数据集中数据分布的不均衡性将分布在稠密区域中的点聚类并移除将分布在稀疏区域中的异常点。此外BIRCH是一种增量聚类方法针对每一个点的聚类决策都是基于当前已经处理过的数据点而不是全局的数据点。 ①　建立一个聚类特征树首先是遍历所有数据使用给定的内存数量和磁盘上的回收空间构建一个初始的内存CF树来反映数据集上的聚类信息。对于稠密数据分组成更精细的簇稀疏的数据点则被作为异常点移除。 ②　缩小范围精简聚类特征树该过程是可选择的这个部分是连接步骤①和步骤③的桥梁相似于步骤①他开始遍历初始化的聚类特征树叶子节点移除更多的异常点和缩小范围进行分组。 ③　全局聚类使用全局聚类或者半全局聚类来操作所有的叶子节点有数据点的聚类算法很容易适应一组子簇每个子簇由其聚类特征向量表示。计算子簇的质心然后每个子簇用质心表示这部分可以捕捉到数据的主要分布规律。 ④　簇类细化因为步骤③只是对数据进行粗略总结原数据只是被扫描了一次需要继续完善簇类。使用上阶段产生的簇的中心作为种子并将数据点重新分配到最近的种子以获得一组新的簇。这不仅允许属于该子簇的点迁移而且可以确保给定数据点的所有副本都迁移到同一个子簇中。还为提供了丢弃异常值的选项。也就是说如果距离最近的点太远种子可以作为离群值处理而不包含在结果中。 2.参数说明函数sklearn.cluster.Birch 参数 threshold(float,default0.5)新的子聚类和最近的子聚类合并的子聚类的半径小于阈值否则将进行分裂。branching_factor(int,default50)每个结点中CF子聚类的最大数量。n_cluster(intdefault3)最终聚类步骤的聚类数量if None不执行最终的聚类步骤子聚类原样返回if sklearn.cluster.Estimator,则该模型将子聚类作为新样本执行。compute_labels(bool,defaultTrue)每次拟合的时候是否标签值计算。copy(bool,defaultTrue)是否复制获得的数据如果设置为false初始化数据将被重写。属性 root_CF tree的rootdummy_leaf_:所有叶子节点的指针subcluster_centers_所有叶子里子聚类的质心subcluster_labels_全聚类之后子聚类质心的labelslabels_所有输入数据的labels 3 具体实现可参考scikit-learn的实例https://scikit-learn.org/stable/auto_examples/cluster/plot_birch_vs_minibatchkmeans.html#sphx-glr-auto-examples-cluster-plot-birch-vs-minibatchkmeans-py 4 源码解析源码在Anaconda3/Lib/site-packages/sklearn/cluster/birch.py中 1前缀知识 hasattr()函数用来判断某个类实例对象是否包含指定名称的属性或方法,返回True和False hasattr(obj, name)其中obj 指的是某个类的实例对象name 表示指定的属性名或方法名。 getattr()函数获取某个类实例对象中指定属性的值 getattr(obj, name[, default])其中obj 表示指定的类实例对象name 表示指定的属性名而 default 是可选参数用于设定该函数的默认返回值即当函数查找失败时如果不指定 default 参数则程序将直接报 AttributeError 错误反之该函数将返回 default 指定的值。 setattr()函数的功能相对比较复杂它最基础的功能是修改类实例对象中的属性值。其次它还可以实现为实例对象动态添加属性或者方法。 2Birch函数 Birch(BaseEstimator, TransformerMixin, ClusterMixin)在sklearn的base文件里其他参数fit函数主要核心计算在_fit函数中def fit(self, X, yNone):Build a CF Tree for the input data.Parameters----------X : {array-like, sparse matrix} of shape (n_samples, n_features)Input data.y : IgnoredNot used, present here for API consistency by convention.Returns-------selfFitted estimator.self.fit_, self.partial_fit_ True, Falsereturn self._fit(X)def _fit(self, X):X self._validate_data(X, accept_sparsecsr, copyself.copy)threshold self.thresholdbranching_factor self.branching_factorif branching_factor 1:raise ValueError(Branching_factor should be greater than one.)n_samples, n_features X.shape# If partial_fit is called for the first time or fit is called, we# start a new tree.partial_fit getattr(self, partial_fit_)has_root getattr(self, root_, None)if getattr(self, fit_) or (partial_fit and not has_root):# The first root is the leaf. Manipulate this object throughout.self.root_ _CFNode(thresholdthreshold,branching_factorbranching_factor,is_leafTrue,n_featuresn_features)# To enable getting back subclusters.self.dummy_leaf_ _CFNode(thresholdthreshold,branching_factorbranching_factor,is_leafTrue, n_featuresn_features)self.dummy_leaf_.next_leaf_ self.root_self.root_.prev_leaf_ self.dummy_leaf_# Cannot vectorize. Enough to convince to use cython.if not sparse.issparse(X):iter_func iterelse:iter_func _iterate_sparse_X#遍历数据构建子聚类for sample in iter_func(X):subcluster _CFSubcluster(linear_sumsample)split self.root_.insert_cf_subcluster(subcluster)#如果确定CF要分裂则使用分裂算法返回两个子聚类并将子聚类增加到root上if split:new_subcluster1, new_subcluster2 _split_node(self.root_, threshold, branching_factor)del self.root_self.root_ _CFNode(thresholdthreshold,branching_factorbranching_factor,is_leafFalse,n_featuresn_features)self.root_.append_subcluster(new_subcluster1)self.root_.append_subcluster(new_subcluster2)#获取叶子节点的质心centroids np.concatenate([leaf.centroids_ for leaf in self._get_leaves()])self.subcluster_centers_ centroidsself._global_clustering(X)return self 其他函数构建稀疏矩阵 def _iterate_sparse_X(X):This little hack returns a densified row when iterating over a sparsematrix, instead of constructing a sparse matrix for every row that isexpensive.n_samples X.shape[0]X_indices X.indicesX_data X.dataX_indptr X.indptrfor i in range(n_samples):row np.zeros(X.shape[1])startptr, endptr X_indptr[i], X_indptr[i 1]nonzero_indices X_indices[startptr:endptr]row[nonzero_indices] X_data[startptr:endptr]yield row分裂叶子结点的函数定义两个子聚类两个CF节点,并将CF节点加入到CF子聚类中如果传入的子聚类是叶子节点就进行一系列的指针变换计算子聚类的质心和平方和之间的距离选择距离最大的矩阵然后选择较小的值为一个子聚类其他的归为另一个子聚类。 def _split_node(node, threshold, branching_factor):The node has to be split if there is no place for a new subclusterin the node.1. Two empty nodes and two empty subclusters are initialized.2. The pair of distant subclusters are found.3. The properties of the empty subclusters and nodes are updatedaccording to the nearest distance between the subclusters to thepair of distant subclusters.4. The two nodes are set as children to the two subclusters.new_subcluster1 _CFSubcluster()new_subcluster2 _CFSubcluster()new_node1 _CFNode(thresholdthreshold, branching_factorbranching_factor,is_leafnode.is_leaf,n_featuresnode.n_features)new_node2 _CFNode(thresholdthreshold, branching_factorbranching_factor,is_leafnode.is_leaf,n_featuresnode.n_features)new_subcluster1.child_ new_node1new_subcluster2.child_ new_node2if node.is_leaf:if node.prev_leaf_ is not None:node.prev_leaf_.next_leaf_ new_node1new_node1.prev_leaf_ node.prev_leaf_new_node1.next_leaf_ new_node2new_node2.prev_leaf_ new_node1new_node2.next_leaf_ node.next_leaf_if node.next_leaf_ is not None:node.next_leaf_.prev_leaf_ new_node2dist euclidean_distances(node.centroids_, Y_norm_squarednode.squared_norm_, squaredTrue)n_clusters dist.shape[0]farthest_idx np.unravel_index(dist.argmax(), (n_clusters, n_clusters))node1_dist, node2_dist dist[(farthest_idx,)]node1_closer node1_dist node2_distfor idx, subcluster in enumerate(node.subclusters_):if node1_closer[idx]:new_node1.append_subcluster(subcluster)new_subcluster1.update(subcluster)else:new_node2.append_subcluster(subcluster)new_subcluster2.update(subcluster)return new_subcluster1, new_subcluster2 获取叶子结点 def _get_leaves(self):Retrieve the leaves of the CF Node.Returns-------leaves : list of shape (n_leaves,)List of the leaf nodes.leaf_ptr self.dummy_leaf_.next_leaf_leaves []while leaf_ptr is not None:leaves.append(leaf_ptr)leaf_ptr leaf_ptr.next_leaf_return leaves进行全局聚类增加了AgglomerativeClustering算法另写。 def _global_clustering(self, XNone):Global clustering for the subclusters obtained after fittingclusterer self.n_clusterscentroids self.subcluster_centers_compute_labels (X is not None) and self.compute_labels# Preprocessing for the global clustering.not_enough_centroids Falseif isinstance(clusterer, numbers.Integral):clusterer AgglomerativeClustering(n_clustersself.n_clusters)# There is no need to perform the global clustering step.if len(centroids) self.n_clusters:not_enough_centroids Trueelif (clusterer is not None and nothasattr(clusterer, fit_predict)):raise ValueError(n_clusters should be an instance of ClusterMixin or an int)# To use in predict to avoid recalculation.self._subcluster_norms row_norms(self.subcluster_centers_, squaredTrue)if clusterer is None or not_enough_centroids:self.subcluster_labels_ np.arange(len(centroids))if not_enough_centroids:warnings.warn(Number of subclusters found (%d) by Birch is less than (%d). Decrease the threshold.% (len(centroids), self.n_clusters), ConvergenceWarning)else:# The global clustering step that clusters the subclusters of# the leaves. It assumes the centroids of the subclusters as# samples and finds the final centroids.self.subcluster_labels_ clusterer.fit_predict(self.subcluster_centers_)if compute_labels:self.labels_ self.predict(X)3CFNode 参数属性threshold:float确定子聚类的阈值subclusters_ : list指定结点的子聚类branching_factor: int分支因子prev_leaf_ : _CFNode前叶子结点is_leaf : bool是否是叶子节点next_leaf_ : _CFNode后叶子结点n_features : int特征数量init_centroids_ 初始化质心shape(branching_factor 1, n_features) init_sq_norm_ 初始化平方和shape(branching_factor 1, n_features) centroids_质心 squared_norm_平方和CFNode有三个函数构成第一个函数:append_subcluster(self, subcluster)更新CF的特征值 def append_subcluster(self, subcluster):#获取CF的子聚类长度n_samples len(self.subclusters_)#将新的子聚类加入到CF中self.subclusters_.append(subcluster)#初始化新子聚类的质心和平方和将质心和平和方加入到列表中self.init_centroids_[n_samples] subcluster.centroid_self.init_sq_norm_[n_samples] subcluster.sq_norm_# Keep centroids and squared norm as views. In this way# if we change init_centroids and init_sq_norm_, it is# sufficient,#更新最终的子聚类的质心和平方和将质心和平和方加入到列表中self.centroids_ self.init_centroids_[:n_samples 1, :]self.squared_norm_ self.init_sq_norm_[:n_samples 1] 第二个函数:update_split_subclusters(self, subcluster,new_subcluster1, new_subcluster2)更新分裂节点 def update_split_subclusters(self, subcluster,new_subcluster1, new_subcluster2):Remove a subcluster from a node and update it with thesplit subclusters.ind self.subclusters_.index(subcluster)self.subclusters_[ind] new_subcluster1self.init_centroids_[ind] new_subcluster1.centroid_self.init_sq_norm_[ind] new_subcluster1.sq_norm_self.append_subcluster(new_subcluster2) 第三个函数:insert_cf_subcluster(self, subcluster):子聚类中插入CF特征 def insert_cf_subcluster(self, subcluster):Insert a new subcluster into the node.# self.subclusters_不存在则将新的子聚类加入到子聚类列表中if not self.subclusters_:self.append_subcluster(subcluster)return Falsethreshold self.thresholdbranching_factor self.branching_factor# We need to find the closest subcluster among all the# subclusters so that we can insert our new subcluster.#计算距离矩阵dist_matrix np.dot(self.centroids_, subcluster.centroid_)dist_matrix * -2.dist_matrix self.squared_norm_closest_index np.argmin(dist_matrix)closest_subcluster self.subclusters_[closest_index]# If the subcluster has a child, we need a recursive strategy.#如果子聚类存在字迹需要采用递归策略更新CF参数if closest_subcluster.child_ is not None:split_child closest_subcluster.child_.insert_cf_subcluster(subcluster)if not split_child:# If it is determined that the child need not be split, we# can just update the closest_subclusterclosest_subcluster.update(subcluster)self.init_centroids_[closest_index] \self.subclusters_[closest_index].centroid_self.init_sq_norm_[closest_index] \self.subclusters_[closest_index].sq_norm_return False# things not too good. we need to redistribute the subclusters in# our child node, and add a new subcluster in the parent# subcluster to accommodate the new child.else:new_subcluster1, new_subcluster2 _split_node(closest_subcluster.child_, threshold, branching_factor)self.update_split_subclusters(closest_subcluster, new_subcluster1, new_subcluster2)if len(self.subclusters_) self.branching_factor:return Truereturn False# good to go!else:#当子聚类的残差半径小于阈值时更新CF参数merged closest_subcluster.merge_subcluster(subcluster, self.threshold)#如果merged存在将新的子聚类加入到子聚类中并更新子聚类的参数if merged:self.init_centroids_[closest_index] \closest_subcluster.centroid_self.init_sq_norm_[closest_index] \closest_subcluster.sq_norm_return False# not close to any other subclusters, and we still# have space, so add.#如果子聚类的CF树超过分支因子数分裂成新的子聚类加入到Node中elif len(self.subclusters_) self.branching_factor:self.append_subcluster(subcluster)return False# We do not have enough space nor is it closer to an# other subcluster. We need to split.else:self.append_subcluster(subcluster)return True 4CFSubcluster 参数属性linear_sum:narray样本n_samples_ :int每个子聚类的样本数 linear_sum_ : narray子聚类所有样本的线性和 squared_sum_ : floatSum of the squared l2 norms centroids_ 质心 child_孩子结点 sq_norm_ 子聚类的平方和 CFSubcluster有三个函数构成第一个函数:update(self, subcluster)更新数值线性和、质心、平方和等数值 def update(self, subcluster):self.n_samples_ subcluster.n_samples_self.linear_sum_ subcluster.linear_sum_self.squared_sum_ subcluster.squared_sum_self.centroid_ self.linear_sum_ / self.n_samples_self.sq_norm_ np.dot(self.centroid_, self.centroid_) 第二个函数:merge_subcluster(self, nominee_cluster, threshold)连接subclustert def merge_subcluster(self, nominee_cluster, threshold):Check if a cluster is worthy enough to be merged. Ifyes then merge.new_ss self.squared_sum_ nominee_cluster.squared_sum_new_ls self.linear_sum_ nominee_cluster.linear_sum_new_n self.n_samples_ nominee_cluster.n_samples_new_centroid (1 / new_n) * new_lsnew_norm np.dot(new_centroid, new_centroid)dot_product (-2 * new_n) * new_normsq_radius (new_ss dot_product) / new_n new_normif sq_radius threshold ** 2:(self.n_samples_, self.linear_sum_, self.squared_sum_,self.centroid_, self.sq_norm_) \new_n, new_ls, new_ss, new_centroid, new_normreturn Truereturn False 第三个函数:radius(self):计算残差 def radius(self):Return radius of the subclusterdot_product -2 * np.dot(self.linear_sum_, self.centroid_)return sqrt(((self.squared_sum_ dot_product) / self.n_samples_) self.sq_norm_)

查看全文

http://www.yutouwan.com/news/454242/