Web Crawler阅读笔记

Posted on 日 09 八月 2015 in Reading • Tagged with python, Crawler

看了下爬虫是怎么设计的，虽然材料有点老旧，但可以学到许多基础知识，WebCrawler

1 Overview

爬虫的目的：快速高效地收集尽可能多的有用的网页。

必须满足的特性：鲁棒性（Robustness），可能会遇到一些页面误导爬虫陷入无限循环中（honeypot）；礼节性（Politeness），遵守一些隐含的或者明确的策略调整访问页面的速率（如robots协议）。

应该满足的特性：分布式地（Distributed），可伸缩（Scalable），高效的（Performance and efficiency），保持最新（Freshness），功能易于扩展（Extensible）

2 Crawling

爬虫从一些种子集（seed set）URL开始爬取内容，将爬取的内容进行解析，抽取出文本放入文本索引器，抽取出url放入URL frontier。之后爬虫从url frontier获取一个url，继续爬取。

2.1 Architecture

basic craler architecture

distributing crawler architecture

2.2 DNS resolution

DNS解析是一个瓶颈。

解决方法：缓存 -> 标准库中的实现通常是同步的，一个请求得等到之前请求完成了才开始执行 -> 实现一个异步版本

2.3 URL frontier

（1）两个重要的考虑：高质量且更新频繁的页面需要优先作为频繁爬取的对象；关于礼节性，需避免在很短的时间间隔内重复爬取一个host。

（2）frontier分成两部分：

frontier

（3）过程：crawler从heap中获取一个队列的可访问时间，等到该时间时，获取该队列（设为j）的头URL（设为u），获取u的内容。结束后，判断这个队列是否为空。若空，则从front queue中获取URL（设为v）来填充它。对于选择哪个front queue，是按照优先级获取的（如一个随机获取的方案，按照优先级的高低，选中的概率不同），保持优先级高的更快流向back queue。先判断back queue是否已经有v的队列，有则放入该队列，并且继续到front queues中获取下一个，直到队列j不空为止。然后更新j的时间，放入heap中。

3 Distributing indexes

两种方式：（1）根据术语（关键字？）分割，即全局索引组织。（2）根据文档分割，即本地索引组织。

前者，将术语划分为各自独立的子集，每个子集存于一个节点。当查询时，把请求发给所有包含查询关键字的节点即可，最后将结果取交集。

后者，将文档划分为子集，每个节点存储一个子集。当查询时，需将请求发往所有节点，最后将结果取并集。

4 Connectivity servers

建图 -> 邻接表 -> 某一行可以由它之前的7行中的一行变换得到或者直接从空行产生，用一个3bits表示选择了哪一行，然后标记元素状态（修改，删除或添加） -> 选择由哪一行得到，需由一个启发函数判断两者的相似度，如果超过阈值则判断为相似，可以由这一行得到 -> 相似度阈值太高，则会导致许多需重新生成的行，太低则会导致恢复一行的数据需要使用到太多的间接行。

Leetcode做题笔记

Posted on 三 08 七月 2015 in 编码 • Tagged with Leetcode

Two Sum

The return numbers in vector should be increasely order.

Longest Substring Without Repeating Characters

Record the position of every characters so far.

Median of Two Sorted Arrays

Use the method of findKth
The start index should add to k

Longest Palindromic Substring

Use Manacher algorithm.reference
Time complexity O(N): the inner loop will take at most a total of N steps in the whole process. Why? Every step of outter loop, the mx(=id+len[id]) will be the most right index before this iteration and the inner loop will start checking charachers at position mx. Then update the mx. So the inner Loop will check every characher once.

Reverse Integer

Use the type of unsigned int to contain the abs(x), sign be the sign of x.
Use res to represent the reverse result so far and res = res * 10 + digit.
When the res is bigger than 214748364 and there is digit which do not reverse yet, it will be overflow. If the reverse is overflow, the reverse of digits exclude the fast digit of x must bigger than ((2 << 31) - 1) >> 1. Because if x have nine digits, then the most significant digit should be 1 or 2.

String to Integer (atoi)

The atoi reference
Use int isspace(int c) to check if character is a white-space and bool isdigit(char) to check if character is decimal digit.
And according to the problem, it requires that if the correct value is out of the range of representable values, INT_MAX (2147483647) or INT_MIN (-2147483648) is returned. But the atoi in standard library does not have this requirement.

Palindrome Number

The negative integer is not a palindrome.

Regular Expression Matching

DP, O(n * m)
f[i][j] = f[i-1][j-1] && match(s[j], p[i]), if p[i] != '*' && p[i+1] != '*'.
f[i][j] = f[i][j-1] || f[i-1][j-1] && match(s[j], p[i]), if p[i+1] == '*'.
f[i][j] = f[i-1][j], if p[i] == '*'.

Container With Most Water

Reference lccycc
assume that 0..i-1 and j+1..n-1 have all find their best match
assume height[i] < height[j]
if i match some k, i<k<j
answer = (k-i) * min(height[i], height[k]) <= (k-i) * height[i] < (j-i) * height[i];
so the best match of i is j. for k < i or k > j, they have find their best match. so no need to concern.

Implement strStr()

kmp
if the needle is empty string, should return 0.

Substring with Concatenation of All Words

O(len * n / len * len) = O(n * len), n is the length of s, m is the number of words, len is the length of word.
The straightforward thought is that we can scan every m * len long string start form each position in s and see if all the strings in words have been appeared only once using map data structure.
How to pick every m * len long string? Pick a start position in [0, len-1] and get the m * len start with it. When we have a m * len long string and the range is [i, j], then we can get another m * len long string by removing the [i, i+len-1] chars and append [j+1, j+len] chars.

Longest Valid Parentheses

First push -1 into stack.
Then Scan the s, when the char scaned now is ( then push the index into stack. When the char is ), pop the top element if the number of element in stack is more than one or change the top element to be the index now if the stack only contains one element, and calculate the i - top which is one of the candidates of the longest length.

Multiply Strings

Leading zero should be removed.

Ubuntu出现相同图标

Posted on 二 09 六月 2015 in 日常 • Tagged with Ubuntu

不知道什么时候误操作，导致dash里出现两个chrome图标：chrome图标，chrome打开新窗口图标。而且如果把第一个图标锁定到启动器，每次打开它又会产生第二个图标，神烦。

原因

原因就是，不知道什么时候，在用户的目录下创建了一个chrome打开新窗口图标的图标，使得系统中存在两个chrome图标。

解决方法

将~/.local/share/applications/google-chrome.desktop删除掉，这样就仅保留系统全局的chrome图标。

AspectJ学习之Introduction

Posted on 一 30 三月 2015 in Reading • Tagged with AspectJ

The AspectJ Programming Guide第一章的笔记

1. 四个constructs：pointcuts, advice, inter-type declarations and aspects

Pointcuts and advice dynamically affect program flow
inter-type declarations statically affects a program’s class hierarchy
and aspects encapsulate these new constructs.

A join point is a well-defined point in the program flow. A pointcut picks out certain join points and values at those points. A piece of advice is code that is executed when a join point is reached. These are the dynamic parts of AspectJ.（动态织入）

AspectJ also has different kinds of inter-type declarations that allow the programmer to modify a program’s static structure, namely, the members of its classes and the relationship between classes.（静态织入）

AspectJ’s aspect are the unit of modularity for crosscutting concerns. They behave somewhat like Java classes, but may also include pointcuts, advice and inter-type declarations.（类似于class）

2.定义pointcut点

全名：关键字call

call(void Point.setX(int))

逻辑语句：

call(void Point.setX(int)) || call(void Point.setY(int))

别名/命名：move()代表这个pointcut

pointcut move():
    call(void FigureElement.setXY(int,int)) ||
    call(void Point.setX(int))

pointcut的pointcut：关键字cflow，控制流（control-point），监测获取整个控制流

cflow(move())

3.advince：

Before advince：在执行pointcut的方法之前执行（但在获得参数后）
After advince：在执行pointcut的方法之后执行（但在return之前）
Around advince：貌似是可以指定在方法运行到什么地方的时候执行该advince（on a join point runs as the join point is reached, and has explicit control over whether the program proceeds with the join point.）

暴露pointcut的上下文：

格式如下：

after(FigureElement fe, int x, int y) returning:
    ...SomePointcut... {
    ...SomeBody...
}

表示这个AfterAdvince可以获得pointcut的环境中的三个参数。使用匿名pointcut：

after(FigureElement fe, int x, int y) returning:
    call(void FigureElement.setXY(int, int)) && target(fe) && args(x, y) {
    System.out.println(fe + " moved to (" + x + ", " + y + ")");
}

使用命名的pointcut：

pointcut setXY(FigureElement fe, int x, int y):
    call(void FigureElement.setXY(int, int)) && target(fe) && args(x, y);
after(FigureElement fe, int x, int y) returning: setXY(fe, x, y) {
    System.out.println(fe + " moved to (" + x + ", " + y + ").");
}

4. 类型间声明(Inter-type declarations)

是切入类和它们层次结构的声明。定义一个Point类的观察者：

aspect PointObserving {
    private Vector Point.observers = new Vector();
    ...
}

添加访问这个私有成员的方法

aspect PointObserving {
    private Vector Point.observers = new Vector();

    public static void addObserver(Point p, Screen s) {
        p.observers.add(s);
    }
    public static void removeObserver(Point p, Screen s) {
        p.observers.remove(s);
    }
    ...
}

添加切入点changes，以便观察变化

aspect PointObserving {
    private Vector Point.observers = new Vector();

    public static void addObserver(Point p, Screen s) {
        p.observers.add(s);
    }
    public static void removeObserver(Point p, Screen s) {
        p.observers.remove(s);
    }

    pointcut changes(Point p): target(p) &amp;&amp; call(void Point.set*(int));

    after(Point p): changes(p) {
        Iterator iter = p.observers.iterator();
        while ( iter.hasNext() ) {
            updateObserver(p, (Screen)iter.next());
        }
    }

    static void updateObserver(Point p, Screen s) {
        s.display(p);
    }
}

5. 切面（Aspect）

Aspects wrap up pointcuts, advice, and inter-type declarations in a a modular unit of crosscutting implementation. 类似一个类。

不可以new出一个Aspect，默认情况它是单例的。这意味着，如果需要保持状态的话，advince可以使用一个非静态域：

aspect Logging {
    OutputStream logStream = System.err;

    before(): move() {
        logStream.println("about to move");
    }
}

误删系统文件作死

Posted on 四 12 三月 2015 in 日常 • Tagged with Ubuntu

背景：今天网卡一直抽风，就怀疑是不是网卡驱动不稳定（因为这个网卡驱动是当时出的一个版本），就想起是不是去更新一下。

在更新无线网卡时，不慎（作死）删除了/lib/modules/3.13.0-32-generic/kernel/drivers/misc/eeprom/eeprom_93cx6.ko、/lib/modules/3.13.0-32-generic/kernel/lib/crc-ccitt.ko, /lib/modules/3.13.0-32-generic/kernel/net/wireless/cfg80211.ko, /lib/modules/3.13.0-32-generic/kernel/net/mac80211/mac80211.ko等，而且恢复不了。

网上搜了一下，尝试了ubuntu重启时进入recovery模式，还是修复不了。然后以为应该是驱动被删的问题，以这个关键字搜，都是一些第三方驱动的安装之类的。后来想到应该是属于内核模块，所以尝试了下更新内核，结果如愿以偿。看来对linux的熟悉程度还不够，一个问题还要绕好久。

在django中使用ImageField将图片存储在bcs中

Posted on 五 05 十二月 2014 in 日常 • Tagged with BAE, BCS, django

因为像bae一类的paas平台，一般都是需要将除了代码外的文件存储到其他专门的服务中。为了使本博能上传图片，就使用了bcs，在django中，ImageField默认时存储到磁盘中，为了能上传到云上，需要自己定制storage。

参考python-django如何在sae中使用自带imagefield和filefield

定制Storage

bcs的API使用官网提供的封装好的pybcs，其他就是继承并重载FileSystemStorage。

#-*- coding: UTF-8 -*-
from django.core.files.storage import FileSystemStorage
from django.core.exceptions import SuspiciousOperation
import pybcs
import logging
import uuid
import os

pybcs.init_logging(logging.INFO)
AK = 'your AK
SK = 'your SK'
BUCKET='your bucket'
bcs = pybcs.BCS('http://bcs.duapp.com/', AK, SK, pybcs.HttplibHTTPC)
bckt = bcs.bucket(BUCKET)

class BcsStorage(FileSystemStorage):

    def __init__(self, location=None, base_url=None):
        super(BcsStorage, self).__init__(location, base_url)

    @property
    def maxSize(self):
        """ max file size 2G """
        return 2 * 1024 * 1024 * 1024

    @property
    def fileTypes(self):
        """ the file types allowed """
        return '*'

    def makename(self, name):
        oname = os.path.basename(name)
        path = os.path.dirname(name)
        try:
            fname, hk = oname.split(".")
        except Except, e:
            fname, hk = oname, ''
        if hk:
            rname = "%s_%s.%s" % (str(uuid.uuid4()), fname, hk)
        else:
            rname = "%s_%s" % (str(uuid.uuid4()), fname)
        name = os.path.join(path, rname)
        return name

    def _save(self, name, content):
        hz = name.split(".")[-1]
        if self.fileTypes != '*':
            if hz.lower() not in self.fileTypes:
                print u"不支持的文件类型%s，仅支持%s" % (hz, self.fileTypes)
                raise SuspiciousOperation(u"不支持的文件类型%s，仅支持%s" % (hz, self.fileTypes))
        name = self.makename(name)
        if content.size > self.maxSize:
            raise SuspiciousOperation(u"文件大小超过限制")
        # object name must start with '/'
        if name.startswith('/'):
            o = bckt.object(name)
        else:
            o = bckt.object("/" + name)
        o.put(content)
        #let the object can be read by public
        o.make_public()
        return name

    def delete(self, name):
        # object name must start with '/'
        if name.startswith('/'):
            o = bckt.object(name)
        else:
            o = bckt.object("/" + name)
        o.delete()

class ImageStorage(BcsStorage):

    @property
    def fileTypes(self):
        return ['jpg', 'jpeg', 'png', 'gif']

修改pybcs

印象中应该只有一处有错，在common.py中需要修改shorten函数：

def shorten(s, l=80):
    if len(s)<=l:
        return s
    if hasattr(s, '__getitem__'):
        return s[:l-3] + '...'
    else:
        return 'data...'

使用方法

在定义model时，使用ImageField，并带上参数storage=ImageStorage()。同时需要重载delete，否则不会删除图片。

class PaperImage(models.Model): image = models.ImageField(verbose_name=ugettext(u'图片'), max_length=250, upload_to=UPLOAD_TO, storage=ImageStorage(), null=True, blank=True)

def __unicode__(self):
    return u"%s" % self.image

def delete(self, using=None):
    try:
        self.image.storage.delete(self.image.name)
    except Exception, e:
        print e
        pass
    super(PaperImage, self).delete(using=None)

修改require,txt

我使用的bae，环境中缺少PIL包，在require.txt添加一行，写上PIL

Ubuntu14.04安装MT7630E网卡

Posted on 五 05 十二月 2014 in 日常 • Tagged with Ubuntu

去年买的HP probook g440，网卡太奇怪，竟然linux低版本的内核不支持，今天发现在3.13~3.14版本支持了，所以整了个ubuntu14.04

安装MT7630E驱动

下载MT7630E_Wi-Fi_BT_Source_V3.14_20140625_v2.tar.gz，解压，按照readme说明安装

开机自动加载

很奇怪，重启后重视不能自动加载驱动，找了好久，终于发现将编译出来的文件：rt2x00lib.ko,rt2x00pci.ko,rt2x00mmio.ko,rt2800lib.ko,rt2800pci.ko复制到/lib/modules/$(uname -r)/kernel/net/rt2x00/中，并运行命令cd /lib/modules/$(uname -r)/;sudo depmod -a;即可

bae上部署django的notes

Posted on 五 21 十一月 2014 in 日常 • Tagged with BAE, django

这个网站就部署在bae上的，正好自己在学python，所以就学了下使用django弄个网页。本来很早就想写部署notes，因为官方的部署很不全，自己遇到好多坑，现在把还记得的重要过程记录一下。

这里的说明先假设在项目的根目录下创建创建了一个django的项目，名为blog，所有他的文件也就包含在/blog下面。

修改index.py

需要把项目的setting放入DJANGO_SETTINGS_MODULE变量中。

#-*- coding:utf-8 -*-
import os
import sys

from django.core.handlers.wsgi import WSGIHandler
from bae.core.wsgi import WSGIApplication
# settings放入变量中
os.environ['DJANGO_SETTINGS_MODULE'] = 'blog.settings'
#将项目目录加入到path中
path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'blog').replace('\\','/')
if path not in sys.path:
    sys.path.insert(1, path)

修改app.conf

app.conf主要是用来控制后端WEB server的处理逻辑的，可以参考bae技术博客

它可以直接处理静态请求，然后把动态请求都转发给后端的webServer。

handlers:
  # 静态请求直接返回静态资源
  - url : /static/(.*)
    script : /blog/blog/static/$1
  # 动态请求转发给index.py
  - url : /.*
    script: index.py

  - expire : .jpg modify 10 years
  - expire : .swf modify 10 years
  - expire : .png modify 10 years
  - expire : .gif modify 10 years
  - expire : .JPG modify 10 years
  - expire : .ico modify 10 years

数据库连接

使用数据库以及django自带的管理模块admin，需要将python目录下的/Lib/site-packages/django/contrib/admin文件夹拷贝到上面设置的静态文件路径下/blog/blog/static/。

settings.py需要修改下数据库连接设置：

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'DBNAME', 
        'USER': 'USERID',
        'PASSWORD': 'PWD',
        'HOST': 'sqld.duapp.com', 
        'PORT': '4050', 
    }
}

本地运行与上线发布

上线发布需要把settings.py设置为生成模式，debug=False。

因为装bae的本地开发环境比较麻烦，所以我在本地运行时，都是直接用runserver方式开启调试模式的，这时就需要设置静态文件的目录，以便获取静态文件

settings.py修改设置：

......
DEBUG=True
......
STATIC_ROOT = os.path.join(PROJECT_DIR, 'static')
STATIC_URL = '/static/'

import os
STATICFILES_DIRS = (
    ('css', os.path.join(STATIC_ROOT, 'css').replace('\\', '/')),
    ('img', os.path.join(STATIC_ROOT, 'img').replace('\\', '/')),
    ('js', os.path.join(STATIC_ROOT, 'js').replace('\\', '/')),
)

sublime配置及使用

Posted on 二 11 十一月 2014 in 日常 • Tagged with sublime

之前一直使用超哥推荐的Notepad++，在windows还是挺顺手的，后来黎大神推荐了sublime，界面酷炫，可以DIY各种功能，跨平台，简直甩前者N条街。但不能文件关联也有一些不便，所以目前是notepad++当记事本用，sublime写点小程序用。

sublime设置及使用

推荐Sublime Text 全程指南

编译&运行C++

默认的设置能编译但运行不了C++程序，把Packages/C++/C++.sublime-build修改为：

{
    "cmd": ["g++", "-std=c++11", "-Wall", "${file}", "-o", "${file_path}/${file_base_name}"],
    "file_regex": "^(..[^:]*):([0-9]+):?([0-9]+)?:? (.*)$",
    "working_dir": "${file_path}",
    "selector": "source.c, source.c++",

    "variants":
    [
        {
            "name": "Run",
            "shell": true,
            "cmd": ["start", "cmd", "/c", "${file_path}/${file_base_name} && pause"]
        }
    ]
}

这样就可以在命令行中运行程序输入数据。

个人习惯配置

可以参考之前提到的那篇文章，这里记录下自己的设置

{
            // 使光标闪动更加柔和
    "caret_style": "phase",
            // 显示空白字符
    "draw_white_space": "all",
            // 保存时自动增加文件末尾换行
    "ensure_newline_at_eof_on_save": true,
            // 字体大小
    "font_size": 16.0,
            // 高亮当前行
    "highlight_line": true,
            // 高亮修改过的标签
    "highlight_modified_tabs": true,
    "ignored_packages":
    [
        "Vintage"
    ],
            // 添加标尺
    "rulers":
    [
        80,
        100
    ],
            // tab大小设置为4
    "tab_size": 4,
    "theme": "Soda Dark.sublime-theme",
            // 使用空格代替tab
    "translate_tabs_to_spaces": true,
            // 保存时去除末尾空白字符
    "trim_trailing_white_space_on_save": true
}

gcj2013 B题 Manage your Energy

Posted on 五 07 十一月 2014 in 编码 • Tagged with GCJ

最近刷codejam的略感无力，本来以为今天刷gcj2013 round1A应该能轻松耍一下。。但是被第二题卡住，看来最近确实脑残了......

这里仅仅记录下脑残的思路是如何产生，以便日后回想自己的too young too naive的时光。。

NC One

看完题目第一感觉，找出最大的Vmax然后把所有e * Vmax，其他求和 * r。但总有种不可能这么简单的感觉，但还是先写出来试试。果然码完就觉得有问题。

NC Two

反省第一版的问题后，发现在最大的值所在位置之后的一段v里需要再另外找最大值。于是求f[i]，表示i点及其后的最大值，接着从前往后扫一遍，尽量将e用在f[i]==v[i]的点上。然后就以为已经解决了。。

正解

最后还是没找出问题所在，时间不多（还有许多砖要搬），就直接看题解了。。

对于每一个i，需要找出下一个比i大的位置j，那么肯定应该尽量把e用于j位置。。至于快速找出i的下一个比i大的位置，需要用到单调栈。

#include <iostream>
#include <cstring>
#include <cstdio>
#include <algorithm>
#include <stack>
using namespace std;

typedef long long LL;
int a[10004], nxt[10004];
stack<int> st;

int main() {
    int T, e, r, n;
    scanf("%d", &T);
    for(int cas = 1; cas <= T; cas++) {
        scanf("%d%d%d", &e, &r, &n);
        if(r > e) r = e;
        for(int i = 0; i < n; i++)
            scanf("%d", a + i);
        for(int i = 0; i < n; i++) {
            while(!st.empty() && a[i] > a[st.top()]) {
                nxt[st.top()] = i;
                st.pop();
            }
            st.push(i);
        }
        while(!st.empty()){
            nxt[st.top()] = -1;
            st.pop();
        }
        LL ans = 0;
        int now = e;
        for(int i = 0; i < n; i++) {
            if(nxt[i] == -1 || (nxt[i] - i) * (LL)r >= e) {
                ans += now * (LL)a[i];
                now = 0;
            } else if((nxt[i] - i) * (LL)r + now > e) {
                ans += ((nxt[i] - i) * (LL)r + now - e) * (LL)a[i];
                now = e - (nxt[i] - i) * (LL)r;
            }
            now += r;
            if(now > e) now = e;
        }
        printf("Case #%d: %lld\n", cas, ans);
    }
    return 0;
}

Older Posts