четверг, 14 февраля 2013 г.

Setup static IP





LAN settings

IP Address : Your static IP
Mask: 255.255.255.0

Add static DHCP for LAN machines to which redirect (in order to this IP won't be changed in future)

Virtual Servers

Add redirect from public (static) IP to private (LAN) IP, port

понедельник, 28 января 2013 г.

Refine homicides


Deal messy data with Google Refine 


Source data set:
https://github.com/baio/cour-compdata/tree/master/w4

First - download and unzip (not mind blowing)
Second - run it

Create new project -> open file with data -> parse data as Line based text files


Base operations of GR:

Split column:

Edit columns -> Split into several columns -> choose seapartor

Clean up column data:

Edit cells -> Transform ... -> use GR syntaxis (very simple, see examples at bottom)

Add column based on other column:

Edit columns -> Add column based on this column

Work with example

Firstly - Split row in two parts by <dt> then split first result column by ',',  and split second column by <dd

Remove junk columns and clean up columns data.

GF Tricks

Before clean up (transform) data make them lower case (edit cells -> common transforms -> to lower case)

Before transform data always see column facet (Facet -> Text Facet), this gives better prespective.

When you made transform see how much rows was affected (in most cases number of rows affected should be equal nuber of rows in dataset)

After transform data also see column facet to confirm all passed well.


Sometimes insted of split columns to several it is better to create their copies (Add column based on this column) and then apply transform to them.

When deal with columns which contain qualitive values, filter them by all values in list to confirm you catch all of them, for example filter on race column ^((?!black|white|hispanic|asian).)*$ - should left only unknown races.

Don't forget to trim columns!

The expressions to refine (Transform) columns:


link:
if(value.startsWith("<"), value.parseHtml().select("a")[0].htmlAttr("href"), ""))

name:
if(value.startsWith("<"), value.parseHtml().select("a")[0].innerHtml(), value))

date:
value.match(/[^\d]*(\d+).*/)[0]

sex:
value.match(/.*([^e]male|female).*/)[0]

Expressions for other columns are simle.

Links :
http://code.google.com/p/google-refine/
http://code.google.com/p/google-refine/wiki/GRELStringFunctions


воскресенье, 27 января 2013 г.

Pushing subtree



Pushing sub-tree not so obvious -
git push remote branch : remote_branch
pushed tags must have explicit names, diffrent from master branch or any other!


Sub-tree ref
Don't forget to commit new files when checkout new branch, after read-subtree.

воскресенье, 12 августа 2012 г.

Logs!

Problem.

Crawling of the sites and attempts to extract structured information from raw data (such as HTML) is a tough work, since the large amount of the explored data and long time span of the executed process. The only way to debug crawling process is to collect the logs data and analyze them when it is necessary. The logs data must be as detailed as it is possible, since it is hard to predict which data would be necessary to analyze and debug executed process. Also logs data couldn't have rigid structure since each logging step has own features, this way logged data are semi-structured (JSON format is the best choice).

Structure of the log.

Structure, huh? As written above log data are semi-structured, which means each log record could have
any structure, but for more efficient parsing it is better to have some base structure (common fields).


Each crawling iteration* consists from number of steps, each step which belongs to the same iteration has the same id.


The project's implementation define following base structure:
  • id - id (uuid) of the crawling step
  • time - time of the log record
  • level - info | warning | error | fatal*
  • code - codes of the step
    • info codes:                
      • start - start of the iteration
      • complete - end of the iteration
      • request - start request to target
      • response - get response from target
      • found - found N records
    • warning code:
      • codes which describe unexpected behavior in the work flow, but doesn't have direct impact on step execution.
    • error codes:
      • codes which describe errors during step, in which case the iteration recognized as failed and will be executed again (depends on configuration)

 * crawling iteration: process workflow - send request, get response, parse response, invoke next iteration

 *fatal level doesn't have predefined codes since it is always unexpected.

Logs ecosystem.

Application collect logs via loggly service, loggly service then backup (optionally) collected data to amazon s3 storage, and then amazon emr parse logs when it is necessary.

Showcase.

When crawling process is completed, data from logs of that process extracted and then moved to mongodb, for better understanding of the crawler workflow and easier detection of the problems during the process, because analyze data stored in mongodb is much convenient than from raw logs.



Application structure


суббота, 21 июля 2012 г.

Data

The biggest problem which I has to solve for the project successful start is the data integrity and consistency. Crawler starts on the wild, gather great amount of data and there is absolutely no guarantee    that peeked data are consistent and complete. Since the amount of records are too large, the task of data validation here is too complex for just  human-intuitive analysis.

From my point of view there is three steps which could help to solve this problem :
1. Robust test environment (the test tasks here should be sliced by complexity from simplest to more advanced)
2. Visualization of gathered data to better intuitive understanding  are they ok or something goes wrong.
3.Machine learning algorithms to validate data on the base previous ones (which considered good)

Today I implemented first basic steps in that direction:
1. Setup the simplest test environment.
The task of the test: craw single dom, no schedulers and other "noise" factors. Store data in isolated "test" storage.


*craw.schedule must be excluded from scheme

2. Visualize gathered test data.

http://twiturstat.apphb.com/

среда, 18 июля 2012 г.

MongoDB

Since our system build entirely upon mongodb data base I have been exploring the ways to make it work faster, for any user's request.


And as data grow, more issues rised.
Now (even in a staging phase) when number of records in collection ~20 million and 10GB data on disk, it is crucial  to maintain good indexing for the collection.
And here that I discovered while trying to tune the query perfomance -
the sequency of the fields in the key is very important :


For example (from mongoDB docs) - 


1. The sort column must be the last column used in the index.
2. The range query must also be the last column in an index. This is an axiom of 1 above.
3. Only use a range query or sort on one column.
more 


But these rules are not always true.
For example when you have a range query for a column and then sort by the same one, this column must be the first column in the index (opposite to 1st rule above).


for example for query:

db.tourism.find({ "country" : "eg", "checkin" : { "$gte" : ISODate("2012-07-17T20:00:00Z"), "$lte" : ISODate("2012-09-29T20:00:00Z") }, "nights" : { "$gte" : 1, "$lte" : 21 }, "stars" : { "$gte" : 1, "$lte" : 5 }, "eprice" : { "$gte" : 1, "$lte" : 2000 } }).sort({eprice : 1}).limit(20).explain()

must be ensured index : 


db.tourism.ensureIndex({ "eprice" : 1, "country" : 1, "checkin" : 1, "nights" : 1, "stars" : 1 });


and most interesting part of it that when your queries has more strict ranges, for example - 


db.tourism.find({ "country" : "eg", "checkin" : { "$gte" : ISODate("2012-07-17T20:00:00Z"), "$lte" : ISODate("2012-07-29T20:00:00Z") }, "nights" : { "$gte" : 1, "$lte" : 21 }, "stars" : { "$gte" : 1, "$lte" : 5 }, "eprice" : { "$gte" : 1, "$lte" : 2000 } }).sort({eprice : 1}).limit(20).explain()


then you could have simple index (following the standard rules) like such:
db.tourism.ensureIndex({ "country" : 1, "checkin" : 1, "nights" : 1, "stars" : 1, "eprice" : 1 });


query time for this index will be the same as for index where eprice the first column (this send me wrong way first time when I'm trying to set indexes)


And don't forget to ensure similar indexes for queries with diffrent sort columns (even when the query part is the same)


So the indexing of mongodb for me now similar the Machine Learning - a little sciense, a little intuition and more trying.


P.S. Still haven't tryed the DEX indexing robot for MongoDB, maybe it could help with the problem, but I love try to slove a problem by the hands first.