Data-Avail
Слава роботам и бозонам.
четверг, 14 февраля 2013 г.
понедельник, 28 января 2013 г.
Refine homicides
Deal messy data with Google Refine
Source data set:
https://github.com/baio/cour-compdata/tree/master/w4
First - download and unzip (not mind blowing)
Second - run it
Create new project -> open file with data -> parse data as Line based text files
Base operations of GR:
Split column:
Edit columns -> Split into several columns -> choose seapartor
Clean up column data:
Edit cells -> Transform ... -> use GR syntaxis (very simple, see examples at bottom)
Add column based on other column:
Edit columns -> Add column based on this column
Work with example
Firstly - Split row in two parts by <dt> then split first result column by ',', and split second column by <dd
Remove junk columns and clean up columns data.
GF Tricks
Before clean up (transform) data make them lower case (edit cells -> common transforms -> to lower case)
Before transform data always see column facet (Facet -> Text Facet), this gives better prespective.
When you made transform see how much rows was affected (in most cases number of rows affected should be equal nuber of rows in dataset)
Sometimes insted of split columns to several it is better to create their copies (Add column based on this column) and then apply transform to them.
When deal with columns which contain qualitive values, filter them by all values in list to confirm you catch all of them, for example filter on race column ^((?!black|white|hispanic|asian).)*$ - should left only unknown races.
Don't forget to trim columns!
The expressions to refine (Transform) columns:
link:
if(value.startsWith("<"), value.parseHtml().select("a")[0].htmlAttr("href"), ""))
name:
if(value.startsWith("<"), value.parseHtml().select("a")[0].innerHtml(), value))
date:
value.match(/[^\d]*(\d+).*/)[0]
sex:
value.match(/.*([^e]male|female).*/)[0]
Expressions for other columns are simle.
Links :
http://code.google.com/p/google-refine/
http://code.google.com/p/google-refine/wiki/GRELStringFunctions
воскресенье, 27 января 2013 г.
Pushing subtree
Pushing sub-tree not so obvious -
git push remote branch : remote_branch
pushed tags must have explicit names, diffrent from master branch or any other!
Sub-tree ref
Don't forget to commit new files when checkout new branch, after read-subtree.
воскресенье, 12 августа 2012 г.
Logs!
Problem.
Crawling of the sites and attempts to extract structured information from raw data (such as HTML) is a tough work, since the large amount of the explored data and long time span of the executed process. The only way to debug crawling process is to collect the logs data and analyze them when it is necessary. The logs data must be as detailed as it is possible, since it is hard to predict which data would be necessary to analyze and debug executed process. Also logs data couldn't have rigid structure since each logging step has own features, this way logged data are semi-structured (JSON format is the best choice).
Structure of the log.
Structure, huh? As written above log data are semi-structured, which means each log record could have
any structure, but for more efficient parsing it is better to have some base structure (common fields).
Each crawling iteration* consists from number of steps, each step which belongs to the same iteration has the same id.
The project's implementation define following base structure:
- id - id (uuid) of the crawling step
- time - time of the log record
- level - info | warning | error | fatal*
- code - codes of the step
- info codes:
- start - start of the iteration
- complete - end of the iteration
- request - start request to target
- response - get response from target
- found - found N records
- warning code:
- codes which describe unexpected behavior in the work flow, but doesn't have direct impact on step execution.
- error codes:
- codes which describe errors during step, in which case the iteration recognized as failed and will be executed again (depends on configuration)
* crawling iteration: process workflow - send request, get response, parse response, invoke next iteration
*fatal level doesn't have predefined codes since it is always unexpected.
Logs ecosystem.
Application collect logs via loggly service, loggly service then backup (optionally) collected data to amazon s3 storage, and then amazon emr parse logs when it is necessary.
Showcase.
When crawling process is completed, data from logs of that process extracted and then moved to mongodb, for better understanding of the crawler workflow and easier detection of the problems during the process, because analyze data stored in mongodb is much convenient than from raw logs.
суббота, 21 июля 2012 г.
Data
The biggest problem which I has to solve for the project successful start is the data integrity and consistency. Crawler starts on the wild, gather great amount of data and there is absolutely no guarantee that peeked data are consistent and complete. Since the amount of records are too large, the task of data validation here is too complex for just human-intuitive analysis.
From my point of view there is three steps which could help to solve this problem :
1. Robust test environment (the test tasks here should be sliced by complexity from simplest to more advanced)
2. Visualization of gathered data to better intuitive understanding are they ok or something goes wrong.
3.Machine learning algorithms to validate data on the base previous ones (which considered good)
Today I implemented first basic steps in that direction:
1. Setup the simplest test environment.
The task of the test: craw single dom, no schedulers and other "noise" factors. Store data in isolated "test" storage.
*craw.schedule must be excluded from scheme
2. Visualize gathered test data.
http://twiturstat.apphb.com/
From my point of view there is three steps which could help to solve this problem :
1. Robust test environment (the test tasks here should be sliced by complexity from simplest to more advanced)
2. Visualization of gathered data to better intuitive understanding are they ok or something goes wrong.
3.Machine learning algorithms to validate data on the base previous ones (which considered good)
Today I implemented first basic steps in that direction:
1. Setup the simplest test environment.
The task of the test: craw single dom, no schedulers and other "noise" factors. Store data in isolated "test" storage.
*craw.schedule must be excluded from scheme
2. Visualize gathered test data.
http://twiturstat.apphb.com/
среда, 18 июля 2012 г.
MongoDB
Since our system build entirely upon mongodb data base I have been exploring the ways to make it work faster, for any user's request.
And as data grow, more issues rised.
Now (even in a staging phase) when number of records in collection ~20 million and 10GB data on disk, it is crucial to maintain good indexing for the collection.
And here that I discovered while trying to tune the query perfomance -
the sequency of the fields in the key is very important :
For example (from mongoDB docs) -
1. The sort column must be the last column used in the index.
2. The range query must also be the last column in an index. This is an axiom of 1 above.
3. Only use a range query or sort on one column.
more
But these rules are not always true.
For example when you have a range query for a column and then sort by the same one, this column must be the first column in the index (opposite to 1st rule above).
for example for query:
db.tourism.find({ "country" : "eg", "checkin" : { "$gte" : ISODate("2012-07-17T20:00:00Z"), "$lte" : ISODate("2012-09-29T20:00:00Z") }, "nights" : { "$gte" : 1, "$lte" : 21 }, "stars" : { "$gte" : 1, "$lte" : 5 }, "eprice" : { "$gte" : 1, "$lte" : 2000 } }).sort({eprice : 1}).limit(20).explain()
must be ensured index :
db.tourism.ensureIndex({ "eprice" : 1, "country" : 1, "checkin" : 1, "nights" : 1, "stars" : 1 });
and most interesting part of it that when your queries has more strict ranges, for example -
db.tourism.find({ "country" : "eg", "checkin" : { "$gte" : ISODate("2012-07-17T20:00:00Z"), "$lte" : ISODate("2012-07-29T20:00:00Z") }, "nights" : { "$gte" : 1, "$lte" : 21 }, "stars" : { "$gte" : 1, "$lte" : 5 }, "eprice" : { "$gte" : 1, "$lte" : 2000 } }).sort({eprice : 1}).limit(20).explain()
then you could have simple index (following the standard rules) like such:
db.tourism.ensureIndex({ "country" : 1, "checkin" : 1, "nights" : 1, "stars" : 1, "eprice" : 1 });
query time for this index will be the same as for index where eprice the first column (this send me wrong way first time when I'm trying to set indexes)
And don't forget to ensure similar indexes for queries with diffrent sort columns (even when the query part is the same)
So the indexing of mongodb for me now similar the Machine Learning - a little sciense, a little intuition and more trying.
P.S. Still haven't tryed the DEX indexing robot for MongoDB, maybe it could help with the problem, but I love try to slove a problem by the hands first.
And as data grow, more issues rised.
Now (even in a staging phase) when number of records in collection ~20 million and 10GB data on disk, it is crucial to maintain good indexing for the collection.
And here that I discovered while trying to tune the query perfomance -
the sequency of the fields in the key is very important :
For example (from mongoDB docs) -
1. The sort column must be the last column used in the index.
2. The range query must also be the last column in an index. This is an axiom of 1 above.
3. Only use a range query or sort on one column.
more
But these rules are not always true.
For example when you have a range query for a column and then sort by the same one, this column must be the first column in the index (opposite to 1st rule above).
for example for query:
db.tourism.find({ "country" : "eg", "checkin" : { "$gte" : ISODate("2012-07-17T20:00:00Z"), "$lte" : ISODate("2012-09-29T20:00:00Z") }, "nights" : { "$gte" : 1, "$lte" : 21 }, "stars" : { "$gte" : 1, "$lte" : 5 }, "eprice" : { "$gte" : 1, "$lte" : 2000 } }).sort({eprice : 1}).limit(20).explain()
must be ensured index :
db.tourism.ensureIndex({ "eprice" : 1, "country" : 1, "checkin" : 1, "nights" : 1, "stars" : 1 });
and most interesting part of it that when your queries has more strict ranges, for example -
db.tourism.find({ "country" : "eg", "checkin" : { "$gte" : ISODate("2012-07-17T20:00:00Z"), "$lte" : ISODate("2012-07-29T20:00:00Z") }, "nights" : { "$gte" : 1, "$lte" : 21 }, "stars" : { "$gte" : 1, "$lte" : 5 }, "eprice" : { "$gte" : 1, "$lte" : 2000 } }).sort({eprice : 1}).limit(20).explain()
then you could have simple index (following the standard rules) like such:
db.tourism.ensureIndex({ "country" : 1, "checkin" : 1, "nights" : 1, "stars" : 1, "eprice" : 1 });
query time for this index will be the same as for index where eprice the first column (this send me wrong way first time when I'm trying to set indexes)
And don't forget to ensure similar indexes for queries with diffrent sort columns (even when the query part is the same)
So the indexing of mongodb for me now similar the Machine Learning - a little sciense, a little intuition and more trying.
P.S. Still haven't tryed the DEX indexing robot for MongoDB, maybe it could help with the problem, but I love try to slove a problem by the hands first.
Подписаться на:
Сообщения (Atom)




