Automated Data Collection with R
A Practical Guide to Web Scraping and Text Mining
(Sprache: Englisch)
A hands on guide to web scraping and text mining for both beginners and experienced users of R
* Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
* Provides basic techniques to query...
* Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
* Provides basic techniques to query...
Leider schon ausverkauft
versandkostenfrei
Buch (Gebunden)
63.00 €
Produktdetails
Produktinformationen zu „Automated Data Collection with R “
Klappentext zu „Automated Data Collection with R “
A hands on guide to web scraping and text mining for both beginners and experienced users of R* Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
* Provides basic techniques to query web documents and data sets (XPath and regular expressions).
* An extensive set of exercises are presented to guide the reader through each technique.
* Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
* Case studies are featured throughout along with examples for each technique presented.
* R code and solutions to exercises featured in the book are provided on a supporting website.
Inhaltsverzeichnis zu „Automated Data Collection with R “
Preface xv1 Introduction 1
1.1 Case study: World Heritage Sites in Danger 1
1.2 Some remarks on web data quality 7
1.3 Technologies for disseminating, extracting, and storing web data 9
1.3.1 Technologies for disseminating content on the Web 9
1.3.2 Technologies for information extraction from web documents 11
1.3.3 Technologies for data storage 12
1.4 Structure of the book 13
Part One A Primer onWeb and Data Technologies 15
2 HTML 17
2.1 Browser presentation and source code 18
2.2 Syntax rules 19
2.2.1 Tags, elements, and attributes 20
2.2.2 Tree structure 21
2.2.3 Comments 22
2.2.4 Reserved and special characters 22
2.2.5 Document type definition 23
2.2.6 Spaces and line breaks 23
2.3 Tags and attributes 24
2.3.1 The anchor tag 24
2.3.2 The metadata tag 25
2.3.3 The external reference tag 26
2.3.4 Emphasizing tags , , 26
2.3.5 The paragraphs tag
27
2.3.6 Heading tags , , ,... 27
2.3.7 Listing content with , , and 27
2.3.8 The organizational tags and 27
2.3.9 The tag and its companions 29
2.3.10 The foreign script tag 30
2.3.11 Table tags , , , and 32
2.4 Parsing 32
2.4.1 What is parsing? 33
2.4.2 Discarding nodes 35
2.4.3 Extracting information in the building process 37
Summary 38
Further reading 38
Problems 39
3 XML and JSON 41
3.1 A short example XML document 42
3.2 XML syntax rules 43
3.2.1 Elements and attributes 44
3.2.2 XML structure 46
3.2.3 Naming and special characters 48
3.2.4 Comments and character data 49
3.2.5 XML syntax summary 50
3.3 When is an XML document well formed or valid? 51
3.4 XML extensions and technologies
... mehr
53
3.4.1 Namespaces 53
3.4.2 Extensions of XML 54
3.4.3 Example: Really Simple Syndication 55
3.4.4 Example: scalable vector graphics 58
3.5 XML and R in practice 60
3.5.1 Parsing XML 60
3.5.2 Basic operations on XML documents 63
3.5.3 From XML to data frames or lists 65
3.5.4 Event-driven parsing 66
3.6 A short example JSON document 68
3.7 JSON syntax rules 69
3.8 JSON and R in practice 71
Summary 76
Further reading 76
Problems 76
4 XPath 79
4.1 XPath--a query language for web documents 80
4.2 Identifying node sets with XPath 81
4.2.1 Basic structure of an XPath query 81
4.2.2 Node relations 84
4.2.3 XPath predicates 86
4.3 Extracting node elements 93
4.3.1 Extending the fun argument 94
4.3.2 XML namespaces 96
4.3.3 Little XPath helper tools 97
Summary 98
Further reading 99
Problems 99
5 HTTP 101
5.1 HTTP fundamentals 102
5.1.1 A short conversation with a web server 102
5.1.2 URL syntax 104
5.1.3 HTTP messages 106
5.1.4 Request methods 108
5.1.5 Status codes 108
5.1.6 Header fields 109
5.2 Advanced features of HTTP 116
5.2.1 Identification 116
5.2.2 Authentication 121
5.2.3 Proxies 123
5.3 Protocols beyond HTTP 124
5.3.1 HTTP Secure 124
5.3.2 FTP 126
5.4 HTTP in action 126
5.4.1 The libcurl library 127
&nb
3.4.1 Namespaces 53
3.4.2 Extensions of XML 54
3.4.3 Example: Really Simple Syndication 55
3.4.4 Example: scalable vector graphics 58
3.5 XML and R in practice 60
3.5.1 Parsing XML 60
3.5.2 Basic operations on XML documents 63
3.5.3 From XML to data frames or lists 65
3.5.4 Event-driven parsing 66
3.6 A short example JSON document 68
3.7 JSON syntax rules 69
3.8 JSON and R in practice 71
Summary 76
Further reading 76
Problems 76
4 XPath 79
4.1 XPath--a query language for web documents 80
4.2 Identifying node sets with XPath 81
4.2.1 Basic structure of an XPath query 81
4.2.2 Node relations 84
4.2.3 XPath predicates 86
4.3 Extracting node elements 93
4.3.1 Extending the fun argument 94
4.3.2 XML namespaces 96
4.3.3 Little XPath helper tools 97
Summary 98
Further reading 99
Problems 99
5 HTTP 101
5.1 HTTP fundamentals 102
5.1.1 A short conversation with a web server 102
5.1.2 URL syntax 104
5.1.3 HTTP messages 106
5.1.4 Request methods 108
5.1.5 Status codes 108
5.1.6 Header fields 109
5.2 Advanced features of HTTP 116
5.2.1 Identification 116
5.2.2 Authentication 121
5.2.3 Proxies 123
5.3 Protocols beyond HTTP 124
5.3.1 HTTP Secure 124
5.3.2 FTP 126
5.4 HTTP in action 126
5.4.1 The libcurl library 127
&nb
... weniger
Autoren-Porträt von Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis
Peter Meißner ist Wissenschaftlicher Mitarbeiter in der Arbeitsgruppe Comparative Parliamentary Politics an der Universität Konstanz.
Bibliographische Angaben
- Autoren: Simon Munzert , Christian Rubba , Peter Meißner , Dominic Nyhuis
- 2014, 1. Auflage, 474 Seiten, Maße: 17,4 x 25,1 cm, Gebunden, Englisch
- Verlag: Wiley & Sons
- ISBN-10: 111883481X
- ISBN-13: 9781118834817
- Erscheinungsdatum: 26.12.2014
Sprache:
Englisch
Kommentar zu "Automated Data Collection with R"
0 Gebrauchte Artikel zu „Automated Data Collection with R“
Zustand | Preis | Porto | Zahlung | Verkäufer | Rating |
---|
Schreiben Sie einen Kommentar zu "Automated Data Collection with R".
Kommentar verfassen