Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
This paper presents a case study of analyzing and improving intercoder reliability in discourse tagging using statistical techniques. Biascorrected tags are formulated and successfully used to guide a revision of the coding manual and develop an automatic classifier.
BOOK DESCRIPTION This book offers a highly accessible introduction to Natural Language Processing, the field that underpins a variety of language technologies, ranging from predictive text and email filtering to automatic summarization and translation. With Natural Language Processing with Python, you’ll learn how to write Python programs to work with large collections of unstructured text. You’ll access richly-annotated datasets using a comprehensive range of linguistic data structures.
In this study, we make use of a complete dataset of property trades by institutional-grade
REITs who are legally mandated to report such trades to the SEC in their 10-K and 10-Q reports,
thus providing both complete trading information and eliminating selection bias. We augment
this information with a dataset of property trades made by portfolio managers of private entities,
such as commingled real-estate funds, who have legally committed to disclose this information to
a private data collector under a strict non-disclosure agreement.
Lastly, there is a new form of science emerging. Each scientific discipline is generating huge data volumes, for example,
from accelerators (physics), telescopes (astronomy), remote sensors (earth sciences), and DNA microarrays (biology).
Simulations are also generating massive datasets. Organizing, analyzing and summarizing these huge scientific datasets
stands as a real DBMS challenge. So is the positioning and transfer of data...
Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.
It’s tough to argue with R as a high-quality, cross-platform, open source statistical software product—unless you’re in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets. You’ll learn the basics of Snow, Multicore, Parallel, and some Hadoop-related tools, including how to find them, how to use them, when they work well, and when they don’t.
This paper analyzes the importance of retail consumers’ banking relationships for loan defaults
using a unique, comprehensive dataset of over one million loans by savings banks in Germany.
We find that loans of retail customers, who have a relationship with their savings bank prior to
applying for a loan, default significantly less than customers with no prior relationship.
Two other papers focus specifically on building sovereign debt datasets. Jeanne and Guscina
(2006) collected data on emerging economy public debt, with details on the jurisdiction of
issuance, maturity, currency, and indexation. The data cover 19 emerging economies during
1980–2002 and were compiled mainly from official publications, supplemented by
information from questionnaires to country authorities and IMF data.
English noun/verb (N/V) pairs (contract, cement) have undergone complex patterns of change between 3 stress patterns for several centuries. We describe a longitudinal dataset of N/V pair pronunciations, leading to a set of properties to be accounted for by any computational model. We analyze the dynamics of 5 dynamical systems models of linguistic populations, each derived from a model of learning by individuals.