►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
►
---------
Introduction to Textual Analysis with a Computer Analyzing Texts Using XML-based Tools Stewart Arneil Greg Newton ⋘ Index ⋙
General Introduction
Goal: Sense of what might be possible with XML-based approach
Goal: Apply text analysis skills using XML-based tools
Session 1: Text Analysis Concepts and Exercise
Session 2: Text Analysis in XML
==>
Session 1: Text Analysis Concepts and Exercise Study examples Mark up exercises Discuss importance of consistent rules Wrapup
==>
Mark up a Sample Document: Emma
Markup (broadly defined as cues to understanding the meaning of the words) is an ancient concept
What instances of "markup" can you see in this extract from Jane Austen's Emma ?
Whitespace, Metadata, Abbreviation, Capitalization, Quotes, Dash, Period, Ellipsis, Comma, Italics, QuestionMark, ExclamationMark
==>
Problems with this kind of "markup"
What is metadata and what is not?
What does a period mean? italics?
Conceptual problems: this markup is implicit, not explicit; visual, not semantic
Practical problem: Traditional texts are not machine-readable.
==>
Mark up a Sample Document: Sonnet 130
Consider a perspective: literary scholar, linguist, librarian, manuscript scholar, student
What features are worth noting in this document for a given perspective?
Indicate instances in document - underline, circle, highlighter, whatever
Where a feature is absent, add it in the margin
==>
Importance of a Consistent Set of Rules
Does everybody come up with the same list of features?
Does everyone agree on what counts as an instance for each feature?
How do we handle markup on one document for different perspectives?
Imagine scaling up to lots of documents amenable to robots doing processing
==>
Rulesets and XML (eXtensible Markup Language)
Graphical representation of a poem as a hierarchical tree
Markup Narrowly defined: The process of embedding tags in an electronic text so as to distinguish the text's logical, syntactic, or structural components.
XML-based document is a hierarchical tree
Example of a simple XML document and ruleset
Real rulesets are complex, but comprehensible, by computers (and humans if needbe)
Text Encoding Initiative (TEI ) provides a set of standard modular rulesets and tools for creating them
==>
Wrap up Session 1
Each document is treated as a hierarchical tree of nodes
Each document is a node in a hierarchical tree of documents
Assignment: update your markup based on discussion; try another markup for a different audience
Optional: look ahead to session 2, download oXygen, download DTD, markup text, validate
==>
Session 2: Text Analysis in XML Watch demonstration of oXygen XML editor Mark up sample text document with oXygen XML editor Watch demonstration of presentation styles applied to XML Review and discuss model sites Wrapup
==>
oXygen - an Editor for XML Files
Download from oXygen site and install
Tell oXygen to start a new document and what DTD to use for validation
Copy and paste text content
Add tags, note how program helps with this based on DTD
Confirm document is well-formed (hierarchical tree) and valid (tested against DTD)
==>
Start New Document with oXygen
Start the oXygen application as you would any other program
Choose New... from the File Menu "XML document", choose XML Document in the New Document dialog box, click OK.
Check "Use DTD..." checkbox (tells oXygen to use a ruleset)
Put "http://web.uvic.ca/hrd/engl500xml/ material/tei_verse.dtd" in the systemID field (tells oXygen which ruleset file to use)
Put "tei_verse.dtd" in the publicID field
Ensure that "TEI.2" is in the Root field, click OK.
Oxygen User Guide (http://www.oxygenxml.com/doc/ug-standalone)
==>
Edit Your Sample Document in oXygen
Note any red wiggly lines indicating errors in your file
Insert appropriate tags - note how oXygen helps / constrains you (requires knowledge of ruleset and material)
Copy and paste, or type in, text
Confirm document is well-formed (blue check) and valid (red check)
Save
Oxygen User Guide (http://www.oxygenxml.com/doc/ug-standalone)
==>
Demonstration of Doing Something with Your XML File
Raw XML isn't user-friendly, but can be manipulated in many ways
Example: apply a Cascading Style Sheet for presentation
CSS has been written assuming the structure specified by the DTD
One stylesheet can be applied to any document that validates to DTD
Example of Sonnet 130 XML with no styling , example of same XML file with styles applied (original image )
==>
Closing Discussion
Consider one of your areas of research interest
Which, if any, of the ideas or techniques we've discussed might be relevant?
==>
Conclusion
Research question and expertise in content analysis needed to mark up documents well
XML : unambiguous, explicit tree structure for information from individual character to huge corpus of documents
XML tags generally specify what things are, not what they look like
Ruleset (e.g. DTD, Schema) : specifies which elements go where
Any document validated by a ruleset can be processed by any software that respects that ruleset (edit, search, transform, present)
For more: UVic Summer Institute workshops on text encoding
==>
Credits
Greg Newton - present content & demos, markup sample document
Stewart Arneil - present content & demos, write presentation
Ray Siemens - consult on pedagogy, provide sample document
Martin Holmes - consult on markup, provide template for presentation
URL of this presentation: http://web.uvic.ca/hrd/engl500xml
==>